Goal: support s3://bucket/key.jsonl, gs://bucket/key.parquet, https://example.com/data.csv as path values across all file-based sources, transparent to the SQL surface.
Scope
- Parquet remote (~30 min, the easy win): register
object_store backends on the SessionContext before calling register_parquet — the existing ListingTable path already supports URL-style paths, we just need the right object store registered for the scheme. Auth: read from standard env vars (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc).
- rows-file remote (the real refactor, ~1.5 days): the rows-file provider is currently sync, mmap-friendly, with disk-resident indexes. To support remote files we need:
- replace
std::fs::File::open + seek + read_exact in reader.rs with object_store::ObjectStore::get_range
- replace
std::fs::read in indexer.rs with get (or stream via get + BodyExt)
- replace inode + mtime in
invalidation.rs with ETag / Last-Modified
- decide where indexes live for remote files: a) next to the data file (write back via
put), b) local cache keyed by (url, etag), c) require explicit "index file" path in YAML
Workaround today
Download the file locally then point dbfy at the local path. Works but loses streaming + costs disk space.
Why it matters
"Read JSONL straight from S3" is the request that keeps showing up — the moment we close this, dbfy goes from "Datafusion-on-local-files" to "Datafusion-on-anything-with-an-URL".
Driven by the SOTA-async plan from 2026-05-02 (after streaming Postgres+LDAP shipped in commit 80ae24d).
Goal: support
s3://bucket/key.jsonl,gs://bucket/key.parquet,https://example.com/data.csvas path values across all file-based sources, transparent to the SQL surface.Scope
object_storebackends on theSessionContextbefore callingregister_parquet— the existingListingTablepath already supports URL-style paths, we just need the right object store registered for the scheme. Auth: read from standard env vars (AWS_ACCESS_KEY_ID,GOOGLE_APPLICATION_CREDENTIALS, etc).std::fs::File::open + seek + read_exactinreader.rswithobject_store::ObjectStore::get_rangestd::fs::readinindexer.rswithget(or stream viaget+BodyExt)invalidation.rswith ETag / Last-Modifiedput), b) local cache keyed by(url, etag), c) require explicit "index file" path in YAMLWorkaround today
Download the file locally then point dbfy at the local path. Works but loses streaming + costs disk space.
Why it matters
"Read JSONL straight from S3" is the request that keeps showing up — the moment we close this, dbfy goes from "Datafusion-on-local-files" to "Datafusion-on-anything-with-an-URL".
Driven by the SOTA-async plan from 2026-05-02 (after streaming Postgres+LDAP shipped in commit 80ae24d).