Skip to content

Source: remote file backends (S3 / GCS / HTTP) for rows-file + Parquet #14

@frhack

Description

@frhack

Goal: support s3://bucket/key.jsonl, gs://bucket/key.parquet, https://example.com/data.csv as path values across all file-based sources, transparent to the SQL surface.

Scope

  • Parquet remote (~30 min, the easy win): register object_store backends on the SessionContext before calling register_parquet — the existing ListingTable path already supports URL-style paths, we just need the right object store registered for the scheme. Auth: read from standard env vars (AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS, etc).
  • rows-file remote (the real refactor, ~1.5 days): the rows-file provider is currently sync, mmap-friendly, with disk-resident indexes. To support remote files we need:
    • replace std::fs::File::open + seek + read_exact in reader.rs with object_store::ObjectStore::get_range
    • replace std::fs::read in indexer.rs with get (or stream via get + BodyExt)
    • replace inode + mtime in invalidation.rs with ETag / Last-Modified
    • decide where indexes live for remote files: a) next to the data file (write back via put), b) local cache keyed by (url, etag), c) require explicit "index file" path in YAML

Workaround today

Download the file locally then point dbfy at the local path. Works but loses streaming + costs disk space.

Why it matters

"Read JSONL straight from S3" is the request that keeps showing up — the moment we close this, dbfy goes from "Datafusion-on-local-files" to "Datafusion-on-anything-with-an-URL".

Driven by the SOTA-async plan from 2026-05-02 (after streaming Postgres+LDAP shipped in commit 80ae24d).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions