I would like to propose splitting the current iceberg library into two lower-level targets:
iceberg-core
iceberg-data
Motivation
Today the iceberg library appears to include both:
-
metadata / planning / model-layer functionality
- schema, types, partition spec, sort order
- table metadata, snapshots, transactions, updates
- manifests
- expressions
- catalog abstractions and in-memory catalog
- general utilities and file abstraction APIs
-
data access / execution-layer functionality
- data file readers and writers
- delete file readers and writers
- delete loading and filtering
- merge-on-read execution
- Puffin reader/writer support
These are conceptually different layers, and separating them would make the project structure clearer and easier to evolve.
In particular, the data path is likely to grow independently over time as support for more read/write behaviors, delete handling, Puffin, and execution-oriented features expands. Splitting it out would help reduce the conceptual scope of the core library and make target responsibilities more explicit.
Proposed direction
iceberg-core
This target would contain the metadata/model/planning layer, including things such as:
- schema / type / partition / sort / transform
- table / snapshot / metadata / requirements / updates / transactions
- manifest handling
- expressions
- catalog abstractions and memory catalog
- generic utilities
- file format declarations and file I/O abstractions
- possibly the abstract reader/writer interfaces, depending on the final boundary decision
iceberg-data
This target would contain data-file-oriented logic, including:
- data writer and reader logic
- delete file writer and reader logic
DeleteLoader
- delete filter logic
- merge-on-read reader/execution logic
- Puffin reader and writer support
- supporting delete/data execution structures that are primarily used by these paths
Compatibility
This is a breaking change. We could keep iceberg as an aggregate/umbrella compatibility target that links both but it seems not a wise decision at this moment.
Why this seems feasible
The repository already has some useful structure that suggests this split is practical:
- the build already distinguishes
iceberg, iceberg-bundle, and iceberg-rest
- source layout already separates areas like
data/, deletes/, puffin/, manifest/, expression/, update/, etc.
- there are already format-agnostic reader/writer abstractions and factory registration points, which should help define a stable boundary
So this looks less like a brand new architecture and more like making an existing separation more explicit at the build and module level.
Main design question
The main point that likely needs discussion is the precise boundary between planning/core and execution/data.
In particular:
- should the abstract
file_reader / file_writer interfaces stay in iceberg-core, with iceberg-data building on top of them?
- or should all reader/writer-related APIs move to
iceberg-data?
My initial preference is to keep the abstract, format-agnostic interfaces in iceberg-core, and move the higher-level data/delete/Puffin/MOR logic into iceberg-data. That seems like the cleanest layering, but I would be interested in feedback.
If this direction sounds reasonable, I’d be happy to work on this.
I would like to propose splitting the current
iceberglibrary into two lower-level targets:iceberg-coreiceberg-dataMotivation
Today the
iceberglibrary appears to include both:metadata / planning / model-layer functionality
data access / execution-layer functionality
These are conceptually different layers, and separating them would make the project structure clearer and easier to evolve.
In particular, the data path is likely to grow independently over time as support for more read/write behaviors, delete handling, Puffin, and execution-oriented features expands. Splitting it out would help reduce the conceptual scope of the core library and make target responsibilities more explicit.
Proposed direction
iceberg-coreThis target would contain the metadata/model/planning layer, including things such as:
iceberg-dataThis target would contain data-file-oriented logic, including:
DeleteLoaderCompatibility
This is a breaking change. We could keep
icebergas an aggregate/umbrella compatibility target that links both but it seems not a wise decision at this moment.Why this seems feasible
The repository already has some useful structure that suggests this split is practical:
iceberg,iceberg-bundle, andiceberg-restdata/,deletes/,puffin/,manifest/,expression/,update/, etc.So this looks less like a brand new architecture and more like making an existing separation more explicit at the build and module level.
Main design question
The main point that likely needs discussion is the precise boundary between planning/core and execution/data.
In particular:
file_reader/file_writerinterfaces stay iniceberg-core, withiceberg-databuilding on top of them?iceberg-data?My initial preference is to keep the abstract, format-agnostic interfaces in
iceberg-core, and move the higher-level data/delete/Puffin/MOR logic intoiceberg-data. That seems like the cleanest layering, but I would be interested in feedback.If this direction sounds reasonable, I’d be happy to work on this.