Skip to content

[DISCUSS] Split the current iceberg library into iceberg-core and iceberg-data #627

@wgtmac

Description

@wgtmac

I would like to propose splitting the current iceberg library into two lower-level targets:

  • iceberg-core
  • iceberg-data

Motivation

Today the iceberg library appears to include both:

  1. metadata / planning / model-layer functionality

    • schema, types, partition spec, sort order
    • table metadata, snapshots, transactions, updates
    • manifests
    • expressions
    • catalog abstractions and in-memory catalog
    • general utilities and file abstraction APIs
  2. data access / execution-layer functionality

    • data file readers and writers
    • delete file readers and writers
    • delete loading and filtering
    • merge-on-read execution
    • Puffin reader/writer support

These are conceptually different layers, and separating them would make the project structure clearer and easier to evolve.

In particular, the data path is likely to grow independently over time as support for more read/write behaviors, delete handling, Puffin, and execution-oriented features expands. Splitting it out would help reduce the conceptual scope of the core library and make target responsibilities more explicit.

Proposed direction

iceberg-core

This target would contain the metadata/model/planning layer, including things such as:

  • schema / type / partition / sort / transform
  • table / snapshot / metadata / requirements / updates / transactions
  • manifest handling
  • expressions
  • catalog abstractions and memory catalog
  • generic utilities
  • file format declarations and file I/O abstractions
  • possibly the abstract reader/writer interfaces, depending on the final boundary decision

iceberg-data

This target would contain data-file-oriented logic, including:

  • data writer and reader logic
  • delete file writer and reader logic
  • DeleteLoader
  • delete filter logic
  • merge-on-read reader/execution logic
  • Puffin reader and writer support
  • supporting delete/data execution structures that are primarily used by these paths

Compatibility

This is a breaking change. We could keep iceberg as an aggregate/umbrella compatibility target that links both but it seems not a wise decision at this moment.

Why this seems feasible

The repository already has some useful structure that suggests this split is practical:

  • the build already distinguishes iceberg, iceberg-bundle, and iceberg-rest
  • source layout already separates areas like data/, deletes/, puffin/, manifest/, expression/, update/, etc.
  • there are already format-agnostic reader/writer abstractions and factory registration points, which should help define a stable boundary

So this looks less like a brand new architecture and more like making an existing separation more explicit at the build and module level.

Main design question

The main point that likely needs discussion is the precise boundary between planning/core and execution/data.

In particular:

  • should the abstract file_reader / file_writer interfaces stay in iceberg-core, with iceberg-data building on top of them?
  • or should all reader/writer-related APIs move to iceberg-data?

My initial preference is to keep the abstract, format-agnostic interfaces in iceberg-core, and move the higher-level data/delete/Puffin/MOR logic into iceberg-data. That seems like the cleanest layering, but I would be interested in feedback.

If this direction sounds reasonable, I’d be happy to work on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions