Skip to content

daft-engine/daft-lance

Repository files navigation

daft-lance

Lance integration for Daft.

Install

# Install just the daft-lance extension
pip install daft-lance

# Install daft with the daft-lance extension
pip install 'daft[lance]'

Usage

Compaction

from daft_lance import compact_files

compact_files("s3://bucket/my_dataset")

Scalar Indexing

from daft_lance import create_scalar_index

create_scalar_index("s3://bucket/my_dataset", column="name", index_type="INVERTED")

Column Merging

from daft_lance import merge_columns_df

merge_columns_df(df, "s3://bucket/my_dataset")

Migration

The migration only requires replacing daft.io.lance with daft_lance.

# See changes in current directory and all subdirectories
find . -type f -name "*.py" -exec sed 's/daft\.io\.lance/daft_lance/g' {} +

# Apply the changes
find . -type f -name "*.py" -exec sed -i 's/daft\.io\.lance/daft_lance/g' {} +

Blob Support

The daft_lance extension supports Lance BLOB V2 by reading descriptors into the following daft datatype. Note that daft.read_lance will NOT materialize Lance BLOB V2 bytes.

{
  kind: uint8,
  position: uint64,
  size: uint64,
  blob_id: uint32,
  blob_uri: string,
}

To materialize blobs, read the dataset with row IDs enabled and call take_blobs:

import lance
import daft
from daft_lance import take_blobs

ds = lance.dataset("s3://bucket/my_dataset")
df = daft.read_lance(ds.uri, default_scan_options={"with_row_id": True})
df = take_blobs(df, ds, "blob_column")

# each value is a lance.Blob — call .read() to fetch bytes
blobs = df.select("blob_column").to_pydict()["blob_column"]
data = blobs[0].read()

To write binary columns as Lance Blob V2, use the blob_columns opt-in:

import daft

df = daft.from_pydict({"id": [1, 2, 3], "data": [b"...", b"...", b"..."]})
df.write_lance("s3://bucket/my_dataset", blob_columns=["data"]).collect()

Development

Requires uv.

uv sync
uv run pytest tests/ -v

License

Apache-2.0

About

A lightweight Daft extension for reading, writing, compacting, indexing, and managing Lance datasets

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages