From 6e2241bda2a993a144caa95acff4bc4dec96961d Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 11:53:43 -0400 Subject: [PATCH 01/18] docs: publish SKILL.md on the docs site via myst include Adds a new `skill` page that embeds the repo-root `SKILL.md` through the myst `{include}` directive, so the agent-facing guide lives on the published docs site without duplication. The page is wired into the User Guide toctree. Implements PR 4a of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/source/index.rst | 1 + docs/source/skill.md | 22 ++++++++++++++++++++++ 2 files changed, 23 insertions(+) create mode 100644 docs/source/skill.md diff --git a/docs/source/index.rst b/docs/source/index.rst index 134d41cb6..85faaeff8 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -78,6 +78,7 @@ Example user-guide/configuration user-guide/sql user-guide/upgrade-guides + skill .. _toc.contributor_guide: diff --git a/docs/source/skill.md b/docs/source/skill.md new file mode 100644 index 000000000..15891d9de --- /dev/null +++ b/docs/source/skill.md @@ -0,0 +1,22 @@ + + +```{include} ../../SKILL.md +:start-line: 4 +``` From c7cdc63db28e832fcac4f43ffa8498bf9e76d0cf Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 11:54:32 -0400 Subject: [PATCH 02/18] docs: publish llms.txt at docs site root Adds `docs/source/llms.txt` in llmstxt.org schema: a short description plus categorized links to the agent skill, user guide pages, DataFrame API reference, and example queries. `html_extra_path` in `conf.py` copies it verbatim to the published site root so it resolves at `https://datafusion.apache.org/python/llms.txt`. Implements PR 4b of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) --- dev/release/rat_exclude_files.txt | 3 ++- docs/source/conf.py | 4 ++++ docs/source/llms.txt | 35 +++++++++++++++++++++++++++++++ 3 files changed, 41 insertions(+), 1 deletion(-) create mode 100644 docs/source/llms.txt diff --git a/dev/release/rat_exclude_files.txt b/dev/release/rat_exclude_files.txt index a7a497dab..50fe94e8d 100644 --- a/dev/release/rat_exclude_files.txt +++ b/dev/release/rat_exclude_files.txt @@ -49,4 +49,5 @@ benchmarks/tpch/create_tables.sql **/.cargo/config.toml uv.lock examples/tpch/answers_sf1/*.tbl -SKILL.md \ No newline at end of file +SKILL.md +docs/source/llms.txt \ No newline at end of file diff --git a/docs/source/conf.py b/docs/source/conf.py index 01813b032..b2e9bb8c3 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -129,6 +129,10 @@ def setup(sphinx) -> None: # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ["_static"] +# Copy agent-facing files (llms.txt) verbatim to the site root so they +# resolve at conventional URLs like `https://.../python/llms.txt`. +html_extra_path = ["llms.txt"] + html_logo = "_static/images/2x_bgwhite_original.png" html_css_files = ["theme_overrides.css"] diff --git a/docs/source/llms.txt b/docs/source/llms.txt new file mode 100644 index 000000000..36555524c --- /dev/null +++ b/docs/source/llms.txt @@ -0,0 +1,35 @@ +# DataFusion in Python + +> Apache DataFusion Python is a Python binding for Apache DataFusion, an in-process, Arrow-native query engine. It exposes a SQL interface and a lazy DataFrame API over PyArrow and any Arrow C Data Interface source. This file points agents and LLM-based tools at the most useful entry points for writing DataFusion Python code. + +## Agent Guide + +- [SKILL.md (agent skill)](https://datafusion.apache.org/python/skill.html): idiomatic DataFrame API patterns, SQL-to-DataFrame mappings, common pitfalls, and the full `functions` catalog. Primary source of truth for writing datafusion-python code. + +## User Guide + +- [Introduction](https://datafusion.apache.org/python/user-guide/introduction.html): install, the Pokemon quick start, Jupyter tips. +- [Basics](https://datafusion.apache.org/python/user-guide/basics.html): `SessionContext`, `DataFrame`, and `Expr` at a glance. +- [Data sources](https://datafusion.apache.org/python/user-guide/data-sources.html): Parquet, CSV, JSON, Arrow, Pandas, Polars, and Python objects. +- [DataFrame operations](https://datafusion.apache.org/python/user-guide/dataframe/index.html): the lazy query-building interface. +- [Common operations](https://datafusion.apache.org/python/user-guide/common-operations/index.html): select, filter, join, aggregate, window, expressions, and functions. +- [SQL](https://datafusion.apache.org/python/user-guide/sql.html): running SQL against registered tables. +- [Configuration](https://datafusion.apache.org/python/user-guide/configuration.html): session and runtime options. + +## DataFrame API reference + +- [`datafusion.dataframe.DataFrame`](https://datafusion.apache.org/python/autoapi/datafusion/dataframe/index.html): the lazy DataFrame builder (`select`, `filter`, `aggregate`, `join`, `sort`, `limit`, set operations). +- [`datafusion.expr`](https://datafusion.apache.org/python/autoapi/datafusion/expr/index.html): expression tree nodes (`Expr`, `Window`, `WindowFrame`, `GroupingSet`). +- [`datafusion.functions`](https://datafusion.apache.org/python/autoapi/datafusion/functions/index.html): 290+ scalar, aggregate, and window functions. +- [`datafusion.context.SessionContext`](https://datafusion.apache.org/python/autoapi/datafusion/context/index.html): session entry point, data loading, SQL execution. + +## Examples + +- [TPC-H queries (GitHub)](https://github.com/apache/datafusion-python/tree/main/examples/tpch): canonical translations of TPC-H Q01–Q22 to idiomatic DataFrame code, each with reference SQL embedded in the module docstring. +- [Other examples (GitHub)](https://github.com/apache/datafusion-python/tree/main/examples): UDF/UDAF/UDWF, Substrait, Pandas/Polars interop, S3 reads. + +## Optional + +- [Contributor guide](https://datafusion.apache.org/python/contributor-guide/introduction.html): building from source, extending the Python bindings. +- [Upgrade guides](https://datafusion.apache.org/python/user-guide/upgrade-guides.html): migration notes between releases. +- [Upstream Rust `DataFusion`](https://datafusion.apache.org/): the underlying query engine. From 23b3be79d16ef14ee7f35499806508ebebc7cd35 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 11:55:52 -0400 Subject: [PATCH 03/18] docs: add write-dataframe-code contributor skill Adds `.ai/skills/write-dataframe-code/SKILL.md`, a contributor-facing skill for agents working on this repo. It layers on top of the user-facing repo-root SKILL.md with: - a TPC-H pattern index mapping idiomatic API usages to the query file that demonstrates them, - an ad-hoc plan-comparison workflow for checking DataFrame translations against a reference SQL query via `optimized_logical_plan()`, and - the project-specific docstring and aggregate/window documentation conventions that CLAUDE.md already enforces for contributors. Implements PR 4c of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) --- .ai/skills/write-dataframe-code/SKILL.md | 152 +++++++++++++++++++++++ 1 file changed, 152 insertions(+) create mode 100644 .ai/skills/write-dataframe-code/SKILL.md diff --git a/.ai/skills/write-dataframe-code/SKILL.md b/.ai/skills/write-dataframe-code/SKILL.md new file mode 100644 index 000000000..d32ee486b --- /dev/null +++ b/.ai/skills/write-dataframe-code/SKILL.md @@ -0,0 +1,152 @@ + + +--- +name: write-dataframe-code +description: Contributor-facing guidance for writing idiomatic datafusion-python DataFrame code inside the repo — examples, docstrings, tests, and benchmark queries. Use when adding or reviewing Python code in this project that builds DataFrames or expressions. Composes on top of the user-facing guide at the repo-root SKILL.md. +argument-hint: [area] (e.g., "tpch", "docstrings", "plan-comparison") +--- + +# Writing DataFrame Code in datafusion-python + +This skill is for contributors working **on** the datafusion-python project +(examples, tests, docstrings, benchmark queries). The primary reference for +**how** to write DataFrame and expression code — imports, data loading, the +DataFrame API, idiomatic patterns, common pitfalls, and the function +catalog — is the repo-root [`SKILL.md`](../../SKILL.md). Read that first. + +This file layers on contributor-specific extras: + +1. The TPC-H pattern index — which example to use as a template for which API. +2. The plan-comparison workflow — a diagnostic for checking a DataFrame + translation against a reference SQL query. +3. Docstring conventions enforced by this project (already summarized in + `CLAUDE.md`; repeated here so the rule is on-hand while writing examples). + +## TPC-H pattern index + +`examples/tpch/q01..q22*.py` is the largest collection of idiomatic DataFrame +code in the repo. Each query file pairs a DataFrame translation with the +canonical TPC-H reference SQL embedded in the module docstring. When adding +a new example or demo, pick the query that already exercises the pattern +rather than re-deriving from scratch. + +| Pattern | Canonical TPC-H example | +|---|---| +| Simple filter + aggregate + sort | `q01_pricing_summary_report.py` | +| Multi-table join with date-range filter | `q03_shipping_priority.py` | +| `DISTINCT` via `.select(...).distinct()` | `q04_order_priority_checking.py` | +| Multi-hop region/nation/customer join | `q05_local_supplier_volume.py` | +| `F.in_list(col, [...])` in place of CASE/array tricks | `q07_volume_shipping.py`, `q12_ship_mode_order_priority.py` | +| Searched `F.when(...).otherwise(...)` against SQL `CASE WHEN` | `q08_market_share.py` | +| Reusing computed expressions as variables | `q09_product_type_profit_measure.py` | +| Window function in place of correlated scalar subquery | `q02_minimum_cost_supplier.py`, `q11_important_stock_identification.py`, `q15_top_supplier.py`, `q17_small_quantity_order.py`, `q22_global_sales_opportunity.py` | +| `F.regexp_like(col, pattern)` for matching | `q16_part_supplier_relationship.py` | +| Compound disjunctive predicate (OR of per-brand conditions) | `q19_discounted_revenue.py` | +| Semi/anti joins for `EXISTS` / `NOT EXISTS` | `q21_suppliers_kept_orders_waiting.py` | +| `F.starts_with(...)` for prefix matching | `q20_potential_part_promotion.py` | + +The queries are correctness-gated against `examples/tpch/answers_sf1/` via +`examples/tpch/_tests.py` at scale factor 1. + +## Plan-comparison diagnostic workflow + +When translating a SQL query to DataFrame form — TPC-H, a benchmark, or an +answer to a user question — the answer-file comparison proves *correctness* +but does not prove the translation is *equivalent at the plan level*. The +optimizer usually smooths over surface differences (filter pushdown, join +reordering, predicate simplification), so two surface-different builders that +resolve to the same optimized plan are effectively identical queries. + +Use this ad-hoc diagnostic when you suspect a DataFrame translation is doing +more work than the SQL form: + +```python +from datafusion import SessionContext + +ctx = SessionContext() +# register the tables the SQL query expects +# ... + +sql_plan = ctx.sql(reference_sql).optimized_logical_plan() +df_plan = dataframe_under_test.optimized_logical_plan() + +if sql_plan == df_plan: + print("Plans match exactly.") +else: + print("=== SQL plan ===") + print(sql_plan.display_indent()) + print("=== DataFrame plan ===") + print(df_plan.display_indent()) +``` + +- `LogicalPlan.__eq__` compares structurally. +- `LogicalPlan.display_indent()` is the readable form for eyeballing diffs. +- `DataFrame.optimized_logical_plan()` is the optimizer output — use it, not + the unoptimized plan, because trivial differences (e.g. column order in a + projection) will otherwise be reported as mismatches. + +This is **a diagnostic, not a gate**. Answer-file comparison is the +correctness gate. A plan-level mismatch does not mean the DataFrame form is +wrong — it means the two forms are not literally the same plan, which is +sometimes fine (e.g. the DataFrame form forces a particular partitioning the +SQL form leaves to the optimizer). + +## Docstring conventions + +Every Python function added or modified in this project must include a +docstring with at least one doctest-verified example. Pre-commit and the +`pytest --doctest-modules` default in `pyproject.toml` will enforce that +examples actually execute. + +Rules (also in `CLAUDE.md`): + +- Examples must run under the doctest harness. The `conftest.py` injects + `dfn` (the `datafusion` module), `col`, `lit`, `F` (functions), `pa` + (pyarrow), and `np` (numpy) so you do not need to import them inside + examples. +- Optional parameters: write a second example that passes the optional + argument **by keyword** (`step=dfn.lit(3)`) so the reader sees which + parameter is being demonstrated. +- Reuse input data across examples for the same function so the effect of + each optional argument is visible against a constant baseline. +- Alias functions (one function that just wraps another — for example + `list_sort` forwarding to `array_sort`) only need a one-line description + and a `See Also` reference to the primary function. They do not need their + own example. + +## Aggregate and window function documentation + +When adding or updating an aggregate or window function, update the matching +site page: + +- Aggregate functions → `docs/source/user-guide/common-operations/aggregations.rst` +- Window functions → `docs/source/user-guide/common-operations/windows.rst` + +Add the function to the function list at the bottom of the page and, if the +function exposes a non-obvious option, add a short usage example. + +## Related + +- Repo-root [`SKILL.md`](../../SKILL.md) — primary DataFrame API guide + (users and agents). +- `.ai/skills/check-upstream/` — audit upstream Apache DataFusion features + and flag what the Python bindings do not yet expose. +- `.ai/skills/audit-skill-md/` — audit the repo-root `SKILL.md` against the + current public Python API and flag drift. From 35b7893cb2cc4948e2b5765ada9c0e82a2b6959a Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 11:56:55 -0400 Subject: [PATCH 04/18] docs: add audit-skill-md skill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds `.ai/skills/audit-skill-md/SKILL.md`, a contributor skill that cross-references the repo-root `SKILL.md` against the current public Python API (functions module, DataFrame, Expr, SessionContext, and package-root re-exports). Reports two classes of drift: - new APIs exposed by the Python surface that are not yet covered in the user-facing guide, and - stale mentions in the guide that no longer exist in the public API. The skill is diff-only — it produces a report the user reviews before any edit to `SKILL.md`. Complements `check-upstream/`, which audits in the opposite direction (upstream Rust features not yet exposed). Implements PR 4d of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) --- .ai/skills/audit-skill-md/SKILL.md | 177 +++++++++++++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 .ai/skills/audit-skill-md/SKILL.md diff --git a/.ai/skills/audit-skill-md/SKILL.md b/.ai/skills/audit-skill-md/SKILL.md new file mode 100644 index 000000000..9d093b6e3 --- /dev/null +++ b/.ai/skills/audit-skill-md/SKILL.md @@ -0,0 +1,177 @@ + + +--- +name: audit-skill-md +description: Cross-reference the repo-root SKILL.md against the current public Python API (DataFrame, Expr, SessionContext, functions module) and report new APIs that need coverage and stale mentions that no longer exist. Use after upstream syncs or any PR that changes the public Python surface. +argument-hint: [area] (e.g., "functions", "dataframe", "expr", "context", "all") +--- + +# Audit SKILL.md Against the Python Public API + +This skill keeps the repo-root `SKILL.md` (the agent-facing DataFrame API +guide) aligned with the actual Python surface exposed by the package. It is +a **diff-only audit** — it does not auto-edit `SKILL.md`. The output is a +report the user reviews and then asks the agent to act on. + +Run this whenever the public Python API changes — most commonly: + +- after an upstream DataFusion sync PR adds new functions or methods, +- after a PR that adds or removes a `DataFrame`, `Expr`, or `SessionContext` + method, +- as a pre-release gate before cutting a new datafusion-python version. + +The companion skill [`check-upstream`](../check-upstream/SKILL.md) reports +upstream APIs that are **not yet** exposed in the Python bindings. This skill +reports APIs that **are** exposed but are missing or misspelled in the +user-facing guide. + +## Areas to Check + +`$ARGUMENTS` selects a subset. If empty or `all`, audit every area. + +### 1. Scalar / aggregate / window functions + +**Source of truth:** `python/datafusion/functions.py` — the `__all__` list. +Only symbols in `__all__` are part of the public surface; helpers not listed +there are implementation details. + +**Procedure:** + +1. Load `python/datafusion/functions.py`, extract the `__all__` list. +2. Parse `SKILL.md`, collect every function reference — patterns to look for: + - Inline `F.(...)`, `F.` references. + - Bare backticked names in the "Available Functions (Categorized)" + section (`sum`, `avg`, ...). +3. Cross-reference: + - **In `__all__` but not mentioned in `SKILL.md`** → new API needing + coverage. Flag unless it is an alias documented through a `See Also` + in the primary function's docstring (see "Alias handling" below). + - **Mentioned in `SKILL.md` but not in `__all__`** → stale reference, has + been renamed or removed. + +### 2. `DataFrame` methods + +**Source of truth:** `python/datafusion/dataframe.py` — public methods on the +`DataFrame` class. A method is public if its name does not begin with an +underscore. + +**Procedure:** + +1. Import `DataFrame` and collect `dir(DataFrame)`, filtering to names that + do not start with `_`. +2. Parse `SKILL.md` for method references — patterns: + - `df.(`, `.(`, and backticked bare names in prose. + - The method tables in "Core Abstractions" and the pitfalls/idiomatic + patterns sections. +3. Flag: + - **Public method, no mention in `SKILL.md`** → candidate addition. + Weight the flag by whether the method would change how a user writes a + query (e.g. `with_column`, `join`, `aggregate` are high-value; a new + `explain_analyze_format` is low-value). + - **Mentioned in `SKILL.md`, no longer a public method** → stale. + +### 3. `Expr` methods and attributes + +**Source of truth:** `python/datafusion/expr.py` — the `Expr` class. Also +include `Window`, `WindowFrame`, and `GroupingSet` if they are re-exported +from `datafusion.expr`. + +**Procedure:** same as for `DataFrame`. Pay particular attention to operator +dunder methods mentioned in `SKILL.md` — the "Common Pitfalls" section +already covers `&`, `|`, `~`, `==`, the comparison operators, and arithmetic +operators on `Expr`. If a new operator is added (e.g. a new `__matmul__`), +it probably warrants a pitfall or pattern note. + +### 4. `SessionContext` methods + +**Source of truth:** `python/datafusion/context.py` — the `SessionContext` +class. + +**Procedure:** same as for `DataFrame`. The high-value methods in `SKILL.md` +are the data-loading methods (`read_parquet`, `read_csv`, `read_json`, +`from_pydict`, `from_pylist`, `from_pandas`) and the SQL entry points +(`sql`, `register_*`, `table`). New additions in those families are +worth flagging for a sentence in the data-loading section. + +### 5. Re-exports at package root + +**Source of truth:** `python/datafusion/__init__.py` — the top-level +`from ... import ...` statements and `__all__`. A symbol re-exported at the +package root is part of the "import" examples in `SKILL.md` even if it +lives in a submodule. + +**Procedure:** verify every name in the top-level `__all__` resolves. Flag +any new re-export that is not already mentioned in the "Import Conventions" +or "Core Abstractions" section. + +## Alias handling + +Many functions in the `functions` module are aliases — for example +`list_sort` aliases `array_sort`, and `character_length` aliases `length`. +The convention in this project is that alias function docstrings carry only +a one-line description and a `See Also` pointing at the primary function +(see `CLAUDE.md`). Do not flag an alias as missing from `SKILL.md` as long +as its primary function is already covered, unless the alias uses a name +that a user would reasonably reach for first (e.g. SQL-standard names). + +## Output Format + +Produce a report of this shape: + +``` +## SKILL.md Audit Report + +### Summary +- Functions checked: N +- DataFrame methods checked: N +- Expr members checked: N +- SessionContext methods checked: N +- Package-root re-exports checked: N + +### New APIs needing coverage in SKILL.md +- `functions.new_fn` — brief description. Suggested section: "String". +- `DataFrame.with_catalog` — brief description. Suggested section: "Core Abstractions". + +### Stale mentions in SKILL.md +- `functions.old_fn` — referenced in "Available Functions" but no longer in `__all__`. Likely renamed to `new_fn` in . +- `DataFrame.show_limit` — referenced in a pitfall; method removed in favor of `DataFrame.show(num=...)`. + +### Informational +- Alias `list_sort` covered transitively via `array_sort` — no action needed. +``` + +If every area is clean, state that explicitly ("All audited areas are in +sync. No action required."). An audit report that elides the summary line +is harder to scan in a release checklist. + +## When to edit SKILL.md + +This skill does not auto-edit. After reporting, wait for the user to +confirm which gaps are worth filling. New APIs often need a natural home +chosen by a human — the categorized function list and the pitfalls section +both have opinionated structure that an automated edit will not respect. + +## Related + +- Repo-root [`SKILL.md`](../../SKILL.md) — the file this skill audits. +- `.ai/skills/check-upstream/` — the complementary audit against upstream + Rust APIs not yet exposed in Python. +- `.ai/skills/write-dataframe-code/` — how to write idiomatic DataFrame + code in this repo. From a3f19a9960b2281c9677c6be2a1b6763bc90e414 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 11:59:41 -0400 Subject: [PATCH 05/18] docs: enrich RST pages with demos relocated from TPC-H rewrite Moves the illustrative patterns that #1504 removed from the TPC-H examples into the common-operations docs, where they serve as pattern-focused teaching material without cluttering the TPC-H translations: - expressions.rst gains a "Testing membership in a list" section comparing `|`-compound filters, `in_list`, and `array_position` + `make_array`, plus a "Conditional expressions" section contrasting switched and searched `case`. - udf-and-udfa.rst gains a "When not to use a UDF" subsection showing the compound-OR predicate that replaces a Python-side UDF for disjunctive bucket filters (the Q19 case). - aggregations.rst gains a "Building per-group arrays" subsection covering `array_agg(filter=..., distinct=True)` with `array_length`/`array_element` for the single-value-per-group pattern (the Q21 case). - Adds `examples/array-operations.py`, a runnable end-to-end walkthrough of the membership and array_agg patterns. Implements PR 4e of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../common-operations/aggregations.rst | 56 ++++++++++ .../common-operations/expressions.rst | 92 ++++++++++++++++ .../common-operations/udf-and-udfa.rst | 61 ++++++++++ examples/README.md | 1 + examples/array-operations.py | 104 ++++++++++++++++++ 5 files changed, 314 insertions(+) create mode 100644 examples/array-operations.py diff --git a/docs/source/user-guide/common-operations/aggregations.rst b/docs/source/user-guide/common-operations/aggregations.rst index de24a2ba5..b713bc038 100644 --- a/docs/source/user-guide/common-operations/aggregations.rst +++ b/docs/source/user-guide/common-operations/aggregations.rst @@ -163,6 +163,62 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")]) +Building per-group arrays +^^^^^^^^^^^^^^^^^^^^^^^^^ + +:py:func:`~datafusion.functions.array_agg` collects the values within each +group into a list. Combined with ``distinct=True`` and the ``filter`` +argument, it lets you ask two questions of the same group in one pass — +"what are all the values?" and "what are the values that satisfy some +condition?". + +Suppose each row records a line item with the supplier that fulfilled it and +a flag for whether that supplier met the commit date. We want to identify +orders where exactly one supplier failed, among two or more suppliers in +total: + +.. ipython:: python + + from datafusion import SessionContext, col, lit, functions as f + + ctx = SessionContext() + df = ctx.from_pydict( + { + "order_id": [1, 1, 1, 2, 2, 3], + "supplier_id": [100, 101, 102, 200, 201, 300], + "failed": [False, True, False, False, False, True], + }, + ) + + grouped = df.aggregate( + [col("order_id")], + [ + f.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"), + f.array_agg( + col("supplier_id"), + filter=col("failed"), + distinct=True, + ).alias("failed_suppliers"), + ], + ) + + grouped.filter( + (f.array_length(col("failed_suppliers")) == lit(1)) + & (f.array_length(col("all_suppliers")) > lit(1)) + ).select( + col("order_id"), + f.array_element(col("failed_suppliers"), lit(1)).alias("the_one_bad_supplier"), + ) + +Two aspects of the pattern are worth calling out: + +- ``filter=`` on an aggregate narrows the rows contributing to *that* + aggregate only. Filtering the DataFrame before the aggregate would have + dropped whole groups that no longer had any rows. +- :py:func:`~datafusion.functions.array_length` tests group size without + another aggregate pass, and :py:func:`~datafusion.functions.array_element` + extracts a single value when you have proven the array has length one. + Grouping Sets ------------- diff --git a/docs/source/user-guide/common-operations/expressions.rst b/docs/source/user-guide/common-operations/expressions.rst index 7848b4ee7..aeb6e2ed1 100644 --- a/docs/source/user-guide/common-operations/expressions.rst +++ b/docs/source/user-guide/common-operations/expressions.rst @@ -146,6 +146,98 @@ This function returns a new array with the elements repeated. In this example, the `repeated_array` column will contain `[[1, 2, 3], [1, 2, 3]]`. +Testing membership in a list +---------------------------- + +A common need is filtering rows where a column equals *any* of a small set of +values. DataFusion offers three forms; they differ in readability and in how +they scale: + +1. A compound boolean using ``|`` across explicit equalities. +2. :py:func:`~datafusion.functions.in_list`, which accepts a list of + expressions and tests equality against all of them in one call. +3. A trick with :py:func:`~datafusion.functions.array_position` and + :py:func:`~datafusion.functions.make_array`, which returns the 1-based + index of the value in a constructed array, or null if it is not present. + +.. ipython:: python + + from datafusion import SessionContext, col, lit + from datafusion import functions as f + + ctx = SessionContext() + df = ctx.from_pydict({"shipmode": ["MAIL", "SHIP", "AIR", "TRUCK", "RAIL"]}) + + # Option 1: compound boolean. Fine for two values; awkward past three. + df.filter((col("shipmode") == lit("MAIL")) | (col("shipmode") == lit("SHIP"))) + + # Option 2: in_list. Preferred for readability as the set grows. + df.filter(f.in_list(col("shipmode"), [lit("MAIL"), lit("SHIP")])) + + # Option 3: array_position / make_array. Useful when you already have the + # set as an array column and want "is in that array" semantics. + df.filter( + ~f.array_position( + f.make_array(lit("MAIL"), lit("SHIP")), col("shipmode") + ).is_null() + ) + +Use ``in_list`` as the default. It is explicit, readable, and matches the +semantics users expect from SQL's ``IN (...)``. Reach for the +``array_position`` form only when the membership set is itself an array +column rather than a literal list. + +Conditional expressions +----------------------- + +DataFusion provides :py:func:`~datafusion.functions.case` for the SQL +``CASE`` expression in both its switched and searched forms, along with +:py:func:`~datafusion.functions.when` as a standalone builder for the +searched form. + +**Switched CASE** (one expression compared against several literal values): + +.. ipython:: python + + df = ctx.from_pydict( + {"priority": ["1-URGENT", "2-HIGH", "3-MEDIUM", "5-LOW"]}, + ) + + df.select( + col("priority"), + f.case(col("priority")) + .when(lit("1-URGENT"), lit(1)) + .when(lit("2-HIGH"), lit(1)) + .otherwise(lit(0)) + .alias("is_high_priority"), + ) + +**Searched CASE** (an independent boolean predicate per branch). Use this +form whenever a branch tests more than simple equality — for example, +checking whether a joined column is ``NULL`` to gate a computed value: + +.. ipython:: python + + df = ctx.from_pydict( + {"volume": [10.0, 20.0, 30.0], "supplier_id": [1, None, 2]}, + ) + + df.select( + col("volume"), + col("supplier_id"), + f.when(col("supplier_id").is_not_null(), col("volume")) + .otherwise(lit(0.0)) + .alias("attributed_volume"), + ) + +This searched-CASE pattern is idiomatic for "attribute the measure to the +matching side of a left join, otherwise contribute zero" — a shape that +appears in TPC-H Q08 and similar market-share calculations. + +If a switched CASE has only two or three branches that test equality, an +``in_list`` filter combined with :py:meth:`~datafusion.expr.Expr.otherwise` +is often simpler than the full ``case`` builder. + Structs ------- diff --git a/docs/source/user-guide/common-operations/udf-and-udfa.rst b/docs/source/user-guide/common-operations/udf-and-udfa.rst index f669721a3..a84a8b646 100644 --- a/docs/source/user-guide/common-operations/udf-and-udfa.rst +++ b/docs/source/user-guide/common-operations/udf-and-udfa.rst @@ -101,6 +101,67 @@ write Rust based UDFs and to expose them to Python. There is an example in the `DataFusion blog `_ describing how to do this. +When not to use a UDF +^^^^^^^^^^^^^^^^^^^^^ + +A UDF is the right tool when the computation genuinely cannot be expressed +with built-in functions. It is often the *wrong* tool for a compound +predicate that happens to be easier to write in Python. The optimizer +cannot push a UDF through joins or filters, so a Python-side predicate +prevents otherwise obvious rewrites and forces a per-row Python callback. + +Consider a filter that selects rows falling into one of three brand-specific +buckets, each with its own containers, quantity range, and size range: + +.. code-block:: python + + # Anti-pattern: the predicate is a plain disjunction, but hidden inside a UDF. + def is_of_interest(brand, container, quantity, size): + result = [] + for b, c, q, s in zip(brand, container, quantity, size): + b = b.as_py() + if b == "Brand#12": + result.append(c.as_py() in ("SM CASE", "SM BOX") and 1 <= q.as_py() <= 11 and 1 <= s.as_py() <= 5) + elif b == "Brand#23": + result.append(c.as_py() in ("MED BAG", "MED BOX") and 10 <= q.as_py() <= 20 and 1 <= s.as_py() <= 10) + else: + result.append(False) + return pa.array(result) + + df = df.filter(udf_is_of_interest(col("brand"), col("container"), col("quantity"), col("size"))) + +The native equivalent keeps the bucket definitions as plain Python data +(a dict) and builds an ``Expr`` from them. The optimizer sees a disjunction +of simple predicates it can analyze and push down: + +.. code-block:: python + + from functools import reduce + from operator import or_ + from datafusion import col, lit, functions as f + + items_of_interest = { + "Brand#12": {"containers": ["SM CASE", "SM BOX"], "min_qty": 1, "max_size": 5}, + "Brand#23": {"containers": ["MED BAG", "MED BOX"], "min_qty": 10, "max_size": 10}, + } + + def brand_clause(brand, spec): + return ( + (col("brand") == lit(brand)) + & f.in_list(col("container"), [lit(c) for c in spec["containers"]]) + & (col("quantity") >= lit(spec["min_qty"])) + & (col("quantity") <= lit(spec["min_qty"] + 10)) + & (col("size") >= lit(1)) + & (col("size") <= lit(spec["max_size"])) + ) + + predicate = reduce(or_, (brand_clause(b, s) for b, s in items_of_interest.items())) + df = df.filter(predicate) + +Reach for a UDF when the per-row computation is not expressible as a tree +of built-in functions. When it *is* expressible, build the ``Expr`` tree +directly. + Aggregate Functions ------------------- diff --git a/examples/README.md b/examples/README.md index 0ef194afe..3024c782f 100644 --- a/examples/README.md +++ b/examples/README.md @@ -37,6 +37,7 @@ Here is a direct link to the file used in the examples: - [Query a Parquet file using the DataFrame API](./dataframe-parquet.py) - [Run a SQL query and store the results in a Pandas DataFrame](./sql-to-pandas.py) - [Query PyArrow Data](./query-pyarrow-data.py) +- [Array operations: membership tests, array_agg patterns, array inspection](./array-operations.py) ### Running User-Defined Python Code diff --git a/examples/array-operations.py b/examples/array-operations.py new file mode 100644 index 000000000..884f93974 --- /dev/null +++ b/examples/array-operations.py @@ -0,0 +1,104 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Array operations in DataFusion Python. + +Runnable reference for the idiomatic array-building and array-inspection +patterns. No external data is required -- the example constructs all inputs +through ``from_pydict``. + +Topics covered: + +- ``F.make_array`` to build a literal array expression. +- ``F.array_position`` and ``F.in_list`` for membership tests. +- ``F.array_length`` and ``F.array_element`` for inspecting an aggregated + array. +- ``F.array_agg(distinct=True, filter=...)`` for building two related arrays + per group in one pass, and filtering groups by array size afterwards. + +Run with:: + + python examples/array-operations.py +""" + +from datafusion import SessionContext, col, lit +from datafusion import functions as F + +ctx = SessionContext() + + +# --------------------------------------------------------------------------- +# 1. Membership tests: in_list vs. array_position / make_array +# --------------------------------------------------------------------------- + +shipments = ctx.from_pydict( + { + "order_id": [1, 2, 3, 4, 5], + "shipmode": ["MAIL", "SHIP", "AIR", "TRUCK", "RAIL"], + } +) + +print("\n== in_list: is shipmode one of {MAIL, SHIP}? ==") +shipments.filter(F.in_list(col("shipmode"), [lit("MAIL"), lit("SHIP")])).show() + +print("\n== array_position / make_array: same question via a literal array ==") +shipments.filter( + ~F.array_position(F.make_array(lit("MAIL"), lit("SHIP")), col("shipmode")).is_null() +).show() + + +# --------------------------------------------------------------------------- +# 2. array_agg with filter to inspect groups of two related arrays +# --------------------------------------------------------------------------- +# +# Input represents line items per order, each fulfilled by one supplier. The +# `failed` column marks whether the supplier met the commit date. We want to +# find orders with multiple suppliers where exactly one of them failed, and +# report that single failing supplier. + +line_items = ctx.from_pydict( + { + "order_id": [1, 1, 1, 2, 2, 3, 3, 3, 3], + "supplier_id": [100, 101, 102, 200, 201, 300, 301, 302, 303], + "failed": [False, True, False, False, False, True, False, False, False], + } +) + +grouped = line_items.aggregate( + [col("order_id")], + [ + F.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"), + F.array_agg( + col("supplier_id"), + filter=col("failed"), + distinct=True, + ).alias("failed_suppliers"), + ], +) + +print("\n== per-order supplier arrays ==") +grouped.sort(col("order_id").sort()).show() + +print("\n== orders with >1 supplier and exactly one failure ==") +singled_out = grouped.filter( + (F.array_length(col("failed_suppliers")) == lit(1)) + & (F.array_length(col("all_suppliers")) > lit(1)) +).select( + col("order_id"), + F.array_element(col("failed_suppliers"), lit(1)).alias("bad_supplier"), +) +singled_out.sort(col("order_id").sort()).show() From e4614993d4310bdd396c7703ddf9c7177a3f5e3c Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 12:00:55 -0400 Subject: [PATCH 06/18] docs: wire new contributor skills and plan-comparison diagnostic into AGENTS.md - List the three contributor skills (`check-upstream`, `write-dataframe-code`, `audit-skill-md`) under the Skills section so agents know what tools they have before starting work. - Document the plan-comparison diagnostic workflow (comparing `ctx.sql(...).optimized_logical_plan()` against a DataFrame's `optimized_logical_plan()` via `LogicalPlan.__eq__`) for translating SQL queries to DataFrame form. Points at the full write-up in the `write-dataframe-code` skill rather than duplicating it. `CLAUDE.md` is a symlink to `AGENTS.md`, so the change lands in both. Implements PR 4f of the plan in #1394. Co-Authored-By: Claude Opus 4.7 (1M context) --- AGENTS.md | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index 7d3262710..1790cf021 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -33,6 +33,35 @@ Skills follow the [Agent Skills](https://agentskills.io) open standard. Each ski - `SKILL.md` — The skill definition with YAML frontmatter (name, description, argument-hint) and detailed instructions. - Additional supporting files as needed. +Currently available skills: + +- [`check-upstream`](.ai/skills/check-upstream/SKILL.md) — audit upstream + Apache DataFusion features (functions, DataFrame ops, SessionContext + methods, FFI types) not yet exposed in the Python bindings. +- [`write-dataframe-code`](.ai/skills/write-dataframe-code/SKILL.md) — + contributor-facing guide for writing idiomatic DataFrame code inside this + repo (TPC-H pattern index, plan-comparison diagnostic, docstring + conventions). Layers on top of the user-facing [`SKILL.md`](SKILL.md). +- [`audit-skill-md`](.ai/skills/audit-skill-md/SKILL.md) — cross-reference + the repo-root `SKILL.md` against the current public Python API and report + new APIs needing coverage and stale mentions. Run after upstream syncs. + +## Plan-comparison diagnostic + +When translating a SQL query to a DataFrame — TPC-H, a benchmark, or an +answer to a user question — correctness is gated by the answer-file +comparison in `examples/tpch/_tests.py`, but plan-level equivalence is a +separate question. Two surface-different DataFrame forms that resolve to +the same optimized logical plan are effectively the same query. + +As an ad-hoc check, compare `ctx.sql(reference_sql).optimized_logical_plan()` +against the DataFrame's `optimized_logical_plan()`. Use `LogicalPlan.__eq__` +for structural equality and `LogicalPlan.display_indent()` for readable +diffs. This is a diagnostic, not a gate — a mismatch does not mean the +DataFrame form is wrong, only that the two forms are not literally the same +plan. The [`write-dataframe-code`](.ai/skills/write-dataframe-code/SKILL.md) +skill has the full workflow. + ## Pull Requests Every pull request must follow the template in From 6336e0084aa4571777cc01e056ace35e03c24765 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 12:02:52 -0400 Subject: [PATCH 07/18] docs: rename aggregations.rst demo df to orders_df to avoid clobbering state The "Building per-group arrays" block added in the previous commit reassigned `df` and `ctx` mid-page, which then broke the Grouping Sets examples further down that share the Pokemon `df` binding (`col_type_1` etc. no longer resolved). Rename the demo DataFrame to `orders_df` and drop the redundant `ctx = SessionContext()` so the shared state from the top of the page stays intact. Verified with `sphinx-build -W --keep-going` against the full docs tree. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/source/user-guide/common-operations/aggregations.rst | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/docs/source/user-guide/common-operations/aggregations.rst b/docs/source/user-guide/common-operations/aggregations.rst index b713bc038..a902fab5c 100644 --- a/docs/source/user-guide/common-operations/aggregations.rst +++ b/docs/source/user-guide/common-operations/aggregations.rst @@ -179,10 +179,7 @@ total: .. ipython:: python - from datafusion import SessionContext, col, lit, functions as f - - ctx = SessionContext() - df = ctx.from_pydict( + orders_df = ctx.from_pydict( { "order_id": [1, 1, 1, 2, 2, 3], "supplier_id": [100, 101, 102, 200, 201, 300], @@ -190,7 +187,7 @@ total: }, ) - grouped = df.aggregate( + grouped = orders_df.aggregate( [col("order_id")], [ f.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"), From dbd83cf8d6270d954b8004d0dca78d09a09f18ee Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 12:12:07 -0400 Subject: [PATCH 08/18] docs: replace raw SKILL.md include with a human-written AI-assistants page The previous approach embedded the repo-root `SKILL.md` on the docs site via a myst `{include}`. That file is written for agents -- dense, skill-formatted, and not suited to a human browsing the User Guide. It also relied on a fragile `:start-line:` offset to strip YAML frontmatter. Replace it with `docs/source/ai-coding-assistants.md`, a short human-readable page that mirrors the README section added in #1503: what the skill is, how to install it via `npx skills` or a manual pointer, and what kinds of things it covers. `SKILL.md` stays at the repo root as the single source of truth; agents fetch the raw GitHub URL directly. `llms.txt` is updated to point its Agent Guide entry at `raw.githubusercontent.com/.../SKILL.md` and to include the new human-readable page as a secondary link. The User Guide toctree now references `ai-coding-assistants` in place of the removed `skill` stub. Verified with `sphinx-build -W --keep-going` against the full docs tree. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/source/ai-coding-assistants.md | 82 +++++++++++++++++++++++++++++ docs/source/index.rst | 2 +- docs/source/llms.txt | 3 +- docs/source/skill.md | 22 -------- 4 files changed, 85 insertions(+), 24 deletions(-) create mode 100644 docs/source/ai-coding-assistants.md delete mode 100644 docs/source/skill.md diff --git a/docs/source/ai-coding-assistants.md b/docs/source/ai-coding-assistants.md new file mode 100644 index 000000000..501d05fbb --- /dev/null +++ b/docs/source/ai-coding-assistants.md @@ -0,0 +1,82 @@ + + +# Using DataFusion with AI Coding Assistants + +If you write DataFusion Python code with an AI coding assistant — Claude +Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, or any other +agent that supports skill discovery — this project ships machine-readable +guidance so the assistant produces idiomatic code rather than guessing from +its training data. + +## What is published + +- **[`SKILL.md`](https://github.com/apache/datafusion-python/blob/main/SKILL.md)** — + a dense, skill-oriented reference covering imports, data loading, + DataFrame operations, expression building, SQL-to-DataFrame mappings, + idiomatic patterns, and common pitfalls. Follows the + [Agent Skills](https://agentskills.io) open standard. +- **[`llms.txt`](llms.txt)** — an entry point for LLM-based tools + following the [llmstxt.org](https://llmstxt.org) convention. Categorized + links to the skill, user guide, API reference, and examples. + +Both files live at stable URLs so an agent can discover them without a +checkout of the repo. + +## Installing the skill + +**Preferred:** run + +```shell +npx skills add apache/datafusion-python +``` + +This installs the skill in any supported agent on your machine (Claude +Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, and others). +The command writes the pointer into the agent's configuration so that any +project you open that uses DataFusion Python picks up the skill +automatically. + +**Manual:** if you are not using the `skills` registry, paste this single +line into your project's `AGENTS.md` or `CLAUDE.md`: + +``` +For DataFusion Python code, see https://github.com/apache/datafusion-python/blob/main/SKILL.md +``` + +Most assistants resolve that pointer the first time they see a +DataFusion-related prompt in the project. + +## What the skill covers + +Writing DataFusion Python code has a handful of conventions that are easy +for a model to miss — bitwise `&` / `|` / `~` instead of Python +`and` / `or` / `not`, the lazy-DataFrame immutability model, how window +functions replace SQL correlated subqueries, the `case` / `when` builder +syntax, and the `in_list` / `array_position` options for membership +tests. The skill enumerates each of these with short, copyable examples. + +It is *not* a replacement for this user guide. Think of it as a distilled +reference the assistant keeps open while it writes code for you. + +## If you are an agent author + +The skill file and `llms.txt` are the two supported integration points. +Both are versioned along with the release and follow open standards — no +project-specific handshake is required. diff --git a/docs/source/index.rst b/docs/source/index.rst index 85faaeff8..0e2b065c1 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -78,7 +78,7 @@ Example user-guide/configuration user-guide/sql user-guide/upgrade-guides - skill + ai-coding-assistants .. _toc.contributor_guide: diff --git a/docs/source/llms.txt b/docs/source/llms.txt index 36555524c..4d6680426 100644 --- a/docs/source/llms.txt +++ b/docs/source/llms.txt @@ -4,7 +4,8 @@ ## Agent Guide -- [SKILL.md (agent skill)](https://datafusion.apache.org/python/skill.html): idiomatic DataFrame API patterns, SQL-to-DataFrame mappings, common pitfalls, and the full `functions` catalog. Primary source of truth for writing datafusion-python code. +- [SKILL.md (agent skill, raw)](https://raw.githubusercontent.com/apache/datafusion-python/main/SKILL.md): idiomatic DataFrame API patterns, SQL-to-DataFrame mappings, common pitfalls, and the full `functions` catalog. Primary source of truth for writing datafusion-python code. +- [Using DataFusion with AI coding assistants](https://datafusion.apache.org/python/ai-coding-assistants.html): human-readable guide for installing the skill and manual setup pointers. ## User Guide diff --git a/docs/source/skill.md b/docs/source/skill.md deleted file mode 100644 index 15891d9de..000000000 --- a/docs/source/skill.md +++ /dev/null @@ -1,22 +0,0 @@ - - -```{include} ../../SKILL.md -:start-line: 4 -``` From 5edc8e90ce2f73c67b51a2a19da20b875bd7a6fe Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 12:24:27 -0400 Subject: [PATCH 09/18] docs: drop redundant assistants list in ai-coding-assistants intro The introduction and the "Installing the skill" section both enumerated the same set of supported assistants. Drop the intro copy; the list that matters is next to `npx skills add`, where it answers "what does this command actually configure?" Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/source/ai-coding-assistants.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/docs/source/ai-coding-assistants.md b/docs/source/ai-coding-assistants.md index 501d05fbb..6554c5be3 100644 --- a/docs/source/ai-coding-assistants.md +++ b/docs/source/ai-coding-assistants.md @@ -19,11 +19,9 @@ # Using DataFusion with AI Coding Assistants -If you write DataFusion Python code with an AI coding assistant — Claude -Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, or any other -agent that supports skill discovery — this project ships machine-readable -guidance so the assistant produces idiomatic code rather than guessing from -its training data. +If you write DataFusion Python code with an AI coding assistant, this +project ships machine-readable guidance so the assistant produces +idiomatic code rather than guessing from its training data. ## What is published From a892c02d763fa9b3a6d21c1b5c4ad84f8a551d20 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 12:28:06 -0400 Subject: [PATCH 10/18] docs: convert ai-coding-assistants page from markdown to rst, shorten title Every other page in `docs/source/user-guide` and the top-level `docs/source` is written in reStructuredText; the lone `.md` page was an inconsistency. Rewrite in rst so the ASF header matches the rest of the tree, cross-references can use `:py:func:` roles if we ever add any, and myst is no longer required just to render this one page. Also shorten the page title from "Using DataFusion with AI Coding Assistants" to "Using AI Coding Assistants" -- it already sits under the DataFusion user guide so the product name is redundant. Verified with `sphinx-build -W --keep-going`. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/source/ai-coding-assistants.md | 80 --------------------------- docs/source/ai-coding-assistants.rst | 82 ++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+), 80 deletions(-) delete mode 100644 docs/source/ai-coding-assistants.md create mode 100644 docs/source/ai-coding-assistants.rst diff --git a/docs/source/ai-coding-assistants.md b/docs/source/ai-coding-assistants.md deleted file mode 100644 index 6554c5be3..000000000 --- a/docs/source/ai-coding-assistants.md +++ /dev/null @@ -1,80 +0,0 @@ - - -# Using DataFusion with AI Coding Assistants - -If you write DataFusion Python code with an AI coding assistant, this -project ships machine-readable guidance so the assistant produces -idiomatic code rather than guessing from its training data. - -## What is published - -- **[`SKILL.md`](https://github.com/apache/datafusion-python/blob/main/SKILL.md)** — - a dense, skill-oriented reference covering imports, data loading, - DataFrame operations, expression building, SQL-to-DataFrame mappings, - idiomatic patterns, and common pitfalls. Follows the - [Agent Skills](https://agentskills.io) open standard. -- **[`llms.txt`](llms.txt)** — an entry point for LLM-based tools - following the [llmstxt.org](https://llmstxt.org) convention. Categorized - links to the skill, user guide, API reference, and examples. - -Both files live at stable URLs so an agent can discover them without a -checkout of the repo. - -## Installing the skill - -**Preferred:** run - -```shell -npx skills add apache/datafusion-python -``` - -This installs the skill in any supported agent on your machine (Claude -Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, and others). -The command writes the pointer into the agent's configuration so that any -project you open that uses DataFusion Python picks up the skill -automatically. - -**Manual:** if you are not using the `skills` registry, paste this single -line into your project's `AGENTS.md` or `CLAUDE.md`: - -``` -For DataFusion Python code, see https://github.com/apache/datafusion-python/blob/main/SKILL.md -``` - -Most assistants resolve that pointer the first time they see a -DataFusion-related prompt in the project. - -## What the skill covers - -Writing DataFusion Python code has a handful of conventions that are easy -for a model to miss — bitwise `&` / `|` / `~` instead of Python -`and` / `or` / `not`, the lazy-DataFrame immutability model, how window -functions replace SQL correlated subqueries, the `case` / `when` builder -syntax, and the `in_list` / `array_position` options for membership -tests. The skill enumerates each of these with short, copyable examples. - -It is *not* a replacement for this user guide. Think of it as a distilled -reference the assistant keeps open while it writes code for you. - -## If you are an agent author - -The skill file and `llms.txt` are the two supported integration points. -Both are versioned along with the release and follow open standards — no -project-specific handshake is required. diff --git a/docs/source/ai-coding-assistants.rst b/docs/source/ai-coding-assistants.rst new file mode 100644 index 000000000..7c12cb43b --- /dev/null +++ b/docs/source/ai-coding-assistants.rst @@ -0,0 +1,82 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Using AI Coding Assistants +========================== + +If you write DataFusion Python code with an AI coding assistant, this +project ships machine-readable guidance so the assistant produces +idiomatic code rather than guessing from its training data. + +What is published +----------------- + +- `SKILL.md `_ — + a dense, skill-oriented reference covering imports, data loading, + DataFrame operations, expression building, SQL-to-DataFrame mappings, + idiomatic patterns, and common pitfalls. Follows the + `Agent Skills `_ open standard. +- `llms.txt `_ — an entry point for LLM-based tools following the + `llmstxt.org `_ convention. Categorized links to the + skill, user guide, API reference, and examples. + +Both files live at stable URLs so an agent can discover them without a +checkout of the repo. + +Installing the skill +-------------------- + +**Preferred:** run + +.. code-block:: shell + + npx skills add apache/datafusion-python + +This installs the skill in any supported agent on your machine (Claude +Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, and others). +The command writes the pointer into the agent's configuration so that any +project you open that uses DataFusion Python picks up the skill +automatically. + +**Manual:** if you are not using the ``skills`` registry, paste this +single line into your project's ``AGENTS.md`` or ``CLAUDE.md``:: + + For DataFusion Python code, see https://github.com/apache/datafusion-python/blob/main/SKILL.md + +Most assistants resolve that pointer the first time they see a +DataFusion-related prompt in the project. + +What the skill covers +--------------------- + +Writing DataFusion Python code has a handful of conventions that are easy +for a model to miss — bitwise ``&`` / ``|`` / ``~`` instead of Python +``and`` / ``or`` / ``not``, the lazy-DataFrame immutability model, how +window functions replace SQL correlated subqueries, the ``case`` / +``when`` builder syntax, and the ``in_list`` / ``array_position`` options +for membership tests. The skill enumerates each of these with short, +copyable examples. + +It is *not* a replacement for this user guide. Think of it as a distilled +reference the assistant keeps open while it writes code for you. + +If you are an agent author +-------------------------- + +The skill file and ``llms.txt`` are the two supported integration +points. Both are versioned along with the release and follow open +standards — no project-specific handshake is required. From 2022588ab52af8c91562c8c6d2f102699a2177ab Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 12:39:45 -0400 Subject: [PATCH 11/18] docs: drop audit-skill-md skill The skill as written pushed for every public method to be mentioned in `SKILL.md`, which is the wrong goal. `SKILL.md` is a distilled agent guide of idiomatic patterns and pitfalls, not an API reference -- autoapi-generated docs and module docstrings already provide full per-method coverage. An audit pressing for 100% method coverage would bloat the skill file into a stale copy of that reference. The two checks with actual value (stale mentions in `SKILL.md`, and drift between `functions.__all__` and the categorized function list) are small enough to be ad-hoc greps at release time and do not warrant a dedicated skill. Also remove references to the skill from `AGENTS.md` and the `write-dataframe-code` skill's "Related" section. Co-Authored-By: Claude Opus 4.7 (1M context) --- .ai/skills/audit-skill-md/SKILL.md | 177 ----------------------- .ai/skills/write-dataframe-code/SKILL.md | 2 - AGENTS.md | 3 - 3 files changed, 182 deletions(-) delete mode 100644 .ai/skills/audit-skill-md/SKILL.md diff --git a/.ai/skills/audit-skill-md/SKILL.md b/.ai/skills/audit-skill-md/SKILL.md deleted file mode 100644 index 9d093b6e3..000000000 --- a/.ai/skills/audit-skill-md/SKILL.md +++ /dev/null @@ -1,177 +0,0 @@ - - ---- -name: audit-skill-md -description: Cross-reference the repo-root SKILL.md against the current public Python API (DataFrame, Expr, SessionContext, functions module) and report new APIs that need coverage and stale mentions that no longer exist. Use after upstream syncs or any PR that changes the public Python surface. -argument-hint: [area] (e.g., "functions", "dataframe", "expr", "context", "all") ---- - -# Audit SKILL.md Against the Python Public API - -This skill keeps the repo-root `SKILL.md` (the agent-facing DataFrame API -guide) aligned with the actual Python surface exposed by the package. It is -a **diff-only audit** — it does not auto-edit `SKILL.md`. The output is a -report the user reviews and then asks the agent to act on. - -Run this whenever the public Python API changes — most commonly: - -- after an upstream DataFusion sync PR adds new functions or methods, -- after a PR that adds or removes a `DataFrame`, `Expr`, or `SessionContext` - method, -- as a pre-release gate before cutting a new datafusion-python version. - -The companion skill [`check-upstream`](../check-upstream/SKILL.md) reports -upstream APIs that are **not yet** exposed in the Python bindings. This skill -reports APIs that **are** exposed but are missing or misspelled in the -user-facing guide. - -## Areas to Check - -`$ARGUMENTS` selects a subset. If empty or `all`, audit every area. - -### 1. Scalar / aggregate / window functions - -**Source of truth:** `python/datafusion/functions.py` — the `__all__` list. -Only symbols in `__all__` are part of the public surface; helpers not listed -there are implementation details. - -**Procedure:** - -1. Load `python/datafusion/functions.py`, extract the `__all__` list. -2. Parse `SKILL.md`, collect every function reference — patterns to look for: - - Inline `F.(...)`, `F.` references. - - Bare backticked names in the "Available Functions (Categorized)" - section (`sum`, `avg`, ...). -3. Cross-reference: - - **In `__all__` but not mentioned in `SKILL.md`** → new API needing - coverage. Flag unless it is an alias documented through a `See Also` - in the primary function's docstring (see "Alias handling" below). - - **Mentioned in `SKILL.md` but not in `__all__`** → stale reference, has - been renamed or removed. - -### 2. `DataFrame` methods - -**Source of truth:** `python/datafusion/dataframe.py` — public methods on the -`DataFrame` class. A method is public if its name does not begin with an -underscore. - -**Procedure:** - -1. Import `DataFrame` and collect `dir(DataFrame)`, filtering to names that - do not start with `_`. -2. Parse `SKILL.md` for method references — patterns: - - `df.(`, `.(`, and backticked bare names in prose. - - The method tables in "Core Abstractions" and the pitfalls/idiomatic - patterns sections. -3. Flag: - - **Public method, no mention in `SKILL.md`** → candidate addition. - Weight the flag by whether the method would change how a user writes a - query (e.g. `with_column`, `join`, `aggregate` are high-value; a new - `explain_analyze_format` is low-value). - - **Mentioned in `SKILL.md`, no longer a public method** → stale. - -### 3. `Expr` methods and attributes - -**Source of truth:** `python/datafusion/expr.py` — the `Expr` class. Also -include `Window`, `WindowFrame`, and `GroupingSet` if they are re-exported -from `datafusion.expr`. - -**Procedure:** same as for `DataFrame`. Pay particular attention to operator -dunder methods mentioned in `SKILL.md` — the "Common Pitfalls" section -already covers `&`, `|`, `~`, `==`, the comparison operators, and arithmetic -operators on `Expr`. If a new operator is added (e.g. a new `__matmul__`), -it probably warrants a pitfall or pattern note. - -### 4. `SessionContext` methods - -**Source of truth:** `python/datafusion/context.py` — the `SessionContext` -class. - -**Procedure:** same as for `DataFrame`. The high-value methods in `SKILL.md` -are the data-loading methods (`read_parquet`, `read_csv`, `read_json`, -`from_pydict`, `from_pylist`, `from_pandas`) and the SQL entry points -(`sql`, `register_*`, `table`). New additions in those families are -worth flagging for a sentence in the data-loading section. - -### 5. Re-exports at package root - -**Source of truth:** `python/datafusion/__init__.py` — the top-level -`from ... import ...` statements and `__all__`. A symbol re-exported at the -package root is part of the "import" examples in `SKILL.md` even if it -lives in a submodule. - -**Procedure:** verify every name in the top-level `__all__` resolves. Flag -any new re-export that is not already mentioned in the "Import Conventions" -or "Core Abstractions" section. - -## Alias handling - -Many functions in the `functions` module are aliases — for example -`list_sort` aliases `array_sort`, and `character_length` aliases `length`. -The convention in this project is that alias function docstrings carry only -a one-line description and a `See Also` pointing at the primary function -(see `CLAUDE.md`). Do not flag an alias as missing from `SKILL.md` as long -as its primary function is already covered, unless the alias uses a name -that a user would reasonably reach for first (e.g. SQL-standard names). - -## Output Format - -Produce a report of this shape: - -``` -## SKILL.md Audit Report - -### Summary -- Functions checked: N -- DataFrame methods checked: N -- Expr members checked: N -- SessionContext methods checked: N -- Package-root re-exports checked: N - -### New APIs needing coverage in SKILL.md -- `functions.new_fn` — brief description. Suggested section: "String". -- `DataFrame.with_catalog` — brief description. Suggested section: "Core Abstractions". - -### Stale mentions in SKILL.md -- `functions.old_fn` — referenced in "Available Functions" but no longer in `__all__`. Likely renamed to `new_fn` in . -- `DataFrame.show_limit` — referenced in a pitfall; method removed in favor of `DataFrame.show(num=...)`. - -### Informational -- Alias `list_sort` covered transitively via `array_sort` — no action needed. -``` - -If every area is clean, state that explicitly ("All audited areas are in -sync. No action required."). An audit report that elides the summary line -is harder to scan in a release checklist. - -## When to edit SKILL.md - -This skill does not auto-edit. After reporting, wait for the user to -confirm which gaps are worth filling. New APIs often need a natural home -chosen by a human — the categorized function list and the pitfalls section -both have opinionated structure that an automated edit will not respect. - -## Related - -- Repo-root [`SKILL.md`](../../SKILL.md) — the file this skill audits. -- `.ai/skills/check-upstream/` — the complementary audit against upstream - Rust APIs not yet exposed in Python. -- `.ai/skills/write-dataframe-code/` — how to write idiomatic DataFrame - code in this repo. diff --git a/.ai/skills/write-dataframe-code/SKILL.md b/.ai/skills/write-dataframe-code/SKILL.md index d32ee486b..6927623e1 100644 --- a/.ai/skills/write-dataframe-code/SKILL.md +++ b/.ai/skills/write-dataframe-code/SKILL.md @@ -148,5 +148,3 @@ function exposes a non-obvious option, add a short usage example. (users and agents). - `.ai/skills/check-upstream/` — audit upstream Apache DataFusion features and flag what the Python bindings do not yet expose. -- `.ai/skills/audit-skill-md/` — audit the repo-root `SKILL.md` against the - current public Python API and flag drift. diff --git a/AGENTS.md b/AGENTS.md index 1790cf021..ab3eafcdc 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -42,9 +42,6 @@ Currently available skills: contributor-facing guide for writing idiomatic DataFrame code inside this repo (TPC-H pattern index, plan-comparison diagnostic, docstring conventions). Layers on top of the user-facing [`SKILL.md`](SKILL.md). -- [`audit-skill-md`](.ai/skills/audit-skill-md/SKILL.md) — cross-reference - the repo-root `SKILL.md` against the current public Python API and report - new APIs needing coverage and stale mentions. Run after upstream syncs. ## Plan-comparison diagnostic From 4f73bcd44b6ca925b4c18d9b6f0e70e3b7abdce2 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 12:42:44 -0400 Subject: [PATCH 12/18] docs: drop write-dataframe-code skill A separate PR covers the same contributor-facing material (TPC-H pattern index, plan-comparison workflow, docstring conventions), so this skill is redundant. Remove the skill directory and the corresponding references in `AGENTS.md`, including the plan-comparison section that pointed at it. Co-Authored-By: Claude Opus 4.7 (1M context) --- .ai/skills/write-dataframe-code/SKILL.md | 150 ----------------------- AGENTS.md | 20 --- 2 files changed, 170 deletions(-) delete mode 100644 .ai/skills/write-dataframe-code/SKILL.md diff --git a/.ai/skills/write-dataframe-code/SKILL.md b/.ai/skills/write-dataframe-code/SKILL.md deleted file mode 100644 index 6927623e1..000000000 --- a/.ai/skills/write-dataframe-code/SKILL.md +++ /dev/null @@ -1,150 +0,0 @@ - - ---- -name: write-dataframe-code -description: Contributor-facing guidance for writing idiomatic datafusion-python DataFrame code inside the repo — examples, docstrings, tests, and benchmark queries. Use when adding or reviewing Python code in this project that builds DataFrames or expressions. Composes on top of the user-facing guide at the repo-root SKILL.md. -argument-hint: [area] (e.g., "tpch", "docstrings", "plan-comparison") ---- - -# Writing DataFrame Code in datafusion-python - -This skill is for contributors working **on** the datafusion-python project -(examples, tests, docstrings, benchmark queries). The primary reference for -**how** to write DataFrame and expression code — imports, data loading, the -DataFrame API, idiomatic patterns, common pitfalls, and the function -catalog — is the repo-root [`SKILL.md`](../../SKILL.md). Read that first. - -This file layers on contributor-specific extras: - -1. The TPC-H pattern index — which example to use as a template for which API. -2. The plan-comparison workflow — a diagnostic for checking a DataFrame - translation against a reference SQL query. -3. Docstring conventions enforced by this project (already summarized in - `CLAUDE.md`; repeated here so the rule is on-hand while writing examples). - -## TPC-H pattern index - -`examples/tpch/q01..q22*.py` is the largest collection of idiomatic DataFrame -code in the repo. Each query file pairs a DataFrame translation with the -canonical TPC-H reference SQL embedded in the module docstring. When adding -a new example or demo, pick the query that already exercises the pattern -rather than re-deriving from scratch. - -| Pattern | Canonical TPC-H example | -|---|---| -| Simple filter + aggregate + sort | `q01_pricing_summary_report.py` | -| Multi-table join with date-range filter | `q03_shipping_priority.py` | -| `DISTINCT` via `.select(...).distinct()` | `q04_order_priority_checking.py` | -| Multi-hop region/nation/customer join | `q05_local_supplier_volume.py` | -| `F.in_list(col, [...])` in place of CASE/array tricks | `q07_volume_shipping.py`, `q12_ship_mode_order_priority.py` | -| Searched `F.when(...).otherwise(...)` against SQL `CASE WHEN` | `q08_market_share.py` | -| Reusing computed expressions as variables | `q09_product_type_profit_measure.py` | -| Window function in place of correlated scalar subquery | `q02_minimum_cost_supplier.py`, `q11_important_stock_identification.py`, `q15_top_supplier.py`, `q17_small_quantity_order.py`, `q22_global_sales_opportunity.py` | -| `F.regexp_like(col, pattern)` for matching | `q16_part_supplier_relationship.py` | -| Compound disjunctive predicate (OR of per-brand conditions) | `q19_discounted_revenue.py` | -| Semi/anti joins for `EXISTS` / `NOT EXISTS` | `q21_suppliers_kept_orders_waiting.py` | -| `F.starts_with(...)` for prefix matching | `q20_potential_part_promotion.py` | - -The queries are correctness-gated against `examples/tpch/answers_sf1/` via -`examples/tpch/_tests.py` at scale factor 1. - -## Plan-comparison diagnostic workflow - -When translating a SQL query to DataFrame form — TPC-H, a benchmark, or an -answer to a user question — the answer-file comparison proves *correctness* -but does not prove the translation is *equivalent at the plan level*. The -optimizer usually smooths over surface differences (filter pushdown, join -reordering, predicate simplification), so two surface-different builders that -resolve to the same optimized plan are effectively identical queries. - -Use this ad-hoc diagnostic when you suspect a DataFrame translation is doing -more work than the SQL form: - -```python -from datafusion import SessionContext - -ctx = SessionContext() -# register the tables the SQL query expects -# ... - -sql_plan = ctx.sql(reference_sql).optimized_logical_plan() -df_plan = dataframe_under_test.optimized_logical_plan() - -if sql_plan == df_plan: - print("Plans match exactly.") -else: - print("=== SQL plan ===") - print(sql_plan.display_indent()) - print("=== DataFrame plan ===") - print(df_plan.display_indent()) -``` - -- `LogicalPlan.__eq__` compares structurally. -- `LogicalPlan.display_indent()` is the readable form for eyeballing diffs. -- `DataFrame.optimized_logical_plan()` is the optimizer output — use it, not - the unoptimized plan, because trivial differences (e.g. column order in a - projection) will otherwise be reported as mismatches. - -This is **a diagnostic, not a gate**. Answer-file comparison is the -correctness gate. A plan-level mismatch does not mean the DataFrame form is -wrong — it means the two forms are not literally the same plan, which is -sometimes fine (e.g. the DataFrame form forces a particular partitioning the -SQL form leaves to the optimizer). - -## Docstring conventions - -Every Python function added or modified in this project must include a -docstring with at least one doctest-verified example. Pre-commit and the -`pytest --doctest-modules` default in `pyproject.toml` will enforce that -examples actually execute. - -Rules (also in `CLAUDE.md`): - -- Examples must run under the doctest harness. The `conftest.py` injects - `dfn` (the `datafusion` module), `col`, `lit`, `F` (functions), `pa` - (pyarrow), and `np` (numpy) so you do not need to import them inside - examples. -- Optional parameters: write a second example that passes the optional - argument **by keyword** (`step=dfn.lit(3)`) so the reader sees which - parameter is being demonstrated. -- Reuse input data across examples for the same function so the effect of - each optional argument is visible against a constant baseline. -- Alias functions (one function that just wraps another — for example - `list_sort` forwarding to `array_sort`) only need a one-line description - and a `See Also` reference to the primary function. They do not need their - own example. - -## Aggregate and window function documentation - -When adding or updating an aggregate or window function, update the matching -site page: - -- Aggregate functions → `docs/source/user-guide/common-operations/aggregations.rst` -- Window functions → `docs/source/user-guide/common-operations/windows.rst` - -Add the function to the function list at the bottom of the page and, if the -function exposes a non-obvious option, add a short usage example. - -## Related - -- Repo-root [`SKILL.md`](../../SKILL.md) — primary DataFrame API guide - (users and agents). -- `.ai/skills/check-upstream/` — audit upstream Apache DataFusion features - and flag what the Python bindings do not yet expose. diff --git a/AGENTS.md b/AGENTS.md index ab3eafcdc..6f27bdb0c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -38,26 +38,6 @@ Currently available skills: - [`check-upstream`](.ai/skills/check-upstream/SKILL.md) — audit upstream Apache DataFusion features (functions, DataFrame ops, SessionContext methods, FFI types) not yet exposed in the Python bindings. -- [`write-dataframe-code`](.ai/skills/write-dataframe-code/SKILL.md) — - contributor-facing guide for writing idiomatic DataFrame code inside this - repo (TPC-H pattern index, plan-comparison diagnostic, docstring - conventions). Layers on top of the user-facing [`SKILL.md`](SKILL.md). - -## Plan-comparison diagnostic - -When translating a SQL query to a DataFrame — TPC-H, a benchmark, or an -answer to a user question — correctness is gated by the answer-file -comparison in `examples/tpch/_tests.py`, but plan-level equivalence is a -separate question. Two surface-different DataFrame forms that resolve to -the same optimized logical plan are effectively the same query. - -As an ad-hoc check, compare `ctx.sql(reference_sql).optimized_logical_plan()` -against the DataFrame's `optimized_logical_plan()`. Use `LogicalPlan.__eq__` -for structural equality and `LogicalPlan.display_indent()` for readable -diffs. This is a diagnostic, not a gate — a mismatch does not mean the -DataFrame form is wrong, only that the two forms are not literally the same -plan. The [`write-dataframe-code`](.ai/skills/write-dataframe-code/SKILL.md) -skill has the full workflow. ## Pull Requests From bd54032d7333ffeeb2abf726a0b2eab3de366f94 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Fri, 24 Apr 2026 12:55:00 -0400 Subject: [PATCH 13/18] docs: show Parquet pushdown plan diff in "When not to use a UDF" The previous version of the section asserted that a UDF predicate blocks optimizer rewrites but did not show evidence. Replace the two `code-block` examples with an executable walkthrough that writes a small Parquet file, runs the same filter two ways, and prints the physical plan for each. The native-expression plan renders with three annotations on the `DataSourceExec` node that the UDF plan does not have: - `predicate=brand@1 = A AND qty@2 >= 150` pushed into the scan - `pruning_predicate=... brand_min@0 <= A AND ... qty_max@4 >= 150` for row-group pruning via Parquet footer min/max stats - `required_guarantees=[brand in (A)]` for bloom-filter / dictionary skipping The UDF form keeps only `predicate=brand_qty_filter(...)`: the scan has to materialize every row group and call the Python callback. The disjunctive-OR rewrite (previously the main example) stays at the end as the idiomatic alternative for multi-bucket filters. Verified with `sphinx-build -W --keep-going`. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../common-operations/udf-and-udfa.rst | 123 +++++++++++++----- 1 file changed, 90 insertions(+), 33 deletions(-) diff --git a/docs/source/user-guide/common-operations/udf-and-udfa.rst b/docs/source/user-guide/common-operations/udf-and-udfa.rst index a84a8b646..48249bb0d 100644 --- a/docs/source/user-guide/common-operations/udf-and-udfa.rst +++ b/docs/source/user-guide/common-operations/udf-and-udfa.rst @@ -104,35 +104,96 @@ describing how to do this. When not to use a UDF ^^^^^^^^^^^^^^^^^^^^^ -A UDF is the right tool when the computation genuinely cannot be expressed -with built-in functions. It is often the *wrong* tool for a compound -predicate that happens to be easier to write in Python. The optimizer -cannot push a UDF through joins or filters, so a Python-side predicate -prevents otherwise obvious rewrites and forces a per-row Python callback. +A UDF is the right tool when the per-row computation genuinely cannot be +expressed with built-in functions. It is often the *wrong* tool for a +predicate that happens to be easier to write in Python. A UDF is opaque +to the optimizer, which means filters expressed as UDFs lose several +rewrites that the engine applies to filters built from native +expressions. The most visible of these is **Parquet predicate pushdown**: +a native predicate can prune entire row groups using the min/max +statistics in the Parquet footer, while a UDF predicate cannot. + +The following example writes a small Parquet file, then filters it two +ways: first with a native expression, then with a UDF that computes the +same result. The filter itself is simple on purpose so we can compare +the plans side by side. -Consider a filter that selects rows falling into one of three brand-specific -buckets, each with its own containers, quantity range, and size range: +.. ipython:: python -.. code-block:: python + import tempfile, os + import pyarrow as pa + import pyarrow.parquet as pq + from datafusion import SessionContext, col, lit, udf + + tmpdir = tempfile.mkdtemp() + parquet_path = os.path.join(tmpdir, "items.parquet") + pq.write_table( + pa.table({ + "id": list(range(100)), + "brand": ["A", "B", "C", "D"] * 25, + "qty": [i * 10 for i in range(100)], + }), + parquet_path, + ) + + ctx = SessionContext() + items = ctx.read_parquet(parquet_path) + +**Native-expression predicate.** The filter is a plain boolean tree +over column references and literals, so the optimizer can analyze it: - # Anti-pattern: the predicate is a plain disjunction, but hidden inside a UDF. - def is_of_interest(brand, container, quantity, size): - result = [] - for b, c, q, s in zip(brand, container, quantity, size): - b = b.as_py() - if b == "Brand#12": - result.append(c.as_py() in ("SM CASE", "SM BOX") and 1 <= q.as_py() <= 11 and 1 <= s.as_py() <= 5) - elif b == "Brand#23": - result.append(c.as_py() in ("MED BAG", "MED BOX") and 10 <= q.as_py() <= 20 and 1 <= s.as_py() <= 10) - else: - result.append(False) - return pa.array(result) - - df = df.filter(udf_is_of_interest(col("brand"), col("container"), col("quantity"), col("size"))) - -The native equivalent keeps the bucket definitions as plain Python data -(a dict) and builds an ``Expr`` from them. The optimizer sees a disjunction -of simple predicates it can analyze and push down: +.. ipython:: python + + native_filtered = items.filter( + (col("brand") == lit("A")) & (col("qty") >= lit(150)) + ) + print(native_filtered.execution_plan().display_indent()) + +Notice the ``DataSourceExec`` line. It carries three annotations the +optimizer computed from the predicate: + +- ``predicate=brand@1 = A AND qty@2 >= 150`` — the filter is pushed + into the Parquet scan itself, so the scan only reads matching rows. +- ``pruning_predicate=... brand_min@0 <= A AND A <= brand_max@1 ... + qty_max@4 >= 150`` — the scan prunes whole row groups by consulting + the Parquet min/max statistics in the footer *before* reading any + column data. +- ``required_guarantees=[brand in (A)]`` — the scan uses this when a + bloom filter or dictionary is available to skip pages. + +**UDF predicate.** Now wrap the same logic in a Python UDF: + +.. ipython:: python + + def brand_qty_filter(brand_arr: pa.Array, qty_arr: pa.Array) -> pa.Array: + return pa.array([ + b.as_py() == "A" and q.as_py() >= 150 + for b, q in zip(brand_arr, qty_arr) + ]) + + pred_udf = udf( + brand_qty_filter, [pa.string(), pa.int64()], pa.bool_(), "stable", + ) + udf_filtered = items.filter(pred_udf(col("brand"), col("qty"))) + print(udf_filtered.execution_plan().display_indent()) + +The ``DataSourceExec`` now carries only ``predicate=brand_qty_filter(...)``. +There is no ``pruning_predicate`` and no ``required_guarantees``: the +scan has to materialize every row group and hand each row to the +Python callback just to decide whether to keep it. + +At small scale the cost difference is invisible; on a Parquet file with +many row groups, or data whose min/max statistics line up well with +the predicate, the native form can skip most of the file. The UDF form +reads all of it. + +**Takeaway.** Reach for a UDF when the per-row computation is genuinely +not expressible as a tree of built-in functions (custom numerical work, +external lookups, complex business rules). When it *is* expressible — +even if the native form is a little more verbose — build the ``Expr`` +tree directly so the optimizer can see through it. For disjunctive +predicates the idiom is to produce one clause per bucket and combine +them with ``|``: .. code-block:: python @@ -140,12 +201,12 @@ of simple predicates it can analyze and push down: from operator import or_ from datafusion import col, lit, functions as f - items_of_interest = { + buckets = { "Brand#12": {"containers": ["SM CASE", "SM BOX"], "min_qty": 1, "max_size": 5}, "Brand#23": {"containers": ["MED BAG", "MED BOX"], "min_qty": 10, "max_size": 10}, } - def brand_clause(brand, spec): + def bucket_clause(brand, spec): return ( (col("brand") == lit(brand)) & f.in_list(col("container"), [lit(c) for c in spec["containers"]]) @@ -155,13 +216,9 @@ of simple predicates it can analyze and push down: & (col("size") <= lit(spec["max_size"])) ) - predicate = reduce(or_, (brand_clause(b, s) for b, s in items_of_interest.items())) + predicate = reduce(or_, (bucket_clause(b, s) for b, s in buckets.items())) df = df.filter(predicate) -Reach for a UDF when the per-row computation is not expressible as a tree -of built-in functions. When it *is* expressible, build the ``Expr`` tree -directly. - Aggregate Functions ------------------- From cd3d0d3121ab762470636a98093f7931b0cd5823 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sat, 25 Apr 2026 09:45:51 -0400 Subject: [PATCH 14/18] docs: rework "subsets within a group" aggregation example Rename the section from "Building per-group arrays" to "Comparing subsets within a group" so the heading matches the content. Rewrite the intro to lead with the problem (compare full group vs filtered subset), reframe the worked example around partially failed orders, and replace the trailing bullet list with a one-line walkthrough of the result. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../common-operations/aggregations.rst | 51 +++++++++---------- 1 file changed, 24 insertions(+), 27 deletions(-) diff --git a/docs/source/user-guide/common-operations/aggregations.rst b/docs/source/user-guide/common-operations/aggregations.rst index a902fab5c..f59b62ab4 100644 --- a/docs/source/user-guide/common-operations/aggregations.rst +++ b/docs/source/user-guide/common-operations/aggregations.rst @@ -163,27 +163,32 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")]) -Building per-group arrays -^^^^^^^^^^^^^^^^^^^^^^^^^ - -:py:func:`~datafusion.functions.array_agg` collects the values within each -group into a list. Combined with ``distinct=True`` and the ``filter`` -argument, it lets you ask two questions of the same group in one pass — -"what are all the values?" and "what are the values that satisfy some -condition?". +Comparing subsets within a group +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Sometimes you need to compare the full membership of a group against a +subset that meets some condition — for example, "which groups have at least +one failure, but not every member failed?". The ``filter`` argument on an +aggregate restricts the rows that contribute to *that* aggregate without +dropping the group, so a single pass can produce both the full set and the +filtered subset side by side. Pairing +:py:func:`~datafusion.functions.array_agg` with ``distinct=True`` and +``filter=`` is a compact way to express this: collect the distinct values +of the group, collect the distinct values that satisfy the condition, then +compare the two arrays. Suppose each row records a line item with the supplier that fulfilled it and a flag for whether that supplier met the commit date. We want to identify -orders where exactly one supplier failed, among two or more suppliers in -total: +*partially failed* orders — orders where at least one supplier failed but +not every supplier failed: .. ipython:: python orders_df = ctx.from_pydict( { - "order_id": [1, 1, 1, 2, 2, 3], - "supplier_id": [100, 101, 102, 200, 201, 300], - "failed": [False, True, False, False, False, True], + "order_id": [1, 1, 1, 2, 2, 3, 4, 4], + "supplier_id": [100, 101, 102, 200, 201, 300, 400, 401], + "failed": [False, True, False, False, False, True, True, True], }, ) @@ -200,21 +205,13 @@ total: ) grouped.filter( - (f.array_length(col("failed_suppliers")) == lit(1)) - & (f.array_length(col("all_suppliers")) > lit(1)) - ).select( - col("order_id"), - f.array_element(col("failed_suppliers"), lit(1)).alias("the_one_bad_supplier"), - ) - -Two aspects of the pattern are worth calling out: + (f.array_length(col("failed_suppliers")) > lit(0)) + & (f.array_length(col("failed_suppliers")) < f.array_length(col("all_suppliers"))) + ).select(col("order_id"), col("failed_suppliers")) -- ``filter=`` on an aggregate narrows the rows contributing to *that* - aggregate only. Filtering the DataFrame before the aggregate would have - dropped whole groups that no longer had any rows. -- :py:func:`~datafusion.functions.array_length` tests group size without - another aggregate pass, and :py:func:`~datafusion.functions.array_element` - extracts a single value when you have proven the array has length one. +Order 1 is partial (one of three suppliers failed). Order 2 is excluded +because no supplier failed, order 3 because its only supplier failed, and +order 4 because both of its suppliers failed. Grouping Sets ------------- From d3054a1b8711e8f272445beeacf7db2bc657df16 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sat, 25 Apr 2026 11:26:43 -0400 Subject: [PATCH 15/18] docs: clarify "When not to use a UDF" intro Rewrite the opening of the section to make three things clearer: the contrast is with native DataFusion expressions (not Python in general), some predicates genuinely feel easier to write as a Python loop and that tension is worth acknowledging, and predicate pushdown is a table-provider mechanism rather than a Parquet-only feature. Parquet stays as the concrete demo. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../common-operations/udf-and-udfa.rst | 23 +++++++++++++------ 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/docs/source/user-guide/common-operations/udf-and-udfa.rst b/docs/source/user-guide/common-operations/udf-and-udfa.rst index 48249bb0d..59c47b595 100644 --- a/docs/source/user-guide/common-operations/udf-and-udfa.rst +++ b/docs/source/user-guide/common-operations/udf-and-udfa.rst @@ -105,13 +105,22 @@ When not to use a UDF ^^^^^^^^^^^^^^^^^^^^^ A UDF is the right tool when the per-row computation genuinely cannot be -expressed with built-in functions. It is often the *wrong* tool for a -predicate that happens to be easier to write in Python. A UDF is opaque -to the optimizer, which means filters expressed as UDFs lose several -rewrites that the engine applies to filters built from native -expressions. The most visible of these is **Parquet predicate pushdown**: -a native predicate can prune entire row groups using the min/max -statistics in the Parquet footer, while a UDF predicate cannot. +expressed with DataFusion's built-in expressions. It is often the *wrong* +tool for a predicate that *can* be written as an ``Expr`` tree but feels +easier to write as a Python function — for example, a filter that keeps +a row if it matches any one of several rule sets, where each rule set +checks its own combination of columns (the worked example at the end of +this section keeps a row when it matches any one of several brand-specific +rules). Looping over the rules in Python and returning a boolean per row +reads naturally and is tempting to wrap in a UDF, but a UDF is opaque to +the optimizer: filters expressed as UDFs lose several rewrites that the +engine applies to filters built from native expressions. The most visible +of these is **predicate pushdown into the table provider**: a native +predicate can be handed to the source so it skips data before it is read, +while a UDF predicate cannot. The example below uses Parquet, where +pushdown prunes whole row groups using the min/max statistics in the +footer, but the same mechanism applies to any table provider that +advertises filter support — including custom providers. The following example writes a small Parquet file, then filters it two ways: first with a native expression, then with a UDF that computes the From ed21f8d67ee80b1329252be0d6d4fdfd3a4ff25f Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sat, 25 Apr 2026 11:29:32 -0400 Subject: [PATCH 16/18] docs: move ai-coding-assistants under user-guide/ The page was sitting at the top level of docs/source/ while every other page in the USER GUIDE toctree lives under docs/source/user-guide/. Move the file, update the toctree entry, and update the absolute URL in llms.txt to match the new path. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/source/index.rst | 2 +- docs/source/llms.txt | 2 +- docs/source/{ => user-guide}/ai-coding-assistants.rst | 0 3 files changed, 2 insertions(+), 2 deletions(-) rename docs/source/{ => user-guide}/ai-coding-assistants.rst (100%) diff --git a/docs/source/index.rst b/docs/source/index.rst index 0e2b065c1..0007cc41a 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -78,7 +78,7 @@ Example user-guide/configuration user-guide/sql user-guide/upgrade-guides - ai-coding-assistants + user-guide/ai-coding-assistants .. _toc.contributor_guide: diff --git a/docs/source/llms.txt b/docs/source/llms.txt index 4d6680426..76e7359a5 100644 --- a/docs/source/llms.txt +++ b/docs/source/llms.txt @@ -5,7 +5,7 @@ ## Agent Guide - [SKILL.md (agent skill, raw)](https://raw.githubusercontent.com/apache/datafusion-python/main/SKILL.md): idiomatic DataFrame API patterns, SQL-to-DataFrame mappings, common pitfalls, and the full `functions` catalog. Primary source of truth for writing datafusion-python code. -- [Using DataFusion with AI coding assistants](https://datafusion.apache.org/python/ai-coding-assistants.html): human-readable guide for installing the skill and manual setup pointers. +- [Using DataFusion with AI coding assistants](https://datafusion.apache.org/python/user-guide/ai-coding-assistants.html): human-readable guide for installing the skill and manual setup pointers. ## User Guide diff --git a/docs/source/ai-coding-assistants.rst b/docs/source/user-guide/ai-coding-assistants.rst similarity index 100% rename from docs/source/ai-coding-assistants.rst rename to docs/source/user-guide/ai-coding-assistants.rst From 1b764b7f0bb3c91bdcb5e6907807c921d3ed12a0 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sat, 25 Apr 2026 11:39:34 -0400 Subject: [PATCH 17/18] docs: replace AGENTS.md skill list with discovery instructions A static skill list in AGENTS.md goes stale as new skills are added (it already missed the make-pythonic skill that was merged separately). Replace the enumerated list with a pointer telling agents to list .ai/skills/ and read each SKILL.md frontmatter, so the catalog never has to be hand-maintained. Co-Authored-By: Claude Opus 4.7 (1M context) --- AGENTS.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 6f27bdb0c..a6d27155a 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -33,11 +33,9 @@ Skills follow the [Agent Skills](https://agentskills.io) open standard. Each ski - `SKILL.md` — The skill definition with YAML frontmatter (name, description, argument-hint) and detailed instructions. - Additional supporting files as needed. -Currently available skills: - -- [`check-upstream`](.ai/skills/check-upstream/SKILL.md) — audit upstream - Apache DataFusion features (functions, DataFrame ops, SessionContext - methods, FFI types) not yet exposed in the Python bindings. +To discover what skills are available, list `.ai/skills/` and read each +`SKILL.md`. The frontmatter `name` and `description` fields summarize the +skill's purpose. ## Pull Requests From 5fb8146421bc7340d0e7fc267d2a3053c9af0214 Mon Sep 17 00:00:00 2001 From: Tim Saucer Date: Sat, 25 Apr 2026 11:49:36 -0400 Subject: [PATCH 18/18] docs: fix broken llms.txt link and stale otherwise xref - ai-coding-assistants.rst: use absolute https://datafusion.apache.org/python/llms.txt URL; the relative `llms.txt` resolved to /python/user-guide/llms.txt and 404'd because html_extra_path publishes the file at the site root. - expressions.rst: drop the broken `:py:meth:~datafusion.expr.Expr.otherwise` xref (otherwise lives on CaseBuilder, not Expr) and spell the recommended replacement as `f.when(f.in_list(...), value).otherwise(default)`. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/source/user-guide/ai-coding-assistants.rst | 2 +- docs/source/user-guide/common-operations/expressions.rst | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/user-guide/ai-coding-assistants.rst b/docs/source/user-guide/ai-coding-assistants.rst index 7c12cb43b..96266dbd6 100644 --- a/docs/source/user-guide/ai-coding-assistants.rst +++ b/docs/source/user-guide/ai-coding-assistants.rst @@ -30,7 +30,7 @@ What is published DataFrame operations, expression building, SQL-to-DataFrame mappings, idiomatic patterns, and common pitfalls. Follows the `Agent Skills `_ open standard. -- `llms.txt `_ — an entry point for LLM-based tools following the +- `llms.txt `_ — an entry point for LLM-based tools following the `llmstxt.org `_ convention. Categorized links to the skill, user guide, API reference, and examples. diff --git a/docs/source/user-guide/common-operations/expressions.rst b/docs/source/user-guide/common-operations/expressions.rst index aeb6e2ed1..ae1ccc0dc 100644 --- a/docs/source/user-guide/common-operations/expressions.rst +++ b/docs/source/user-guide/common-operations/expressions.rst @@ -234,9 +234,9 @@ This searched-CASE pattern is idiomatic for "attribute the measure to the matching side of a left join, otherwise contribute zero" — a shape that appears in TPC-H Q08 and similar market-share calculations. -If a switched CASE has only two or three branches that test equality, an -``in_list`` filter combined with :py:meth:`~datafusion.expr.Expr.otherwise` -is often simpler than the full ``case`` builder. +If a switched CASE only groups several equality matches into one bucket, +``f.when(f.in_list(col(...), [...]), value).otherwise(default)`` is often +simpler than the full ``case`` builder. Structs -------