Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
6e2241b
docs: publish SKILL.md on the docs site via myst include
timsaucer Apr 24, 2026
c7cdc63
docs: publish llms.txt at docs site root
timsaucer Apr 24, 2026
23b3be7
docs: add write-dataframe-code contributor skill
timsaucer Apr 24, 2026
35b7893
docs: add audit-skill-md skill
timsaucer Apr 24, 2026
a3f19a9
docs: enrich RST pages with demos relocated from TPC-H rewrite
timsaucer Apr 24, 2026
e461499
docs: wire new contributor skills and plan-comparison diagnostic into…
timsaucer Apr 24, 2026
6336e00
docs: rename aggregations.rst demo df to orders_df to avoid clobberin…
timsaucer Apr 24, 2026
dbd83cf
docs: replace raw SKILL.md include with a human-written AI-assistants…
timsaucer Apr 24, 2026
5edc8e9
docs: drop redundant assistants list in ai-coding-assistants intro
timsaucer Apr 24, 2026
a892c02
docs: convert ai-coding-assistants page from markdown to rst, shorten…
timsaucer Apr 24, 2026
2022588
docs: drop audit-skill-md skill
timsaucer Apr 24, 2026
4f73bcd
docs: drop write-dataframe-code skill
timsaucer Apr 24, 2026
bd54032
docs: show Parquet pushdown plan diff in "When not to use a UDF"
timsaucer Apr 24, 2026
cd3d0d3
docs: rework "subsets within a group" aggregation example
timsaucer Apr 25, 2026
d3054a1
docs: clarify "When not to use a UDF" intro
timsaucer Apr 25, 2026
ed21f8d
docs: move ai-coding-assistants under user-guide/
timsaucer Apr 25, 2026
b489ca4
Merge branch 'main' into feat/docsite-agent-improvements
timsaucer Apr 25, 2026
1b764b7
docs: replace AGENTS.md skill list with discovery instructions
timsaucer Apr 25, 2026
5fb8146
docs: fix broken llms.txt link and stale otherwise xref
timsaucer Apr 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ Skills follow the [Agent Skills](https://agentskills.io) open standard. Each ski
- `SKILL.md` — The skill definition with YAML frontmatter (name, description, argument-hint) and detailed instructions.
- Additional supporting files as needed.

To discover what skills are available, list `.ai/skills/` and read each
`SKILL.md`. The frontmatter `name` and `description` fields summarize the
skill's purpose.

## Pull Requests

Every pull request must follow the template in
Expand Down
3 changes: 2 additions & 1 deletion dev/release/rat_exclude_files.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,5 @@ benchmarks/tpch/create_tables.sql
**/.cargo/config.toml
uv.lock
examples/tpch/answers_sf1/*.tbl
SKILL.md
SKILL.md
docs/source/llms.txt
4 changes: 4 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,10 @@ def setup(sphinx) -> None:
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

# Copy agent-facing files (llms.txt) verbatim to the site root so they
# resolve at conventional URLs like `https://.../python/llms.txt`.
html_extra_path = ["llms.txt"]

html_logo = "_static/images/2x_bgwhite_original.png"

html_css_files = ["theme_overrides.css"]
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ Example
user-guide/configuration
user-guide/sql
user-guide/upgrade-guides
user-guide/ai-coding-assistants


.. _toc.contributor_guide:
Expand Down
36 changes: 36 additions & 0 deletions docs/source/llms.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# DataFusion in Python

> Apache DataFusion Python is a Python binding for Apache DataFusion, an in-process, Arrow-native query engine. It exposes a SQL interface and a lazy DataFrame API over PyArrow and any Arrow C Data Interface source. This file points agents and LLM-based tools at the most useful entry points for writing DataFusion Python code.

## Agent Guide

- [SKILL.md (agent skill, raw)](https://raw.githubusercontent.com/apache/datafusion-python/main/SKILL.md): idiomatic DataFrame API patterns, SQL-to-DataFrame mappings, common pitfalls, and the full `functions` catalog. Primary source of truth for writing datafusion-python code.
- [Using DataFusion with AI coding assistants](https://datafusion.apache.org/python/user-guide/ai-coding-assistants.html): human-readable guide for installing the skill and manual setup pointers.

## User Guide

- [Introduction](https://datafusion.apache.org/python/user-guide/introduction.html): install, the Pokemon quick start, Jupyter tips.
- [Basics](https://datafusion.apache.org/python/user-guide/basics.html): `SessionContext`, `DataFrame`, and `Expr` at a glance.
- [Data sources](https://datafusion.apache.org/python/user-guide/data-sources.html): Parquet, CSV, JSON, Arrow, Pandas, Polars, and Python objects.
- [DataFrame operations](https://datafusion.apache.org/python/user-guide/dataframe/index.html): the lazy query-building interface.
- [Common operations](https://datafusion.apache.org/python/user-guide/common-operations/index.html): select, filter, join, aggregate, window, expressions, and functions.
- [SQL](https://datafusion.apache.org/python/user-guide/sql.html): running SQL against registered tables.
- [Configuration](https://datafusion.apache.org/python/user-guide/configuration.html): session and runtime options.

## DataFrame API reference

- [`datafusion.dataframe.DataFrame`](https://datafusion.apache.org/python/autoapi/datafusion/dataframe/index.html): the lazy DataFrame builder (`select`, `filter`, `aggregate`, `join`, `sort`, `limit`, set operations).
- [`datafusion.expr`](https://datafusion.apache.org/python/autoapi/datafusion/expr/index.html): expression tree nodes (`Expr`, `Window`, `WindowFrame`, `GroupingSet`).
- [`datafusion.functions`](https://datafusion.apache.org/python/autoapi/datafusion/functions/index.html): 290+ scalar, aggregate, and window functions.
- [`datafusion.context.SessionContext`](https://datafusion.apache.org/python/autoapi/datafusion/context/index.html): session entry point, data loading, SQL execution.

## Examples

- [TPC-H queries (GitHub)](https://github.com/apache/datafusion-python/tree/main/examples/tpch): canonical translations of TPC-H Q01–Q22 to idiomatic DataFrame code, each with reference SQL embedded in the module docstring.
- [Other examples (GitHub)](https://github.com/apache/datafusion-python/tree/main/examples): UDF/UDAF/UDWF, Substrait, Pandas/Polars interop, S3 reads.

## Optional

- [Contributor guide](https://datafusion.apache.org/python/contributor-guide/introduction.html): building from source, extending the Python bindings.
- [Upgrade guides](https://datafusion.apache.org/python/user-guide/upgrade-guides.html): migration notes between releases.
- [Upstream Rust `DataFusion`](https://datafusion.apache.org/): the underlying query engine.
82 changes: 82 additions & 0 deletions docs/source/user-guide/ai-coding-assistants.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.

Using AI Coding Assistants
==========================

If you write DataFusion Python code with an AI coding assistant, this
project ships machine-readable guidance so the assistant produces
idiomatic code rather than guessing from its training data.

What is published
-----------------

- `SKILL.md <https://github.com/apache/datafusion-python/blob/main/SKILL.md>`_ —
a dense, skill-oriented reference covering imports, data loading,
DataFrame operations, expression building, SQL-to-DataFrame mappings,
idiomatic patterns, and common pitfalls. Follows the
`Agent Skills <https://agentskills.io>`_ open standard.
- `llms.txt <https://datafusion.apache.org/python/llms.txt>`_ — an entry point for LLM-based tools following the
`llmstxt.org <https://llmstxt.org>`_ convention. Categorized links to the
skill, user guide, API reference, and examples.

Both files live at stable URLs so an agent can discover them without a
checkout of the repo.

Installing the skill
--------------------

**Preferred:** run

.. code-block:: shell

npx skills add apache/datafusion-python

This installs the skill in any supported agent on your machine (Claude
Code, Cursor, Windsurf, Cline, Codex, Copilot, Gemini CLI, and others).
The command writes the pointer into the agent's configuration so that any
project you open that uses DataFusion Python picks up the skill
automatically.

**Manual:** if you are not using the ``skills`` registry, paste this
single line into your project's ``AGENTS.md`` or ``CLAUDE.md``::

For DataFusion Python code, see https://github.com/apache/datafusion-python/blob/main/SKILL.md

Most assistants resolve that pointer the first time they see a
DataFusion-related prompt in the project.

What the skill covers
---------------------

Writing DataFusion Python code has a handful of conventions that are easy
for a model to miss — bitwise ``&`` / ``|`` / ``~`` instead of Python
``and`` / ``or`` / ``not``, the lazy-DataFrame immutability model, how
window functions replace SQL correlated subqueries, the ``case`` /
``when`` builder syntax, and the ``in_list`` / ``array_position`` options
for membership tests. The skill enumerates each of these with short,
copyable examples.

It is *not* a replacement for this user guide. Think of it as a distilled
reference the assistant keeps open while it writes code for you.

If you are an agent author
--------------------------

The skill file and ``llms.txt`` are the two supported integration
points. Both are versioned along with the release and follow open
standards — no project-specific handshake is required.
50 changes: 50 additions & 0 deletions docs/source/user-guide/common-operations/aggregations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,56 @@ Suppose we want to find the speed values for only Pokemon that have low Attack v
f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])


Comparing subsets within a group
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Sometimes you need to compare the full membership of a group against a
subset that meets some condition — for example, "which groups have at least
one failure, but not every member failed?". The ``filter`` argument on an
aggregate restricts the rows that contribute to *that* aggregate without
dropping the group, so a single pass can produce both the full set and the
filtered subset side by side. Pairing
:py:func:`~datafusion.functions.array_agg` with ``distinct=True`` and
``filter=`` is a compact way to express this: collect the distinct values
of the group, collect the distinct values that satisfy the condition, then
compare the two arrays.

Suppose each row records a line item with the supplier that fulfilled it and
a flag for whether that supplier met the commit date. We want to identify
*partially failed* orders — orders where at least one supplier failed but
not every supplier failed:

.. ipython:: python

orders_df = ctx.from_pydict(
{
"order_id": [1, 1, 1, 2, 2, 3, 4, 4],
"supplier_id": [100, 101, 102, 200, 201, 300, 400, 401],
"failed": [False, True, False, False, False, True, True, True],
},
)

grouped = orders_df.aggregate(
[col("order_id")],
[
f.array_agg(col("supplier_id"), distinct=True).alias("all_suppliers"),
f.array_agg(
col("supplier_id"),
filter=col("failed"),
distinct=True,
).alias("failed_suppliers"),
],
)

grouped.filter(
(f.array_length(col("failed_suppliers")) > lit(0))
& (f.array_length(col("failed_suppliers")) < f.array_length(col("all_suppliers")))
).select(col("order_id"), col("failed_suppliers"))

Order 1 is partial (one of three suppliers failed). Order 2 is excluded
because no supplier failed, order 3 because its only supplier failed, and
order 4 because both of its suppliers failed.

Grouping Sets
-------------

Expand Down
92 changes: 92 additions & 0 deletions docs/source/user-guide/common-operations/expressions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,98 @@ This function returns a new array with the elements repeated.
In this example, the `repeated_array` column will contain `[[1, 2, 3], [1, 2, 3]]`.


Testing membership in a list
----------------------------

A common need is filtering rows where a column equals *any* of a small set of
values. DataFusion offers three forms; they differ in readability and in how
they scale:

1. A compound boolean using ``|`` across explicit equalities.
2. :py:func:`~datafusion.functions.in_list`, which accepts a list of
expressions and tests equality against all of them in one call.
3. A trick with :py:func:`~datafusion.functions.array_position` and
:py:func:`~datafusion.functions.make_array`, which returns the 1-based
index of the value in a constructed array, or null if it is not present.

.. ipython:: python

from datafusion import SessionContext, col, lit
from datafusion import functions as f

ctx = SessionContext()
df = ctx.from_pydict({"shipmode": ["MAIL", "SHIP", "AIR", "TRUCK", "RAIL"]})

# Option 1: compound boolean. Fine for two values; awkward past three.
df.filter((col("shipmode") == lit("MAIL")) | (col("shipmode") == lit("SHIP")))

# Option 2: in_list. Preferred for readability as the set grows.
df.filter(f.in_list(col("shipmode"), [lit("MAIL"), lit("SHIP")]))

# Option 3: array_position / make_array. Useful when you already have the
# set as an array column and want "is in that array" semantics.
df.filter(
~f.array_position(
f.make_array(lit("MAIL"), lit("SHIP")), col("shipmode")
).is_null()
)

Use ``in_list`` as the default. It is explicit, readable, and matches the
semantics users expect from SQL's ``IN (...)``. Reach for the
``array_position`` form only when the membership set is itself an array
column rather than a literal list.

Conditional expressions
-----------------------

DataFusion provides :py:func:`~datafusion.functions.case` for the SQL
``CASE`` expression in both its switched and searched forms, along with
:py:func:`~datafusion.functions.when` as a standalone builder for the
searched form.

**Switched CASE** (one expression compared against several literal values):

.. ipython:: python

df = ctx.from_pydict(
{"priority": ["1-URGENT", "2-HIGH", "3-MEDIUM", "5-LOW"]},
)

df.select(
col("priority"),
f.case(col("priority"))
.when(lit("1-URGENT"), lit(1))
.when(lit("2-HIGH"), lit(1))
.otherwise(lit(0))
.alias("is_high_priority"),
)

**Searched CASE** (an independent boolean predicate per branch). Use this
form whenever a branch tests more than simple equality — for example,
checking whether a joined column is ``NULL`` to gate a computed value:

.. ipython:: python

df = ctx.from_pydict(
{"volume": [10.0, 20.0, 30.0], "supplier_id": [1, None, 2]},
)

df.select(
col("volume"),
col("supplier_id"),
f.when(col("supplier_id").is_not_null(), col("volume"))
.otherwise(lit(0.0))
.alias("attributed_volume"),
)

This searched-CASE pattern is idiomatic for "attribute the measure to the
matching side of a left join, otherwise contribute zero" — a shape that
appears in TPC-H Q08 and similar market-share calculations.

If a switched CASE only groups several equality matches into one bucket,
``f.when(f.in_list(col(...), [...]), value).otherwise(default)`` is often
simpler than the full ``case`` builder.

Structs
-------

Expand Down
Loading
Loading