Skip to content

tpch examples: rewrite queries idiomatically and embed reference SQL#1504

Merged
timsaucer merged 5 commits intoapache:mainfrom
timsaucer:feat/tpch-idiomatic-rewrites
Apr 24, 2026
Merged

tpch examples: rewrite queries idiomatically and embed reference SQL#1504
timsaucer merged 5 commits intoapache:mainfrom
timsaucer:feat/tpch-idiomatic-rewrites

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented Apr 24, 2026

Which issue does this PR close?

Relates to #1394.

Rationale for this change

The TPC-H examples under examples/tpch/ serve as the canonical hands-on reference for how to write DataFusion Python DataFrame code. Before this PR: Q20 had a bug where a filter was computed and discarded (df.filter(...) without assignment); several queries used non-idiomatic constructs (switched CASE on boolean subjects, array_position(make_array(...)) in place of in_list, 0-based substring tricks, a pyarrow UDF re-implementing a disjunctive predicate, aggregate([col], []) in place of distinct(), etc.); the reference SQL was not embedded in the files, so readers had to cross-reference benchmarks/tpch/queries/ to see the intended query; and where reference SQL was embedded, it used different TPC-H substitution parameters than the DataFrame code, so the two expressions described different queries.

What changes are included in this PR?

Four commits, grouped by concern:

  1. tpch examples: add reference SQL to each query, fix Q20 — append the canonical TPC-H reference SQL to each q01..q22 module docstring; fix the missing assignment on Q20's excess-quantity filter.
  2. tpch examples: rewrite non-idiomatic queries in idiomatic DataFrame form — rewrite Q04, Q07, Q08, Q12, Q19, Q20, Q21 using the DataFrame-native pattern (semi/anti joins for EXISTS/NOT EXISTS, searched F.when for CASE WHEN, F.in_list for IN, compound predicates in place of a pyarrow UDF, etc.).
  3. tpch examples: align reference SQL constants with DataFrame queries — update the embedded SQL in 21 of 22 docstrings so the substitution parameters match the DataFrame code (which is validated at scale factor 1 against answers_sf1/). Interval units (month, year) are preserved where the problem-statement text reads "quarter", "year", or "month".
  4. tpch examples: apply SKILL.md idioms across all 22 queries — sweep all 22 queries for SKILL.md idioms: auto-wrap on comparison RHS, plain-name group/sort keys, drop how="inner", collapse chained .filter() calls, F.count_star() for SQL count(*), F.starts_with / F.in_list / searched F.when. Q16 also picks up the secondary sort keys (p_brand, p_type, p_size) that the TPC-H spec requires but the original DataFrame omitted.

All 22 answer-file comparisons under examples/tpch/_tests.py pass.

Are there any user-facing changes?

No public API changes. The examples/tpch/ directory is a teaching aid shipped in the source tree, not in the wheel, so the visible effect is limited to readers of the examples.

timsaucer and others added 4 commits April 24, 2026 08:50
- Append the canonical TPC-H reference SQL (from benchmarks/tpch/queries/)
  to each q01..q22 module docstring so readers can compare the DataFrame
  translation against the SQL at a glance.
- Fix Q20: `df = df.filter(col("ps_availqty") > lit(0.5) * col("total_sold"))`
  was missing the assignment so the filter was dropped from the pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite the seven TPC-H example queries that did not demonstrate the
idiomatic DataFrame pattern. The remaining queries (Q02/Q11/Q15/Q17/Q22,
which use window functions in place of correlated subqueries) already are
idiomatic and are left unchanged.

- Q04: replace `.aggregate([col("l_orderkey")], [])` with
  `.select("l_orderkey").distinct()`, which is the natural way to express
  "reduce to one row per order" on a DataFrame.
- Q07: remove the CASE-as-filter on `n_name` and use
  `F.in_list(col("n_name"), [nation_1, nation_2])` instead. Drops a
  comment block that admitted the filter form was simpler.
- Q08: rewrite the switched CASE `F.case(...).when(lit(False), ...)` as a
  searched `F.when(col(...).is_not_null(), ...).otherwise(...)`. That
  mirrors the reference SQL's `case when ... then ... else 0 end` shape.
- Q12: replace `array_position(make_array(...), col)` with
  `F.in_list(col("l_shipmode"), [...])`. Same semantics, without routing
  through array construction / array search.
- Q19: remove the pyarrow UDF that re-implemented a disjunctive predicate
  in Python. Build the same predicate in DataFusion by OR-combining one
  `in_list` + range-filter expression per brand. Keeps the per-brand
  constants in the existing `items_of_interest` dict.
- Q20: use `F.starts_with` instead of an explicit substring slice. Replace
  the inner-join + `select(...).distinct()` tail with a semi join against
  a precomputed set of excess-quantity suppliers so the supplier columns
  are preserved without deduplication after the fact.
- Q21: replace the `array_agg` / `array_length` / `array_element` pipeline
  with two semi joins. One semi join keeps orders with more than one
  distinct supplier (stand-in for the reference SQL's `exists` subquery),
  the other keeps orders with exactly one late supplier (stand-in for the
  `not exists` subquery).

All 22 answer-file comparisons and 22 plan-comparison diagnostics still
pass (`pytest examples/tpch/_tests.py`: 44 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reference SQL embedded in each q01..q22 module docstring was carried
over verbatim from ``benchmarks/tpch/queries/`` and uses a different set
of TPC-H substitution parameters than the DataFrame examples
(answer-file-validated at scale factor 1). Update each reference SQL to
use the substitution parameters the DataFrame uses, so both expressions
describe the same query and would produce the same results against the
same data.

Constants aligned:

- Q01: ``90 days`` cutoff (DataFrame ``DAYS_BEFORE_FINAL = 90``).
- Q02: ``p_size = 15``, ``p_type like '%BRASS'``, ``r_name = 'EUROPE'``.
- Q04: base date ``1993-07-01`` (``3 month`` interval preserved per the
  "quarter of a year" wording).
- Q05: ``r_name = 'ASIA'``.
- Q06: ``l_discount between 0.06 - 0.01 and 0.06 + 0.01``.
- Q07: nations ``'FRANCE'`` / ``'GERMANY'``.
- Q08: ``r_name = 'AMERICA'``, ``p_type = 'ECONOMY ANODIZED STEEL'``,
  inner-case ``nation = 'BRAZIL'``.
- Q09: ``p_name like '%green%'``.
- Q10: base date ``1993-10-01`` (``3 month`` interval preserved).
- Q11: ``n_name = 'GERMANY'``.
- Q12: ship modes ``('MAIL', 'SHIP')``, base date ``1994-01-01``.
- Q13: ``o_comment not like '%special%requests%'``.
- Q14: base date ``1995-09-01``.
- Q15: base date ``1996-01-01``.
- Q16: ``p_brand <> 'Brand#45'``, ``p_type not like 'MEDIUM POLISHED%'``,
  sizes ``(49, 14, 23, 45, 19, 3, 36, 9)``.
- Q17: ``p_brand = 'Brand#23'``, ``p_container = 'MED BOX'``.
- Q18: ``sum(l_quantity) > 300``.
- Q19: brands ``Brand#12`` / ``Brand#23`` / ``Brand#34`` with the matching
  minimum quantities (1, 10, 20).
- Q20: ``p_name like 'forest%'``, base date ``1994-01-01``,
  ``n_name = 'CANADA'``.
- Q21: ``n_name = 'SAUDI ARABIA'``.
- Q22: country codes ``('13', '31', '23', '29', '30', '18', '17')``.

Interval units (month / year) are preserved where the problem-statement
text reads "given quarter", "given year", "given month". Q01 keeps the
literal "days" unit because the TPC-H problem statement itself describes
the cutoff in days.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep every q01..q22 example for idiomatic DataFrame style as described in
the repo-root SKILL.md:

- ``col("x") == "s"`` in place of ``col("x") == lit("s")`` on comparison
  right-hand sides (auto-wrap applies).
- Plain-name strings in ``select``/``aggregate``/``sort`` group/sort key
  lists when the key is a bare column.
- Drop redundant ``how="inner"`` and single-element ``left_on``/``right_on``
  list wrapping on equi-joins.
- Collapse chained ``.filter(a).filter(b)`` runs into ``.filter(a, b)``
  and chained ``.with_column`` runs into ``.with_columns(a=..., b=...)``.
- ``df.sort_by(...)`` or plain-name ``df.sort(...)`` when no null-placement
  override is needed.
- ``F.count_star()`` in place of ``F.count(col("x"))`` whenever the SQL
  reads ``count(*)``.
- ``F.starts_with(col, lit(prefix))`` and ``~F.starts_with(...)`` in place
  of substring-prefix equality/inequality tricks.
- ``F.in_list(col, [lit(...)])`` in place of ``~F.array_position(...).
  is_null()`` and in place of disjunctions of equality comparisons.
- Searched ``F.when(cond, x).otherwise(y)`` in place of switched
  ``F.case(bool_expr).when(lit(True/False), x).end()`` forms.
- Semi-joins as the DataFrame form of ``EXISTS`` (Q04); anti-joins as
  ``NOT EXISTS`` (Q22 was already using this idiom).
- Whole-frame window aggregates as the DataFrame stand-in for a SQL
  scalar subquery (Q11/Q15/Q17/Q22).

Individual query fixes of note:

- Q16 — add the secondary sort keys (``p_brand``, ``p_type``, ``p_size``)
  that the TPC-H spec requires but the original DataFrame omitted.
- Q22 — drop a stray ``df.show()`` mid-pipeline; replace the 0-based
  substring slice with ``F.left(col("c_phone"), lit(2))``.
- Q14 — rewrite the promo/non-promo factor split as a searched CASE inside
  ``F.sum(...)`` so the DataFrame expression matches the reference SQL
  shape exactly.

All 22 answer-file comparisons still pass at scale factor 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer marked this pull request as draft April 24, 2026 13:36
@timsaucer timsaucer marked this pull request as ready for review April 24, 2026 13:55
…ndling

Additional sweep of the TPC-H DataFrame examples informed by comparing
against a fresh set of SKILL.md-only generations under
``examples/tpch/agentic_queries/``:

- Q02: ``F.ends_with(col("p_type"), lit(TYPE_OF_INTEREST))`` in place of
  ``F.strpos(col, lit) > 0``. The reference SQL is ``p_type like '%BRASS'``,
  which is an ends_with check, not contains. ``F.strpos > 0`` returned the
  correct rows on TPC-H data by coincidence but is semantically wrong.
- Q09: ``F.contains(col("p_name"), lit(part_color))`` in place of
  ``F.strpos(col, lit) > 0``. The SQL is ``p_name like '%green%'``.
- Q08, Q12, Q14: use the ``filter`` keyword on ``F.sum`` / ``F.count`` —
  the DataFrame form of SQL ``sum(...) FILTER (WHERE ...)`` — instead of
  wrapping the aggregate input in ``F.when(cond, x).otherwise(0)``. Q08
  also reorganises to inner-join the supplier's nation onto the regional
  sales, which removes the previous left-join + ``F.when(is_not_null, ...)``
  dance.
- Q15: compute the grand maximum revenue as a separate scalar aggregate
  and ``join_on(...)`` on equality, instead of the whole-frame window
  ``F.max`` + filter shape. Simpler plan, same result.
- Q16: ``F.regexp_like(col, pattern)`` in place of
  ``F.regexp_match(col, pattern).is_not_null()``.
- Q04, Q05, Q06, Q07, Q08, Q10, Q12, Q14, Q15, Q20: store both the start
  and the end of the date window as plain ``datetime.date`` objects and
  compare with ``lit(end_date)``, instead of carrying the start date +
  ``pa.month_day_nano_interval`` and adding them at query-build time.
  Drops unused ``pyarrow`` imports from the files that no longer need
  Arrow scalars.

All 22 answer-file comparisons still pass at scale factor 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer
Copy link
Copy Markdown
Member Author

Since this is just examples, I'm not going to bother anyone for a review. These were generated by an agent and every one reviewed by me.

@timsaucer timsaucer merged commit 0357716 into apache:main Apr 24, 2026
21 checks passed
@timsaucer timsaucer deleted the feat/tpch-idiomatic-rewrites branch April 24, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant