tpch examples: rewrite queries idiomatically and embed reference SQL#1504
Merged
timsaucer merged 5 commits intoapache:mainfrom Apr 24, 2026
Merged
tpch examples: rewrite queries idiomatically and embed reference SQL#1504timsaucer merged 5 commits intoapache:mainfrom
timsaucer merged 5 commits intoapache:mainfrom
Conversation
- Append the canonical TPC-H reference SQL (from benchmarks/tpch/queries/)
to each q01..q22 module docstring so readers can compare the DataFrame
translation against the SQL at a glance.
- Fix Q20: `df = df.filter(col("ps_availqty") > lit(0.5) * col("total_sold"))`
was missing the assignment so the filter was dropped from the pipeline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite the seven TPC-H example queries that did not demonstrate the
idiomatic DataFrame pattern. The remaining queries (Q02/Q11/Q15/Q17/Q22,
which use window functions in place of correlated subqueries) already are
idiomatic and are left unchanged.
- Q04: replace `.aggregate([col("l_orderkey")], [])` with
`.select("l_orderkey").distinct()`, which is the natural way to express
"reduce to one row per order" on a DataFrame.
- Q07: remove the CASE-as-filter on `n_name` and use
`F.in_list(col("n_name"), [nation_1, nation_2])` instead. Drops a
comment block that admitted the filter form was simpler.
- Q08: rewrite the switched CASE `F.case(...).when(lit(False), ...)` as a
searched `F.when(col(...).is_not_null(), ...).otherwise(...)`. That
mirrors the reference SQL's `case when ... then ... else 0 end` shape.
- Q12: replace `array_position(make_array(...), col)` with
`F.in_list(col("l_shipmode"), [...])`. Same semantics, without routing
through array construction / array search.
- Q19: remove the pyarrow UDF that re-implemented a disjunctive predicate
in Python. Build the same predicate in DataFusion by OR-combining one
`in_list` + range-filter expression per brand. Keeps the per-brand
constants in the existing `items_of_interest` dict.
- Q20: use `F.starts_with` instead of an explicit substring slice. Replace
the inner-join + `select(...).distinct()` tail with a semi join against
a precomputed set of excess-quantity suppliers so the supplier columns
are preserved without deduplication after the fact.
- Q21: replace the `array_agg` / `array_length` / `array_element` pipeline
with two semi joins. One semi join keeps orders with more than one
distinct supplier (stand-in for the reference SQL's `exists` subquery),
the other keeps orders with exactly one late supplier (stand-in for the
`not exists` subquery).
All 22 answer-file comparisons and 22 plan-comparison diagnostics still
pass (`pytest examples/tpch/_tests.py`: 44 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reference SQL embedded in each q01..q22 module docstring was carried
over verbatim from ``benchmarks/tpch/queries/`` and uses a different set
of TPC-H substitution parameters than the DataFrame examples
(answer-file-validated at scale factor 1). Update each reference SQL to
use the substitution parameters the DataFrame uses, so both expressions
describe the same query and would produce the same results against the
same data.
Constants aligned:
- Q01: ``90 days`` cutoff (DataFrame ``DAYS_BEFORE_FINAL = 90``).
- Q02: ``p_size = 15``, ``p_type like '%BRASS'``, ``r_name = 'EUROPE'``.
- Q04: base date ``1993-07-01`` (``3 month`` interval preserved per the
"quarter of a year" wording).
- Q05: ``r_name = 'ASIA'``.
- Q06: ``l_discount between 0.06 - 0.01 and 0.06 + 0.01``.
- Q07: nations ``'FRANCE'`` / ``'GERMANY'``.
- Q08: ``r_name = 'AMERICA'``, ``p_type = 'ECONOMY ANODIZED STEEL'``,
inner-case ``nation = 'BRAZIL'``.
- Q09: ``p_name like '%green%'``.
- Q10: base date ``1993-10-01`` (``3 month`` interval preserved).
- Q11: ``n_name = 'GERMANY'``.
- Q12: ship modes ``('MAIL', 'SHIP')``, base date ``1994-01-01``.
- Q13: ``o_comment not like '%special%requests%'``.
- Q14: base date ``1995-09-01``.
- Q15: base date ``1996-01-01``.
- Q16: ``p_brand <> 'Brand#45'``, ``p_type not like 'MEDIUM POLISHED%'``,
sizes ``(49, 14, 23, 45, 19, 3, 36, 9)``.
- Q17: ``p_brand = 'Brand#23'``, ``p_container = 'MED BOX'``.
- Q18: ``sum(l_quantity) > 300``.
- Q19: brands ``Brand#12`` / ``Brand#23`` / ``Brand#34`` with the matching
minimum quantities (1, 10, 20).
- Q20: ``p_name like 'forest%'``, base date ``1994-01-01``,
``n_name = 'CANADA'``.
- Q21: ``n_name = 'SAUDI ARABIA'``.
- Q22: country codes ``('13', '31', '23', '29', '30', '18', '17')``.
Interval units (month / year) are preserved where the problem-statement
text reads "given quarter", "given year", "given month". Q01 keeps the
literal "days" unit because the TPC-H problem statement itself describes
the cutoff in days.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep every q01..q22 example for idiomatic DataFrame style as described in
the repo-root SKILL.md:
- ``col("x") == "s"`` in place of ``col("x") == lit("s")`` on comparison
right-hand sides (auto-wrap applies).
- Plain-name strings in ``select``/``aggregate``/``sort`` group/sort key
lists when the key is a bare column.
- Drop redundant ``how="inner"`` and single-element ``left_on``/``right_on``
list wrapping on equi-joins.
- Collapse chained ``.filter(a).filter(b)`` runs into ``.filter(a, b)``
and chained ``.with_column`` runs into ``.with_columns(a=..., b=...)``.
- ``df.sort_by(...)`` or plain-name ``df.sort(...)`` when no null-placement
override is needed.
- ``F.count_star()`` in place of ``F.count(col("x"))`` whenever the SQL
reads ``count(*)``.
- ``F.starts_with(col, lit(prefix))`` and ``~F.starts_with(...)`` in place
of substring-prefix equality/inequality tricks.
- ``F.in_list(col, [lit(...)])`` in place of ``~F.array_position(...).
is_null()`` and in place of disjunctions of equality comparisons.
- Searched ``F.when(cond, x).otherwise(y)`` in place of switched
``F.case(bool_expr).when(lit(True/False), x).end()`` forms.
- Semi-joins as the DataFrame form of ``EXISTS`` (Q04); anti-joins as
``NOT EXISTS`` (Q22 was already using this idiom).
- Whole-frame window aggregates as the DataFrame stand-in for a SQL
scalar subquery (Q11/Q15/Q17/Q22).
Individual query fixes of note:
- Q16 — add the secondary sort keys (``p_brand``, ``p_type``, ``p_size``)
that the TPC-H spec requires but the original DataFrame omitted.
- Q22 — drop a stray ``df.show()`` mid-pipeline; replace the 0-based
substring slice with ``F.left(col("c_phone"), lit(2))``.
- Q14 — rewrite the promo/non-promo factor split as a searched CASE inside
``F.sum(...)`` so the DataFrame expression matches the reference SQL
shape exactly.
All 22 answer-file comparisons still pass at scale factor 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndling
Additional sweep of the TPC-H DataFrame examples informed by comparing
against a fresh set of SKILL.md-only generations under
``examples/tpch/agentic_queries/``:
- Q02: ``F.ends_with(col("p_type"), lit(TYPE_OF_INTEREST))`` in place of
``F.strpos(col, lit) > 0``. The reference SQL is ``p_type like '%BRASS'``,
which is an ends_with check, not contains. ``F.strpos > 0`` returned the
correct rows on TPC-H data by coincidence but is semantically wrong.
- Q09: ``F.contains(col("p_name"), lit(part_color))`` in place of
``F.strpos(col, lit) > 0``. The SQL is ``p_name like '%green%'``.
- Q08, Q12, Q14: use the ``filter`` keyword on ``F.sum`` / ``F.count`` —
the DataFrame form of SQL ``sum(...) FILTER (WHERE ...)`` — instead of
wrapping the aggregate input in ``F.when(cond, x).otherwise(0)``. Q08
also reorganises to inner-join the supplier's nation onto the regional
sales, which removes the previous left-join + ``F.when(is_not_null, ...)``
dance.
- Q15: compute the grand maximum revenue as a separate scalar aggregate
and ``join_on(...)`` on equality, instead of the whole-frame window
``F.max`` + filter shape. Simpler plan, same result.
- Q16: ``F.regexp_like(col, pattern)`` in place of
``F.regexp_match(col, pattern).is_not_null()``.
- Q04, Q05, Q06, Q07, Q08, Q10, Q12, Q14, Q15, Q20: store both the start
and the end of the date window as plain ``datetime.date`` objects and
compare with ``lit(end_date)``, instead of carrying the start date +
``pa.month_day_nano_interval`` and adding them at query-build time.
Drops unused ``pyarrow`` imports from the files that no longer need
Arrow scalars.
All 22 answer-file comparisons still pass at scale factor 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
Since this is just examples, I'm not going to bother anyone for a review. These were generated by an agent and every one reviewed by me. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Relates to #1394.
Rationale for this change
The TPC-H examples under
examples/tpch/serve as the canonical hands-on reference for how to write DataFusion Python DataFrame code. Before this PR: Q20 had a bug where a filter was computed and discarded (df.filter(...)without assignment); several queries used non-idiomatic constructs (switched CASE on boolean subjects,array_position(make_array(...))in place ofin_list, 0-based substring tricks, a pyarrow UDF re-implementing a disjunctive predicate,aggregate([col], [])in place ofdistinct(), etc.); the reference SQL was not embedded in the files, so readers had to cross-referencebenchmarks/tpch/queries/to see the intended query; and where reference SQL was embedded, it used different TPC-H substitution parameters than the DataFrame code, so the two expressions described different queries.What changes are included in this PR?
Four commits, grouped by concern:
tpch examples: add reference SQL to each query, fix Q20— append the canonical TPC-H reference SQL to eachq01..q22module docstring; fix the missing assignment on Q20's excess-quantity filter.tpch examples: rewrite non-idiomatic queries in idiomatic DataFrame form— rewrite Q04, Q07, Q08, Q12, Q19, Q20, Q21 using the DataFrame-native pattern (semi/anti joins for EXISTS/NOT EXISTS, searchedF.whenforCASE WHEN,F.in_listforIN, compound predicates in place of a pyarrow UDF, etc.).tpch examples: align reference SQL constants with DataFrame queries— update the embedded SQL in 21 of 22 docstrings so the substitution parameters match the DataFrame code (which is validated at scale factor 1 againstanswers_sf1/). Interval units (month, year) are preserved where the problem-statement text reads "quarter", "year", or "month".tpch examples: apply SKILL.md idioms across all 22 queries— sweep all 22 queries forSKILL.mdidioms: auto-wrap on comparison RHS, plain-name group/sort keys, drophow="inner", collapse chained.filter()calls,F.count_star()for SQLcount(*),F.starts_with/F.in_list/ searchedF.when. Q16 also picks up the secondary sort keys (p_brand,p_type,p_size) that the TPC-H spec requires but the original DataFrame omitted.All 22 answer-file comparisons under
examples/tpch/_tests.pypass.Are there any user-facing changes?
No public API changes. The
examples/tpch/directory is a teaching aid shipped in the source tree, not in the wheel, so the visible effect is limited to readers of the examples.