diff --git a/.claude/skills/add-cel-feature/SKILL.md b/.claude/skills/add-cel-feature/SKILL.md new file mode 100644 index 0000000..2e09431 --- /dev/null +++ b/.claude/skills/add-cel-feature/SKILL.md @@ -0,0 +1,115 @@ +--- +name: add-cel-feature +description: Adds a new CEL function, operator, macro, or comprehension form to pycel2sql by extending Converter visit methods in src/pycel2sql/_converter.py, deciding whether a new Dialect ABC abstract method is needed, implementing it across all six dialects when SQL diverges, and adding parametrized test cases. Use when wiring a new CEL string method (toLowerCase, isAlpha), timestamp method (toISOString), free-function helper, operator overload, or comprehension form across PostgreSQL, MySQL, SQLite, DuckDB, BigQuery, and Apache Spark. +--- + +# Add CEL Feature + +Adding a new CEL surface area (function, operator, macro, comprehension) is the second-most-common contribution shape after adding a dialect. The work is the same shape every time — this skill captures it. + +Recent examples in pycel2sql's history: per-dialect `format()` dispatch (PR #8), `validate_schema` parameter (PR #1), `json_variables` / `column_aliases` / `param_start_index` options (PR #8). Each followed the pattern below. + +## Quick start + +1. **Classify the feature** — function call (e.g. `string.lowerAscii()`), operator (e.g. `&&`), macro (e.g. `has(x.y)`), or comprehension (e.g. `list.exists(v, pred)`). +2. **Identify the visitor** in `_converter.py` that owns the surface area — see [references/converter-file-map.md](references/converter-file-map.md). +3. **Decide whether a new `Dialect` ABC method is needed** — see "New method or inline?" below. +4. **If a new method is needed**, walk the dialect-method checklist at `.claude/skills/add-sql-dialect/references/dialect-method-checklist.md` (shared between both skills — open it directly). +5. **Add a test case** to the appropriate `tests/test_*.py` with parametrized coverage across `ALL_DIALECTS` from `tests/conftest.py`. +6. **Run** `uv run pytest tests/ --ignore=tests/integration -v` and `uv run ruff check src/ tests/`. + +## New method or inline? + +The `Dialect` ABC at `src/pycel2sql/dialect/_base.py` is already large (~50 abstract methods). Add a new method only when needed; otherwise inline the SQL in the visitor. + +| Situation | Verdict | +|---|---| +| The SQL is identical across all six dialects | Inline in the visitor (e.g. `LOWER(x)` for `lowerAscii`, `MOD(a, b)` for `%`). | +| Any dialect needs different syntax | New `Dialect` ABC `@abstractmethod`. | +| A subset of dialects can't support the feature at all | New method + each unsupported dialect raises `UnsupportedDialectFeatureError` in its implementation. | +| The feature is JSON-related, regex-related, or array-related | Look for an existing `write_…` method first — JSON / array / regex method coverage is broad. | + +If you add an `@abstractmethod`, the **`Dialect` ABC enforces it at instantiation time** — calling `()` with any abstract method missing raises `TypeError: Can't instantiate abstract class`. CI's `tests/conftest.py` instantiates every dialect in `ALL_DIALECTS`, so a missing method is caught on the first test collection. mypy strict mode also flags missing methods. + +## Where features live + +`_converter.py` is large (~80 KB / ~2000 lines) and uses Lark `Interpreter` visitor methods. The grammar rule names map to method names — `member_dot_arg` for method calls (`obj.method(args)`), `ident_arg` for free-function calls (`func(args)`), `addition_add` for `+`, `relation_eq` for `==`, etc. + +For a method-call feature like `string.lowerAscii()`: +- Entry point is `member_dot_arg` (`_converter.py:607`). +- The method name dispatches via the long if/elif chain at lines 619–691. +- `LOWER(...)` is universal SQL → inline (no new `Dialect` method). + +For a method-call feature like `string.format([args])`: +- Entry point is also `member_dot_arg`. +- SQL diverges — Postgres/BigQuery use `FORMAT`, SQLite/DuckDB use `printf`, Spark uses `format_string`, MySQL has no equivalent → new `Dialect.write_format()` method (added in PR #8). Each dialect implements its own form; MySQL raises. + +For a free-function feature like `has(x.y)`: +- Entry point is `ident_arg` (`_converter.py:757`). +- Dispatch is via the `func_name` if/elif at lines 763–820. +- `has()` is dialect-divergent (Postgres JSONB `?` operator vs others' `IS NOT NULL`) → uses `Dialect.write_json_existence`. + +For an operator like `+`: +- Entry point is `addition` (`_converter.py:415`). +- Operator-name dispatch via `op_name` (`addition_add` / `addition_sub`). +- Cross-dialect handling routes through `Dialect.write_string_concat` for string contexts, plain `+` for numeric. + +For a macro like `exists` / `all` / `filter` / `map`: +- Entry point is `member_dot_arg`, but the chain immediately routes to `_visit_comprehension` (`_converter.py:615`). +- Comprehensions use the `Dialect` `write_unnest` / `write_array_subquery_open` / `write_array_subquery_expr_close` triple plus per-method scaffolding. + +For the file-by-file map of which CEL surface area each section of `_converter.py` owns, see [references/converter-file-map.md](references/converter-file-map.md). + +## Parser-level vs visitor-level + +CEL features that pycel2sql can support are limited by what `cel-python`'s Lark grammar parses. If the grammar doesn't recognise a syntax (e.g. some CEL-extension functions), the feature can't be added at the visitor level — `_parser.parse(cel_expr)` will fail before reaching the converter. Test the parse first: + +```python +from celpy.celparser import CELParser +CELParser().parse('"hello".isAlpha()') # raises CELParseError if unsupported +``` + +If parsing fails, the feature requires upstream work in `cel-python`, not in pycel2sql. + +## Test coverage + +Add a representative test to the appropriate file in `tests/`: + +| CEL surface | Test file | +|---|---| +| String methods | `tests/test_string_functions.py` | +| Timestamps / durations | `tests/test_timestamps.py` | +| JSON access / `has()` | `tests/test_json.py` | +| Comprehensions (`exists`, `all`, `filter`, `map`) | `tests/test_comprehensions.py` | +| Type casts (`int`, `string`, `bool`, etc.) | `tests/test_convert.py` (general) or a dialect-specific file | +| Universal SQL (operators, comparisons) | `tests/test_dialect_parametrized.py` | +| Parameterized output | `tests/test_parameterized.py` | +| New conversion options | `tests/test_options.py` | + +When the SQL is identical across dialects, parametrize over `ALL_DIALECTS` from `tests/conftest.py`: + +```python +import pytest +from tests.conftest import ALL_DIALECTS + +@pytest.mark.parametrize("dialect", ALL_DIALECTS) +def test_lower_ascii(dialect): + assert convert("name.lowerAscii()", dialect=dialect) == "LOWER(name)" +``` + +When the SQL diverges, write per-dialect cases (mirror `tests/test_string_functions.py::TestFormatPerDialect`). + +## Verification + +```bash +uv run ruff check src/ tests/ +uv run mypy src/pycel2sql/ # pre-existing lark errors expected +uv run pytest tests/ --ignore=tests/integration -v + +python .claude/skills/skill-authoring/scripts/lint_skill.py .claude/skills/add-cel-feature/ +``` + +## References + +- [references/converter-file-map.md](references/converter-file-map.md) — which section of `_converter.py` owns which CEL surface area, with line-number anchors. +- The `Dialect` ABC method checklist lives in the `add-sql-dialect` skill at `.claude/skills/add-sql-dialect/references/dialect-method-checklist.md` — when you add a new `@abstractmethod`, open and update that file too. diff --git a/.claude/skills/add-cel-feature/references/converter-file-map.md b/.claude/skills/add-cel-feature/references/converter-file-map.md new file mode 100644 index 0000000..beb4017 --- /dev/null +++ b/.claude/skills/add-cel-feature/references/converter-file-map.md @@ -0,0 +1,174 @@ +# Converter File Map + +A map of which CEL surface area each section of `src/pycel2sql/_converter.py` owns. Line numbers are approximate (the file evolves) — search by method name to find the current location. + +## Contents + +- Top-level entry +- Logical operators +- Comparison operators +- `in` operator +- Arithmetic +- Unary +- Member access (dot / index / dot-arg) +- Identifiers and free-function calls +- Literals +- Compound literals (lists, maps) +- String methods (contains, startsWith, endsWith, etc.) +- Size, has, type cast +- Timestamp / duration +- Comprehensions + +## Top-level entry + +| Method | Owns | +|---|---| +| `expr` | `?:` ternary; otherwise pass-through to single child. | +| `visit` | Override of Lark `Interpreter.visit` to handle bare `Token` children. | +| `_visit_child` | Depth-tracking visitor; every internal recursion goes through this. | + +## Logical operators + +| Method | Owns | +|---|---| +| `conditionalor` | `lhs \|\| rhs` → `lhs OR rhs`. | +| `conditionaland` | `lhs && rhs` → `lhs AND rhs`. | + +## Comparison operators + +The `relation` method dispatches to operator-specific helpers. Each `relation_` method handles null-aware emission (`IS NULL` / `IS NOT NULL` / `IS TRUE` / etc.) and JSON-text-extraction numeric coercion. + +| Method | Owns | +|---|---| +| `relation` | Dispatches to per-operator helpers. | +| `relation_eq`, `relation_ne` | `==`, `!=`. Handles null/bool comparisons specially. | +| `relation_lt`, `relation_le`, `relation_gt`, `relation_ge` | `<`, `<=`, `>`, `>=`. | +| `relation_in` | `x in [list]` / `x in arr`; routes through `_visit_in` and `Dialect.write_array_membership`. | + +## `in` operator + +| Method | Owns | +|---|---| +| `_visit_in` | Routes `x in arr` to `Dialect.write_array_membership`. Currently does not yet route JSON-array membership through the dedicated `Dialect.write_json_array_membership` method (those abstract methods exist in the ABC but no call site invokes them yet). | + +## Arithmetic + +| Method | Owns | +|---|---| +| `addition` | `+`, `-` dispatch. **Critical**: timestamp-arithmetic check happens BEFORE string-concat check (temporal functions contain string literals; CLAUDE.md notes this). | +| `addition_add`, `addition_sub` | Trivial pass-throughs (the work is in `addition`). | +| `multiplication` | `*`, `/`, `%` dispatch. `%` becomes `MOD(a, b)` (universal SQL). | +| `multiplication_mul`, `multiplication_div`, `multiplication_mod` | Trivial pass-throughs. | + +## Unary + +| Method | Owns | +|---|---| +| `unary`, `unary_not`, `unary_neg` | `!x` → `NOT x`; `-x` → `-x`. | + +## Member access + +The most-trafficked area. Three entry points: + +| Method | Owns | +|---|---| +| `member` | Pass-through. | +| `member_dot` | `obj.field` access. Routes through `_emit_json_path` for JSON-path access (schema-declared JSON or `json_variables`); otherwise emits plain `.field`. Applies `validate_field_name`. | +| `member_dot_arg` | `obj.method(args)` — the **method-call dispatcher**. The long if/elif chain checks `method_name` and routes to the appropriate `_visit_` helper. New string/timestamp/comprehension methods land here. | +| `member_index` | `obj[idx]` and `obj["key"]`. String literal index → JSON path (when root is a `json_variable`) or `.key` plain access. Integer index → `Dialect.write_list_index_const`. Dynamic index → `Dialect.write_list_index`. | +| `member_object` | Type construction `T{f: v}` (rare). | + +When adding a new CEL **method** like `string.toLowerCase()`, the entry point is the `member_dot_arg` if/elif chain. + +## Identifiers and free-function calls + +| Method | Owns | +|---|---| +| `primary` | Pass-through. | +| `ident` | Bare identifier emission. Applies `column_aliases` (PR #8) before `validate_field_name`. | +| `ident_arg` | `func(args)` — the **free-function-call dispatcher**. Long if/elif chain on `func_name`. Owns `has`, `size`, `matches`, `int`/`uint`/`double`/`string`/`bool`/`bytes`/`timestamp`/`duration` casts, `now()`, `getInterval()`, `dyn()`. | +| `dot_ident_arg`, `dot_ident` | `.name` (rare CEL syntax). | +| `paren_expr` | `(expr)` parenthesisation. | + +When adding a new CEL **free function** like `format(...)` or `has(...)`, the entry point is the `ident_arg` if/elif chain. + +## Literals + +| Method | Owns | +|---|---| +| `literal` | All scalar literals — strings, ints, uints, floats, bools, null, bytes. Routes through dialect-specific `write_string_literal` / `write_bytes_literal`. Parameterized mode replaces literals with placeholders via `_add_param`. | + +## Compound literals + +| Method | Owns | +|---|---| +| `list_lit` | `[1, 2, 3]` array literals — calls `Dialect.write_array_literal_open` / `write_array_literal_close`. | +| `map_lit` | `{a: 1, b: 2}` map/struct literals — calls `Dialect.write_struct_open` / `write_struct_close`. | +| `exprlist` | Comma-separated expression list (used inside list/map/format args). | +| `mapinits` | Map initialiser list. | +| `fieldinits` | Struct initialiser list. | + +## String methods + +Each method lives in a `_visit_` helper. Most route through a `Dialect.write_` method to absorb dialect divergence. + +| Helper | CEL surface | Notes | +|---|---|---| +| `_visit_contains` | `s.contains(needle)` | → `Dialect.write_contains`. | +| `_visit_starts_with` | `s.startsWith(prefix)` | → `LIKE 'prefix%'` + dialect-specific ESCAPE clause. | +| `_visit_ends_with` | `s.endsWith(suffix)` | → `LIKE '%suffix'`. | +| `_visit_matches_method` | `s.matches(pattern)` | → `Dialect.convert_regex` + `write_regex_match`. | +| `_visit_matches_func` | `matches(s, pattern)` | Same as above, function form. | +| `_visit_char_at` | `s.charAt(idx)` | → `SUBSTRING(s, idx + 1, 1)`. | +| `_visit_index_of` | `s.indexOf(n)` / `s.indexOf(n, off)` | Inline `POSITION(...)` (universal). | +| `_visit_last_index_of` | `s.lastIndexOf(n)` | Inline. | +| `_visit_substring` | `s.substring(start)` / `s.substring(start, end)` | `SUBSTRING(...)`. | +| `_visit_replace` | `s.replace(old, new)` | `REPLACE(...)`. | +| `_visit_split` | `s.split(d)` / `s.split(d, n)` | → `Dialect.write_split` / `write_split_with_limit`. | +| `_visit_join` | `arr.join(d)` | → `Dialect.write_join`. | +| `_visit_format` | `s.format([args])` | → `Dialect.write_format` (added in PR #8). The `%d`/`%f` → `%s` normalization happens unconditionally before dispatch. | + +## Size, has, type cast + +| Helper | CEL surface | Notes | +|---|---|---| +| `_visit_size_method` | `x.size()` | Routes to `Dialect.write_array_length` for arrays, plain `LENGTH(x)` for strings. | +| `_visit_size_func` | `size(x)` | Same as above, function form. | +| `_visit_has` | `has(a.b)` / `has(jsonvar.k)` | Routes to `Dialect.write_json_existence` for JSON paths; otherwise emits `... IS NOT NULL`. The `json_variables` audit (PR #8) added the JSON-variable branch. | +| `_visit_type_cast` | `int(x)`, `uint(x)`, `double(x)`, `string(x)`, `bool(x)`, `bytes(x)` | Special-case for `int(timestamp)` → `Dialect.write_epoch_extract`. Other casts emit `CAST(x AS )` via `Dialect.write_type_name`. | + +## Timestamp / duration + +| Helper | CEL surface | Notes | +|---|---|---| +| `_visit_timestamp_func` | `timestamp("...")` | → `Dialect.write_timestamp_cast`. | +| `_visit_duration_func` | `duration("...")` | Parses Go-style duration string (`24h`, `30m`, `1500ms`). Limits: only h/m/s/ms/us/ns recognised. → `Dialect.write_duration`. | +| `_visit_interval_func` | `interval(value, unit)` | Dynamic-value variant. → `Dialect.write_interval`. | +| `_visit_datetime_constructor` | `date(...)`, `time(...)`, etc. | Constructor-style date/time helpers. | +| `_visit_current_datetime` | `now()`, `today()` | → `CURRENT_TIMESTAMP` / `CURRENT_DATE` (universal). | +| `_visit_timestamp_extract` | `t.getFullYear()`, `t.getMonth()`, `t.getDayOfWeek()`, etc. | Routes to `Dialect.write_extract`. **Critical**: DOW maps differ (Sunday=0 vs Sunday=1) — handled per-dialect in `write_extract`. | + +## Comprehensions + +| Helper | CEL surface | +|---|---| +| `_visit_comprehension` | Top-level dispatch on macro name. | +| `_visit_comp_all` | `list.all(v, pred)` | +| `_visit_comp_exists` | `list.exists(v, pred)` | +| `_visit_comp_exists_one` | `list.exists_one(v, pred)` | +| `_visit_comp_map` | `list.map(v, transform)` | +| `_visit_comp_map_filter` | `list.map(v, pred, transform)` | +| `_visit_comp_filter` | `list.filter(v, pred)` | + +All comprehension helpers use the `Dialect.write_unnest` + `write_array_subquery_open` + `write_array_subquery_expr_close` triple. Spark's variant (`(SELECT collect_list(...))`) needs the close to emit `)` because `collect_list` is an aggregator wrapper; other dialects close with empty. + +## Common patterns + +When adding a new CEL feature, the workflow is almost always: + +1. Find the entry point in `member_dot_arg` (for methods) or `ident_arg` (for free functions). +2. Add an `elif method_name == "": self._visit_(obj, args); return`. +3. Add a `_visit_` helper near the existing string-method helpers (or wherever fits semantically). +4. Decide universal vs dialect-divergent (see SKILL.md "New method or inline?"). +5. Add a parametrized test in the appropriate file under `tests/`. +6. If a new `Dialect` `@abstractmethod` is added, update every dialect file AND the dialect-method-checklist in the `add-sql-dialect` skill.