Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions .claude/skills/add-cel-feature/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
name: add-cel-feature
description: Adds a new CEL function, operator, macro, or comprehension form to pycel2sql by extending Converter visit methods in src/pycel2sql/_converter.py, deciding whether a new Dialect ABC abstract method is needed, implementing it across all six dialects when SQL diverges, and adding parametrized test cases. Use when wiring a new CEL string method (toLowerCase, isAlpha), timestamp method (toISOString), free-function helper, operator overload, or comprehension form across PostgreSQL, MySQL, SQLite, DuckDB, BigQuery, and Apache Spark.
---

# Add CEL Feature

Adding a new CEL surface area (function, operator, macro, comprehension) is the second-most-common contribution shape after adding a dialect. The work is the same shape every time — this skill captures it.

Recent examples in pycel2sql's history: per-dialect `format()` dispatch (PR #8), `validate_schema` parameter (PR #1), `json_variables` / `column_aliases` / `param_start_index` options (PR #8). Each followed the pattern below.

## Quick start

1. **Classify the feature** — function call (e.g. `string.lowerAscii()`), operator (e.g. `&&`), macro (e.g. `has(x.y)`), or comprehension (e.g. `list.exists(v, pred)`).
2. **Identify the visitor** in `_converter.py` that owns the surface area — see [references/converter-file-map.md](references/converter-file-map.md).
3. **Decide whether a new `Dialect` ABC method is needed** — see "New method or inline?" below.
4. **If a new method is needed**, walk the dialect-method checklist at `.claude/skills/add-sql-dialect/references/dialect-method-checklist.md` (shared between both skills — open it directly).
5. **Add a test case** to the appropriate `tests/test_*.py` with parametrized coverage across `ALL_DIALECTS` from `tests/conftest.py`.
6. **Run** `uv run pytest tests/ --ignore=tests/integration -v` and `uv run ruff check src/ tests/`.

## New method or inline?

The `Dialect` ABC at `src/pycel2sql/dialect/_base.py` is already large (~50 abstract methods). Add a new method only when needed; otherwise inline the SQL in the visitor.

| Situation | Verdict |
|---|---|
| The SQL is identical across all six dialects | Inline in the visitor (e.g. `LOWER(x)` for `lowerAscii`, `MOD(a, b)` for `%`). |
| Any dialect needs different syntax | New `Dialect` ABC `@abstractmethod`. |
| A subset of dialects can't support the feature at all | New method + each unsupported dialect raises `UnsupportedDialectFeatureError` in its implementation. |
| The feature is JSON-related, regex-related, or array-related | Look for an existing `write_…` method first — JSON / array / regex method coverage is broad. |

If you add an `@abstractmethod`, the **`Dialect` ABC enforces it at instantiation time** — calling `<Dialect>()` with any abstract method missing raises `TypeError: Can't instantiate abstract class`. CI's `tests/conftest.py` instantiates every dialect in `ALL_DIALECTS`, so a missing method is caught on the first test collection. mypy strict mode also flags missing methods.

## Where features live

`_converter.py` is large (~80 KB / ~2000 lines) and uses Lark `Interpreter` visitor methods. The grammar rule names map to method names — `member_dot_arg` for method calls (`obj.method(args)`), `ident_arg` for free-function calls (`func(args)`), `addition_add` for `+`, `relation_eq` for `==`, etc.

For a method-call feature like `string.lowerAscii()`:
- Entry point is `member_dot_arg` (`_converter.py:607`).
- The method name dispatches via the long if/elif chain at lines 619–691.
- `LOWER(...)` is universal SQL → inline (no new `Dialect` method).

For a method-call feature like `string.format([args])`:
- Entry point is also `member_dot_arg`.
- SQL diverges — Postgres/BigQuery use `FORMAT`, SQLite/DuckDB use `printf`, Spark uses `format_string`, MySQL has no equivalent → new `Dialect.write_format()` method (added in PR #8). Each dialect implements its own form; MySQL raises.

For a free-function feature like `has(x.y)`:
- Entry point is `ident_arg` (`_converter.py:757`).
- Dispatch is via the `func_name` if/elif at lines 763–820.
- `has()` is dialect-divergent (Postgres JSONB `?` operator vs others' `IS NOT NULL`) → uses `Dialect.write_json_existence`.

For an operator like `+`:
- Entry point is `addition` (`_converter.py:415`).
- Operator-name dispatch via `op_name` (`addition_add` / `addition_sub`).
- Cross-dialect handling routes through `Dialect.write_string_concat` for string contexts, plain `+` for numeric.

For a macro like `exists` / `all` / `filter` / `map`:
- Entry point is `member_dot_arg`, but the chain immediately routes to `_visit_comprehension` (`_converter.py:615`).
- Comprehensions use the `Dialect` `write_unnest` / `write_array_subquery_open` / `write_array_subquery_expr_close` triple plus per-method scaffolding.

For the file-by-file map of which CEL surface area each section of `_converter.py` owns, see [references/converter-file-map.md](references/converter-file-map.md).

## Parser-level vs visitor-level

CEL features that pycel2sql can support are limited by what `cel-python`'s Lark grammar parses. If the grammar doesn't recognise a syntax (e.g. some CEL-extension functions), the feature can't be added at the visitor level — `_parser.parse(cel_expr)` will fail before reaching the converter. Test the parse first:

```python
from celpy.celparser import CELParser
CELParser().parse('"hello".isAlpha()') # raises CELParseError if unsupported
```

If parsing fails, the feature requires upstream work in `cel-python`, not in pycel2sql.

## Test coverage

Add a representative test to the appropriate file in `tests/`:

| CEL surface | Test file |
|---|---|
| String methods | `tests/test_string_functions.py` |
| Timestamps / durations | `tests/test_timestamps.py` |
| JSON access / `has()` | `tests/test_json.py` |
| Comprehensions (`exists`, `all`, `filter`, `map`) | `tests/test_comprehensions.py` |
| Type casts (`int`, `string`, `bool`, etc.) | `tests/test_convert.py` (general) or a dialect-specific file |
| Universal SQL (operators, comparisons) | `tests/test_dialect_parametrized.py` |
| Parameterized output | `tests/test_parameterized.py` |
| New conversion options | `tests/test_options.py` |

When the SQL is identical across dialects, parametrize over `ALL_DIALECTS` from `tests/conftest.py`:

```python
import pytest
from tests.conftest import ALL_DIALECTS

@pytest.mark.parametrize("dialect", ALL_DIALECTS)
def test_lower_ascii(dialect):
assert convert("name.lowerAscii()", dialect=dialect) == "LOWER(name)"
```

When the SQL diverges, write per-dialect cases (mirror `tests/test_string_functions.py::TestFormatPerDialect`).

## Verification

```bash
uv run ruff check src/ tests/
uv run mypy src/pycel2sql/ # pre-existing lark errors expected
uv run pytest tests/ --ignore=tests/integration -v

python .claude/skills/skill-authoring/scripts/lint_skill.py .claude/skills/add-cel-feature/
```

## References

- [references/converter-file-map.md](references/converter-file-map.md) — which section of `_converter.py` owns which CEL surface area, with line-number anchors.
- The `Dialect` ABC method checklist lives in the `add-sql-dialect` skill at `.claude/skills/add-sql-dialect/references/dialect-method-checklist.md` — when you add a new `@abstractmethod`, open and update that file too.
174 changes: 174 additions & 0 deletions .claude/skills/add-cel-feature/references/converter-file-map.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Converter File Map

A map of which CEL surface area each section of `src/pycel2sql/_converter.py` owns. Line numbers are approximate (the file evolves) — search by method name to find the current location.

## Contents

- Top-level entry
- Logical operators
- Comparison operators
- `in` operator
- Arithmetic
- Unary
- Member access (dot / index / dot-arg)
- Identifiers and free-function calls
- Literals
- Compound literals (lists, maps)
- String methods (contains, startsWith, endsWith, etc.)
- Size, has, type cast
- Timestamp / duration
- Comprehensions

## Top-level entry

| Method | Owns |
|---|---|
| `expr` | `?:` ternary; otherwise pass-through to single child. |
| `visit` | Override of Lark `Interpreter.visit` to handle bare `Token` children. |
| `_visit_child` | Depth-tracking visitor; every internal recursion goes through this. |

## Logical operators

| Method | Owns |
|---|---|
| `conditionalor` | `lhs \|\| rhs` → `lhs OR rhs`. |
| `conditionaland` | `lhs && rhs` → `lhs AND rhs`. |

## Comparison operators

The `relation` method dispatches to operator-specific helpers. Each `relation_<op>` method handles null-aware emission (`IS NULL` / `IS NOT NULL` / `IS TRUE` / etc.) and JSON-text-extraction numeric coercion.

| Method | Owns |
|---|---|
| `relation` | Dispatches to per-operator helpers. |
| `relation_eq`, `relation_ne` | `==`, `!=`. Handles null/bool comparisons specially. |
| `relation_lt`, `relation_le`, `relation_gt`, `relation_ge` | `<`, `<=`, `>`, `>=`. |
| `relation_in` | `x in [list]` / `x in arr`; routes through `_visit_in` and `Dialect.write_array_membership`. |

## `in` operator

| Method | Owns |
|---|---|
| `_visit_in` | Routes `x in arr` to `Dialect.write_array_membership`. Currently does not yet route JSON-array membership through the dedicated `Dialect.write_json_array_membership` method (those abstract methods exist in the ABC but no call site invokes them yet). |

## Arithmetic

| Method | Owns |
|---|---|
| `addition` | `+`, `-` dispatch. **Critical**: timestamp-arithmetic check happens BEFORE string-concat check (temporal functions contain string literals; CLAUDE.md notes this). |
| `addition_add`, `addition_sub` | Trivial pass-throughs (the work is in `addition`). |
| `multiplication` | `*`, `/`, `%` dispatch. `%` becomes `MOD(a, b)` (universal SQL). |
| `multiplication_mul`, `multiplication_div`, `multiplication_mod` | Trivial pass-throughs. |

## Unary

| Method | Owns |
|---|---|
| `unary`, `unary_not`, `unary_neg` | `!x` → `NOT x`; `-x` → `-x`. |

## Member access

The most-trafficked area. Three entry points:

| Method | Owns |
|---|---|
| `member` | Pass-through. |
| `member_dot` | `obj.field` access. Routes through `_emit_json_path` for JSON-path access (schema-declared JSON or `json_variables`); otherwise emits plain `.field`. Applies `validate_field_name`. |
| `member_dot_arg` | `obj.method(args)` — the **method-call dispatcher**. The long if/elif chain checks `method_name` and routes to the appropriate `_visit_<feature>` helper. New string/timestamp/comprehension methods land here. |
| `member_index` | `obj[idx]` and `obj["key"]`. String literal index → JSON path (when root is a `json_variable`) or `.key` plain access. Integer index → `Dialect.write_list_index_const`. Dynamic index → `Dialect.write_list_index`. |
| `member_object` | Type construction `T{f: v}` (rare). |

When adding a new CEL **method** like `string.toLowerCase()`, the entry point is the `member_dot_arg` if/elif chain.

## Identifiers and free-function calls

| Method | Owns |
|---|---|
| `primary` | Pass-through. |
| `ident` | Bare identifier emission. Applies `column_aliases` (PR #8) before `validate_field_name`. |
| `ident_arg` | `func(args)` — the **free-function-call dispatcher**. Long if/elif chain on `func_name`. Owns `has`, `size`, `matches`, `int`/`uint`/`double`/`string`/`bool`/`bytes`/`timestamp`/`duration` casts, `now()`, `getInterval()`, `dyn()`. |
| `dot_ident_arg`, `dot_ident` | `.name` (rare CEL syntax). |
| `paren_expr` | `(expr)` parenthesisation. |

When adding a new CEL **free function** like `format(...)` or `has(...)`, the entry point is the `ident_arg` if/elif chain.

## Literals

| Method | Owns |
|---|---|
| `literal` | All scalar literals — strings, ints, uints, floats, bools, null, bytes. Routes through dialect-specific `write_string_literal` / `write_bytes_literal`. Parameterized mode replaces literals with placeholders via `_add_param`. |

## Compound literals

| Method | Owns |
|---|---|
| `list_lit` | `[1, 2, 3]` array literals — calls `Dialect.write_array_literal_open` / `write_array_literal_close`. |
| `map_lit` | `{a: 1, b: 2}` map/struct literals — calls `Dialect.write_struct_open` / `write_struct_close`. |
| `exprlist` | Comma-separated expression list (used inside list/map/format args). |
| `mapinits` | Map initialiser list. |
| `fieldinits` | Struct initialiser list. |

## String methods

Each method lives in a `_visit_<name>` helper. Most route through a `Dialect.write_<name>` method to absorb dialect divergence.

| Helper | CEL surface | Notes |
|---|---|---|
| `_visit_contains` | `s.contains(needle)` | → `Dialect.write_contains`. |
| `_visit_starts_with` | `s.startsWith(prefix)` | → `LIKE 'prefix%'` + dialect-specific ESCAPE clause. |
| `_visit_ends_with` | `s.endsWith(suffix)` | → `LIKE '%suffix'`. |
| `_visit_matches_method` | `s.matches(pattern)` | → `Dialect.convert_regex` + `write_regex_match`. |
| `_visit_matches_func` | `matches(s, pattern)` | Same as above, function form. |
| `_visit_char_at` | `s.charAt(idx)` | → `SUBSTRING(s, idx + 1, 1)`. |
| `_visit_index_of` | `s.indexOf(n)` / `s.indexOf(n, off)` | Inline `POSITION(...)` (universal). |
| `_visit_last_index_of` | `s.lastIndexOf(n)` | Inline. |
| `_visit_substring` | `s.substring(start)` / `s.substring(start, end)` | `SUBSTRING(...)`. |
| `_visit_replace` | `s.replace(old, new)` | `REPLACE(...)`. |
| `_visit_split` | `s.split(d)` / `s.split(d, n)` | → `Dialect.write_split` / `write_split_with_limit`. |
| `_visit_join` | `arr.join(d)` | → `Dialect.write_join`. |
| `_visit_format` | `s.format([args])` | → `Dialect.write_format` (added in PR #8). The `%d`/`%f` → `%s` normalization happens unconditionally before dispatch. |

## Size, has, type cast

| Helper | CEL surface | Notes |
|---|---|---|
| `_visit_size_method` | `x.size()` | Routes to `Dialect.write_array_length` for arrays, plain `LENGTH(x)` for strings. |
| `_visit_size_func` | `size(x)` | Same as above, function form. |
| `_visit_has` | `has(a.b)` / `has(jsonvar.k)` | Routes to `Dialect.write_json_existence` for JSON paths; otherwise emits `... IS NOT NULL`. The `json_variables` audit (PR #8) added the JSON-variable branch. |
| `_visit_type_cast` | `int(x)`, `uint(x)`, `double(x)`, `string(x)`, `bool(x)`, `bytes(x)` | Special-case for `int(timestamp)` → `Dialect.write_epoch_extract`. Other casts emit `CAST(x AS <dialect_type>)` via `Dialect.write_type_name`. |

## Timestamp / duration

| Helper | CEL surface | Notes |
|---|---|---|
| `_visit_timestamp_func` | `timestamp("...")` | → `Dialect.write_timestamp_cast`. |
| `_visit_duration_func` | `duration("...")` | Parses Go-style duration string (`24h`, `30m`, `1500ms`). Limits: only h/m/s/ms/us/ns recognised. → `Dialect.write_duration`. |
| `_visit_interval_func` | `interval(value, unit)` | Dynamic-value variant. → `Dialect.write_interval`. |
| `_visit_datetime_constructor` | `date(...)`, `time(...)`, etc. | Constructor-style date/time helpers. |
| `_visit_current_datetime` | `now()`, `today()` | → `CURRENT_TIMESTAMP` / `CURRENT_DATE` (universal). |
| `_visit_timestamp_extract` | `t.getFullYear()`, `t.getMonth()`, `t.getDayOfWeek()`, etc. | Routes to `Dialect.write_extract`. **Critical**: DOW maps differ (Sunday=0 vs Sunday=1) — handled per-dialect in `write_extract`. |

## Comprehensions

| Helper | CEL surface |
|---|---|
| `_visit_comprehension` | Top-level dispatch on macro name. |
| `_visit_comp_all` | `list.all(v, pred)` |
| `_visit_comp_exists` | `list.exists(v, pred)` |
| `_visit_comp_exists_one` | `list.exists_one(v, pred)` |
| `_visit_comp_map` | `list.map(v, transform)` |
| `_visit_comp_map_filter` | `list.map(v, pred, transform)` |
| `_visit_comp_filter` | `list.filter(v, pred)` |

All comprehension helpers use the `Dialect.write_unnest` + `write_array_subquery_open` + `write_array_subquery_expr_close` triple. Spark's variant (`(SELECT collect_list(...))`) needs the close to emit `)` because `collect_list` is an aggregator wrapper; other dialects close with empty.

## Common patterns

When adding a new CEL feature, the workflow is almost always:

1. Find the entry point in `member_dot_arg` (for methods) or `ident_arg` (for free functions).
2. Add an `elif method_name == "<name>": self._visit_<name>(obj, args); return`.
3. Add a `_visit_<name>` helper near the existing string-method helpers (or wherever fits semantically).
4. Decide universal vs dialect-divergent (see SKILL.md "New method or inline?").
5. Add a parametrized test in the appropriate file under `tests/`.
6. If a new `Dialect` `@abstractmethod` is added, update every dialect file AND the dialect-method-checklist in the `add-sql-dialect` skill.
Loading