Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions .claude/skills/add-sql-dialect/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
name: add-sql-dialect
description: Adds a new SQL dialect to pycel2sql by creating src/pycel2sql/dialect/<name>.py (subclass of the Dialect ABC), registering in the DialectName enum and the get_dialect() factory, threading new test cases through every parametrized test class, and updating the README badge grid and dialect-comparison tables. Use when porting a new database backend (Trino, Snowflake, ClickHouse, MS SQL, Athena, Oracle) or any new analytics engine.
---

# Add SQL Dialect

Adding a SQL dialect is the largest contribution shape in this repo (~1500-line PR, ~18 file touches in lockstep). The pattern is well-established — six dialects already follow it (PostgreSQL, MySQL, SQLite, DuckDB, BigQuery, Apache Spark). This skill captures the procedure so the engineer can follow the template instead of reverse-engineering the layout from existing dialects.

## Quick start

```bash
# 1. Pick the closest analogue (see "Picking the analogue" below).
# 2. Scaffold by copying — the script stubs every Dialect ABC method.
python .claude/skills/add-sql-dialect/scripts/scaffold_dialect.py duckdb cockroach Cockroach
# ^^^^^^^ template
# ^^^^^^^^^ folder/identifier
# ^^^^^^^^^ class prefix

# 3. Then:
# a. Fill SQL bodies in src/pycel2sql/dialect/cockroach.py (replace NotImplementedError stubs).
# b. Add SPARK = "spark" → COCKROACH = "cockroach" to DialectName in dialect/_base.py.
# c. Register CockroachDialect in dialect/__init__.py (_REGISTRY + __all__).
# d. Export from src/pycel2sql/__init__.py.
# e. Add cockroach_dialect fixture + CockroachDialect() to ALL_DIALECTS in tests/conftest.py.
# f. Add to tests/test_dialect_parametrized.py ALL_DIALECTS list.
# g. Create tests/test_cockroach.py mirroring tests/test_duckdb.py shape.
# h. Update README badge grid + dialect count + comparison table; bump CLAUDE.md.
# i. Run: uv run ruff check src/ tests/ && uv run pytest tests/ --ignore=tests/integration
```

## Picking the analogue

Decide by which existing dialect's syntax shape your target most resembles:

| Question | Operator-style → use DuckDB | Function-style → use BigQuery |
|---|---|---|
| Regex match | `target ~ 'p'` (Postgres, DuckDB) | `REGEXP_CONTAINS(target, 'p')` (BigQuery); `target RLIKE 'p'` (Spark) |
| JSON access | `b->>'f'` (Postgres, DuckDB, MySQL) | `JSON_VALUE(b, '$.f')` (BigQuery); `get_json_object(b, '$.f')` (Spark); `json_extract(b, '$.f')` (SQLite) |
| Array literal | `ARRAY[…]` (Postgres); `[…]` (DuckDB, BigQuery) | `array(…)` (Spark) |
| Array index | 1-indexed (Postgres, DuckDB) | 0-indexed (BigQuery via `OFFSET`, Spark direct) |
| Param placeholder | `$N` (Postgres, DuckDB) | `?` (MySQL, SQLite, Spark) or `@pN` (BigQuery) |
| Cast to numeric | `::numeric` postfix (Postgres) | `+ 0` arithmetic coercion (MySQL, SQLite, Spark); `CAST(... AS FLOAT64)` (BigQuery) |
| Format function | `FORMAT('...', ...)` (Postgres, BigQuery) | `printf('...', ...)` (SQLite, DuckDB); `format_string('...', ...)` (Spark); raises (MySQL) |

For the full Dialect-method-by-method matrix across the existing six dialects, see [references/dialect-method-checklist.md](references/dialect-method-checklist.md). When in doubt, copy DuckDB and patch — its layout is the cleanest.

## Critical surface

These methods on the `Dialect` ABC (`src/pycel2sql/dialect/_base.py`) are where dialects diverge most. Plan how to implement them before writing any code:

- `write_regex_match` — operator vs function call vs `RLIKE`.
- `write_json_field_access` — operator (`->>`) vs function wrapper (`JSON_VALUE`, `get_json_object`); whether intermediate vs final access uses different forms (Postgres `->` vs `->>`; Spark uses the same function for both).
- `write_array_literal_open` / `write_array_literal_close` — `ARRAY[`, `[`, `array(`.
- `write_list_index` / `write_list_index_const` — 0-indexed vs 1-indexed; bare `[i]` vs `[OFFSET(i)]` vs `+ 1`.
- `write_param_placeholder` — `$N`, `?`, `@pN`. Positional `?` dialects ignore the index argument.
- `write_extract` for DOW — Sunday=1 (BigQuery, Spark) vs Sunday=0 (Postgres/DuckDB convention) — adjust by `(dayofweek(t) - 1)` etc.
- `write_cast_to_numeric` — postfix `::TYPE` vs arithmetic coercion `+ 0` vs `CAST(... AS NUMERIC)`.
- `write_json_array_elements` — must be a **set-returning expression** (used in `FROM <here> AS iter`); use the engine's `EXPLODE` / `UNNEST` / `json_each` / `from_json` form.
- `write_json_array_membership` / `write_nested_json_array_membership` — must produce a valid RHS for `lhs = ` (subquery form, like SQLite's `(SELECT value FROM json_each(...))`). If your engine cannot construct a boolean predicate without the candidate element, raise `UnsupportedDialectFeatureError` (mirrors `SparkDialect`).
- `write_format` — per-dialect format() dispatch added in PR #8. Pick `FORMAT(...)`, `printf(...)`, `format_string(...)`, or raise.

## Capabilities methods are not just informational

The four `supports_*()` methods on `Dialect` drive Converter routing. Set them honestly:

```python
def supports_native_arrays(self) -> bool: return True
def supports_jsonb(self) -> bool: return False # Postgres-style JSONB only
```

## Optional: IndexAdvisor

Implement the `IndexAdvisor` Protocol (in `dialect/_base.py`) only if the engine has user-controllable indexes (BTREE, GIN, ART, CLUSTERING). Skip for storage-layer-driven engines like Spark (Delta Z-order, Iceberg sort) — `get_index_advisor()` returns `None` for non-`IndexAdvisor` dialects, which gives an empty recommendation list (the right semantic for "no SQL-level recommendations").

## Doc refresh

When the implementation is green, refresh:

- `README.md` — bump dialect count (currently "Six SQL dialects"), add badge after the existing six in the badge grid, add column to the comparison table near the placeholder list, add row to the introspect-supported list (only if you also add an introspect module under `src/pycel2sql/introspect/`).
- `CLAUDE.md` — bump the dialect count near line 7, add a bullet under "Dialect Differences", append `dialect/<name>.py` to the dialect-files list.

The full file-by-file checklist is in [references/test-files.md](references/test-files.md).

## Verification

```bash
# Lint
uv run ruff check src/ tests/

# Type check (lark generic-arg notes are pre-existing — see CLAUDE.md)
uv run mypy src/pycel2sql/

# Unit tests — must pass for the new dialect plus all six existing ones
uv run pytest tests/ --ignore=tests/integration -v

# Optional integration (if you add Docker fixtures in tests/integration/conftest.py)
uv pip install -e ".[integration]"
uv run pytest tests/integration/ -v -k <dialect>

# Skill lint
python .claude/skills/skill-authoring/scripts/lint_skill.py .claude/skills/add-sql-dialect/
```

The Dialect ABC is enforced at instantiation time — calling `<New>Dialect()` with any abstract method missing raises `TypeError: Can't instantiate abstract class`. CI's `tests/conftest.py` instantiates every dialect in `ALL_DIALECTS`, so a missing method is caught immediately.

## Scripts

- **Run** `python .claude/skills/add-sql-dialect/scripts/scaffold_dialect.py <template> <new-name> <NewClassPrefix>` — copies an existing dialect file, renames the class to `<NewClassPrefix>Dialect`, replaces every method body with a `raise NotImplementedError(...)` stub, and prints the list of files created plus the next manual steps. Does not register the dialect anywhere — that's left to the engineer to do consciously.

## References

- [references/dialect-method-checklist.md](references/dialect-method-checklist.md) — every method on the `Dialect` ABC grouped by category, with one-line "what to emit" guidance per method drawn from the six existing implementations.
- [references/test-files.md](references/test-files.md) — exhaustive file-by-file checklist for a new dialect (code, tests, docs).
118 changes: 118 additions & 0 deletions .claude/skills/add-sql-dialect/references/dialect-method-checklist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Dialect Method Checklist

Every abstract method on the `Dialect` ABC (`src/pycel2sql/dialect/_base.py`), grouped by category, with one-line "what to emit" guidance per method drawn from the six existing implementations.

## Contents

- Literals
- Operators
- Type casting
- Arrays
- JSON
- Timestamps
- String functions
- Comprehensions
- Regex
- Struct
- Validation
- Capabilities

## Literals

| Method | What to emit | Examples |
|---|---|---|
| `write_string_literal(w, value)` | Single-quoted string with `''` escaping (or `\\'` for BigQuery). | Postgres/DuckDB/MySQL: `'foo''bar'`. BigQuery: `'foo\'bar'`. |
| `write_bytes_literal(w, value)` | Hex-encoded byte literal in the engine's preferred form. | Postgres: `'\\x...'`. SQLite/Spark: `X'...'`. BigQuery: `b"..."` form. |
| `write_param_placeholder(w, param_index)` | Numbered or positional placeholder. | Postgres/DuckDB: `$N`. BigQuery: `@pN`. MySQL/SQLite/Spark: `?` (index ignored). |

## Operators

| Method | What to emit |
|---|---|
| `write_string_concat(w, write_lhs, write_rhs)` | Engine's concat form. Postgres/DuckDB: `lhs \|\| rhs`. MySQL: `CONCAT(lhs, rhs)`. SQLite: `lhs \|\| rhs`. BigQuery: `CONCAT(lhs, rhs)`. Spark: `concat(lhs, rhs)`. |
| `write_regex_match(w, write_target, pattern, case_insensitive)` | Operator or function call. Postgres: `target ~ 'p'` / `~* 'p'`. DuckDB: `regexp_matches(target, 'p')`. MySQL: `target REGEXP 'p'`. BigQuery: `REGEXP_CONTAINS(target, 'p')`. Spark: `target RLIKE 'p'`. SQLite: raises (no portable regex). |
| `write_like_escape(w)` | The trailing `ESCAPE` clause for `LIKE`. Postgres/DuckDB: ` ESCAPE '\\'`. SQLite: ` ESCAPE '\\'`. MySQL: ` ESCAPE '\\\\'`. BigQuery: empty (no ESCAPE supported). Spark: ` ESCAPE '\\\\'`. |
| `write_array_membership(w, write_elem, write_array)` | `elem` membership in array. Postgres: `elem = ANY(array)`. DuckDB: `elem = ANY(array)`. BigQuery: `elem IN UNNEST(array)`. Spark: `array_contains(array, elem)` — note arg-order swap. MySQL/SQLite: emit through JSON-array path (no native arrays). |

## Type casting

| Method | What to emit |
|---|---|
| `write_cast_to_numeric(w, write_expr)` | Force string→number coercion. Postgres: `expr::numeric`. DuckDB: `expr::DOUBLE`. BigQuery: `CAST(expr AS FLOAT64)`. MySQL/SQLite/Spark: `expr + 0` (arithmetic coercion). |
| `write_type_name(w, cel_type_name)` | Engine type name for explicit casts. Postgres: lowercase (`bigint`, `double precision`). MySQL: uppercase (`SIGNED`, `DOUBLE`). BigQuery: `BIGNUMERIC`/`FLOAT64`. Spark: `BIGINT`/`DOUBLE`/`STRING`. |
| `write_epoch_extract(w, write_expr)` | `int(timestamp)` → epoch seconds. Postgres: `EXTRACT(EPOCH FROM expr)::bigint`. DuckDB: `EXTRACT(EPOCH FROM expr)::BIGINT`. MySQL: `UNIX_TIMESTAMP(expr)`. BigQuery: `UNIX_SECONDS(expr)`. Spark: `UNIX_TIMESTAMP(expr)`. SQLite: `CAST(strftime('%s', expr) AS INTEGER)`. |
| `write_timestamp_cast(w, write_expr)` | `timestamp(string)`. Postgres/DuckDB: `CAST(expr AS TIMESTAMPTZ)`. MySQL: `CAST(expr AS DATETIME)`. BigQuery: `CAST(expr AS TIMESTAMP)`. Spark: `CAST(expr AS TIMESTAMP)`. SQLite: `datetime(expr)`. |

## Arrays

| Method | What to emit |
|---|---|
| `write_array_literal_open(w)` / `write_array_literal_close(w)` | Open/close array literal. Postgres: `ARRAY[` / `]`. DuckDB/BigQuery: `[` / `]`. Spark: `array(` / `)`. MySQL: `JSON_ARRAY(` / `)`. SQLite: `json_array(` / `)`. |
| `write_array_length(w, dimension, write_expr)` | Length, NULL-safe. Wrap in `COALESCE(..., 0)` — every existing dialect does this. Multi-dim raises `UnsupportedDialectFeatureError` for engines without portable multi-dim length (Spark). |
| `write_list_index(w, write_array, write_index)` | Dynamic index. 1-indexed engines (Postgres, DuckDB, MySQL, SQLite): emit `arr[idx + 1]`. 0-indexed (BigQuery): `arr[OFFSET(idx)]`. Spark: `arr[idx]` (0-indexed direct). |
| `write_list_index_const(w, write_array, index)` | Constant-int index — same shapes as `write_list_index` with the integer baked in. |
| `write_empty_typed_array(w, type_name)` | Empty typed array literal for `split(s, d, 0)` etc. Postgres: `ARRAY[]::<type>[]`. DuckDB: `[]::<type>[]`. BigQuery: `ARRAY<<type>>[]`. Spark: `CAST(array() AS ARRAY<<type>>)`. |

## JSON

| Method | What to emit |
|---|---|
| `write_json_field_access(w, write_base, field_name, is_final)` | Access a JSON field. Postgres/DuckDB: `base->'field'` (intermediate) / `base->>'field'` (final). MySQL: `base->>'$.field'` (always text). BigQuery: `JSON_QUERY(base, '$.field')` / `JSON_VALUE(base, '$.field')`. Spark: `get_json_object(base, '$.field')` (single function for both). SQLite: `json_extract(base, '$.field')`. |
| `write_json_existence(w, is_jsonb, field_name, write_base)` | `has(base.field)`. Postgres JSONB: `base ? 'field'`. Postgres JSON: `base->>'field' IS NOT NULL`. Others: `<extract> IS NOT NULL`. |
| `write_json_array_elements(w, is_jsonb, as_text, write_expr)` | Set-returning expression for `FROM <here>` in comprehensions. Postgres: `jsonb_array_elements_text(expr)`. DuckDB: `json_each(expr)` style. BigQuery: `UNNEST(JSON_QUERY_ARRAY(expr))`. Spark: `EXPLODE(from_json(expr, 'ARRAY<STRING>'))`. SQLite: `json_each(expr)`. |
| `write_json_array_length(w, write_expr)` | NULL-safe length of a JSON array column. **Wrap in `COALESCE(..., 0)`** — every dialect does this; the BigQuery wrap was added in PR #8 to match. |
| `write_json_array_membership(w, json_func, write_expr)` | RHS for `lhs = <subquery>` in comprehensions. SQLite: `(SELECT value FROM json_each(expr))`. Spark: raises (no portable boolean-predicate form available without candidate element). |
| `write_nested_json_array_membership(w, write_expr)` | Same as above but for nested chains. |

## Timestamps

| Method | What to emit |
|---|---|
| `write_duration(w, value, unit)` | Constant duration literal. Postgres/DuckDB: `INTERVAL 'N unit'`. MySQL/SQLite: dialect-specific INTERVAL syntax. Spark: `INTERVAL N unit`. BigQuery: `INTERVAL N unit`. |
| `write_interval(w, write_value, unit)` | Dynamic-value INTERVAL. Same shapes as above with the value coming from a callback. |
| `write_extract(w, part, write_expr, write_tz)` | `EXTRACT(part FROM expr)`. **DOW special case**: Sunday=1 (BigQuery, Spark) vs Sunday=0 (Postgres convention). Adjust with `(dayofweek(expr) - 1)` (Spark) or modulo arithmetic (BigQuery). |
| `write_timestamp_arithmetic(w, op, write_ts, write_dur)` | `timestamp +/- duration`. Postgres/DuckDB: `ts op dur`. BigQuery: `TIMESTAMP_ADD(ts, dur)` / `TIMESTAMP_SUB(...)`. MySQL: `DATE_ADD(...)` / `DATE_SUB(...)`. SQLite: `datetime(ts, '<sign>N unit')`. Spark: `ts op dur`. |

## String functions

| Method | What to emit |
|---|---|
| `write_contains(w, write_haystack, write_needle)` | `haystack.contains(needle)` → boolean. Postgres: `POSITION(needle IN haystack) > 0`. DuckDB: `CONTAINS(haystack, needle)`. MySQL: `LOCATE(needle, haystack) > 0`. BigQuery: `STRPOS(haystack, needle) > 0`. Spark: `LOCATE(needle, haystack) > 0`. SQLite: `INSTR(haystack, needle) > 0`. |
| `write_split(w, write_str, write_delim)` | Split into array. Postgres: `STRING_TO_ARRAY(s, d)`. DuckDB: `STRING_SPLIT(s, d)`. BigQuery: `SPLIT(s, d)`. Spark: `split(s, d)`. MySQL: `JSON_ARRAY(s)` (cannot split into a SQL array; emits a single-element JSON array). SQLite: raises. |
| `write_split_with_limit(w, write_str, write_delim, limit)` | 3-arg split. Spark/Postgres-style: `split(s, d, limit)` or 2-arg + slice. BigQuery: `SPLIT(...)` with `WHERE OFFSET < limit`. |
| `write_join(w, write_array, write_delim)` | Array → string. Postgres/DuckDB: `ARRAY_TO_STRING(arr, delim, '')`. BigQuery: `ARRAY_TO_STRING(arr, delim)`. Spark: `array_join(arr, delim)`. MySQL: `JSON_UNQUOTE(arr)` (no-op fallback). SQLite: raises. |
| `write_format(w, fmt_string, write_args)` | `string.format([args])`. Postgres/BigQuery: `FORMAT('fmt', ...)`. SQLite/DuckDB: `printf('fmt', ...)`. Spark: `format_string('fmt', ...)`. MySQL: raises `UnsupportedDialectFeatureError` (no equivalent). |

## Comprehensions

| Method | What to emit |
|---|---|
| `write_unnest(w, write_source)` | Set-returning expression for `FROM <here>`. Postgres/DuckDB/BigQuery: `UNNEST(source)`. Spark: `EXPLODE(source)`. MySQL: `JSON_TABLE(source, '$[*]' COLUMNS(...))`. SQLite: `json_each(source)`. |
| `write_array_subquery_open(w)` | Opens an `ARRAY(SELECT ...)` wrapper for `map()` / `filter()`. Postgres/DuckDB: `ARRAY(SELECT `. BigQuery: `ARRAY(SELECT `. Spark: `(SELECT collect_list(` (different — `collect_list` aggregator). MySQL/SQLite: subquery scaffolding. |
| `write_array_subquery_expr_close(w)` | Closes the inner expression before the FROM clause. Postgres/DuckDB: `` (no-op). Spark: `)` (closes `collect_list`). |

## Regex

| Method | What to emit |
|---|---|
| `convert_regex(re2_pattern)` | Validate RE2 pattern + convert to engine-native form. Returns `(pattern, case_insensitive)`. Postgres/DuckDB/Spark: passthrough after ReDoS validators. MySQL: convert to MySQL POSIX form. SQLite: not called (regex unsupported). |

## Struct

| Method | What to emit |
|---|---|
| `write_struct_open(w)` / `write_struct_close(w)` | Struct/record literal opener and closer. Postgres: `ROW(` / `)`. DuckDB: `{` / `}` (struct literal). BigQuery: `STRUCT(` / `)`. Spark: `struct(` / `)`. MySQL/SQLite: `JSON_OBJECT(` / `)` or similar. |

## Validation

| Method | What to emit |
|---|---|
| `max_identifier_length()` | Engine's identifier length limit. Postgres/MySQL: 63/64. BigQuery: 1024. Spark: 128. SQLite: no limit (returns 0). |
| `validate_field_name(name)` | Raise `InvalidFieldNameError` for invalid names. Should check empty, length, regex, reserved-keyword set. |

## Capabilities

| Method | What to emit |
|---|---|
| `supports_native_arrays()` | True for Postgres/DuckDB/BigQuery/Spark; False for MySQL/SQLite (use JSON arrays). |
| `supports_jsonb()` | True for Postgres only (JSONB-specific behaviour like `?` operator). False everywhere else. |
Loading
Loading