Fix CRLF pipe-table line ending in tree-sitter scanner#139
Fix CRLF pipe-table line ending in tree-sitter scanner#139
Conversation
Line 2259 had a typo: `(lookahead == '\n' || lookahead != '\r')` —
logically equivalent to `lookahead != '\r'` and never matching `\r`.
Paired with the PIPE_TABLE_LINE_ENDING check on line 2256, the intent
was:
2256: lookahead is NOT a newline (content follows) -> emit
PIPE_TABLE_LINE_ENDING (table continues with another row).
2259: lookahead IS a newline (blank line follows) -> emit
LINE_ENDING (table closes).
The `!=` should be `==`. With LF input the first clause matched `\n`
and the table closed correctly by accident. With CRLF input the
lookahead after consuming the row-ending `\r\n` is `\r` (start of
the blank line), which the buggy expression rejected — so the table
absorbed the following paragraph as a malformed body row instead of
closing.
This propagated through pampa to a structurally wrong Pandoc AST for
any qmd document with a pipe table followed by content on a Windows
checkout (default core.autocrlf=true) — affecting every output
format, since the AST is upstream of every writer.
The 5 pipe-table corpus tests that fail on Windows now pass: Example
201 (GFM), Pipe table can precede content (two variants), Pipe table
can follow content, Pipe table with blank-line caption.
The buggy block was introduced in 79489c4 on this fork. All other
newline checks in scanner.c already use the paired `'\n' || '\r'`
form; line 2259 was the only outlier.
Builds the CRLF input in-process from the LF input rather than reading from a fixture, so Linux CI exercises CRLF too. The existing pipe-table corpus tests miss this on Linux because their `.txt` fixtures are checked out as LF, which masks regressions in scanner.c that only manifest on `\r`.
|
Thank you, that's a really good catch. We have dedicated tree-sitter tests too, but since this is whitespace related, I imagine that we would need a new strategy for those. What do you think about something like a script that converts all unix line breaks to windows linebreaks in the tree-sitter test files and then runs again? I think I would like the parses and errors to stay exactly the same. The source locations would change, but the tree-sitter test suite isn't checking source locations. |
|
Thanks for the feedback. I'll look at this option. |
Step 4 of `cargo xtask verify` now re-runs `tree-sitter test` against a temp copy of the grammar with the corpus converted to CRLF line endings. This locks in the scanner-level CRLF fix from c4524a1 and prevents future grammar/scanner changes from silently regressing CRLF behavior on Linux CI (where corpus files are checked out as LF). Source locations differ between the LF and CRLF runs, but tree-sitter test does not compare them — corpus expectations are bare S-expressions. Skippable via `--skip-treesitter-crlf-tests`. Implementation lives in crates/xtask/src/treesitter_crlf.rs: - to_crlf: idempotent LF→CRLF normalization with 5 unit tests - run_parity_check: copies grammar dir to tempdir, converts every test/corpus/**/*.txt to CRLF, then invokes `tree-sitter test` Verified end-to-end against the markdown grammar (451/451 corpus tests pass under CRLF). Regression-tested by reverting the scanner fix: parity run drops to 446/451, exits non-zero. Design and implementation plan: claude-notes/plans/2026-04-29-tree-sitter-crlf-parity-design.md
|
Done in 8f61f58. Step 4 of 451/451 corpus tests pass under CRLF on the fixed scanner. Reverting the one-character scanner fix drops it to 446/451 and fails the verify, so the parity check does catch the regression it's meant to. Scope is the markdown grammar only — no CI workflow changes here. If a corpus test ever turns out to legitimately depend on bare-LF semantics, we could exclude it case-by-case; none of the current 451 do. About doctemplate and
|
|
@cscheid just asking for review so that you can verify I got your idea right. |
When a Windows checkout (default
core.autocrlf=true) authors a qmd document with a pipe table followed by content, the table absorbs the trailing paragraph as a malformed body row instead of closing at the blank line. Surfaces as 5 failing pipe-table tests undertree-sitter teston Windows; Linux CI passes even with the bug present.Root Cause
In
crates/tree-sitter-qmd/tree-sitter-markdown/src/scanner.c, line 2259 had a typo. Paired with line 2256, the two checks are meant to be opposites:PIPE_TABLE_LINE_ENDING(table continues with another row).LINE_ENDING(table closes).The expression
(lookahead == '\n' || lookahead != '\r')is logicallylookahead != '\r'and never matches\r. With LF the first clause matched\nand the table closed correctly by accident. With CRLF, the lookahead after consuming\r\nis\r(start of the blank line); the buggy expression rejected it, so fall-through emittedPIPE_TABLE_LINE_ENDINGvia line 2287 instead — extending the table past the blank line.The buggy block was introduced in 79489c4 on this fork. All other newline checks in
scanner.calready use the paired'\n' || '\r'form.Fix
Change
!=to==. The 5 originally-failing pipe-table corpus tests now pass on Windows: Example 201 (GFM), Pipe table can precede content (two variants), Pipe table can follow content, Pipe table with blank-line caption.Tests
Two layers of regression coverage:
pipe_table_crlf_matches_lfthat builds the CRLF input in-process and asserts the parsed S-expression matches the LF parse. The existing corpus tests miss this on Linux because their.txtfixtures are checked out as LF; building CRLF in-process gives Linux CI regression coverage for the specific bug.cargo xtask verify. After the LF corpus passes, it copies the grammar to a tempdir, converts everytest/corpus/**/*.txtto CRLF, and re-runstree-sitter test. Locks in full-corpus CRLF parity for any future grammar/scanner change. Skippable via--skip-treesitter-crlf-tests. 451/451 corpus tests pass under CRLF; reverting the scanner fix correctly drops it to 446/451.Fixes #138