Fix CRLF pipe-table line ending in tree-sitter scanner by cderv · Pull Request #139 · quarto-dev/q2

cderv · 2026-04-28T15:29:00Z

When a Windows checkout (default core.autocrlf=true) authors a qmd document with a pipe table followed by content, the table absorbs the trailing paragraph as a malformed body row instead of closing at the blank line. Surfaces as 5 failing pipe-table tests under tree-sitter test on Windows; Linux CI passes even with the bug present.

Root Cause

In crates/tree-sitter-qmd/tree-sitter-markdown/src/scanner.c, line 2259 had a typo. Paired with line 2256, the two checks are meant to be opposites:

2256: lookahead is NOT a newline (content follows) -> emit PIPE_TABLE_LINE_ENDING (table continues with another row).
2259: lookahead IS a newline (blank line follows) -> emit LINE_ENDING (table closes).

The expression (lookahead == '\n' || lookahead != '\r') is logically lookahead != '\r' and never matches \r. With LF the first clause matched \n and the table closed correctly by accident. With CRLF, the lookahead after consuming \r\n is \r (start of the blank line); the buggy expression rejected it, so fall-through emitted PIPE_TABLE_LINE_ENDING via line 2287 instead — extending the table past the blank line.

The buggy block was introduced in 79489c4 on this fork. All other newline checks in scanner.c already use the paired '\n' || '\r' form.

Fix

Change != to ==. The 5 originally-failing pipe-table corpus tests now pass on Windows: Example 201 (GFM), Pipe table can precede content (two variants), Pipe table can follow content, Pipe table with blank-line caption.

Tests

Two layers of regression coverage:

A Rust unit test pipe_table_crlf_matches_lf that builds the CRLF input in-process and asserts the parsed S-expression matches the LF parse. The existing corpus tests miss this on Linux because their .txt fixtures are checked out as LF; building CRLF in-process gives Linux CI regression coverage for the specific bug.
A new CRLF parity sub-step inside step 4 of cargo xtask verify. After the LF corpus passes, it copies the grammar to a tempdir, converts every test/corpus/**/*.txt to CRLF, and re-runs tree-sitter test. Locks in full-corpus CRLF parity for any future grammar/scanner change. Skippable via --skip-treesitter-crlf-tests. 451/451 corpus tests pass under CRLF; reverting the scanner fix correctly drops it to 446/451.

Fixes #138

Line 2259 had a typo: `(lookahead == '\n' || lookahead != '\r')` — logically equivalent to `lookahead != '\r'` and never matching `\r`. Paired with the PIPE_TABLE_LINE_ENDING check on line 2256, the intent was: 2256: lookahead is NOT a newline (content follows) -> emit PIPE_TABLE_LINE_ENDING (table continues with another row). 2259: lookahead IS a newline (blank line follows) -> emit LINE_ENDING (table closes). The `!=` should be `==`. With LF input the first clause matched `\n` and the table closed correctly by accident. With CRLF input the lookahead after consuming the row-ending `\r\n` is `\r` (start of the blank line), which the buggy expression rejected — so the table absorbed the following paragraph as a malformed body row instead of closing. This propagated through pampa to a structurally wrong Pandoc AST for any qmd document with a pipe table followed by content on a Windows checkout (default core.autocrlf=true) — affecting every output format, since the AST is upstream of every writer. The 5 pipe-table corpus tests that fail on Windows now pass: Example 201 (GFM), Pipe table can precede content (two variants), Pipe table can follow content, Pipe table with blank-line caption. The buggy block was introduced in 79489c4 on this fork. All other newline checks in scanner.c already use the paired `'\n' || '\r'` form; line 2259 was the only outlier.

Builds the CRLF input in-process from the LF input rather than reading from a fixture, so Linux CI exercises CRLF too. The existing pipe-table corpus tests miss this on Linux because their `.txt` fixtures are checked out as LF, which masks regressions in scanner.c that only manifest on `\r`.

cscheid · 2026-04-28T19:35:11Z

Thank you, that's a really good catch.

We have dedicated tree-sitter tests too, but since this is whitespace related, I imagine that we would need a new strategy for those.

What do you think about something like a script that converts all unix line breaks to windows linebreaks in the tree-sitter test files and then runs again? I think I would like the parses and errors to stay exactly the same. The source locations would change, but the tree-sitter test suite isn't checking source locations.

cderv · 2026-04-29T07:44:58Z

Thanks for the feedback. I'll look at this option.

Step 4 of `cargo xtask verify` now re-runs `tree-sitter test` against a temp copy of the grammar with the corpus converted to CRLF line endings. This locks in the scanner-level CRLF fix from c4524a1 and prevents future grammar/scanner changes from silently regressing CRLF behavior on Linux CI (where corpus files are checked out as LF). Source locations differ between the LF and CRLF runs, but tree-sitter test does not compare them — corpus expectations are bare S-expressions. Skippable via `--skip-treesitter-crlf-tests`. Implementation lives in crates/xtask/src/treesitter_crlf.rs: - to_crlf: idempotent LF→CRLF normalization with 5 unit tests - run_parity_check: copies grammar dir to tempdir, converts every test/corpus/**/*.txt to CRLF, then invokes `tree-sitter test` Verified end-to-end against the markdown grammar (451/451 corpus tests pass under CRLF). Regression-tested by reverting the scanner fix: parity run drops to 446/451, exits non-zero. Design and implementation plan: claude-notes/plans/2026-04-29-tree-sitter-crlf-parity-design.md

cderv · 2026-04-29T09:33:04Z

Done in 8f61f58.

Step 4 of cargo xtask verify now re-runs tree-sitter test against a temp copy of the grammar with the corpus converted to CRLF. Skippable via --skip-treesitter-crlf-tests.

https://github.com/quarto-dev/q2/blob/8f61f587e03569758a0308e0f3ceeee7bd4bf88f/crates/xtask/src/treesitter_crlf.rs

451/451 corpus tests pass under CRLF on the fixed scanner. Reverting the one-character scanner fix drops it to 446/451 and fails the verify, so the parity check does catch the regression it's meant to.

Scope is the markdown grammar only — no CI workflow changes here. If a corpus test ever turns out to legitimately depend on bare-LF semantics, we could exclude it case-by-case; none of the current 451 do.

About doctemplate and `.gitattributes`

The doctemplate grammar has * text=auto eol=lf in its .gitattributes, which forces LF on checkout everywhere, including Windows. That's why a Windows checkout doesn't naturally exercise CRLF for that grammar today — git rewrites the corpus to LF before tree-sitter sees it.

Two ways CRLF coverage could come in for doctemplate later:

Drop eol=lf and let text=auto apply normal Windows CRLF conversion on checkout. Once we have Windows CI running tree-sitter test, that path would catch CRLF regressions there for free.
Or: extend the parity runner in this PR to the doctemplate grammar too. Same mechanism, different grammar dir.

Not doing either here — doctemplate's scanner has no \r handling because newlines aren't scanner-relevant in that grammar, so the CRLF risk is negligible. Flagging it so it's not lost.

cderv · 2026-04-29T11:06:11Z

@cscheid just asking for review so that you can verify I got your idea right.

cderv added 2 commits April 28, 2026 17:17

cderv requested a review from cscheid April 29, 2026 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CRLF pipe-table line ending in tree-sitter scanner#139

Fix CRLF pipe-table line ending in tree-sitter scanner#139
cderv wants to merge 3 commits intomainfrom
fix/windows-tree-sitter-crlf

cderv commented Apr 28, 2026 •

edited

Loading

Uh oh!

cscheid commented Apr 28, 2026

Uh oh!

cderv commented Apr 29, 2026

Uh oh!

cderv commented Apr 29, 2026 •

edited

Loading

Uh oh!

cderv commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cderv commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Fix

Tests

Uh oh!

cscheid commented Apr 28, 2026

Uh oh!

cderv commented Apr 29, 2026

Uh oh!

cderv commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About doctemplate and .gitattributes

Uh oh!

cderv commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cderv commented Apr 28, 2026 •

edited

Loading

cderv commented Apr 29, 2026 •

edited

Loading

About doctemplate and `.gitattributes`