Skip to content

Fix CRLF pipe-table line ending in tree-sitter scanner#139

Open
cderv wants to merge 3 commits intomainfrom
fix/windows-tree-sitter-crlf
Open

Fix CRLF pipe-table line ending in tree-sitter scanner#139
cderv wants to merge 3 commits intomainfrom
fix/windows-tree-sitter-crlf

Conversation

@cderv
Copy link
Copy Markdown
Member

@cderv cderv commented Apr 28, 2026

When a Windows checkout (default core.autocrlf=true) authors a qmd document with a pipe table followed by content, the table absorbs the trailing paragraph as a malformed body row instead of closing at the blank line. Surfaces as 5 failing pipe-table tests under tree-sitter test on Windows; Linux CI passes even with the bug present.

Root Cause

In crates/tree-sitter-qmd/tree-sitter-markdown/src/scanner.c, line 2259 had a typo. Paired with line 2256, the two checks are meant to be opposites:

  • 2256: lookahead is NOT a newline (content follows) -> emit PIPE_TABLE_LINE_ENDING (table continues with another row).
  • 2259: lookahead IS a newline (blank line follows) -> emit LINE_ENDING (table closes).

The expression (lookahead == '\n' || lookahead != '\r') is logically lookahead != '\r' and never matches \r. With LF the first clause matched \n and the table closed correctly by accident. With CRLF, the lookahead after consuming \r\n is \r (start of the blank line); the buggy expression rejected it, so fall-through emitted PIPE_TABLE_LINE_ENDING via line 2287 instead — extending the table past the blank line.

The buggy block was introduced in 79489c4 on this fork. All other newline checks in scanner.c already use the paired '\n' || '\r' form.

Fix

Change != to ==. The 5 originally-failing pipe-table corpus tests now pass on Windows: Example 201 (GFM), Pipe table can precede content (two variants), Pipe table can follow content, Pipe table with blank-line caption.

Tests

Two layers of regression coverage:

  • A Rust unit test pipe_table_crlf_matches_lf that builds the CRLF input in-process and asserts the parsed S-expression matches the LF parse. The existing corpus tests miss this on Linux because their .txt fixtures are checked out as LF; building CRLF in-process gives Linux CI regression coverage for the specific bug.
  • A new CRLF parity sub-step inside step 4 of cargo xtask verify. After the LF corpus passes, it copies the grammar to a tempdir, converts every test/corpus/**/*.txt to CRLF, and re-runs tree-sitter test. Locks in full-corpus CRLF parity for any future grammar/scanner change. Skippable via --skip-treesitter-crlf-tests. 451/451 corpus tests pass under CRLF; reverting the scanner fix correctly drops it to 446/451.

Fixes #138

cderv added 2 commits April 28, 2026 17:17
Line 2259 had a typo: `(lookahead == '\n' || lookahead != '\r')` —
logically equivalent to `lookahead != '\r'` and never matching `\r`.
Paired with the PIPE_TABLE_LINE_ENDING check on line 2256, the intent
was:

  2256: lookahead is NOT a newline (content follows) -> emit
        PIPE_TABLE_LINE_ENDING (table continues with another row).
  2259: lookahead IS a newline (blank line follows) -> emit
        LINE_ENDING (table closes).

The `!=` should be `==`. With LF input the first clause matched `\n`
and the table closed correctly by accident. With CRLF input the
lookahead after consuming the row-ending `\r\n` is `\r` (start of
the blank line), which the buggy expression rejected — so the table
absorbed the following paragraph as a malformed body row instead of
closing.

This propagated through pampa to a structurally wrong Pandoc AST for
any qmd document with a pipe table followed by content on a Windows
checkout (default core.autocrlf=true) — affecting every output
format, since the AST is upstream of every writer.

The 5 pipe-table corpus tests that fail on Windows now pass: Example
201 (GFM), Pipe table can precede content (two variants), Pipe table
can follow content, Pipe table with blank-line caption.

The buggy block was introduced in 79489c4 on this fork. All other
newline checks in scanner.c already use the paired `'\n' || '\r'`
form; line 2259 was the only outlier.
Builds the CRLF input in-process from the LF input rather than reading
from a fixture, so Linux CI exercises CRLF too. The existing pipe-table
corpus tests miss this on Linux because their `.txt` fixtures are
checked out as LF, which masks regressions in scanner.c that only
manifest on `\r`.
@cscheid
Copy link
Copy Markdown
Member

cscheid commented Apr 28, 2026

Thank you, that's a really good catch.

We have dedicated tree-sitter tests too, but since this is whitespace related, I imagine that we would need a new strategy for those.

What do you think about something like a script that converts all unix line breaks to windows linebreaks in the tree-sitter test files and then runs again? I think I would like the parses and errors to stay exactly the same. The source locations would change, but the tree-sitter test suite isn't checking source locations.

@cderv
Copy link
Copy Markdown
Member Author

cderv commented Apr 29, 2026

Thanks for the feedback. I'll look at this option.

Step 4 of `cargo xtask verify` now re-runs `tree-sitter test` against
a temp copy of the grammar with the corpus converted to CRLF line
endings. This locks in the scanner-level CRLF fix from c4524a1 and
prevents future grammar/scanner changes from silently regressing CRLF
behavior on Linux CI (where corpus files are checked out as LF).

Source locations differ between the LF and CRLF runs, but tree-sitter
test does not compare them — corpus expectations are bare S-expressions.

Skippable via `--skip-treesitter-crlf-tests`.

Implementation lives in crates/xtask/src/treesitter_crlf.rs:
- to_crlf: idempotent LF→CRLF normalization with 5 unit tests
- run_parity_check: copies grammar dir to tempdir, converts every
  test/corpus/**/*.txt to CRLF, then invokes `tree-sitter test`

Verified end-to-end against the markdown grammar (451/451 corpus tests
pass under CRLF). Regression-tested by reverting the scanner fix:
parity run drops to 446/451, exits non-zero.

Design and implementation plan:
claude-notes/plans/2026-04-29-tree-sitter-crlf-parity-design.md
@cderv
Copy link
Copy Markdown
Member Author

cderv commented Apr 29, 2026

Done in 8f61f58.

Step 4 of cargo xtask verify now re-runs tree-sitter test against a temp copy of the grammar with the corpus converted to CRLF. Skippable via --skip-treesitter-crlf-tests.

https://github.com/quarto-dev/q2/blob/8f61f587e03569758a0308e0f3ceeee7bd4bf88f/crates/xtask/src/treesitter_crlf.rs

451/451 corpus tests pass under CRLF on the fixed scanner. Reverting the one-character scanner fix drops it to 446/451 and fails the verify, so the parity check does catch the regression it's meant to.

Scope is the markdown grammar only — no CI workflow changes here. If a corpus test ever turns out to legitimately depend on bare-LF semantics, we could exclude it case-by-case; none of the current 451 do.

About doctemplate and .gitattributes

The doctemplate grammar has * text=auto eol=lf in its .gitattributes, which forces LF on checkout everywhere, including Windows. That's why a Windows checkout doesn't naturally exercise CRLF for that grammar today — git rewrites the corpus to LF before tree-sitter sees it.

Two ways CRLF coverage could come in for doctemplate later:

  • Drop eol=lf and let text=auto apply normal Windows CRLF conversion on checkout. Once we have Windows CI running tree-sitter test, that path would catch CRLF regressions there for free.
  • Or: extend the parity runner in this PR to the doctemplate grammar too. Same mechanism, different grammar dir.

Not doing either here — doctemplate's scanner has no \r handling because newlines aren't scanner-relevant in that grammar, so the CRLF risk is negligible. Flagging it so it's not lost.

@cderv cderv requested a review from cscheid April 29, 2026 11:05
@cderv
Copy link
Copy Markdown
Member Author

cderv commented Apr 29, 2026

@cscheid just asking for review so that you can verify I got your idea right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pipe tables with CRLF line endings absorb the following block

2 participants