Rewrite lexer and parser by Schamper · Pull Request #146 · fox-it/dissect.cstruct

Schamper · 2026-03-03T16:19:09Z

Closes #85, partially #142, and will make #86 and #138 a lot easier to implement. Fixes #149.

This PR will (finally) replace the shoddy C syntax parser I originally wrote many moons ago, when I discovered the existence of re.Scanner and ran with it. This PR aims to add a somewhat decent lexer and separate parser. I'm still not a compsci 1337coder, so this is just what I came up with (with some help) and definitely not a textbook implementation. All feedback is welcome.

New lexer
New C syntax parser that utilizes the new lexer
Expression parser re-uses the new lexer
Reworked how sizeof works in the expression parser, and added offsetof

The new parser has made changing parsing behavior a lot easier. As such, this PR already makes the following changes:

The new parser is slightly stricter, requiring proper semicolon endings for example. We'll need to fix this in any dissect code that has this.
An important semantic change is how named nested structures are handled. In my infinite wisdom, I originally figured that named nested structures do not "exist" in the top level scope. That's not true, so now named nested structures get properly registered with the cstruct instance:

struct a {
    struct b {
        ...
    };
};

// Will register both `a` and `b`

Another important change is how we deal with struct { ... } name;. We used to parse this first as an anonymous struct, then capture name as the structure type name. That's not strictly correct, name is a variable of an anonymous unnamed struct, so we now treat it as such. We don't error on this, but rather we silently ignore name and skip until we reach a ;
typedef enum ... is now allowed
Probably some other things I'm forgetting

This probably warrants a major version bump, so maybe good to pair this with #114, #144 and what we discussed in #142.

codecov · 2026-03-03T16:25:39Z

Codecov Report

❌ Patch coverage is 0% with 761 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (29652dd) to head (3a37ab0).

Files with missing lines	Patch %	Lines
dissect/cstruct/lexer.py	0.00%	341 Missing ⚠️
dissect/cstruct/parser.py	0.00%	293 Missing ⚠️
dissect/cstruct/expression.py	0.00%	111 Missing ⚠️
dissect/cstruct/utils.py	0.00%	12 Missing ⚠️
dissect/cstruct/cstruct.py	0.00%	2 Missing ⚠️
dissect/cstruct/exceptions.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##            main    #146    +/-   ##
======================================
  Coverage   0.00%   0.00%            
======================================
  Files         21      22     +1     
  Lines       2470    2582   +112     
======================================
- Misses      2470    2582   +112

Flag	Coverage Δ
unittests	`0.00% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codspeed-hq · 2026-03-03T16:27:44Z

Merging this PR will improve performance by 11.81%

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

#### 🎉 Hooray! `pytest-codspeed` just leveled up to 5.0.2!

A heads-up, this is a breaking change and it might affect your current performance baseline a bit. But here's the exciting part - it's packed with new, cool features and promises improved result stability 🥳!
Curious about what's new? Visit our releases page to delve into all the awesome details about this new version.

⚡ 8 improved benchmarks
❌ 1 regressed benchmark
✅ 3 untouched benchmarks
🆕 2 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
❌	`test_benchmark_expression_evaluate`	81.9 µs	127.7 µs	-35.89%
⚡	`test_benchmark_attribute_access`	15.3 µs	11.3 µs	+35.53%
⚡	`test_benchmark_expression_parse`	345.4 µs	264.2 µs	+30.73%
⚡	`test_benchmark_basic[compiled]`	81.2 µs	73.3 µs	+10.7%
⚡	`test_benchmark_expression_parse_and_evaluate`	391.6 µs	353.7 µs	+10.73%
⚡	`test_benchmark_getattr_constants`	17 µs	13.6 µs	+25.48%
⚡	`test_benchmark_getattr_typedefs`	27.5 µs	24.3 µs	+13.42%
⚡	`test_benchmark_getattr_types`	27 µs	23.8 µs	+13.69%
🆕	`test_benchmark_lexer`	N/A	2.4 ms	N/A
⚡	`test_benchmark_lexer_and_parser`	15.7 ms	13 ms	+21.21%
🆕	`test_benchmark_parser`	N/A	10.6 ms	N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing rewrite-parser (3a37ab0) with main (29652dd)}

Schamper · 2026-03-03T16:30:24Z

@sMezaOrellana would be interested in your thoughts on these changes!

twiggler

Architecure looks ok, found 2 possible issues TBC

Migrated-from: fox-it/dissect.cstruct#146

twiggler · 2026-04-16T14:40:28Z

This PR has been migrated to the dissect monorepo: twiggler/dissect-monorepo-test#5

The original diff and commit history have been preserved on the migrate/dissect.cstruct/pr-146 branch.

Schamper · 2026-04-16T21:08:57Z

@twiggler I was a bit sad that your message was just this test, as I thought maybe you left more comments 😉.

Migrated-from: fox-it/dissect.cstruct#146

twiggler

LGTM

Since it is an impactful rewrite I asked @Miauwkeru to QA

Miauwkeru · 2026-04-28T10:50:39Z

I did some checking on our other projects to see what goes wrong, and found some regressions regarding to where it crashed unexpectedly.

it cannot evaluate a ternary operation, which occurred in dissect.vmfs and gives a LexerError when trying to evaluate it as it doesn't understand the ? token. While I am aware that those definitions don't function yet, it probably shouldn't crash when trying to evaluate it

#define func(x) ( x ? 1 : 0 )

Without a ternary it makes it into a string:

# define func(x)    ( x == 0)
>>> c_struct.func
' ( x ) ( x = = 0)`

Another issue I found was when it was trying to create define a flag in dissect.apfs. Here it couldn't evaluate the . after the INODE in the definition.

flag INODE {
  ...
};

#define APFS_INODE_PINNED_MASK            INODE.PINNED_TO_MAIN | INODE.PINNED_TO_TIER2

I'll look through the code now to see whether I can find some additional issues.

Schamper · 2026-04-29T12:51:47Z

@Miauwkeru fixed.

Co-authored-by: Copilot <copilot@github.com>

Miauwkeru · 2026-05-07T14:48:12Z

Only found one more thing that might be an oddity:

#define ADCRYPT_MAGIC ADCRYPT\00

This now fails to parse due to the \00 at the end. Which is fixed by using quotes. I don't know if this was an intended change tho

Schamper · 2026-05-07T15:46:37Z

Not necessarily intended. What do you think would be reasonable behavior in this case? (also ping @JSCU-CNI)

Miauwkeru · 2026-05-11T09:09:28Z

Not necessarily intended. What do you think would be reasonable behavior in this case? (also ping @JSCU-CNI)

I think it makes most sense to go the C route. In the example I gave:

#define ADCRYPT_MAGIC ADCRYPT\00

C wouldn't compile it as it would expect ADCRYPT to be another definition or something else it can resolve. (Besides not knowing what to do with the \0). So I think it would be better if we require strings to be explicitly quoted.

What do you think? @Schamper @JSCU-CNI

Schamper · 2026-05-11T09:43:04Z

I was leaning in that direction too.

JSCU-CNI · 2026-05-13T12:11:03Z

Agreed. Something like #define ADCRYPT_MAGIC "ADCRYPT\00" or #define ADCRYPT_MAGIC b"ADCRYPT\x00" would make sense.

Schamper · 2026-05-15T13:39:01Z

I changed a bit how #define values are handled with 3a37ab0. Feedback is welcome.

Both ways now actually work.

Miauwkeru · 2026-05-18T09:28:21Z

+    #define RAW somevalue
+    #define STR "hello"
+    #define BYTES b"world"
+    #define NULLRAW ADCRYPT\00


Don't we want it to be explicitly quoted so that it fails on this kind of definition?

My "new approach" (which is basically, don't tokenize anything after #define NAME, but just take its raw value until the end of the line) allows this to work again. I think as long as it's unit tested, it should be fine to keep in.

The reason why I slightly prefer this new approach is so that in the parser, we get a more "true" representation of that the define value actually is, including spacing and such. The downside being that the parser now has to deal a little bit with some basic string parsing.

Schamper force-pushed the rewrite-parser branch from 1bb3aac to 5f43faa Compare March 3, 2026 16:24

Schamper mentioned this pull request Mar 17, 2026

ValueError: Cannot use capturing groups in re.Scanner on Python 3.15 #149

Open

Schamper force-pushed the rewrite-parser branch from 5f43faa to 16e6b8b Compare March 24, 2026 13:21

twiggler requested changes Apr 1, 2026

View reviewed changes

Comment thread dissect/cstruct/expression.py

Comment thread dissect/cstruct/parser.py Outdated

Schamper requested a review from twiggler April 13, 2026 16:01

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 16, 2026

Rewrite lexer and parser

82b6163

Migrated-from: fox-it/dissect.cstruct#146

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 16, 2026

Process review feedback

054f448

Migrated-from: fox-it/dissect.cstruct#146

twiggler mentioned this pull request Apr 16, 2026

[migrated] Rewrite lexer and parser twiggler/dissect-monorepo-test#5

Closed

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026

Rewrite lexer and parser

db65c16

Migrated-from: fox-it/dissect.cstruct#146

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026

Process review feedback

90c6fe4

Migrated-from: fox-it/dissect.cstruct#146

twiggler mentioned this pull request Apr 20, 2026

[migrated] Rewrite lexer and parser twiggler/dissect-monorepo-test#6

Closed

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026

Rewrite lexer and parser

065d903

Migrated-from: fox-it/dissect.cstruct#146

twiggler pushed a commit to twiggler/dissect-monorepo-test that referenced this pull request Apr 20, 2026

Process review feedback

14d3ab1

Migrated-from: fox-it/dissect.cstruct#146

twiggler mentioned this pull request Apr 20, 2026

[migrated] Rewrite lexer and parser twiggler/dissect-monorepo-test#7

Draft

twiggler requested changes Apr 22, 2026

View reviewed changes

Comment thread dissect/cstruct/utils.py Outdated

Comment thread dissect/cstruct/parser.py Outdated

Comment thread dissect/cstruct/cstruct.py Outdated

Comment thread dissect/cstruct/parser.py Outdated

Comment thread dissect/cstruct/lexer.py Outdated

Schamper requested a review from twiggler April 22, 2026 12:59

twiggler requested a review from Miauwkeru April 23, 2026 11:26

twiggler previously approved these changes Apr 23, 2026

View reviewed changes

Miauwkeru requested changes Apr 28, 2026

View reviewed changes

Comment thread dissect/cstruct/lexer.py Outdated

Comment thread tests/test_lexer.py

Comment thread dissect/cstruct/lexer.py

Comment thread tests/test_lexer.py Outdated

Schamper dismissed twiggler’s stale review via f95b499 April 29, 2026 09:29

Schamper requested a review from Miauwkeru April 29, 2026 12:51

Miauwkeru reviewed May 7, 2026

View reviewed changes

Comment thread tests/test_parser.py

Rewrite lexer and parser

086f9ed

Schamper and others added 5 commits May 7, 2026 13:23

Process review feedback

74c5102

Process review feedback

b2da764

Process review feedback

f0be7db

Co-authored-by: Copilot <copilot@github.com>

Process feedback

577794b

Co-authored-by: Copilot <copilot@github.com>

Address review feedback

0d4ec4d

Schamper force-pushed the rewrite-parser branch from f989504 to 0d4ec4d Compare May 7, 2026 11:23

Schamper requested a review from Miauwkeru May 7, 2026 11:23

Schamper added 4 commits May 7, 2026 13:33

Merge _read_while and _read_until

81799a4

Fix docs error

7dd2608

Different approach for conditional reading

7ebe2f7

Fix linter

ef6c734

Change how #define values are handled

3a37ab0

Miauwkeru reviewed May 18, 2026

View reviewed changes

Conversation

Schamper commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 11.81%

Performance Changes

Uh oh!

Schamper commented Mar 3, 2026

Uh oh!

twiggler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

twiggler commented Apr 16, 2026

Uh oh!

Schamper commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

twiggler left a comment

Choose a reason for hiding this comment

Uh oh!

Miauwkeru commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Schamper commented Apr 29, 2026

Uh oh!

Uh oh!

Miauwkeru commented May 7, 2026

Uh oh!

Schamper commented May 7, 2026

Uh oh!

Miauwkeru commented May 11, 2026

Uh oh!

Schamper commented May 11, 2026

Uh oh!

JSCU-CNI commented May 13, 2026

Uh oh!

Schamper commented May 15, 2026

Uh oh!

Miauwkeru May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Schamper May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Schamper commented Mar 3, 2026 •

edited

Loading

codecov Bot commented Mar 3, 2026 •

edited

Loading

codspeed-hq Bot commented Mar 3, 2026 •

edited

Loading