Rewrite lexer and parser#146
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #146 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 21 22 +1
Lines 2470 2582 +112
======================================
- Misses 2470 2582 +112
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Merging this PR will improve performance by 11.81%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ❌ | test_benchmark_expression_evaluate |
81.9 µs | 127.7 µs | -35.89% |
| ⚡ | test_benchmark_attribute_access |
15.3 µs | 11.3 µs | +35.53% |
| ⚡ | test_benchmark_expression_parse |
345.4 µs | 264.2 µs | +30.73% |
| ⚡ | test_benchmark_basic[compiled] |
81.2 µs | 73.3 µs | +10.7% |
| ⚡ | test_benchmark_expression_parse_and_evaluate |
391.6 µs | 353.7 µs | +10.73% |
| ⚡ | test_benchmark_getattr_constants |
17 µs | 13.6 µs | +25.48% |
| ⚡ | test_benchmark_getattr_typedefs |
27.5 µs | 24.3 µs | +13.42% |
| ⚡ | test_benchmark_getattr_types |
27 µs | 23.8 µs | +13.69% |
| 🆕 | test_benchmark_lexer |
N/A | 2.4 ms | N/A |
| ⚡ | test_benchmark_lexer_and_parser |
15.7 ms | 13 ms | +21.21% |
| 🆕 | test_benchmark_parser |
N/A | 10.6 ms | N/A |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing rewrite-parser (3a37ab0) with main (29652dd)
|
@sMezaOrellana would be interested in your thoughts on these changes! |
5f43faa to
16e6b8b
Compare
twiggler
left a comment
There was a problem hiding this comment.
Architecure looks ok, found 2 possible issues TBC
Migrated-from: fox-it/dissect.cstruct#146
Migrated-from: fox-it/dissect.cstruct#146
|
This PR has been migrated to the dissect monorepo: twiggler/dissect-monorepo-test#5 The original diff and commit history have been preserved on the |
|
@twiggler I was a bit sad that your message was just this test, as I thought maybe you left more comments 😉. |
Migrated-from: fox-it/dissect.cstruct#146
Migrated-from: fox-it/dissect.cstruct#146
Migrated-from: fox-it/dissect.cstruct#146
Migrated-from: fox-it/dissect.cstruct#146
twiggler
left a comment
There was a problem hiding this comment.
LGTM
Since it is an impactful rewrite I asked @Miauwkeru to QA
|
I did some checking on our other projects to see what goes wrong, and found some regressions regarding to where it crashed unexpectedly.
#define func(x) ( x ? 1 : 0 )Without a ternary it makes it into a string: # define func(x) ( x == 0)
>>> c_struct.func
' ( x ) ( x = = 0)`
flag INODE {
...
};
#define APFS_INODE_PINNED_MASK INODE.PINNED_TO_MAIN | INODE.PINNED_TO_TIER2I'll look through the code now to see whether I can find some additional issues. |
|
@Miauwkeru fixed. |
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
|
Only found one more thing that might be an oddity: This now fails to parse due to the |
|
Not necessarily intended. What do you think would be reasonable behavior in this case? (also ping @JSCU-CNI) |
I think it makes most sense to go the C route. In the example I gave:
C wouldn't compile it as it would expect ADCRYPT to be another definition or something else it can resolve. (Besides not knowing what to do with the |
|
I was leaning in that direction too. |
|
Agreed. Something like |
|
I changed a bit how Both ways now actually work. |
| #define RAW somevalue | ||
| #define STR "hello" | ||
| #define BYTES b"world" | ||
| #define NULLRAW ADCRYPT\00 |
There was a problem hiding this comment.
Don't we want it to be explicitly quoted so that it fails on this kind of definition?
There was a problem hiding this comment.
My "new approach" (which is basically, don't tokenize anything after #define NAME, but just take its raw value until the end of the line) allows this to work again. I think as long as it's unit tested, it should be fine to keep in.
The reason why I slightly prefer this new approach is so that in the parser, we get a more "true" representation of that the define value actually is, including spacing and such. The downside being that the parser now has to deal a little bit with some basic string parsing.
Closes #85, partially #142, and will make #86 and #138 a lot easier to implement. Fixes #149.
This PR will (finally) replace the shoddy C syntax parser I originally wrote many moons ago, when I discovered the existence of
re.Scannerand ran with it. This PR aims to add a somewhat decent lexer and separate parser. I'm still not a compsci 1337coder, so this is just what I came up with (with some help) and definitely not a textbook implementation. All feedback is welcome.sizeofworks in the expression parser, and addedoffsetofThe new parser has made changing parsing behavior a lot easier. As such, this PR already makes the following changes:
The new parser is slightly stricter, requiring proper semicolon endings for example. We'll need to fix this in any dissect code that has this.
An important semantic change is how named nested structures are handled. In my infinite wisdom, I originally figured that named nested structures do not "exist" in the top level scope. That's not true, so now named nested structures get properly registered with the cstruct instance:
struct { ... } name;. We used to parse this first as an anonymous struct, then capturenameas the structure type name. That's not strictly correct,nameis a variable of an anonymous unnamed struct, so we now treat it as such. We don't error on this, but rather we silently ignorenameand skip until we reach a;typedef enum ...is now allowedThis probably warrants a major version bump, so maybe good to pair this with #114, #144 and what we discussed in #142.