Skip to content

Generated parse() re-decodes grammar tables on every call #630

@avityuk

Description

@avityuk

Hi! First, thanks for grmtools — I'm looking at using it in a project
that parses a lot of small inputs in a tight loop, and while profiling I
noticed something I wanted to flag.

The generated parse() from lrpar_mod! calls
::lrpar::ctbuilder::_reconstitute(__GRM_DATA, __STABLE_DATA) on every
invocation, which does two bincode::decode_from_slice calls to rebuild
the YaccGrammar and StateTable. For a single whole-buffer parse this
is invisible, but if you call parse() many times — e.g. parsing many
small chunks, or parsing inside a hot loop — that decode ends up
dominating. RTParserBuilder::new only borrows the tables, so the work
is pure repeat overhead.

In addition, the per-call setup in lrpar::parser has a couple of small
redundancies:

  • Parser::token_cost is stored as Box<&dyn Fn> — a heap allocation
    around an already-borrowed reference, on every parse.
  • parse_map / parse_actions collect the lexer iterator into a
    Vec<Result<_, _>> and then iterate it again to build the lexeme
    vec, instead of a single collect::<Result<Vec<_>, _>>().
  • The iter_tidxs cost-positive assert! loop runs in release builds,
    even though it's really a builder-side contract.

On the calc_actions example, with 200k iterations of a few short
expressions in release mode, I'm seeing roughly ~2.06 µs/parse → ~0.80 µs/parse
(about a 2.5× speedup) once these are addressed. Most of that is the
table decode; the rest is the small setup items.

I have a working patch and would be happy to send a PR. Two things I'd
appreciate guidance on before I do:

  1. Public API. To cache the tables across calls, generated code
    needs a concrete type to put in a OnceLock. Naming
    (YaccGrammar<S>, StateTable<S>) directly would force every
    downstream crate to depend on lrtable, which seems like a
    regression. My current approach is to introduce an opaque
    lrpar::ParserTables<S> newtype with grm() / stable()
    accessors, re-exported at the crate root, so generated code only
    names ::lrpar::ParserTables. Does that shape sound reasonable, or
    would you prefer it #[doc(hidden)], or somewhere else entirely?

  2. _reconstitute signature. It's already #[doc(hidden)], but its
    return type would change from a tuple to ParserTables<S>. Since
    generated code is regenerated against the matching lrpar version,
    I think this is fine, but happy to keep the old function and add a
    new one if you'd rather not touch it.

Happy to open the PR as-is, or to iterate on the design first — whatever
works best for you. Thanks again!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions