Skip to content

Faster inlining#9148

Open
abadams wants to merge 7 commits into
mainfrom
abadams/faster_inlining
Open

Faster inlining#9148
abadams wants to merge 7 commits into
mainfrom
abadams/faster_inlining

Conversation

@abadams
Copy link
Copy Markdown
Member

@abadams abadams commented May 20, 2026

There are two places in the compiler where we inline Func bodies into caller Stmts or Exprs.

The first is schedule_functions, where they are inlined one at a time in realization order into the Stmt being built. Each requires a walk over the IR so far, so this is O(N^2) in the number of Funcs.

Bounds inference also inlines bodies into the Exprs in the function call dag in order to compute bounds relationships. It avoids the O(N^2) cost by having its own inliner that does all the functions at once. Unfortunately this has a pernicious problem to do with how RemoveLets in CSE works. CSE first throws away any existing lets into to get the invariant that the same Expr node always has the same value. This isn't true if there are lets. RemoveLets is a graph mutator... sort of. When it encounters a let it doesn't know if a previously cached Expr should be substituted in the same way, because a dependency of a dependency of a dependency might refer to the let just introduced, and thus the Expr takes on a different meaning with the let, so it drops the cache. There's no easy way to address this, so RemoveLets just expands the IR tree pessimistically, not sharing copies across lets. The lets created by inline_functions hit the worst case of this and create exponential runtime and IR size post RemoveLets. Global value numbering cleans up the exponential IR size, so the CSE output is good, but the exponential runtime cost was already paid.

Both issues can be solved by inlining functions in small batches instead of either one at a time or all at once. 8 seems to be a sweet spot that gives you a good discount on the O(N^2) term while avoiding the worst of the exponential issue. This PR deduplicates the separate inliners, and speeds up both bounds inference and schedule functions modestly for complex pipelines, More importantly, it tames exponential compile times for pathological ones (see the new test). Some production pipelines more closely resemble the pathological case than our apps.

abadams and others added 6 commits May 19, 2026 16:08
ScheduleFunctions used to call inline_function once per inlinable Func,
which is O(N) walks of the IR. BoundsInference used a separate Inliner
that inlined every Func at once via one CSE invocation, which can be
exponentially expensive in deep inline chains because RemoveLets inside
CSE re-walks shared subtrees under nested Lets.

Neither extreme is great; the sweet spot is to batch a small constant
number of Funcs per CSE invocation, with intermediate CSE flattening
the working IR between batches. Empirically K ≈ 8 is a flat optimum
across a wide range of input sizes (the optimum drifts only as log N).

This change:

- Unifies BoundsInference's Inliner and the one in Inline.h.
- Adds Inliner::do_inlining(Expr/Stmt) which processes the to_inline
  set by iterative deepening: passes raise an active_limit through the
  add() sequence, only inlining entries below the current limit.
- Each Entry caches both its qualified body and the lowest order_id
  of any inlinable Call still inside it, so the cache is re-processed
  exactly when the current limit makes new things eligible.
- The outer loop jumps the limit directly to the lowest pending
  order_id rather than stepping by batch_size, so an Expr that only
  references far-out entries doesn't do useless intermediate passes.
- ScheduleFunctions collects consecutive inlinable groups into one
  inline_functions call, flushing before each realization so
  validate_schedule sees an up-to-date 's'.

For best performance, callers should add() in consumer-first
(reverse-topological) order. Any add() order is correct, just slower.

On a Fibonacci-shaped stress test (test/correctness/deep_inline_chain
at n=200), schedule_functions and computation bounds inference each go
from ~4s to ~0.13s. On the apps suite the speedup is more modest --
typical apps have only a few funcs in any single batch, so most of the
savings come from amortizing the per-call walk of 's' across a batch
rather than from the iterative deepening loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The test was previously a smoke test -- it just compile_jit'd a deeply
inlined pipeline and printed Success! if nothing crashed. The comment
described it as Fibonacci-shaped which was no longer accurate.

Update the comment to describe what the test actually exercises (a chain
of compute_inline Funcs where each one references its last 10
predecessors through a per-level LUT), what failure modes it guards
against (hangs, crashes, wrong values), and rewrite the body to compute
the expected output value in plain C++ alongside the IR build so a
silently-incorrect inliner result will fail the test with "Mismatch"
rather than print "Success!".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove unused includes (<chrono>, <iostream>, <set>) left over from
earlier instrumentation in Inline.cpp, and an unused <set> from Inline.h.
Also trim two comments in visit(Call) that no longer add information.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
do_inlining and operator() were synonyms. Collapse to operator(), which
is the natural entry point for an IRMutator-derived class anyway.

Also drop the ScheduleFunctions comment that leaked Inliner internals
(per-CSE batching is an Inliner implementation detail).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 81.57895% with 28 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3bb5d0c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/Inline.cpp 85.57% 6 Missing and 9 partials ⚠️
src/ScheduleFunctions.cpp 66.66% 6 Missing and 7 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9148   +/-   ##
=======================================
  Coverage        ?   69.30%           
=======================================
  Files           ?      255           
  Lines           ?    78280           
  Branches        ?    18737           
=======================================
  Hits            ?    54251           
  Misses          ?    18519           
  Partials        ?     5510           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

1. error_hoist_storage_without_compute_at: the new ScheduleFunctions
   short-circuit for inlinable groups skipped validate_schedule, which is
   where the hoist_storage-without-compute_at error lives. Calling the
   full validate_schedule on inlinable funcs is also wrong though: it
   walks 's' for call sites via ComputeLegalSchedules, which fails for
   batched inline chains whose inner call sites haven't been exposed in
   's' yet (e.g. repeat_edge called from a not-yet-flushed downsampled).
   So we run only the schedule-property checks directly here.

2. truncated_pyramid/fft: the qualified-body cache only propagated its
   lowest_pending_order_id up to the deepening loop's min_skipped when
   re-processing the cache. On a cache hit without re-processing, the
   cached body still has un-inlined Calls at order_id == lowest_pending
   but the outer loop wasn't told, so it could terminate while those
   Calls remained. Propagate lowest_pending up unconditionally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant