Faster inlining#9148
Open
abadams wants to merge 7 commits into
Open
Conversation
ScheduleFunctions used to call inline_function once per inlinable Func, which is O(N) walks of the IR. BoundsInference used a separate Inliner that inlined every Func at once via one CSE invocation, which can be exponentially expensive in deep inline chains because RemoveLets inside CSE re-walks shared subtrees under nested Lets. Neither extreme is great; the sweet spot is to batch a small constant number of Funcs per CSE invocation, with intermediate CSE flattening the working IR between batches. Empirically K ≈ 8 is a flat optimum across a wide range of input sizes (the optimum drifts only as log N). This change: - Unifies BoundsInference's Inliner and the one in Inline.h. - Adds Inliner::do_inlining(Expr/Stmt) which processes the to_inline set by iterative deepening: passes raise an active_limit through the add() sequence, only inlining entries below the current limit. - Each Entry caches both its qualified body and the lowest order_id of any inlinable Call still inside it, so the cache is re-processed exactly when the current limit makes new things eligible. - The outer loop jumps the limit directly to the lowest pending order_id rather than stepping by batch_size, so an Expr that only references far-out entries doesn't do useless intermediate passes. - ScheduleFunctions collects consecutive inlinable groups into one inline_functions call, flushing before each realization so validate_schedule sees an up-to-date 's'. For best performance, callers should add() in consumer-first (reverse-topological) order. Any add() order is correct, just slower. On a Fibonacci-shaped stress test (test/correctness/deep_inline_chain at n=200), schedule_functions and computation bounds inference each go from ~4s to ~0.13s. On the apps suite the speedup is more modest -- typical apps have only a few funcs in any single batch, so most of the savings come from amortizing the per-call walk of 's' across a batch rather than from the iterative deepening loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The test was previously a smoke test -- it just compile_jit'd a deeply inlined pipeline and printed Success! if nothing crashed. The comment described it as Fibonacci-shaped which was no longer accurate. Update the comment to describe what the test actually exercises (a chain of compute_inline Funcs where each one references its last 10 predecessors through a per-level LUT), what failure modes it guards against (hangs, crashes, wrong values), and rewrite the body to compute the expected output value in plain C++ alongside the IR build so a silently-incorrect inliner result will fail the test with "Mismatch" rather than print "Success!". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove unused includes (<chrono>, <iostream>, <set>) left over from earlier instrumentation in Inline.cpp, and an unused <set> from Inline.h. Also trim two comments in visit(Call) that no longer add information. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
do_inlining and operator() were synonyms. Collapse to operator(), which is the natural entry point for an IRMutator-derived class anyway. Also drop the ScheduleFunctions comment that leaked Inliner internals (per-CSE batching is an Inliner implementation detail). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9148 +/- ##
=======================================
Coverage ? 69.30%
=======================================
Files ? 255
Lines ? 78280
Branches ? 18737
=======================================
Hits ? 54251
Misses ? 18519
Partials ? 5510 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1. error_hoist_storage_without_compute_at: the new ScheduleFunctions short-circuit for inlinable groups skipped validate_schedule, which is where the hoist_storage-without-compute_at error lives. Calling the full validate_schedule on inlinable funcs is also wrong though: it walks 's' for call sites via ComputeLegalSchedules, which fails for batched inline chains whose inner call sites haven't been exposed in 's' yet (e.g. repeat_edge called from a not-yet-flushed downsampled). So we run only the schedule-property checks directly here. 2. truncated_pyramid/fft: the qualified-body cache only propagated its lowest_pending_order_id up to the deepening loop's min_skipped when re-processing the cache. On a cache hit without re-processing, the cached body still has un-inlined Calls at order_id == lowest_pending but the outer loop wasn't told, so it could terminate while those Calls remained. Propagate lowest_pending up unconditionally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There are two places in the compiler where we inline Func bodies into caller Stmts or Exprs.
The first is schedule_functions, where they are inlined one at a time in realization order into the Stmt being built. Each requires a walk over the IR so far, so this is O(N^2) in the number of Funcs.
Bounds inference also inlines bodies into the Exprs in the function call dag in order to compute bounds relationships. It avoids the O(N^2) cost by having its own inliner that does all the functions at once. Unfortunately this has a pernicious problem to do with how RemoveLets in CSE works. CSE first throws away any existing lets into to get the invariant that the same Expr node always has the same value. This isn't true if there are lets. RemoveLets is a graph mutator... sort of. When it encounters a let it doesn't know if a previously cached Expr should be substituted in the same way, because a dependency of a dependency of a dependency might refer to the let just introduced, and thus the Expr takes on a different meaning with the let, so it drops the cache. There's no easy way to address this, so RemoveLets just expands the IR tree pessimistically, not sharing copies across lets. The lets created by inline_functions hit the worst case of this and create exponential runtime and IR size post RemoveLets. Global value numbering cleans up the exponential IR size, so the CSE output is good, but the exponential runtime cost was already paid.
Both issues can be solved by inlining functions in small batches instead of either one at a time or all at once. 8 seems to be a sweet spot that gives you a good discount on the O(N^2) term while avoiding the worst of the exponential issue. This PR deduplicates the separate inliners, and speeds up both bounds inference and schedule functions modestly for complex pipelines, More importantly, it tames exponential compile times for pathological ones (see the new test). Some production pipelines more closely resemble the pathological case than our apps.