Faster inlining by abadams · Pull Request #9148 · halide/Halide

abadams · 2026-05-20T01:52:19Z

There are two places in the compiler where we inline Func bodies into caller Stmts or Exprs.

The first is schedule_functions, where they are inlined one at a time in realization order into the Stmt being built. Each requires a walk over the IR so far, so this is O(N^2) in the number of Funcs.

Bounds inference also inlines bodies into the Exprs in the function call dag in order to compute bounds relationships. It avoids the O(N^2) cost by having its own inliner that does all the functions at once. Unfortunately this has a pernicious problem to do with how RemoveLets in CSE works. CSE first throws away any existing lets into to get the invariant that the same Expr node always has the same value. This isn't true if there are lets. RemoveLets is a graph mutator... sort of. When it encounters a let it doesn't know if a previously cached Expr should be substituted in the same way, because a dependency of a dependency of a dependency might refer to the let just introduced, and thus the Expr takes on a different meaning with the let, so it drops the cache. There's no easy way to address this, so RemoveLets just expands the IR tree pessimistically, not sharing copies across lets. The lets created by inline_functions hit the worst case of this and create exponential runtime and IR size post RemoveLets. Global value numbering cleans up the exponential IR size, so the CSE output is good, but the exponential runtime cost was already paid.

Both issues can be solved by inlining functions in small batches instead of either one at a time or all at once. 8 seems to be a sweet spot that gives you a good discount on the O(N^2) term while avoiding the worst of the exponential issue. This PR deduplicates the separate inliners, and speeds up both bounds inference and schedule functions modestly for complex pipelines, More importantly, it tames exponential compile times for pathological ones (see the new test). Some production pipelines more closely resemble the pathological case than our apps.

ScheduleFunctions used to call inline_function once per inlinable Func, which is O(N) walks of the IR. BoundsInference used a separate Inliner that inlined every Func at once via one CSE invocation, which can be exponentially expensive in deep inline chains because RemoveLets inside CSE re-walks shared subtrees under nested Lets. Neither extreme is great; the sweet spot is to batch a small constant number of Funcs per CSE invocation, with intermediate CSE flattening the working IR between batches. Empirically K ≈ 8 is a flat optimum across a wide range of input sizes (the optimum drifts only as log N). This change: - Unifies BoundsInference's Inliner and the one in Inline.h. - Adds Inliner::do_inlining(Expr/Stmt) which processes the to_inline set by iterative deepening: passes raise an active_limit through the add() sequence, only inlining entries below the current limit. - Each Entry caches both its qualified body and the lowest order_id of any inlinable Call still inside it, so the cache is re-processed exactly when the current limit makes new things eligible. - The outer loop jumps the limit directly to the lowest pending order_id rather than stepping by batch_size, so an Expr that only references far-out entries doesn't do useless intermediate passes. - ScheduleFunctions collects consecutive inlinable groups into one inline_functions call, flushing before each realization so validate_schedule sees an up-to-date 's'. For best performance, callers should add() in consumer-first (reverse-topological) order. Any add() order is correct, just slower. On a Fibonacci-shaped stress test (test/correctness/deep_inline_chain at n=200), schedule_functions and computation bounds inference each go from ~4s to ~0.13s. On the apps suite the speedup is more modest -- typical apps have only a few funcs in any single batch, so most of the savings come from amortizing the per-call walk of 's' across a batch rather than from the iterative deepening loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The test was previously a smoke test -- it just compile_jit'd a deeply inlined pipeline and printed Success! if nothing crashed. The comment described it as Fibonacci-shaped which was no longer accurate. Update the comment to describe what the test actually exercises (a chain of compute_inline Funcs where each one references its last 10 predecessors through a per-level LUT), what failure modes it guards against (hangs, crashes, wrong values), and rewrite the body to compute the expected output value in plain C++ alongside the IR build so a silently-incorrect inliner result will fail the test with "Mismatch" rather than print "Success!". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove unused includes (<chrono>, <iostream>, <set>) left over from earlier instrumentation in Inline.cpp, and an unused <set> from Inline.h. Also trim two comments in visit(Call) that no longer add information. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

do_inlining and operator() were synonyms. Collapse to operator(), which is the natural entry point for an IRMutator-derived class anyway. Also drop the ScheduleFunctions comment that leaked Inliner internals (per-CSE batching is an Inliner implementation detail). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

codecov · 2026-05-20T03:16:08Z

Codecov Report

❌ Patch coverage is 81.57895% with 28 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3bb5d0c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/Inline.cpp	85.57%	6 Missing and 9 partials ⚠️
src/ScheduleFunctions.cpp	66.66%	6 Missing and 7 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #9148   +/-   ##
=======================================
  Coverage        ?   69.30%           
=======================================
  Files           ?      255           
  Lines           ?    78280           
  Branches        ?    18737           
=======================================
  Hits            ?    54251           
  Misses          ?    18519           
  Partials        ?     5510

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

1. error_hoist_storage_without_compute_at: the new ScheduleFunctions short-circuit for inlinable groups skipped validate_schedule, which is where the hoist_storage-without-compute_at error lives. Calling the full validate_schedule on inlinable funcs is also wrong though: it walks 's' for call sites via ComputeLegalSchedules, which fails for batched inline chains whose inner call sites haven't been exposed in 's' yet (e.g. repeat_edge called from a not-yet-flushed downsampled). So we run only the schedule-property checks directly here. 2. truncated_pyramid/fft: the qualified-body cache only propagated its lowest_pending_order_id up to the deepening loop's min_skipped when re-processing the cache. On a cache hit without re-processing, the cached body still has un-inlined Calls at order_id == lowest_pending but the outer loop wasn't told, so it could terminate while those Calls remained. Propagate lowest_pending up unconditionally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

abadams and others added 6 commits May 19, 2026 16:08

Merge remote-tracking branch 'origin/main' into abadams/faster_inlining

e14bc73

Apply pre-commit auto-fixes

d8ca397

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster inlining#9148

Faster inlining#9148
abadams wants to merge 7 commits into
mainfrom
abadams/faster_inlining

abadams commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abadams commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 20, 2026 •

edited

Loading