perf: cache FunctionId2 handles in BlazeSymbolizerWrapper#549
Open
r1viollet wants to merge 3 commits into
Open
perf: cache FunctionId2 handles in BlazeSymbolizerWrapper#549r1viollet wants to merge 3 commits into
r1viollet wants to merge 3 commits into
Conversation
intern_function() was called on every sample for every blaze-symbolized
frame, driving ~128 ms/min of CPU in ProfilesDictionary_insert_str even
on stable workloads where the same addresses recur.
Two caches added to BlazeSymbolizerWrapper (per-ELF-file, same lifetime
as the symbolizer instance):
- function_cache: elf_addr → [{FunctionId2, line}]
On first symbolization of an address, intern strings and populate.
On cache hit, write the stored handles directly — zero dict calls.
- nosym_cache: MappingId2 → FunctionId2
intern_function("", sopath) is identical for all frames of the same
DSO. Cache the result on first call per mapping.
Co-authored-by: nsavoire <nsavoire@users.noreply.github.com>
ede1254 to
e5b9e5f
Compare
`intern_function()` (→ 3× `ProfilesDictionary_insert_str`) was called on
every sample for every blaze-symbolized frame, driving ~128 ms/min of CPU
in `ProfilesDictionary_insert_str` even on stable workloads.
The existing `SymbolTable` / `symbol_idx` cache covers frames resolved at
unwind time (DSO/common/base/runtime lookups). Blaze frames are added with
`k_symbol_idx_null` and a `file_info_id`; `process_symbolization` runs
blaze at pprof-creation time but never stores the resulting FunctionId2 —
so every sample for the same ELF address re-interns from scratch.
**Level 1 — function identity** (shared across all call sites):
- `function_id_cache[func_start]` → `FunctionId2` for outer frames,
keyed by `blaze_sym.addr` (function start address). All call sites
within the same function share one dict handle — zero `intern_function`
calls after the first visit to any call site of that function.
- `function_id_cache[inlined_key(elf_addr, idx)]` → `FunctionId2` for
inlined frames (per call site, since inlined frames have no start addr).
**Level 2 — call site** (fast full hit for repeated addresses):
- `address_cache[elf_addr]` → `{func_start, lines[]}`. When the exact
same ELF address recurs in a later sample, write from the caches
without calling blaze or touching the dict at all.
**nosym_cache** — `MappingId2 → FunctionId2` for no-symbol frames; all
frames from the same DSO share one "empty name, sopath" function handle.
| Cycle | addr_misses | **intern_fn_calls** | addr_hits | hit rate |
|-------|-------------|---------------------|-----------|----------|
| 1 (cold) | 2771 | 115 | 5.26 M | 99.9 % |
| 2+ (warm) | ~300 | **0** | ~6 M | 100.0 % |
Zero actual dict insertions in steady state.
Three new stats visible via `--internal_stats`:
- `symbols.blaze.intern_fn_calls` — actual `intern_function()` calls
- `symbols.blaze.addr_misses` — new ELF addresses not yet in cache
- `symbols.blaze.addr_hits` — full cache hits (zero dict cost)
`symbol_idx` is set at **unwind time** and covers DSO/common/base/runtime
symbols. Blaze frames arrive with `k_symbol_idx_null`; updating `symbol_idx`
retroactively in committed `FunLoc` entries is unsafe. This cache is a
complementary layer at pprof-creation time. Longer term, feeding blaze
results back into the unwind-time cache would remove the need for this layer.
Co-authored-by: nsavoire <nsavoire@users.noreply.github.com>
ede1254 to
b5528cf
Compare
Avoids relying on the assumption that ELF virtual addresses fit in 48 bits.
In theory a pair<ElfAddress_t, unsigned> could be packed into a uint64_t
(ELF vaddrs are well under 48 bits in practice on both aarch64 and x86_64),
but std::pair removes any architectural dependency.
Separates the two caches cleanly:
function_id_cache: ElfAddress_t (func_start) → FunctionId2 (outer frames)
inlined_id_cache: {elf_addr, inlined_idx} → FunctionId2 (inlined frames)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
intern_function()(→ 3×ProfilesDictionary_insert_str) was called on every sample for every blaze-symbolized frame, driving ~128 ms/min of CPU inProfilesDictionary_insert_streven on stable workloads.The existing
SymbolTable/symbol_idxcache covers frames resolved at unwind time (DSO/common/base/runtime lookups). Blaze frames are added withk_symbol_idx_nulland afile_info_id;process_symbolizationruns blaze at pprof-creation time but never stores the resultingFunctionId2— so every sample for the same ELF address re-interns from scratch.Solution: two-level
FunctionId2cache inBlazeSymbolizerWrapperLevel 1 — function identity (shared across all call sites of the same function):
function_id_cache[func_start]→FunctionId2for outer frames, keyed byblaze_sym.addr(function start address). All call sites within the same function share one dict handle — zerointern_functioncalls after the first visit to any call site.function_id_cache[inlined_key(elf_addr, idx)]→FunctionId2for inlined frames (per call site, since inlined frames have no function start address).Level 2 — call site (full hit for repeated exact addresses):
address_cache[elf_addr]→{func_start, lines[]}. When the exact same ELF address recurs, write from the caches without calling blaze or touching the dict.nosym_cache—MappingId2 → FunctionId2for no-symbol frames: all frames from the same DSO share one(empty_name, sopath)function handle.The caches live in
BlazeSymbolizerWrapper(one per ELF file, keyed byFileInfoId_t). When a file is evicted byremove_unvisited(), both caches evict with it — no stale handles.Observed results (collatz at 999 Hz, ~10 unique functions)
Zero actual dict insertions in steady state.
Relationship to
symbol_idxsymbol_idxis set at unwind time and covers DSO/common/base/runtime symbols. Blaze frames arrive withk_symbol_idx_null; updatingsymbol_idxretroactively in committedFunLocentries is unsafe. This cache is a complementary layer at pprof-creation time. Longer term, feeding blaze results back into the unwind-time cache would remove the need for this layer entirely.Metrics
Three new stats visible via
--internal_stats/ statsd:symbols.blaze.intern_fn_calls— actualintern_function()calls per cyclesymbols.blaze.addr_misses— new ELF addresses not yet in address_cachesymbols.blaze.addr_hits— full cache hits (zero dict cost)