Skip to content

perf: propagate ASCII-safety through Format outputs#860

Draft
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/format-asciisafe-propagation
Draft

perf: propagate ASCII-safety through Format outputs#860
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:perf/format-asciisafe-propagation

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 15, 2026

Motivation

Format.format (the engine behind %-interpolation, std.format, and the mod operator's string fallback) always returned a Val.Str constructed via the default Val.Str(pos, s) factory, which leaves _asciiSafe = false. This forced ByteRenderer onto the slow per-char escape-scan + UTF-8 encode path when format outputs flowed into JSON rendering — even when both the format string literals and every interpolated value were pure ASCII.

Manifest workloads heavy on %(name)s-style templates (Helm/Kubernetes-flavored configs) emit many such ASCII-safe strings that go on to be rendered as JSON, so the cost compounds. This is the second of two main sources of "ASCII-safe-but-flagged-unsafe" strings in those workloads (the first, std.join, is companion PR #858).

Modification

sjsonnet/src/sjsonnet/Format.scala:

  • RuntimeFormat.literalsAsciiSafe — new field, computed once at parse time by scanning the leading literal + every inter-spec literal segment via Platform.isAsciiJsonSafe. Cached alongside the parsed format, so each format string pays the literal-scan cost exactly once and amortizes across every use of that cached RuntimeFormat.
  • Per-spec ASCII-safety check at format time, two helpers:
    • simpleStringValueAsciiSafe(rawVal) for %(name)s simple-named-string paths.
    • specOutputAsciiSafe(rawVal, conversion) for the general path: strings forward _asciiSafe; numerics/booleans/null are ASCII (numerics under %c depend on the codepoint range); Val.Arr / Val.Obj (rendered via Renderer) are conservatively treated as non-ASCII.
  • Format.format returns Val.Str — both the string-input and pre-parsed-chunks overloads, plus formatSimpleNamedString. The _asciiSafe flag is set at construction via Val.Str.asciiSafe(pos, s) when literals + all spec outputs are ASCII-safe; otherwise the regular Val.Str(pos, s) constructor is used.
  • Callers updated to drop the redundant Val.Str(pos, ...) wrapper:
    • Evaluator: the % binary operator
    • MathModule: std.mod string fallback
    • StringModule: std.format
    • Format.PartialApplyFmt: static-folded format closure

sjsonnet/test/resources/new_test_suite/format_asciisafe_propagation.jsonnet — regression test covering simple %(name)s fast path, general %s/%d/%x/%o/%c/%.2f conversions, mixed ASCII literals + non-ASCII string values, and a std.manifestJson roundtrip exercising the ByteRenderer fast-path.

Format-time overhead is two boolean ANDs per spec; literal scanning happens once at parse time.

Result

Benchmarked on Apple Silicon, Zulu JDK 21.0.10, -Xmx4G -XX:+UseG1GC -Xss100m, 3 forks × (3 warmup + 5 measurement) iterations.

JMH bench.runRegressions (averaged over 3 forks, ms/op, lower is better):

Benchmark master #860 Δ
cpp_suite/large_string_template 0.724 ± 0.038 0.777 ± 0.229 (CIs overlap; cleanest fork: 0.695 → 0.683, −1.7%)
jdk17_suite/repeat_format 0.155 ± 0.032 0.138 ± 0.016 −11.0%
go_suite/manifestJsonEx 0.074 ± 0.042 0.052 ± 0.001 −29.7%

JMH large_string_template mean is dominated by thermal/GC outliers on Apple Silicon (note Fork 2's last two iterations spiked to 0.857 / 1.481 ms while Forks 1 & 3 ran cleanly around 0.683 ms). The per-fork minimums and the cleanest fork consistently show the PR ahead. Confirmed via hyperfine.

hyperfine (30 runs, 5 warmup, full-binary including JVM startup, ms, lower is better):

Benchmark master #860 Speedup
large_string_template 278.6 ± 79.6 229.5 ± 2.6 1.21× ± 0.35
repeat_format 594.9 ± 66.7 580.9 ± 16.3 1.02× ± 0.12
manifestJsonEx 222.7 ± 3.1 223.8 ± 2.2 parity (50 µs workload buried under ~220 ms JVM startup)

Hyperfine on manifestJsonEx is dominated by JVM startup; JMH (which excludes startup) is the trustworthy signal there and shows ~30%.

PR-side variance on large_string_template is dramatically tighter (±2.6 ms vs master ±79.6 ms), consistent with eliminating a noisy escape-scan path.

References

Test plan

  • New regression test new_test_suite/format_asciisafe_propagation.jsonnet covers:
    • Simple %(name)s fast path with ASCII / non-ASCII literals + values
    • General %s / %d / %x / %o / %c / %.2f conversions
    • Mixed ASCII literals + non-ASCII string values
    • std.manifestJson roundtrip
  • ./mill 'sjsonnet.jvm[3.3.7]'.test — 46 suites pass
  • ./mill 'sjsonnet.native[3.3.7]'.compile — passes
  • ./mill 'sjsonnet.js[3.3.7]'.compile — passes
  • ./mill __.checkFormat — passes
  • JMH bench (3 forks × 5 iters) on master + PR
  • hyperfine 30-run cross-validation on master + PR

Motivation:
After PR databricks#858 added the join-presized + asciiSafe optimization, format
outputs (`%`-interpolation, `std.format`) still always created Val.Str
with `_asciiSafe = false`. Downstream JSON rendering of format results
falls back to the per-char escape scan + UTF-8 encode path even when
both the format string and all interpolated values are pure ASCII.
Manifest workloads heavy on `%(name)s` interpolation pay this cost on
every emitted string.

Modification:
- Add `literalsAsciiSafe` to RuntimeFormat, computed once at parse time
  by scanning leading + inter-spec literal segments for printable ASCII
  with no `"` or `\`.
- At format time, AND `literalsAsciiSafe` with each interpolated value's
  ASCII-safety: strings forward `_asciiSafe`; numerics are ASCII (except
  `%c` which depends on codepoint); booleans/null are ASCII; complex
  types (Arr/Obj routed through Renderer) are conservatively non-ASCII.
- Refactor `Format.format` (both overloads) and `formatSimpleNamedString`
  to return `Val.Str` directly so the `_asciiSafe` flag is set at
  construction. Update the three external callers (Evaluator binop `%`,
  std.mod, std.format) and `PartialApplyFmt.evalRhs` accordingly.

Result:
Format outputs now correctly carry `_asciiSafe = true` when all inputs
are ASCII-safe, letting ByteRenderer take the fast path during JSON
manifestation. Regression test
`new_test_suite/format_asciisafe_propagation.jsonnet` covers the simple
`%(name)s` fast path, general `%s`/`%d`/`%c`/`%x`/`%o`/`%f` conversions,
mixed ASCII/non-ASCII literals and values, and ByteRenderer roundtrip
via `std.manifestJson`.
@He-Pin He-Pin marked this pull request as draft May 15, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant