Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,13 +70,31 @@ jobs:
target: default
arch: x86_64
emulator-boot-timeout: 600
script: ./gradlew :benchmarks:connectedBenchmarkAndroidTest
disable-animations: true
emulator-options: -no-window -no-audio -no-boot-anim -gpu swiftshader_indirect
script: |
# Wait for boot to complete
adb wait-for-device
adb shell 'while [[ -z $(getprop sys.boot_completed) ]]; do sleep 1; done'
sleep 5

# Disable animations
adb shell settings put global window_animation_scale 0
adb shell settings put global transition_animation_scale 0
adb shell settings put global animator_duration_scale 0

# Verify device is responsive
adb shell getprop ro.build.version.release

# Run tests
./gradlew :benchmarks:connectedBenchmarkBenchmarkAndroidTest

- name: Parse Benchmark Results
if: always()
run: |
echo "### Macrobenchmark Results" >> $GITHUB_STEP_SUMMARY
python3 benchmarks/BenchmarkResultsParser.py >> $GITHUB_STEP_SUMMARY
python3 benchmarks/BenchmarkResultsParser.py | tee -a $GITHUB_STEP_SUMMARY
exit ${PIPESTATUS[0]}

- name: Upload benchmark JSON
if: always()
Expand Down
260 changes: 260 additions & 0 deletions METHODOLOGY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
# Benchmark Methodology

This document covers how benchmarks in this project are designed, what hardware conditions are
required for trustworthy results, why the build configuration is the way it is, how to read the
output metrics, and what the numbers cannot tell you.

---

## Device specification

### CI environment

CI runs macrobenchmarks on a GitHub-hosted runner using the
[`reactivecircus/android-emulator-runner`](https://github.com/ReactiveCircus/android-emulator-runner)
action:

| Property | Value |
|---|---|
| API level | 34 (Android 14) |
| Architecture | x86_64 |
| Target | default (AOSP, no Play Services) |
| Boot timeout | 600 s |
| Compilation mode | `CompilationMode.None()` — JIT only, no AOT |

Emulator results are inherently noisier than physical hardware (see [Limitations](#limitations)).
The emulator configuration intentionally suppresses the two errors the benchmark runner would
otherwise emit:

```kotlin
// benchmarks/build.gradle.kts
testInstrumentationRunnerArguments["androidx.benchmark.suppressErrors"] =
"EMULATOR,DYNAMIC_RECEIVER_NOT_EXPORTED_PERMISSION"
```

`EMULATOR` silences the "running on emulator" error. `DYNAMIC_RECEIVER_NOT_EXPORTED_PERMISSION`
silences a permissions-check false positive that appears on API 34 emulators. Neither suppression
affects what is actually measured.

### Physical device setup

Running on physical hardware reduces variance significantly. Before measuring, lock the CPU and
GPU clocks so the SoC cannot throttle or boost mid-run.

**Prerequisites:** the device must be rooted or running a userdebug/eng build. Stock consumer
devices cannot lock clocks.

```bash
# 1. Connect the device and verify adb access
adb devices

# 2. Lock clocks using the AndroidX Benchmark Gradle task
# (available when the benchmark module uses MacrobenchmarkRule)
./gradlew :benchmarks:lockClocks

# 3. Run the benchmarks
./gradlew :benchmarks:connectedBenchmarkAndroidTest

# 4. Unlock clocks when done (skipping this degrades battery life)
./gradlew :benchmarks:unlockClocks
```

`lockClocks` pins CPU frequency to a fixed mid-range value (not max), disables the interactive
governor, and locks the GPU where the kernel exposes a control node. The fixed frequency is
intentionally below peak so thermal headroom is preserved across a full benchmark run.

**Recommended device properties for reproducible results:**

- Disable Wi-Fi and mobile data (reduces background wakeups).
- Charge to ≥ 80 % or keep plugged in (battery saver policies alter scheduling at low charge).
- Turn off all notification delivery from other apps (`adb shell settings put global
zen_mode 1`).
- Keep display on (`adb shell svc power stayon true`) — some devices throttle when the
screen is off.

---

## Why nonDebuggable builds are required

All macrobenchmarks in this project run against the `benchmark` build type, defined in
`app/build.gradle.kts`:

```kotlin
create("benchmark") {
initWith(getByName("release")) // inherits minification + R8
signingConfig = signingConfigs.getByName("debug") // debug cert for CI
isDebuggable = false
}
```

`isDebuggable = false` is not optional. Debug builds carry several sources of overhead that
inflate every metric and make before/after comparisons unreliable:

| Overhead source | Effect on benchmarks |
|---|---|
| JDWP agent always attached | Adds ~5–15 ms to every cold start; unpredictable per-frame cost |
| JIT profiling hooks | Extra bookkeeping per method call; suppresses some JIT optimisations |
| `StrictMode` and debug assertions | Extra allocations and thread checks on every UI operation |
| Compose `isDebugInspectorInfoEnabled` | Turns on slot-table inspection for Layout Inspector; adds recomposition overhead |
| R8 / ProGuard disabled | Dead code not stripped; more class loading; larger DEX → slower first-frame JIT |

The benchmark runner enforces this: if `isDebuggable = true`, it emits a `DEBUG_BUILD` error and
refuses to record results (unless you add `"DEBUG_BUILD"` to `suppressErrors`, which would
invalidate the data).

The `benchmark` build type keeps debug signing so the APK can be installed on CI without a
release keystore. The signing cert has no effect on runtime performance.

---

## How to interpret frame timing metrics

`ScrollBenchmark` uses `FrameTimingMetric`, which records a distribution of frame durations over
5 iterations of 5 down-scrolls + 5 up-scrolls. The output JSON contains these fields per
benchmark:

```
frameDurationCpuMs.p50 — median frame duration (CPU time only)
frameDurationCpuMs.p90 — 90th percentile
frameDurationCpuMs.p95 — 95th percentile
frameDurationCpuMs.p99 — 99th percentile
frameOverrunMs — signed wall-clock budget overrun (hardware timestamp devices only)
jankyFrameCount — frames that exceeded the 16.67 ms / 60 fps deadline
jankyFramePercent — janky frames as a share of total frames rendered
```

### Reading the percentiles

Think of the percentile distribution as a story about different kinds of rendering problems:

**p50** reflects steady-state cost — what a typical frame costs when nothing unusual is happening.
A high p50 (> 8 ms on a 60 Hz display) means the per-frame work budget is already half-consumed
before any hiccup occurs. The optimised scroll screen targets p50 around 4–6 ms.

**p90** reflects how well the app handles light variation — minor GC pauses, occasional longer
layout passes, background service wakeups. A p90 below 10 ms means nine out of ten frames are
comfortable even under normal system noise.

**p99** is the headline regression gate in this project. It captures the worst 1 % of frames —
the frames a user would perceive as a visible stutter. The CI threshold is **16.0 ms**:

```python
# benchmarks/BenchmarkResultsParser.py
FRAME_P99_THRESHOLD_MS = 16.0
```

This is intentionally 1 % tighter than the 16.67 ms budget for 60 fps. The reasoning: if p99 is
already at the deadline, a single additional GC pause or thermal event pushes real-world p99
over the cliff. A p99 of 16 ms leaves almost no headroom.

The threshold is only enforced for `scrollAnimatedList_optimized`. The unoptimized variant is
allowed to exceed it — its purpose is to confirm the baseline is genuinely slow, not to pass CI.

**p95** is not gated but is worth watching: a large gap between p90 and p95 typically signals
infrequent but expensive allocations (bitmaps, large `List` copies) rather than per-frame waste.

### `frameOverrunMs` vs `frameDurationCpuMs`

`frameDurationCpuMs` measures only CPU-side work (including RenderThread). It is available on
all devices. `frameOverrunMs` measures wall-clock overrun relative to the frame deadline and
requires hardware GPU-timestamp support (most Pixel devices, some Snapdragons). On the CI
emulator, `frameOverrunMs` is absent from the JSON; do not treat its absence as a failure.

### `jankyFrameCount` vs p99

These are complementary, not redundant. p99 tells you how bad the worst frames are.
`jankyFrameCount` tells you how many frames crossed the 16.67 ms deadline. A test can have a
low p99 but a non-zero jank count if a handful of frames spiked just barely over the deadline.
For 60 Hz content, a jank count of zero is the target; one or two janky frames per 100 is
acceptable on non-rooted emulator hardware.

---

## Startup timing metrics

`StartupBenchmark` and `AppStartupBenchmark` use `StartupTimingMetric` across 10 iterations:

```
timeToInitialDisplayMs — TTID: system-measured time from process start to first frame drawn
timeToFullDisplayMs — TTFD: time until the app calls reportFullyDrawn()
```

**TTID** is reported by the system and cannot be manipulated by the app. It ends when the window
surface receives its first rendered frame — even if that frame shows only a blank background.

**TTFD** is the app-reported milestone. `MainActivity` calls `reportFullyDrawn()` after the
Compose layout pass completes and the feed `LazyColumn` is scrollable. TTFD is absent for
`StartupMode.HOT` because `onCreate()` is not called in that mode and `reportFullyDrawn()` is
never invoked.

The CI cold-start threshold is **800 ms TTID**:

```python
COLD_START_THRESHOLD_MS = 800
```

The optimised build targets 150–350 ms; the 800 ms gate is a wide safety margin designed to catch
regressions (e.g. an SDK accidentally moved back onto the main thread) rather than to certify
production quality.

The startup tests use `CompilationMode.None()` (JIT only, no AOT pre-compilation). This produces
the worst-case startup time — the same condition a user experiences on first install before ART
has had time to profile and compile. Baseline Profiles are generated separately via
`./gradlew :app:generateBaselineProfile` and are measured independently.

---

## Limitations and variance expectations

### Emulator variance

CPU clock locking is not possible on the emulator. The emulator shares host CPU cores with other
processes and is subject to the host scheduler. Expect ±30–50 ms variance on startup metrics
and ±2–4 ms variance on p99 frame duration across runs. This is why:

- Startup uses 10 iterations (more samples reduce the impact of outliers).
- Scroll uses 5 iterations (frame metrics are per-frame averages over hundreds of frames, so
fewer iterations are needed for stable statistics).
- The CI threshold for cold start (800 ms) is set 3× above the measured optimised value
(~250 ms) to absorb emulator noise.

### `CompilationMode.None()` and JIT behaviour

All benchmarks in this project run with `CompilationMode.None()`. JIT compilation happens during
the benchmark run, which means the first iteration is always slower (the JIT is profiling) and
later iterations are faster (hot methods are compiled). The benchmark library accounts for this
by recording all iterations but reporting the distribution — look at p50 and p90 across multiple
runs rather than a single median.

If you switch to `CompilationMode.Full()` (AOT), numbers will be lower and more consistent but
will not represent install-fresh behaviour. `CompilationMode.None()` is the right choice for
detecting regressions in production conditions.

### Thermal throttling on physical devices

Even with locked clocks, sustained benchmarks on physical hardware can trigger thermal
throttling if the device approaches its temperature limit. Signs of throttling:

- Startup times that increase monotonically across iterations (not random noise).
- Frame p99 that is higher for `scrollAnimatedList_optimized` than for `scrollAnimatedList_unoptimized`
(impossible without throttling — the unoptimized path does more work).

If you observe these patterns, let the device cool for 5–10 minutes and re-run. Plugging in
USB-C power delivery can worsen thermals on some devices; consider unplugging during the run.

### What the numbers do and do not represent

| The numbers DO reflect | The numbers DO NOT reflect |
|---|---|
| Regression introduced in the code under test | Absolute production performance on a user's device |
| Relative improvement from a specific optimisation | Performance under network I/O or database load |
| Worst-case startup before ART profiling | Performance after a user's device has profiled and compiled the app |
| Per-frame Compose rendering cost | GPU-bound rendering (these benchmarks are CPU-bound) |
| Recomposition pass count (unit test metric) | Number of composables recomposed within a single pass |

Recomposition counts in `RecompositionBenchmark` measure `Recomposer.changeCount` — the number
of complete composition passes applied, not the number of individual composables that re-ran.
One click that triggers one state change = one pass = `delta` of 1 in the optimised build.
The assertion `assertEquals(1L, delta)` verifies no cascading second pass was triggered; it
does not verify which composables were skipped within that pass. Use Layout Inspector's
recomposition highlighting to inspect per-composable skip behaviour.
1 change: 1 addition & 0 deletions app/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ android {

buildFeatures {
compose = true
buildConfig = true
}
}

Expand Down
Loading
Loading