operasoftware · kzajac-opera · May 18, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -7,6 +7,19 @@ on:
     branches: [main]
 
 jobs:
+  lint-benchmark:
+    runs-on: ubuntu-latest
+    defaults:
+      run:
+        working-directory: benchmarks/snapshot-efficiency
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - run: pip install -r requirements-dev.txt
+      - run: make check
+
   build-and-test:
     runs-on: ubuntu-latest
     steps:

diff --git a/README.md b/README.md
@@ -328,6 +328,10 @@ export OPERA_CLI_MCP_BIN=opera-devtools-mcp
 export OPERA_CLI_HEADED=1
 ```
 
+## Benchmarks
+
+See [`benchmarks/snapshot-efficiency/`](benchmarks/snapshot-efficiency/README.md) — measures token cost and task-completion quality of compact snapshot output vs raw MCP and `chrome-devtools-axi`.
+
 ## Development
 
 ```sh

diff --git a/benchmarks/snapshot-efficiency/.flake8 b/benchmarks/snapshot-efficiency/.flake8
@@ -0,0 +1,5 @@
+[flake8]
+max-line-length = 120
+# E203: whitespace before ':' — conflicts with black's slice formatting
+# W503: line break before binary operator — conflicts with black
+extend-ignore = E203, W503
diff --git a/benchmarks/snapshot-efficiency/CLAUDE.md b/benchmarks/snapshot-efficiency/CLAUDE.md
@@ -0,0 +1,65 @@
+# snapshot-efficiency benchmark — Claude guidance
+
+## File roles
+
+| File | Role |
+|---|---|
+| `src/run_benchmark.py` | Entry point. Loads all three config files, resolves CLI overrides, runs the outer condition × task × repeat loop, writes artifacts and JSONL. |
+| `src/agent.py` | Browser agent loop. `run_agent()` drives the LLM turn loop; `AgentState` owns all mutable state accumulation; `AgentResult` is the immutable output. |
+| `src/judge.py` | LLM-as-judge grading. `grade()` takes a trajectory and returns `{"pass": bool, "reason": str}`. |
+| `src/tools.py` | `ToolSet` base class + `CLIToolSet` (subprocess) and `BridgeToolSet` (HTTP) subclasses. `make_tool_set(condition)` is the factory. |
+| `src/llm.py` | Thin OpenAI Responses API wrapper. `Client.call()` returns a `Turn` dataclass. |
+| `src/report.py` | Reads `results/*.jsonl`, prints and writes `results/report.md`. No external deps beyond stdlib + the results files. |
+| `src/utils.py` | `snapshot_chars(text)` — counts characters in a snapshot result, returns 0 for empty/None. |
+| `config/conditions.yaml` | Benchmark conditions: tool mode (`cli` or `bridge`), CLI binary path, bridge URL. |
+| `config/tasks.yaml` | Task prompts and grading hints. |
+| `config/models.yaml` | Agent and judge model names and reasoning effort. **The only place to change model defaults.** |
+
+## Data flow
+
+```
+run_benchmark.py
+  └── run_once()
+        ├── make_tool_set(condition)      → ToolSet (CLIToolSet or BridgeToolSet)
+        ├── run_agent(prompt, tool_set, model, reasoning_effort)
+        │     └── loop:
+        │           client.call()         → Turn
+        │           tool_set.dispatch()   → result str (side effect: browser action)
+        │           state.update(turn, turn_index, tool_results)
+        │     └── state.to_result()       → AgentResult
+        └── grade(prompt, trajectory, model, reasoning_effort, grading_hint)
+              └── Client.call()           → {"pass": bool, "reason": str}
+```
+
+## Running checks
+
+```sh
+# Install dev dependencies (once)
+pip install -r requirements-dev.txt
+
+make format      # apply black + isort (modifies files)
+make lint        # ruff + flake8 (read-only)
+make typecheck   # mypy (read-only)
+make check       # format-check + lint + typecheck — no modifications, matches CI
+```
+
+Config: `pyproject.toml` for black/isort/ruff/mypy; `.flake8` for flake8 (88-char line length throughout).
+
+## Key design decisions
+
+### No hardcoded model defaults
+`run_agent()` and `grade()` require `model` and `reasoning_effort` as positional parameters — there are no defaults in the function signatures. All defaults live in `config/models.yaml`. CLI flags `--model`, `--reasoning-effort`, `--judge-model`, `--judge-reasoning-effort` override them for a single run.
+
+### AgentState owns all state mutations
+`AgentState.update(turn, turn_index, tool_results=None)` is the single place that mutates benchmark state:
+- Always: accumulates `input_tokens` and `output_tokens` from the turn
+- `tool_results=None` (final turn): sets `answer`, appends to `trajectory`
+- `tool_results` provided (tool-call turn): increments `tool_call_count`, appends to `snapshot_chars` for snapshot tools, appends to `trajectory`
+
+`run_agent()` only handles control flow and I/O (LLM calls, tool dispatch, `inputs` buffer).
+
+### SNAPSHOT_TOOLS
+`SNAPSHOT_TOOLS: frozenset[str]` in `agent.py` defines which tool names produce page snapshots worth measuring. Add a tool name here if it returns a snapshot.
+
+### ToolSet dispatch
+Both `CLIToolSet` and `BridgeToolSet` use `match/case` in `dispatch()`. The shared tool schema lives in `_CLI_SCHEMA` (module-level constant in `tools.py`), evaluated once at import time.
diff --git a/benchmarks/snapshot-efficiency/Makefile b/benchmarks/snapshot-efficiency/Makefile
@@ -0,0 +1,23 @@
+SRC = src
+
+.PHONY: format check lint typecheck
+
+# Apply formatting (local dev)
+format:
+	black $(SRC)/
+	isort $(SRC)/
+
+# Check formatting without modifying (CI)
+format-check:
+	black --check $(SRC)/
+	isort --check-only $(SRC)/
+
+lint:
+	ruff check $(SRC)/
+	flake8 $(SRC)/
+
+typecheck:
+	mypy $(SRC)/
+
+# Full validation suite — no file modifications (used in CI)
+check: format-check lint typecheck
diff --git a/benchmarks/snapshot-efficiency/README.md b/benchmarks/snapshot-efficiency/README.md
@@ -0,0 +1,185 @@
+# Snapshot Efficiency Benchmark
+
+Measures the token cost and task-completion quality of `opera-browser-cli`'s compact snapshot output against raw MCP output and alternative browser CLI tools.
+
+## What it measures
+
+Every browser agent task requires sending the current page as context to the LLM. This benchmark answers:
+
+- **Token savings** — how much does compact snapshot output reduce input token usage vs raw MCP output?
+- **Quality** — does compression affect task-completion rate?
+- **vs AXI** — how does `opera-browser-cli` compare to `chrome-devtools-axi`, an established browser CLI tool?
+
+### Conditions
+
+| ID              | Description                                                                             |
+|-----------------|-----------------------------------------------------------------------------------------|
+| `opera-compact` | `opera-browser-cli` default — compact snapshots with URL compression (our tool)         |
+| `opera-raw`     | `opera-browser-cli --raw` — uncompressed MCP output piped through our CLI               |
+| `mcp-raw`       | Raw `take_snapshot` via bridge HTTP API — no compression at all (chrome-mcp equivalent) |
+| `axi`           | `chrome-devtools-axi` CLI — external comparison baseline                                |
+
+### Tasks
+
+7 browser tasks adapted from the [axi bench-browser benchmark](https://github.com/kunchenguid/axi/tree/main/bench-browser), covering single-step reads, multi-step navigation, and complex multi-page extraction:
+
+| ID                           | Category      | Target                                   |
+|------------------------------|---------------|------------------------------------------|
+| `read_static_page`           | single-step   | example.com                              |
+| `wikipedia_fact_lookup`      | single-step   | Wikipedia — Moon infobox                 |
+| `github_repo_stars`          | single-step   | github.com/torvalds/linux                |
+| `wikipedia_table_read`       | single-step   | Wikipedia — population table             |
+| `wikipedia_link_follow`      | multi-step    | Wikipedia Ada Lovelace → Charles Babbage |
+| `wikipedia_deep_extraction`  | investigation | Wikipedia Nobel Physics laureates        |
+| `github_issue_investigation` | investigation | github.com/facebook/react/issues         |
+
+### Model
+
+Model defaults are set in [`config/models.yaml`](config/models.yaml):
+
+```yaml
+agent:
+  model: gpt-5.5
+  reasoning_effort: medium
+
+judge:
+  model: gpt-5.5
+  reasoning_effort: low
+```
+
+Both use the OpenAI Responses API (`/v1/responses`). The judge runs at lower effort since pass/fail grading is simpler than browser navigation. To use a different model for a run, pass CLI flags (see [CLI reference](#cli-reference)) — these override the config file without changing it.
+
+## Setup
+
+Requirements: Python 3.11+, `opera-browser-cli` in PATH, Opera/Chrome browser open.
+
+```sh
+cd benchmarks/snapshot-efficiency
+python -m venv .venv
+source .venv/bin/activate   # Windows: .venv\Scripts\activate
+pip install -r requirements.txt
+```
+
+For the `axi` condition, also install:
+
+```sh
+npm install -g chrome-devtools-axi
+```
+
+## Running
+
+All commands run from `benchmarks/snapshot-efficiency/` with the venv active.
+
+### Sanity check (1 run, 1 task)
+
+```sh
+OPENAI_API_KEY=<key> python src/run_benchmark.py \
+  --conditions opera-compact \
+  --tasks read_static_page \
+  --repeats 1
+```
+
+### Single condition
+
+```sh
+OPENAI_API_KEY=<key> python src/run_benchmark.py --conditions opera-compact --repeats 5
+```
+
+### All conditions (skipping axi if not installed)
+
+```sh
+OPENAI_API_KEY=<key> python src/run_benchmark.py \
+  --conditions opera-compact,opera-raw,mcp-raw \
+  --repeats 5
+```
+
+### Full matrix (requires chrome-devtools-axi)
+
+```sh
+OPENAI_API_KEY=<key> python src/run_benchmark.py --repeats 5
+```
+
+### Generate report
+
+```sh
+python src/report.py
+# → results/report.md
+```
+
+## Linting & formatting
+
+Install dev tools (separate from benchmark runtime deps):
+
+```sh
+pip install -r requirements-dev.txt
+```
+
+| Command | What it does |
+|---|---|
+| `make format` | Apply black + isort (local dev) |
+| `make lint` | ruff + flake8 |
+| `make typecheck` | mypy |
+| `make check` | All of the above, read-only — same as CI |
+
+Config lives in `pyproject.toml` (black, isort, ruff, mypy) and `.flake8`.
+All tools are configured for 120-char line length.
+
+## Source layout
+
+```
+src/
+├── run_benchmark.py   # entry point — CLI arg parsing, outer loop, artifact writing
+├── agent.py           # browser agent loop (AgentState, AgentResult, run_agent)
+├── judge.py           # LLM-as-judge pass/fail grading (grade)
+├── tools.py           # ToolSet subclasses (CLIToolSet, BridgeToolSet) + factory
+├── llm.py             # thin OpenAI Responses API wrapper (Client, Turn)
+├── report.py          # reads results/*.jsonl and writes results/report.md
+└── utils.py           # shared utilities (snapshot_chars)
+
+config/
+├── conditions.yaml    # benchmark conditions (tool mode, CLI binary, bridge URL)
+├── tasks.yaml         # task prompts and grading hints
+└── models.yaml        # agent and judge model + reasoning_effort defaults
+```
+
+## CLI reference
+
+```
+python src/run_benchmark.py [options]
+
+  --conditions             Comma-separated condition IDs (default: all four)
+  --tasks                  Comma-separated task IDs (default: all seven)
+  --repeats                Runs per condition × task (default: 5)
+  --model                  Agent model — overrides config/models.yaml
+  --reasoning-effort       Agent reasoning effort: low / medium / high — overrides config/models.yaml
+  --judge-model            Judge model — overrides config/models.yaml
+  --judge-reasoning-effort Judge reasoning effort: low / medium / high — overrides config/models.yaml
+```
+
+To permanently change the defaults, edit [`config/models.yaml`](config/models.yaml).
+
+## Results layout
+
+```
+results/
+├── opera-compact.jsonl      # one record per run
+├── opera-raw.jsonl
+├── mcp-raw.jsonl
+├── axi.jsonl
+├── report.md                # generated by report.py
+└── {condition}/{task}/run{N}/
+    ├── agent_output.json    # full trajectory + per-turn token usage
+    ├── grade.json           # pass/fail verdict + reason
+    └── result.json          # merged record (same shape as the .jsonl row)
+```
+
+## Attribution
+
+This benchmark is based on the [axi browser benchmark](https://github.com/kunchenguid/axi/tree/main/bench-browser) by [@kunchenguid](https://github.com/kunchenguid):
+
+- **Task definitions** (`config/tasks.yaml`) — adapted directly from [`bench-browser/config/tasks.yaml`](https://github.com/kunchenguid/axi/blob/main/bench-browser/config/tasks.yaml)
+- **LLM-as-judge grading approach** — adapted from [`bench-browser/src/grader.ts`](https://github.com/kunchenguid/axi/blob/main/bench-browser/src/grader.ts)
+- **Benchmark methodology** (per-condition JSONL results, trajectory capture, usage metrics) — adapted from [`bench-browser/src/runner.ts`](https://github.com/kunchenguid/axi/blob/main/bench-browser/src/runner.ts)
+- **`axi` condition** — uses [`chrome-devtools-axi`](https://github.com/kunchenguid/axi), the browser CLI tool the axi project benchmarks
+
+The original benchmark uses TypeScript + Claude Sonnet. This port uses Python + OpenAI GPT-5.5 with the Responses API.
diff --git a/benchmarks/snapshot-efficiency/config/conditions.yaml b/benchmarks/snapshot-efficiency/config/conditions.yaml
@@ -0,0 +1,25 @@
+conditions:
+  - id: opera-compact
+    description: opera-browser-cli default (compact snapshots, URL compression)
+    tool_mode: cli
+    cli_bin: opera-browser-cli
+    raw: false
+
+  - id: opera-raw
+    description: opera-browser-cli with --raw flag (uncompressed MCP output)
+    tool_mode: cli
+    cli_bin: opera-browser-cli
+    raw: true
+
+  - id: mcp-raw
+    description: Raw take_snapshot via bridge HTTP API, no compression layer
+    tool_mode: bridge
+    bridge_url: "http://localhost:9224"
+
+  - id: axi
+    description: chrome-devtools-axi CLI (external comparison baseline)
+    tool_mode: cli
+    cli_bin: chrome-devtools-axi
+    raw: false
+    start: "chrome-devtools-axi start"
+    stop: "chrome-devtools-axi stop"
diff --git a/benchmarks/snapshot-efficiency/config/models.yaml b/benchmarks/snapshot-efficiency/config/models.yaml
@@ -0,0 +1,7 @@
+agent:
+  model: gpt-5.5
+  reasoning_effort: medium
+
+judge:
+  model: gpt-5.5
+  reasoning_effort: low