Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,19 @@ on:
branches: [main]

jobs:
lint-benchmark:
runs-on: ubuntu-latest
defaults:
run:
working-directory: benchmarks/snapshot-efficiency
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -r requirements-dev.txt
- run: make check

build-and-test:
runs-on: ubuntu-latest
steps:
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,10 @@ export OPERA_CLI_MCP_BIN=opera-devtools-mcp
export OPERA_CLI_HEADED=1
```

## Benchmarks

See [`benchmarks/snapshot-efficiency/`](benchmarks/snapshot-efficiency/README.md) — measures token cost and task-completion quality of compact snapshot output vs raw MCP and `chrome-devtools-axi`.

## Development

```sh
Expand Down
5 changes: 5 additions & 0 deletions benchmarks/snapshot-efficiency/.flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[flake8]
max-line-length = 120
# E203: whitespace before ':' — conflicts with black's slice formatting
# W503: line break before binary operator — conflicts with black
extend-ignore = E203, W503
65 changes: 65 additions & 0 deletions benchmarks/snapshot-efficiency/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# snapshot-efficiency benchmark — Claude guidance

## File roles

| File | Role |
|---|---|
| `src/run_benchmark.py` | Entry point. Loads all three config files, resolves CLI overrides, runs the outer condition × task × repeat loop, writes artifacts and JSONL. |
| `src/agent.py` | Browser agent loop. `run_agent()` drives the LLM turn loop; `AgentState` owns all mutable state accumulation; `AgentResult` is the immutable output. |
| `src/judge.py` | LLM-as-judge grading. `grade()` takes a trajectory and returns `{"pass": bool, "reason": str}`. |
| `src/tools.py` | `ToolSet` base class + `CLIToolSet` (subprocess) and `BridgeToolSet` (HTTP) subclasses. `make_tool_set(condition)` is the factory. |
| `src/llm.py` | Thin OpenAI Responses API wrapper. `Client.call()` returns a `Turn` dataclass. |
| `src/report.py` | Reads `results/*.jsonl`, prints and writes `results/report.md`. No external deps beyond stdlib + the results files. |
| `src/utils.py` | `snapshot_chars(text)` — counts characters in a snapshot result, returns 0 for empty/None. |
| `config/conditions.yaml` | Benchmark conditions: tool mode (`cli` or `bridge`), CLI binary path, bridge URL. |
| `config/tasks.yaml` | Task prompts and grading hints. |
| `config/models.yaml` | Agent and judge model names and reasoning effort. **The only place to change model defaults.** |

## Data flow

```
run_benchmark.py
└── run_once()
├── make_tool_set(condition) → ToolSet (CLIToolSet or BridgeToolSet)
├── run_agent(prompt, tool_set, model, reasoning_effort)
│ └── loop:
│ client.call() → Turn
│ tool_set.dispatch() → result str (side effect: browser action)
│ state.update(turn, turn_index, tool_results)
│ └── state.to_result() → AgentResult
└── grade(prompt, trajectory, model, reasoning_effort, grading_hint)
└── Client.call() → {"pass": bool, "reason": str}
```

## Running checks

```sh
# Install dev dependencies (once)
pip install -r requirements-dev.txt

make format # apply black + isort (modifies files)
make lint # ruff + flake8 (read-only)
make typecheck # mypy (read-only)
make check # format-check + lint + typecheck — no modifications, matches CI
```

Config: `pyproject.toml` for black/isort/ruff/mypy; `.flake8` for flake8 (88-char line length throughout).

## Key design decisions

### No hardcoded model defaults
`run_agent()` and `grade()` require `model` and `reasoning_effort` as positional parameters — there are no defaults in the function signatures. All defaults live in `config/models.yaml`. CLI flags `--model`, `--reasoning-effort`, `--judge-model`, `--judge-reasoning-effort` override them for a single run.

### AgentState owns all state mutations
`AgentState.update(turn, turn_index, tool_results=None)` is the single place that mutates benchmark state:
- Always: accumulates `input_tokens` and `output_tokens` from the turn
- `tool_results=None` (final turn): sets `answer`, appends to `trajectory`
- `tool_results` provided (tool-call turn): increments `tool_call_count`, appends to `snapshot_chars` for snapshot tools, appends to `trajectory`

`run_agent()` only handles control flow and I/O (LLM calls, tool dispatch, `inputs` buffer).

### SNAPSHOT_TOOLS
`SNAPSHOT_TOOLS: frozenset[str]` in `agent.py` defines which tool names produce page snapshots worth measuring. Add a tool name here if it returns a snapshot.

### ToolSet dispatch
Both `CLIToolSet` and `BridgeToolSet` use `match/case` in `dispatch()`. The shared tool schema lives in `_CLI_SCHEMA` (module-level constant in `tools.py`), evaluated once at import time.
23 changes: 23 additions & 0 deletions benchmarks/snapshot-efficiency/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
SRC = src

.PHONY: format check lint typecheck

# Apply formatting (local dev)
format:
black $(SRC)/
isort $(SRC)/

# Check formatting without modifying (CI)
format-check:
black --check $(SRC)/
isort --check-only $(SRC)/

lint:
ruff check $(SRC)/
flake8 $(SRC)/

typecheck:
mypy $(SRC)/

# Full validation suite — no file modifications (used in CI)
check: format-check lint typecheck
185 changes: 185 additions & 0 deletions benchmarks/snapshot-efficiency/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Snapshot Efficiency Benchmark

Measures the token cost and task-completion quality of `opera-browser-cli`'s compact snapshot output against raw MCP output and alternative browser CLI tools.

## What it measures

Every browser agent task requires sending the current page as context to the LLM. This benchmark answers:

- **Token savings** — how much does compact snapshot output reduce input token usage vs raw MCP output?
- **Quality** — does compression affect task-completion rate?
- **vs AXI** — how does `opera-browser-cli` compare to `chrome-devtools-axi`, an established browser CLI tool?

### Conditions

| ID | Description |
|-----------------|-----------------------------------------------------------------------------------------|
| `opera-compact` | `opera-browser-cli` default — compact snapshots with URL compression (our tool) |
| `opera-raw` | `opera-browser-cli --raw` — uncompressed MCP output piped through our CLI |
| `mcp-raw` | Raw `take_snapshot` via bridge HTTP API — no compression at all (chrome-mcp equivalent) |
| `axi` | `chrome-devtools-axi` CLI — external comparison baseline |

### Tasks

7 browser tasks adapted from the [axi bench-browser benchmark](https://github.com/kunchenguid/axi/tree/main/bench-browser), covering single-step reads, multi-step navigation, and complex multi-page extraction:

| ID | Category | Target |
|------------------------------|---------------|------------------------------------------|
| `read_static_page` | single-step | example.com |
| `wikipedia_fact_lookup` | single-step | Wikipedia — Moon infobox |
| `github_repo_stars` | single-step | github.com/torvalds/linux |
| `wikipedia_table_read` | single-step | Wikipedia — population table |
| `wikipedia_link_follow` | multi-step | Wikipedia Ada Lovelace → Charles Babbage |
| `wikipedia_deep_extraction` | investigation | Wikipedia Nobel Physics laureates |
| `github_issue_investigation` | investigation | github.com/facebook/react/issues |

### Model

Model defaults are set in [`config/models.yaml`](config/models.yaml):

```yaml
agent:
model: gpt-5.5
reasoning_effort: medium

judge:
model: gpt-5.5
reasoning_effort: low
```

Both use the OpenAI Responses API (`/v1/responses`). The judge runs at lower effort since pass/fail grading is simpler than browser navigation. To use a different model for a run, pass CLI flags (see [CLI reference](#cli-reference)) — these override the config file without changing it.

## Setup

Requirements: Python 3.11+, `opera-browser-cli` in PATH, Opera/Chrome browser open.

```sh
cd benchmarks/snapshot-efficiency
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
```

For the `axi` condition, also install:

```sh
npm install -g chrome-devtools-axi
```

## Running

All commands run from `benchmarks/snapshot-efficiency/` with the venv active.

### Sanity check (1 run, 1 task)

```sh
OPENAI_API_KEY=<key> python src/run_benchmark.py \
--conditions opera-compact \
--tasks read_static_page \
--repeats 1
```

### Single condition

```sh
OPENAI_API_KEY=<key> python src/run_benchmark.py --conditions opera-compact --repeats 5
```

### All conditions (skipping axi if not installed)

```sh
OPENAI_API_KEY=<key> python src/run_benchmark.py \
--conditions opera-compact,opera-raw,mcp-raw \
--repeats 5
```

### Full matrix (requires chrome-devtools-axi)

```sh
OPENAI_API_KEY=<key> python src/run_benchmark.py --repeats 5
```

### Generate report

```sh
python src/report.py
# → results/report.md
```

## Linting & formatting

Install dev tools (separate from benchmark runtime deps):

```sh
pip install -r requirements-dev.txt
```

| Command | What it does |
|---|---|
| `make format` | Apply black + isort (local dev) |
| `make lint` | ruff + flake8 |
| `make typecheck` | mypy |
| `make check` | All of the above, read-only — same as CI |

Config lives in `pyproject.toml` (black, isort, ruff, mypy) and `.flake8`.
All tools are configured for 120-char line length.

## Source layout

```
src/
├── run_benchmark.py # entry point — CLI arg parsing, outer loop, artifact writing
├── agent.py # browser agent loop (AgentState, AgentResult, run_agent)
├── judge.py # LLM-as-judge pass/fail grading (grade)
├── tools.py # ToolSet subclasses (CLIToolSet, BridgeToolSet) + factory
├── llm.py # thin OpenAI Responses API wrapper (Client, Turn)
├── report.py # reads results/*.jsonl and writes results/report.md
└── utils.py # shared utilities (snapshot_chars)

config/
├── conditions.yaml # benchmark conditions (tool mode, CLI binary, bridge URL)
├── tasks.yaml # task prompts and grading hints
└── models.yaml # agent and judge model + reasoning_effort defaults
```

## CLI reference

```
python src/run_benchmark.py [options]

--conditions Comma-separated condition IDs (default: all four)
--tasks Comma-separated task IDs (default: all seven)
--repeats Runs per condition × task (default: 5)
--model Agent model — overrides config/models.yaml
--reasoning-effort Agent reasoning effort: low / medium / high — overrides config/models.yaml
--judge-model Judge model — overrides config/models.yaml
--judge-reasoning-effort Judge reasoning effort: low / medium / high — overrides config/models.yaml
```

To permanently change the defaults, edit [`config/models.yaml`](config/models.yaml).

## Results layout

```
results/
├── opera-compact.jsonl # one record per run
├── opera-raw.jsonl
├── mcp-raw.jsonl
├── axi.jsonl
├── report.md # generated by report.py
└── {condition}/{task}/run{N}/
├── agent_output.json # full trajectory + per-turn token usage
├── grade.json # pass/fail verdict + reason
└── result.json # merged record (same shape as the .jsonl row)
```

## Attribution

This benchmark is based on the [axi browser benchmark](https://github.com/kunchenguid/axi/tree/main/bench-browser) by [@kunchenguid](https://github.com/kunchenguid):

- **Task definitions** (`config/tasks.yaml`) — adapted directly from [`bench-browser/config/tasks.yaml`](https://github.com/kunchenguid/axi/blob/main/bench-browser/config/tasks.yaml)
- **LLM-as-judge grading approach** — adapted from [`bench-browser/src/grader.ts`](https://github.com/kunchenguid/axi/blob/main/bench-browser/src/grader.ts)
- **Benchmark methodology** (per-condition JSONL results, trajectory capture, usage metrics) — adapted from [`bench-browser/src/runner.ts`](https://github.com/kunchenguid/axi/blob/main/bench-browser/src/runner.ts)
- **`axi` condition** — uses [`chrome-devtools-axi`](https://github.com/kunchenguid/axi), the browser CLI tool the axi project benchmarks

The original benchmark uses TypeScript + Claude Sonnet. This port uses Python + OpenAI GPT-5.5 with the Responses API.
25 changes: 25 additions & 0 deletions benchmarks/snapshot-efficiency/config/conditions.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
conditions:
- id: opera-compact
description: opera-browser-cli default (compact snapshots, URL compression)
tool_mode: cli
cli_bin: opera-browser-cli
raw: false

- id: opera-raw
description: opera-browser-cli with --raw flag (uncompressed MCP output)
tool_mode: cli
cli_bin: opera-browser-cli
raw: true

- id: mcp-raw
description: Raw take_snapshot via bridge HTTP API, no compression layer
tool_mode: bridge
bridge_url: "http://localhost:9224"

- id: axi
description: chrome-devtools-axi CLI (external comparison baseline)
tool_mode: cli
cli_bin: chrome-devtools-axi
raw: false
start: "chrome-devtools-axi start"
stop: "chrome-devtools-axi stop"
7 changes: 7 additions & 0 deletions benchmarks/snapshot-efficiency/config/models.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
agent:
model: gpt-5.5
reasoning_effort: medium

judge:
model: gpt-5.5
reasoning_effort: low
Loading
Loading