Agent-loop e2e suite for style_update, a11y_fix, error_fix (ANN-6) by kurtstohrer · Pull Request #37 · kurtstohrer/annotask

kurtstohrer · 2026-05-12T20:22:51Z

Summary

Stands up the agent round-trip plumbing as a CI-blocking artifact for the public demo and design-partner pitch. Three task types — style_update, a11y_fix, error_fix — run on React+Vite (react-workflows) and Vue+Vite (vue-data-lab) MFEs in the stress lab.
Each spec captures the test-only AgentLoopTarget component, seeds a deterministic task, drives the shell, runs a rule-based simulator that follows the same MCP-CLI sequence a real coding agent would, verifies the iframe DOM / axe rescan / console stream after Vite HMR, then restores the captured files.
Per-run metrics land as JSON under playgrounds/stress-test/e2e/annotask/reports/agent-loop/. Schema documented in docs/agent-loop-evals.md; CI uploads the directory as an artifact.

v1 caveat

The simulator's apply step is intentionally rule-based, not LLM-driven. The harness measures plumbing reliability (does the task land, do the MCP tools work, does HMR pick the fix up, do metrics persist). LLM apply quality is the follow-up ticket — see the doc for where the LLM plugs into the same harness.

Test plan

CI agent-loop job runs pnpm test:e2e:stress:annotask:agent-loop against the focused playwright config and uploads the metrics artifact
Existing pnpm test:e2e:stress:annotask still picks the new specs up via its annotask/ directory filter
Manual: pnpm build && pnpm test:e2e:stress:annotask:agent-loop locally with the stress host + react-workflows + vue-data-lab dev servers reachable
Spot-check JSON output for one passing run and confirm the schema in docs/agent-loop-evals.md matches

Closes ANN-6.

Stands up the agent round-trip plumbing as a measurable, CI-blocking artifact for the public demo and design-partner pitch. Each task type drives a full lifecycle on react-workflows (React+Vite) and vue-data-lab (Vue+Vite): - Capture the AgentLoopTarget component + tracer stylesheet - Seed a deterministic task shape via the per-MFE API - Open the host shell, exercise the task panel / a11y scan / error monitor as appropriate - Run a rule-based simulator that follows the same MCP-CLI sequence (annotask task / update-task) a real coding agent would - Verify the iframe DOM, axe rescan, or console stream reflects the fix after Vite HMR - Restore the captured files and emit a per-run JSON metric The simulator is intentionally rule-based for v1 — see docs/agent-loop-evals.md for the schema, the caveats around what is and isn't measured today, and where the LLM apply step plugs in for v2. Per-test metrics land under playgrounds/stress-test/e2e/annotask/reports/agent-loop/ and are uploaded as a CI artifact. A focused playwright config (agent-loop.config.ts) only spins up host + the two target MFEs so the new CI job stays under the broader stress-cluster cost. The existing pnpm test:e2e:stress:annotask script still picks the tests up via its directory filter. Co-Authored-By: Paperclip <noreply@paperclip.ing>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent-loop e2e suite for style_update, a11y_fix, error_fix (ANN-6)#37

Agent-loop e2e suite for style_update, a11y_fix, error_fix (ANN-6)#37
kurtstohrer wants to merge 1 commit into
mainfrom
ann-6-agent-loop-e2e

kurtstohrer commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kurtstohrer commented May 12, 2026

Summary

v1 caveat

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant