A DOM-driven browser agent powered by Gemini's native function calling. The agent uses a multi-turn chat with structured tool calls — no free-text JSON parsing.
Architecture: Snapshot → Interpreter → Multi-turn Chat (ReAct reasoning + function call) → Execution → Result fed back to chat
Key design choices:
- Native function calling — the LLM returns typed
FunctionCallobjects, not free-text JSON. Eliminates parsing failures. - Multi-turn conversation — the chat maintains history across steps. The LLM sees its prior reasoning and tool results.
- Chain-of-thought — the system prompt requires step-by-step reasoning (Observe → Think → Act) before every tool call.
- Few-shot examples — the system instruction includes worked examples of correct reasoning patterns.
- Human-in-the-loop — three approval modes (safe/hybrid/auto) gate risky actions.
- Two-tiered memory — the agent learns from mistakes across runs and self-optimizes over time.
- Python 3.11+
- Playwright CLI installed
google-genaiPython package (v1.66+)- A Gemini API key
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements-browser-agent.txt
pip install -e .If playwright-cli is already available globally, skip this.
# Option A: global install (preferred)
# Follow your org's standard for installing the Playwright CLI
# Option B: use npx every time
npx playwright-cli open https://example.comIf you use npx, run the agent with --use-npx (or set it during setup).
Run once to create config at ~/.browser_agent/config.yaml:
browser-agent "open example.com" --safePrompts:
- API key
- model (default
gemini-1.5-flash) - default mode (
safe/hybrid/auto) - default Playwright session name (optional)
- default start URL (optional)
- whether to use
npxby default
--safe: approve every action. Best for first-time flows and logins.--hybrid: approves only risky actions (navigation, typing, submissions, destructive clicks).--auto: fully autonomous.
Examples:
browser-agent "find best padel rackets" --safe --start-url https://google.com
browser-agent "find best padel rackets" --hybrid --start-url https://google.com
browser-agent "find best padel rackets" --auto --start-url https://google.combrowser-agent "open youtube.com" --safebrowser-agent "search for best padel rackets" --safe --start-url https://google.combrowser-agent "check inbox" --session gmail --start-url https://mail.google.combrowser-agent "open gmail" --session gmail --persistent --headed --start-url https://mail.google.combrowser-agent "open gmail" --session gmail --profile ~/.browser-agent-profiles/gmail --headed --start-url https://mail.google.com- Launch in safe mode with a persistent profile:
browser-agent "login to site" --safe --persistent --headed --start-url https://example.com/login- Approve steps or manually complete login in the browser window.
- Save storage state after login:
playwright-cli state-save auth.json- Restore login next time:
playwright-cli state-load auth.json- Re-run the agent with the same session/profile.
playwright-cli state-load auth.json
browser-agent "check account" --safe --start-url https://example.combrowser-agent "open account A" --session acct-a --persistent --headed --start-url https://example.com/login
browser-agent "open account B" --session acct-b --persistent --headed --start-url https://example.com/login- Use
--safefor any login or payment flow. - Use
--persistentwith a named--sessionfor repeat logins. - Save
state-saveafter successful login. - Use
--debugwhen a flow fails to capture trace + video. - Start with smaller tasks and increase autonomy gradually.
- Run
--autoon payment, checkout, or account-modifying tasks. - Rely on storage state across unrelated sessions or profiles.
- Leave stale persistent profiles uncleaned for high‑risk apps.
- Ignore repeated action loops; stop and review logs.
browser-agent "search for best padel rackets" --debug --start-url https://google.comOutputs:
- Traces:
.playwright-cli/traces/ - Video:
runs/<run_id>/session.webm
playwright-cli tracing-start
# run actions
playwright-cli tracing-stopEach run produces:
runs/<run_id>/
snapshots/
screenshots/
actions.jsonl # Every action executed + approval status + stdout/stderr
llm_responses.jsonl # Tool calls + reasoning text from the LLM
browser_state.jsonl # URL, title, snapshot paths per step
interpreter_state.jsonl # Parsed page state per step
memory_events.jsonl # Memory access, recalls, learning, promotions
run_meta.json # Task, stop_reason, step count, runtime
The agent has a two-tiered memory system that learns from failure→recovery patterns across runs. Every memory interaction is logged to memory_events.jsonl:
| Event | What it tells you |
|---|---|
tier1_loaded |
Which universal lessons were injected into the system prompt |
error_recall |
A command failed — did memory find relevant tips? |
domain_recall |
Navigated to a new domain — any site-specific advice? |
lesson_recorded |
Post-run learning found a new failure→recovery pattern |
lesson_deduplicated |
An existing lesson was reinforced (use count bumped) |
lesson_promoted |
A lesson graduated from reactive (Tier 2) to always-on (Tier 1) |
lessons_pruned |
Stale lessons were cleaned up on startup |
Query examples:
# All memory activity for a run
cat runs/run_*/memory_events.jsonl | python -m json.tool
# Did Trigger A (error recall) fire? What matched?
grep "error_recall" runs/run_*/memory_events.jsonl
# Did Trigger B (domain recall) fire?
grep "domain_recall" runs/run_*/memory_events.jsonl
# Any promotions across all runs?
grep "lesson_promoted" runs/run_*/memory_events.jsonlThe memory store itself is persisted at ~/.browser_agent/memory.json. See docs/memory-system.md for the full architecture.
playwright-cli open https://example.com
playwright-cli snapshot
playwright-cli click e12
playwright-cli type "search query"
playwright-cli press Enter
playwright-cli screenshot
playwright-cli closeInstall it globally or run with --use-npx.
Switch to a lower-cost model (e.g., gemini-1.5-flash) or wait for quota reset.
Use --hybrid or --auto if you want fewer prompts.
- Keep sensitive tasks in
--safe. - Store login profiles in a dedicated folder per account.
- Use a dedicated browser profile for automation to avoid leaking personal sessions.
Run these once to confirm Playwright CLI and the agent are working:
playwright-cli open https://example.com
playwright-cli snapshot
playwright-cli close
browser-agent "open example.com" --safeIf playwright-cli is missing, use:
npx playwright-cli open https://example.com- Use
--safefor any login flow. - Complete MFA manually in the headed browser.
- Avoid automating OTP codes unless explicitly required by policy.
- Use
--debugto capture traces and video. - Try a fresh profile:
--persistent --profile ~/.browser-agent-profiles/<site>. - Clear stale session data:
playwright-cli -s=mysession delete-data- Sessions isolate cookies/storage by
--sessionname. - Use persistent profiles for long‑lived logins.
- Delete persistent data if it becomes corrupt or risky.
Commands:
playwright-cli list
playwright-cli close-all
playwright-cli kill-all
playwright-cli -s=mysession delete-data- Check
runs/<run_id>/run_meta.jsonfor stop reason. - Check
actions.jsonlfor execution errors. - Check
llm_responses.jsonlfor planner errors or malformed actions. - Inspect
snapshots/step_XXXX.txtto see the DOM references. - Use
--debugfor trace/video if the issue is visual or timing‑related.
- SAFE: every action requires approval.
- HYBRID: approves only risky actions (navigation, typing, storage changes, destructive clicks).
- AUTO: no approvals.
If you see repeated approvals on a low‑risk task, switch to --hybrid.
The agent enforces a strict whitelist, aligned to the skill docs. It will reject commands that:
- are not in the allowed list
- have malformed args (e.g.,
check --url ...) - do not target valid element refs when required
browser-agent "search for best padel rackets and open the first result" --safe --start-url https://google.combrowser-agent "fill the contact form with my name and email" --safe --start-url https://example.com/contactplaywright-cli close-all
playwright-cli kill-all
playwright-cli delete-data- Use
--headedduring early development for visibility. - Use
--start-urlto reduce unnecessary navigation steps. - Use a dedicated profile folder per site to isolate login state.