Skip to content

ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration)#83

Open
heskew wants to merge 8 commits into
mainfrom
ci/bump-ai-review-prompts-post-30
Open

ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration)#83
heskew wants to merge 8 commits into
mainfrom
ci/bump-ai-review-prompts-post-30

Conversation

@heskew
Copy link
Copy Markdown
Member

@heskew heskew commented May 13, 2026

Summary

Symmetric pin bump across claude-review.yml and gemini-review.yml, from 128656e4d446b4c6. The Gemini-side change is substantive (MCP pivot + debug iteration); Claude inherits no functional changes.

Branch history: this PR was originally opened to bump to the post-#30 pin (7c8aae0a). The trial run on that pin failed at the gemini-cli step with three compounding issues — MCP tools never registered, shell commands denied, maxSessionTurns exhausted. Force-pushed forward to d446b4c6 which includes the debug iteration (#32) that should surface root causes on the next run.

What gemini-review.yml picks up

Architectural pivot (#30):

  • Workflow-posts-single-shot → MCP-based agentic tool use (Docker'd github-mcp-server, agent posts inline comments + a marker'd review summary via pull_request_review_write)
  • Custom /harper-review slash command composed at runtime from our layered review scope
  • Severity tagging discipline: only 🔴 Critical and 🟠 High are posted

Debug iteration (#32):

  • gemini_debug: 'true' — streams CLI stdout/stderr inline in the workflow log; MCP startup and tool registration become visible
  • tools.core allowlist removed — temporarily opens shell surface so we observe what the agent actually wants to call (last trial got "denied by policy" without surfacing which commands)
  • maxSessionTurns: 1530 — headroom while iterating

After this trial converges, we tighten back toward production: re-add a fact-based tools.core allowlist, lower turn budget, disable debug.

What claude-review.yml picks up

Pin stays in lockstep on the SHA. No functional changes — the shared scripts continue to default through their pre-Gemini code paths. Inherits the validator hardening from #31 (Docker image references lint-required to use @sha256: digest), which is a no-op for Claude (no images referenced).

Expected first-run quirk

This PR modifies claude-review.yml, so the Claude reviewer will fail with the known App-token-401 validation gotcha. Harmless and documented. Gemini side isn't affected.

What we're watching for in the next Gemini run

  • Whether the github MCP server registers any tools (or fails to spawn the Docker container — debug output will tell us)
  • What run_shell_command arguments the agent attempts
  • Whether the agent successfully submits a review with the open shell allowlist

If MCP loads: tighten the tools.core allowlist based on observed commands, lower the turn budget, disable debug, ship.

If MCP still doesn't register after we can see the diagnostic output: investigate further or fall back to single-shot mode with the tuning settings (Option C from the earlier debug summary).

🤖 Generated with Claude Code

…teration)

Symmetric pin across `claude-review.yml` and `gemini-review.yml`,
bumped from `128656e4` → `d446b4c6` (skipping the intermediate
`7c8aae0a` since `d446b4c6` includes everything from #30/#31 PLUS
the debug-iteration changes from #32).

## What `gemini-review.yml` picks up (substantive)

The MCP-based agentic Gemini reviewer (#30) + debug iteration
settings (#32):

- Architecture: workflow-posts-single-shot → MCP-based agentic
  tool use (Docker'd `github-mcp-server`, inline comments,
  `pull_request_review_write` etc.)
- Custom `/harper-review` slash command composed from our
  layered review scope
- Severity tagging (`🔴` Critical / `🟠` High only)
- Debug iteration (this PR's new content):
  - `gemini_debug: 'true'` — CLI stdout/stderr streamed inline
    in the workflow log so MCP startup and tool registration are
    visible
  - `tools.core` allowlist REMOVED — opens shell surface so we
    can observe what the agent actually wants to call (the first
    trial of #30 hit "Tool execution denied by policy" on
    `run_shell_command` without surfacing which commands were
    attempted)
  - `maxSessionTurns: 15` → `30` — headroom while iterating

Patterns adopted from Google's official PR review example for
`run-gemini-cli`; reimplemented around our auth gate, layered
scope, and log-issue threading. See the upstream
`_gemini-review.yml` header comment for the upstream-vs-ours diff.

## What `claude-review.yml` picks up (no functional change)

Just stays in lockstep on the SHA. The Gemini-side changes don't
affect Claude: shared `find-prior-review-comment.sh` and
`log-review-to-ai-review-log.sh` continue to default through
their pre-Gemini code paths.

Also inherits the validator hardening (#31): Docker image
references in reusable workflows are now lint-required to use
`@sha256:` digest pinning. Claude's caller has none, so it's a
no-op for Claude — but documents the discipline going forward.

## This PR's history

Originally opened as a bump from `128656e4` → `7c8aae0a` (the
post-#30 pin). The initial trial-run failed at the gemini-cli
step with three compounding issues (MCP tools never registered,
shell commands denied, maxSessionTurns hit). Branch is now
amended forward to `d446b4c6` which includes the debug iteration
that should surface root causes on the next run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@heskew heskew force-pushed the ci/bump-ai-review-prompts-post-30 branch from 789b725 to 0f1ada8 Compare May 13, 2026 20:57
@heskew heskew changed the title ci: bump AI-review pins to ai-review-prompts@7c8aae0a (Gemini MCP pivot) ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration) May 13, 2026
heskew and others added 7 commits May 13, 2026 16:05
Atomic single-call submission for Gemini (`pull_request_review_write`
with `method=create, event=COMMENT`) replaces the multi-step
pending-review flow that hit submission races in the previous
iteration. Prompt also tightens scope discipline so the agent stops
reviewing source files outside the diff.

Claude has no functional change from #33; pin moves to stay
symmetric with the Gemini pin within this repo.

Re-using this PR as the comparison-test surface for one more
iteration. If the run is clean, next bump disables `gemini_debug`
and finalizes the settings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#34 fixes the post-#33 turn-exhaustion: passes PR-context env vars
(PULL_REQUEST_NUMBER, REPOSITORY, ISSUE_TITLE, ISSUE_BODY) so the
agent doesn't burn turns running `printenv` / `env | grep` / `git
remote -v` trying to figure out which PR it's reviewing. Also
widens tools.core to match upstream (`head`/`tail` added) so denial
retries stop eating the turn budget.

Claude has no functional change from #34; pin moves to stay
symmetric with the Gemini pin within this repo.

Continuing to use this PR as the comparison-test surface for one
more iteration. If clean, next bump disables `gemini_debug`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#35 pivots Gemini from our custom `/harper-review` slash command to
the upstream `/pr-code-review` from the gemini-cli-extensions/code-review
extension. The custom command was fighting the model's training:
agent reverted to shell-based PR exploration (printenv, gh pr view,
ls -R, npm list — all denied, all retried) and never reached MCP.
Upstream's slash command is the pattern the model was trained on.

Harper-specific scope (layered reviews, severity discipline,
trivial-diff short-circuit, marker requirement) is now layered in
via the ADDITIONAL_CONTEXT env var, which the upstream prompt
substitutes at render time.

Claude has no functional change from #35; pin moves to stay
symmetric with the Gemini pin within this repo.

Continuing to use this PR as the comparison-test surface. If clean,
next bump disables gemini_debug and removes the now-unused legacy
prompt template.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#36 adds caller-side `if: failure()` artifact upload + job-summary
tail. The previous run-attempt failed at gemini-cli but produced no
artifacts because upstream's run-gemini-cli@v0.1.22 doesn't gate its
inner upload step with `if: always()` — composite-action failure
semantics short-circuit it.

Purpose of THIS run-attempt: capture the full post-#35 agent trace
so we can finish diagnosing the actual failure mode (the prior
attempt's gh-CLI log truncated to the first ~5 seconds out of ~38).
No functional change to the Gemini run itself.

Claude has no functional change from #36; pin moves to stay
symmetric with the Gemini pin within this repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Throwaway debug workflow to isolate why MCP tools never propagate to
the model's function-call surface in the production gemini-review.yml
(despite the github-mcp-server starting fine and registering session-
side). Telemetry on the prior runs proves it definitively: every tool
call has `tool_type: "native"`; zero `tool_type: "mcp"`.

Hypothesis under test
---------------------

The production caller pins `ghcr.io/github/github-mcp-server@sha256:
e3816a4...` which is v1.0.4 (released 2026-05-11). v1.0.0 (2026-04-16)
was a major-version bump that upgraded modelcontextprotocol/go-sdk to
v1.5.0 and reorganized toolsets, introducing feature-flagged tool
groups gated by `dynamicToolsets` (visible in our server logs as
`dynamicToolsets=false`). Upstream's example pins to `:v0.27.0`
(Feb 2026 era), predating that reorganization. Our bump to v1.0.x was
Harper-side, well-intended ("newer is better"), but appears to be the
regression.

This workflow tests the simplest variable: same gemini-cli, same
action, same prompt, same model — only the MCP server version differs.
If MCP propagates with `:v0.27.0`, the root cause is the v1.0.x server.

Approach
--------

Mirrors upstream's tip-of-main pr-review example as closely as
practical, with minimal Harper-specific deviations:

  - Auth: skip App-token-mint step. Default GITHUB_TOKEN only. Single-
    repo debug branch; only org members push. Backport will reintroduce
    the gate via the reusable.
  - Model: hardcoded gemini-3-flash-preview (our known-good choice).
  - Layered Harper scope: NOT applied. ADDITIONAL_CONTEXT is empty.
    Prove MCP propagates first; reintroduce scope after.
  - tools.core: [] (empty) — forces all I/O through MCP. Cleanest
    signal: if MCP is broken, the agent has nothing.
  - MCP server pin: TAG `:v0.27.0` instead of `@sha256:...` digest.
    SHA-pinning discipline waived for the spike; will digest-pin on
    backport.
  - Visibility: failure-path artifact upload + summary tail (same as
    _gemini-review.yml post-#36, with the pipefail/SIGPIPE bug fixed
    via `set -eu` instead of `set -euo pipefail` plus `|| true`
    around the offending pipe).

Iteration model
---------------

Edit this file → push → wait ~90s. No PR ceremony on ai-review-prompts.
Runs in parallel with the production caller via a distinct concurrency
group, so we can A/B compare side-by-side on the same PR.

When the working config is identified, backport to
`HarperFast/ai-review-prompts/_gemini-review.yml` in one focused PR.
Delete this file after backport.
Previous debug run (25874573129) confirmed MCP tools register
correctly under server v0.27.0 (3 tools discovered: pull_request_read,
pull_request_review_write, add_comment_to_pending_review). But the
model failed to use them — INVALID_STREAM error with 0 output tokens
across 4 API requests. Mute model.

The extension's GEMINI.md loaded into agent context lists two
top-level commands:

  /code-review  - "When a user requests that code changes be reviewed"
  /pr-review    - "When a user requests that pr to be reviewed,
                   look at user provided input '{{args}}' and
                   environment variables to see if $REPOSITORY,
                   $PULL_REQUEST_NUMBER, and $ADDITIONAL_CONTEXT
                   are set."

Notably `/pr-review` (NOT `/pr-code-review`) is what the extension
documents as the entry point. `/pr-code-review` exists as a file in
the extension's commands/ directory but isn't mentioned in GEMINI.md.
Possibility: `/pr-code-review` is internal / older / deprecated, and
`/pr-review` is the intended public command (may be a routing alias).

Cheap test. If `/pr-review` doesn't resolve, gemini-cli will fail
clearly with a "command not found" error. If it does resolve and
behaves better than `/pr-code-review`, we've found the right command.
After 3 MCP-debug iterations proving it's not happening:
  - Run 25869995910: server v1.0.4, 0 MCP tools propagated
  - Run 25874573129: server v0.27.0, 3 MCP tools registered, model
    silent (INVALID_STREAM, 0 output tokens)
  - Run 25879095909: server v0.27.0 + /pr-review slash command,
    model hallucinated a full review of fictional google/oauth#42
    using <call:...> tags as prose (0 actual tool calls)

Across all 3, the model demonstrated a strong preference for `gh`
CLI patterns even when shell was denied. It tried `gh pr view`,
`gh pr diff`, `printenv`, etc. in every run.

Pragmatic pivot: stop fighting the model.

  - Drop mcpServers entirely
  - Drop the upstream code-review extension entirely
  - No custom slash command file; inline prompt only
  - tools.core: gh, git, cat, echo, grep, head, tail
  - Agent reads diff via `gh pr diff`, posts via
    `gh pr review --comment --body`

Loses true inline comments (would require MCP). Findings reference
file:line as plain-text `**File:** path:N` in the body.

Goal of this iteration: produce A working Gemini review on PR #83
to compare against Claude. Architecture refinement comes later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant