ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration)#83
Open
heskew wants to merge 8 commits into
Open
ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration)#83heskew wants to merge 8 commits into
heskew wants to merge 8 commits into
Conversation
5 tasks
…teration) Symmetric pin across `claude-review.yml` and `gemini-review.yml`, bumped from `128656e4` → `d446b4c6` (skipping the intermediate `7c8aae0a` since `d446b4c6` includes everything from #30/#31 PLUS the debug-iteration changes from #32). ## What `gemini-review.yml` picks up (substantive) The MCP-based agentic Gemini reviewer (#30) + debug iteration settings (#32): - Architecture: workflow-posts-single-shot → MCP-based agentic tool use (Docker'd `github-mcp-server`, inline comments, `pull_request_review_write` etc.) - Custom `/harper-review` slash command composed from our layered review scope - Severity tagging (`🔴` Critical / `🟠` High only) - Debug iteration (this PR's new content): - `gemini_debug: 'true'` — CLI stdout/stderr streamed inline in the workflow log so MCP startup and tool registration are visible - `tools.core` allowlist REMOVED — opens shell surface so we can observe what the agent actually wants to call (the first trial of #30 hit "Tool execution denied by policy" on `run_shell_command` without surfacing which commands were attempted) - `maxSessionTurns: 15` → `30` — headroom while iterating Patterns adopted from Google's official PR review example for `run-gemini-cli`; reimplemented around our auth gate, layered scope, and log-issue threading. See the upstream `_gemini-review.yml` header comment for the upstream-vs-ours diff. ## What `claude-review.yml` picks up (no functional change) Just stays in lockstep on the SHA. The Gemini-side changes don't affect Claude: shared `find-prior-review-comment.sh` and `log-review-to-ai-review-log.sh` continue to default through their pre-Gemini code paths. Also inherits the validator hardening (#31): Docker image references in reusable workflows are now lint-required to use `@sha256:` digest pinning. Claude's caller has none, so it's a no-op for Claude — but documents the discipline going forward. ## This PR's history Originally opened as a bump from `128656e4` → `7c8aae0a` (the post-#30 pin). The initial trial-run failed at the gemini-cli step with three compounding issues (MCP tools never registered, shell commands denied, maxSessionTurns hit). Branch is now amended forward to `d446b4c6` which includes the debug iteration that should surface root causes on the next run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
789b725 to
0f1ada8
Compare
Atomic single-call submission for Gemini (`pull_request_review_write` with `method=create, event=COMMENT`) replaces the multi-step pending-review flow that hit submission races in the previous iteration. Prompt also tightens scope discipline so the agent stops reviewing source files outside the diff. Claude has no functional change from #33; pin moves to stay symmetric with the Gemini pin within this repo. Re-using this PR as the comparison-test surface for one more iteration. If the run is clean, next bump disables `gemini_debug` and finalizes the settings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#34 fixes the post-#33 turn-exhaustion: passes PR-context env vars (PULL_REQUEST_NUMBER, REPOSITORY, ISSUE_TITLE, ISSUE_BODY) so the agent doesn't burn turns running `printenv` / `env | grep` / `git remote -v` trying to figure out which PR it's reviewing. Also widens tools.core to match upstream (`head`/`tail` added) so denial retries stop eating the turn budget. Claude has no functional change from #34; pin moves to stay symmetric with the Gemini pin within this repo. Continuing to use this PR as the comparison-test surface for one more iteration. If clean, next bump disables `gemini_debug`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#35 pivots Gemini from our custom `/harper-review` slash command to the upstream `/pr-code-review` from the gemini-cli-extensions/code-review extension. The custom command was fighting the model's training: agent reverted to shell-based PR exploration (printenv, gh pr view, ls -R, npm list — all denied, all retried) and never reached MCP. Upstream's slash command is the pattern the model was trained on. Harper-specific scope (layered reviews, severity discipline, trivial-diff short-circuit, marker requirement) is now layered in via the ADDITIONAL_CONTEXT env var, which the upstream prompt substitutes at render time. Claude has no functional change from #35; pin moves to stay symmetric with the Gemini pin within this repo. Continuing to use this PR as the comparison-test surface. If clean, next bump disables gemini_debug and removes the now-unused legacy prompt template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#36 adds caller-side `if: failure()` artifact upload + job-summary tail. The previous run-attempt failed at gemini-cli but produced no artifacts because upstream's run-gemini-cli@v0.1.22 doesn't gate its inner upload step with `if: always()` — composite-action failure semantics short-circuit it. Purpose of THIS run-attempt: capture the full post-#35 agent trace so we can finish diagnosing the actual failure mode (the prior attempt's gh-CLI log truncated to the first ~5 seconds out of ~38). No functional change to the Gemini run itself. Claude has no functional change from #36; pin moves to stay symmetric with the Gemini pin within this repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Throwaway debug workflow to isolate why MCP tools never propagate to
the model's function-call surface in the production gemini-review.yml
(despite the github-mcp-server starting fine and registering session-
side). Telemetry on the prior runs proves it definitively: every tool
call has `tool_type: "native"`; zero `tool_type: "mcp"`.
Hypothesis under test
---------------------
The production caller pins `ghcr.io/github/github-mcp-server@sha256:
e3816a4...` which is v1.0.4 (released 2026-05-11). v1.0.0 (2026-04-16)
was a major-version bump that upgraded modelcontextprotocol/go-sdk to
v1.5.0 and reorganized toolsets, introducing feature-flagged tool
groups gated by `dynamicToolsets` (visible in our server logs as
`dynamicToolsets=false`). Upstream's example pins to `:v0.27.0`
(Feb 2026 era), predating that reorganization. Our bump to v1.0.x was
Harper-side, well-intended ("newer is better"), but appears to be the
regression.
This workflow tests the simplest variable: same gemini-cli, same
action, same prompt, same model — only the MCP server version differs.
If MCP propagates with `:v0.27.0`, the root cause is the v1.0.x server.
Approach
--------
Mirrors upstream's tip-of-main pr-review example as closely as
practical, with minimal Harper-specific deviations:
- Auth: skip App-token-mint step. Default GITHUB_TOKEN only. Single-
repo debug branch; only org members push. Backport will reintroduce
the gate via the reusable.
- Model: hardcoded gemini-3-flash-preview (our known-good choice).
- Layered Harper scope: NOT applied. ADDITIONAL_CONTEXT is empty.
Prove MCP propagates first; reintroduce scope after.
- tools.core: [] (empty) — forces all I/O through MCP. Cleanest
signal: if MCP is broken, the agent has nothing.
- MCP server pin: TAG `:v0.27.0` instead of `@sha256:...` digest.
SHA-pinning discipline waived for the spike; will digest-pin on
backport.
- Visibility: failure-path artifact upload + summary tail (same as
_gemini-review.yml post-#36, with the pipefail/SIGPIPE bug fixed
via `set -eu` instead of `set -euo pipefail` plus `|| true`
around the offending pipe).
Iteration model
---------------
Edit this file → push → wait ~90s. No PR ceremony on ai-review-prompts.
Runs in parallel with the production caller via a distinct concurrency
group, so we can A/B compare side-by-side on the same PR.
When the working config is identified, backport to
`HarperFast/ai-review-prompts/_gemini-review.yml` in one focused PR.
Delete this file after backport.
Previous debug run (25874573129) confirmed MCP tools register
correctly under server v0.27.0 (3 tools discovered: pull_request_read,
pull_request_review_write, add_comment_to_pending_review). But the
model failed to use them — INVALID_STREAM error with 0 output tokens
across 4 API requests. Mute model.
The extension's GEMINI.md loaded into agent context lists two
top-level commands:
/code-review - "When a user requests that code changes be reviewed"
/pr-review - "When a user requests that pr to be reviewed,
look at user provided input '{{args}}' and
environment variables to see if $REPOSITORY,
$PULL_REQUEST_NUMBER, and $ADDITIONAL_CONTEXT
are set."
Notably `/pr-review` (NOT `/pr-code-review`) is what the extension
documents as the entry point. `/pr-code-review` exists as a file in
the extension's commands/ directory but isn't mentioned in GEMINI.md.
Possibility: `/pr-code-review` is internal / older / deprecated, and
`/pr-review` is the intended public command (may be a routing alias).
Cheap test. If `/pr-review` doesn't resolve, gemini-cli will fail
clearly with a "command not found" error. If it does resolve and
behaves better than `/pr-code-review`, we've found the right command.
After 3 MCP-debug iterations proving it's not happening:
- Run 25869995910: server v1.0.4, 0 MCP tools propagated
- Run 25874573129: server v0.27.0, 3 MCP tools registered, model
silent (INVALID_STREAM, 0 output tokens)
- Run 25879095909: server v0.27.0 + /pr-review slash command,
model hallucinated a full review of fictional google/oauth#42
using <call:...> tags as prose (0 actual tool calls)
Across all 3, the model demonstrated a strong preference for `gh`
CLI patterns even when shell was denied. It tried `gh pr view`,
`gh pr diff`, `printenv`, etc. in every run.
Pragmatic pivot: stop fighting the model.
- Drop mcpServers entirely
- Drop the upstream code-review extension entirely
- No custom slash command file; inline prompt only
- tools.core: gh, git, cat, echo, grep, head, tail
- Agent reads diff via `gh pr diff`, posts via
`gh pr review --comment --body`
Loses true inline comments (would require MCP). Findings reference
file:line as plain-text `**File:** path:N` in the body.
Goal of this iteration: produce A working Gemini review on PR #83
to compare against Claude. Architecture refinement comes later.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Symmetric pin bump across
claude-review.ymlandgemini-review.yml, from128656e4→d446b4c6. The Gemini-side change is substantive (MCP pivot + debug iteration); Claude inherits no functional changes.Branch history: this PR was originally opened to bump to the post-#30 pin (
7c8aae0a). The trial run on that pin failed at the gemini-cli step with three compounding issues — MCP tools never registered, shell commands denied,maxSessionTurnsexhausted. Force-pushed forward tod446b4c6which includes the debug iteration (#32) that should surface root causes on the next run.What
gemini-review.ymlpicks upArchitectural pivot (#30):
github-mcp-server, agent posts inline comments + a marker'd review summary viapull_request_review_write)/harper-reviewslash command composed at runtime from our layered review scope🔴 Criticaland🟠 Highare postedDebug iteration (#32):
gemini_debug: 'true'— streams CLI stdout/stderr inline in the workflow log; MCP startup and tool registration become visibletools.coreallowlist removed — temporarily opens shell surface so we observe what the agent actually wants to call (last trial got "denied by policy" without surfacing which commands)maxSessionTurns: 15→30— headroom while iteratingAfter this trial converges, we tighten back toward production: re-add a fact-based
tools.coreallowlist, lower turn budget, disable debug.What
claude-review.ymlpicks upPin stays in lockstep on the SHA. No functional changes — the shared scripts continue to default through their pre-Gemini code paths. Inherits the validator hardening from #31 (Docker image references lint-required to use
@sha256:digest), which is a no-op for Claude (no images referenced).Expected first-run quirk
This PR modifies
claude-review.yml, so the Claude reviewer will fail with the known App-token-401 validation gotcha. Harmless and documented. Gemini side isn't affected.What we're watching for in the next Gemini run
githubMCP server registers any tools (or fails to spawn the Docker container — debug output will tell us)run_shell_commandarguments the agent attemptsIf MCP loads: tighten the
tools.coreallowlist based on observed commands, lower the turn budget, disable debug, ship.If MCP still doesn't register after we can see the diagnostic output: investigate further or fall back to single-shot mode with the tuning settings (Option C from the earlier debug summary).
🤖 Generated with Claude Code