ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration) by heskew · Pull Request #83 · HarperFast/oauth

heskew · 2026-05-13T20:00:26Z

Summary

Symmetric pin bump across claude-review.yml and gemini-review.yml, from 128656e4 → d446b4c6. The Gemini-side change is substantive (MCP pivot + debug iteration); Claude inherits no functional changes.

Branch history: this PR was originally opened to bump to the post-#30 pin (7c8aae0a). The trial run on that pin failed at the gemini-cli step with three compounding issues — MCP tools never registered, shell commands denied, maxSessionTurns exhausted. Force-pushed forward to d446b4c6 which includes the debug iteration (#32) that should surface root causes on the next run.

What `gemini-review.yml` picks up

Architectural pivot (#30):

Workflow-posts-single-shot → MCP-based agentic tool use (Docker'd github-mcp-server, agent posts inline comments + a marker'd review summary via pull_request_review_write)
Custom /harper-review slash command composed at runtime from our layered review scope
Severity tagging discipline: only 🔴 Critical and 🟠 High are posted

Debug iteration (#32):

gemini_debug: 'true' — streams CLI stdout/stderr inline in the workflow log; MCP startup and tool registration become visible
tools.core allowlist removed — temporarily opens shell surface so we observe what the agent actually wants to call (last trial got "denied by policy" without surfacing which commands)
maxSessionTurns: 15 → 30 — headroom while iterating

After this trial converges, we tighten back toward production: re-add a fact-based tools.core allowlist, lower turn budget, disable debug.

What `claude-review.yml` picks up

Pin stays in lockstep on the SHA. No functional changes — the shared scripts continue to default through their pre-Gemini code paths. Inherits the validator hardening from #31 (Docker image references lint-required to use @sha256: digest), which is a no-op for Claude (no images referenced).

Expected first-run quirk

This PR modifies claude-review.yml, so the Claude reviewer will fail with the known App-token-401 validation gotcha. Harmless and documented. Gemini side isn't affected.

What we're watching for in the next Gemini run

Whether the github MCP server registers any tools (or fails to spawn the Docker container — debug output will tell us)
What run_shell_command arguments the agent attempts
Whether the agent successfully submits a review with the open shell allowlist

If MCP loads: tighten the tools.core allowlist based on observed commands, lower the turn budget, disable debug, ship.

If MCP still doesn't register after we can see the diagnostic output: investigate further or fall back to single-shot mode with the tuning settings (Option C from the earlier debug summary).

🤖 Generated with Claude Code

…teration) Symmetric pin across `claude-review.yml` and `gemini-review.yml`, bumped from `128656e4` → `d446b4c6` (skipping the intermediate `7c8aae0a` since `d446b4c6` includes everything from #30/#31 PLUS the debug-iteration changes from #32). ## What `gemini-review.yml` picks up (substantive) The MCP-based agentic Gemini reviewer (#30) + debug iteration settings (#32): - Architecture: workflow-posts-single-shot → MCP-based agentic tool use (Docker'd `github-mcp-server`, inline comments, `pull_request_review_write` etc.) - Custom `/harper-review` slash command composed from our layered review scope - Severity tagging (`🔴` Critical / `🟠` High only) - Debug iteration (this PR's new content): - `gemini_debug: 'true'` — CLI stdout/stderr streamed inline in the workflow log so MCP startup and tool registration are visible - `tools.core` allowlist REMOVED — opens shell surface so we can observe what the agent actually wants to call (the first trial of #30 hit "Tool execution denied by policy" on `run_shell_command` without surfacing which commands were attempted) - `maxSessionTurns: 15` → `30` — headroom while iterating Patterns adopted from Google's official PR review example for `run-gemini-cli`; reimplemented around our auth gate, layered scope, and log-issue threading. See the upstream `_gemini-review.yml` header comment for the upstream-vs-ours diff. ## What `claude-review.yml` picks up (no functional change) Just stays in lockstep on the SHA. The Gemini-side changes don't affect Claude: shared `find-prior-review-comment.sh` and `log-review-to-ai-review-log.sh` continue to default through their pre-Gemini code paths. Also inherits the validator hardening (#31): Docker image references in reusable workflows are now lint-required to use `@sha256:` digest pinning. Claude's caller has none, so it's a no-op for Claude — but documents the discipline going forward. ## This PR's history Originally opened as a bump from `128656e4` → `7c8aae0a` (the post-#30 pin). The initial trial-run failed at the gemini-cli step with three compounding issues (MCP tools never registered, shell commands denied, maxSessionTurns hit). Branch is now amended forward to `d446b4c6` which includes the debug iteration that should surface root causes on the next run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Atomic single-call submission for Gemini (`pull_request_review_write` with `method=create, event=COMMENT`) replaces the multi-step pending-review flow that hit submission races in the previous iteration. Prompt also tightens scope discipline so the agent stops reviewing source files outside the diff. Claude has no functional change from #33; pin moves to stay symmetric with the Gemini pin within this repo. Re-using this PR as the comparison-test surface for one more iteration. If the run is clean, next bump disables `gemini_debug` and finalizes the settings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#34 fixes the post-#33 turn-exhaustion: passes PR-context env vars (PULL_REQUEST_NUMBER, REPOSITORY, ISSUE_TITLE, ISSUE_BODY) so the agent doesn't burn turns running `printenv` / `env | grep` / `git remote -v` trying to figure out which PR it's reviewing. Also widens tools.core to match upstream (`head`/`tail` added) so denial retries stop eating the turn budget. Claude has no functional change from #34; pin moves to stay symmetric with the Gemini pin within this repo. Continuing to use this PR as the comparison-test surface for one more iteration. If clean, next bump disables `gemini_debug`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#35 pivots Gemini from our custom `/harper-review` slash command to the upstream `/pr-code-review` from the gemini-cli-extensions/code-review extension. The custom command was fighting the model's training: agent reverted to shell-based PR exploration (printenv, gh pr view, ls -R, npm list — all denied, all retried) and never reached MCP. Upstream's slash command is the pattern the model was trained on. Harper-specific scope (layered reviews, severity discipline, trivial-diff short-circuit, marker requirement) is now layered in via the ADDITIONAL_CONTEXT env var, which the upstream prompt substitutes at render time. Claude has no functional change from #35; pin moves to stay symmetric with the Gemini pin within this repo. Continuing to use this PR as the comparison-test surface. If clean, next bump disables gemini_debug and removes the now-unused legacy prompt template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#36 adds caller-side `if: failure()` artifact upload + job-summary tail. The previous run-attempt failed at gemini-cli but produced no artifacts because upstream's run-gemini-cli@v0.1.22 doesn't gate its inner upload step with `if: always()` — composite-action failure semantics short-circuit it. Purpose of THIS run-attempt: capture the full post-#35 agent trace so we can finish diagnosing the actual failure mode (the prior attempt's gh-CLI log truncated to the first ~5 seconds out of ~38). No functional change to the Gemini run itself. Claude has no functional change from #36; pin moves to stay symmetric with the Gemini pin within this repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Throwaway debug workflow to isolate why MCP tools never propagate to the model's function-call surface in the production gemini-review.yml (despite the github-mcp-server starting fine and registering session- side). Telemetry on the prior runs proves it definitively: every tool call has `tool_type: "native"`; zero `tool_type: "mcp"`. Hypothesis under test --------------------- The production caller pins `ghcr.io/github/github-mcp-server@sha256: e3816a4...` which is v1.0.4 (released 2026-05-11). v1.0.0 (2026-04-16) was a major-version bump that upgraded modelcontextprotocol/go-sdk to v1.5.0 and reorganized toolsets, introducing feature-flagged tool groups gated by `dynamicToolsets` (visible in our server logs as `dynamicToolsets=false`). Upstream's example pins to `:v0.27.0` (Feb 2026 era), predating that reorganization. Our bump to v1.0.x was Harper-side, well-intended ("newer is better"), but appears to be the regression. This workflow tests the simplest variable: same gemini-cli, same action, same prompt, same model — only the MCP server version differs. If MCP propagates with `:v0.27.0`, the root cause is the v1.0.x server. Approach -------- Mirrors upstream's tip-of-main pr-review example as closely as practical, with minimal Harper-specific deviations: - Auth: skip App-token-mint step. Default GITHUB_TOKEN only. Single- repo debug branch; only org members push. Backport will reintroduce the gate via the reusable. - Model: hardcoded gemini-3-flash-preview (our known-good choice). - Layered Harper scope: NOT applied. ADDITIONAL_CONTEXT is empty. Prove MCP propagates first; reintroduce scope after. - tools.core: [] (empty) — forces all I/O through MCP. Cleanest signal: if MCP is broken, the agent has nothing. - MCP server pin: TAG `:v0.27.0` instead of `@sha256:...` digest. SHA-pinning discipline waived for the spike; will digest-pin on backport. - Visibility: failure-path artifact upload + summary tail (same as _gemini-review.yml post-#36, with the pipefail/SIGPIPE bug fixed via `set -eu` instead of `set -euo pipefail` plus `|| true` around the offending pipe). Iteration model --------------- Edit this file → push → wait ~90s. No PR ceremony on ai-review-prompts. Runs in parallel with the production caller via a distinct concurrency group, so we can A/B compare side-by-side on the same PR. When the working config is identified, backport to `HarperFast/ai-review-prompts/_gemini-review.yml` in one focused PR. Delete this file after backport.

Previous debug run (25874573129) confirmed MCP tools register correctly under server v0.27.0 (3 tools discovered: pull_request_read, pull_request_review_write, add_comment_to_pending_review). But the model failed to use them — INVALID_STREAM error with 0 output tokens across 4 API requests. Mute model. The extension's GEMINI.md loaded into agent context lists two top-level commands: /code-review - "When a user requests that code changes be reviewed" /pr-review - "When a user requests that pr to be reviewed, look at user provided input '{{args}}' and environment variables to see if $REPOSITORY, $PULL_REQUEST_NUMBER, and $ADDITIONAL_CONTEXT are set." Notably `/pr-review` (NOT `/pr-code-review`) is what the extension documents as the entry point. `/pr-code-review` exists as a file in the extension's commands/ directory but isn't mentioned in GEMINI.md. Possibility: `/pr-code-review` is internal / older / deprecated, and `/pr-review` is the intended public command (may be a routing alias). Cheap test. If `/pr-review` doesn't resolve, gemini-cli will fail clearly with a "command not found" error. If it does resolve and behaves better than `/pr-code-review`, we've found the right command.

After 3 MCP-debug iterations proving it's not happening: - Run 25869995910: server v1.0.4, 0 MCP tools propagated - Run 25874573129: server v0.27.0, 3 MCP tools registered, model silent (INVALID_STREAM, 0 output tokens) - Run 25879095909: server v0.27.0 + /pr-review slash command, model hallucinated a full review of fictional google/oauth#42 using <call:...> tags as prose (0 actual tool calls) Across all 3, the model demonstrated a strong preference for `gh` CLI patterns even when shell was denied. It tried `gh pr view`, `gh pr diff`, `printenv`, etc. in every run. Pragmatic pivot: stop fighting the model. - Drop mcpServers entirely - Drop the upstream code-review extension entirely - No custom slash command file; inline prompt only - tools.core: gh, git, cat, echo, grep, head, tail - Agent reads diff via `gh pr diff`, posts via `gh pr review --comment --body` Loses true inline comments (would require MCP). Findings reference file:line as plain-text `**File:** path:N` in the body. Goal of this iteration: produce A working Gemini review on PR #83 to compare against Claude. Architecture refinement comes later.

heskew requested a review from a team as a code owner May 13, 2026 20:00

heskew mentioned this pull request May 13, 2026

fix(workflow): enable Gemini debug + drop shell allowlist + raise turn budget for MCP diagnosis HarperFast/ai-review-prompts#32

Merged

5 tasks

heskew force-pushed the ci/bump-ai-review-prompts-post-30 branch from 789b725 to 0f1ada8 Compare May 13, 2026 20:57

heskew changed the title ~~ci: bump AI-review pins to ai-review-prompts@7c8aae0a (Gemini MCP pivot)~~ ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration) May 13, 2026

heskew and others added 7 commits May 13, 2026 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration)#83

ci: bump AI-review pins to ai-review-prompts@d446b4c6 (Gemini MCP pivot + debug iteration)#83
heskew wants to merge 8 commits into
mainfrom
ci/bump-ai-review-prompts-post-30

heskew commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heskew commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What gemini-review.yml picks up

What claude-review.yml picks up

Expected first-run quirk

What we're watching for in the next Gemini run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

heskew commented May 13, 2026 •

edited

Loading

What `gemini-review.yml` picks up

What `claude-review.yml` picks up