Skip to content

feat(agent-session): ✨ Add run-level auto-retry for retryable provider errors#186

Merged
jorben merged 4 commits into
masterfrom
feat/api-retry-support
May 16, 2026
Merged

feat(agent-session): ✨ Add run-level auto-retry for retryable provider errors#186
jorben merged 4 commits into
masterfrom
feat/api-retry-support

Conversation

@HayWolf
Copy link
Copy Markdown
Contributor

@HayWolf HayWolf commented May 16, 2026

Summary

  • Add run-level auto-retry (up to 5 times, exponential back-off 2s→32s) for retryable provider HTTP errors (408/429/500/502/503/504)
  • Show i18n-aware "Server error, retry X of 5…" in the thinking placeholder during retries
  • After exhausting all retries, transition to failed state with the final error displayed

Test Plan

  • Verify retryable errors (429/500/502) trigger automatic retry with thinking placeholder update
  • Verify non-retryable errors (401/403) fail immediately without retry
  • Verify retry exhaustion (5×) shows final error banner
  • Verify user can cancel during retry delay gap
  • Verify i18n strings display correctly in zh-CN and en
  • cargo test --locked, npm run typecheck, npm run test:unit pass

🤖 Generated with TiyCode

…r errors

Automatically retry up to 5 times with exponential back-off (2s–32s)
when a thread run fails with HTTP 408/429/500/502/503/504 errors.
During retries the thinking placeholder shows an i18n-aware
"Server error, retry X of 5…" message. After exhausting all retries
the run transitions to failed with the final error displayed.

Key changes:
- spawn_provider_error_retry_loop monitors event loop and retries
  via start_retry_run with a fresh run_id and clean message history
- is_retryable_provider_error matches HTTP status codes in error strings
- discard_failed_messages_for_run prevents history pollution on retry
- retry_cancellations set allows cancel during the retry delay gap
- RunRetrying excluded from should_complete_reasoning_for_event
- Frontend onRawEvent updates thinking placeholder label on run_retrying
- i18n keys: run.retrying (zh-CN / en)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

AI Code Review Summary

PR: #186 (feat(agent-session): ✨ Add run-level auto-retry for retryable provider errors)
Preferred language: English

Overall Assessment

Detected 3 actionable findings, prioritize CRITICAL/HIGH before merge.

Major Findings by Severity

  • HIGH (2)
    • src-tauri/src/persistence/repo/message_repo.rs:413 - Missing integration test for discard_failed_messages_for_run
    • src-tauri/src/persistence/repo/run_repo.rs:257 - Missing integration test for find_status_by_id
  • MEDIUM (1)
    • src-tauri/src/persistence/repo/message_repo.rs:418 - Hard-coded SQL status strings lack test validation

Actionable Suggestions

  • Add Rust integration tests for discard_failed_messages_for_run and find_status_by_id.
  • Confirm npm run typecheck passes for the TSX change.
  • Verify that a missing translation key for run.retrying does not break the UI.
  • Add Rust integration tests for discard_failed_messages_for_run covering mixed-status rows and correct rows_affected.
  • Add Rust integration tests for find_status_by_id covering existing and missing run IDs.
  • Manually verify thread surface placeholder transitions during run_retrying followed by run_started.
  • If status string constants exist elsewhere, reference them in tests to guard against literal drift.
  • Replace the O(N) runs.values().any scan with a HashMap<thread_id, run_id> index to make existence checks O(1).
  • Consider batching or transactionalizing the retry bookkeeping DB calls (status check + discard failed messages) to reduce round-trips per retry attempt.

Potential Risks

  • Failed-message discard filters by status; concurrent message updates during retry could race, though SQLite serializes writes.
  • The retrying placeholder may overlap with other thinking states if not managed carefully by the showThinkingPlaceholder helper.
  • Persistence helpers may regress without CI coverage.
  • UI placeholder state may stick if event sequence assumptions are wrong.
  • Increased database load during provider outages due to per-retry bookkeeping queries.
  • Linear mutex scan may become a bottleneck if the manager handles a large number of concurrent active runs.

Test Suggestions

  • Add backend integration tests for the two new repo functions.
  • Add UI/component test or manual verification for run_retrying placeholder lifecycle.
  • Backend integration test in src-tauri/tests/agent_run.rs or a message/repo-specific module for discard_failed_messages_for_run.
  • Backend integration test for find_status_by_id in the corresponding run repo test module.
  • Frontend manual test plan: trigger provider error retry and confirm placeholder updates from run_retrying to run_started cleanly.
  • Load-test run start/retry with many concurrent threads to profile active_runs lock hold times.
  • Verify that retry cancellation signals and stale flag cleanup do not leak memory under repeated start/cancel cycles.

File-Level Coverage Notes

  • src-tauri/src/persistence/repo/message_repo.rs: A new run-level retry cleanup helper is added but is not covered by Rust integration tests. (Per project guidance, persistence changes must include backend integration tests.)
  • src-tauri/src/persistence/repo/run_repo.rs: A new status lookup query is added with no test validation against the database.
  • src/modules/workbench-shell/ui/runtime-thread-surface.tsx: New run_retrying event branch was added to the thread surface handler. Automated UI tests are not mandated by project guidance, but the retry→start event transition is an observable boundary condition. (Project guidance allows manual verification for UI changes; however, the ordering boundary between run_retrying and run_started should be explicitly checked.)
  • src/i18n/locales/en.ts: Safe addition of a new translation key with expected interpolation placeholders.
  • src/i18n/locales/zh-CN.ts: Safe addition of the Chinese translation for the new key. (The Chinese string uses only {{attempt}} and omits {{maxAttempts}} present in the English source. This is acceptable if the i18n engine allows unused passed variables; confirm the runtime interpolator does not throw or warn when a referenced variable is missing.)
  • src-tauri/src/core/agent_run_manager.rs: needs_attention (The retry loop correctly reuses the frontend broadcast sender and limits retries to 5, keeping per-run task overhead bounded.)
  • src-tauri/src/core/agent_run_event_handler.rs: ok (Trivial match-arm addition for RunRetrying; no performance impact.)

Inline Downgraded Items (processed but not inline)

  • None

Coverage Status

  • Target files: 7
  • Covered files: 7
  • Uncovered files: 0
  • No-patch/binary covered as file-level: 0
  • Findings with unknown confidence (N/A): 0

Uncovered list:

  • None

No-patch covered list:

  • None

Runtime/Budget

  • Rounds used: 2/4
  • Planned batches: 4
  • Executed batches: 4
  • Sub-agent runs: 8
  • Planner calls: 2
  • Reviewer calls: 13
  • Model calls: 15/64
  • Structured-output summary-only degradation: NO

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 7
  • Findings with unknown confidence: 0
  • Inline comments attempted: 6
  • Target files: 7
  • Covered files: 7
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

/// Mark all `failed` messages for the given run as `discarded` so they are
/// excluded from subsequent `list_since_last_reset` queries and do not pollute
/// the LLM history on automatic run-level retries.
pub async fn discard_failed_messages_for_run(

This comment was marked as outdated.

"retrying run after retryable provider error"
);

sleep(Duration::from_millis(delay_ms)).await;

This comment was marked as outdated.


/// Returns `true` when the error string looks like a retryable HTTP server /
/// rate-limit error. Auth errors (401, 403) are intentionally excluded.
fn is_retryable_provider_error(error: &str) -> bool {

This comment was marked as outdated.

// Discard any failed streaming messages left by the previous
// run so they don't pollute the LLM history on retry.
if let Err(e) =
message_repo::discard_failed_messages_for_run(&manager.pool, &last_run_id).await

This comment was marked as outdated.

last_run_id = new_run_id;
event_loop_handle = handle;
}
Err(e) => {

This comment was marked as outdated.

}

/// Returns the status of a specific run by ID.
pub async fn find_status_by_id(pool: &SqlitePool, id: &str) -> Result<Option<String>, AppError> {

This comment was marked as outdated.

jorben added 2 commits May 17, 2026 00:26
Address PR review feedback:
- Re-check retry_cancellations after sleep() to honor cancel
  during back-off delay (#2)
- Abort retry loop when discard_failed_messages_for_run fails
  to prevent polluted LLM history (#4)
- Emit RunFailed via frontend_tx when start_retry_run fails so
  the frontend exits the retrying placeholder state (#5)
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 8
  • Findings with unknown confidence: 0
  • Inline comments attempted: 7
  • Target files: 7
  • Covered files: 7
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.


/// Returns `true` when the error string looks like a retryable HTTP server /
/// rate-limit error. Auth errors (401, 403) are intentionally excluded.
fn is_retryable_provider_error(error: &str) -> bool {

This comment was marked as outdated.

// Also check if a cancellation was requested during the gap
// between runs (i.e. the user clicked cancel while no ActiveRun
// existed). The flag is set by cancel_run_if_active.
{

This comment was marked as outdated.

!matches!(
event,
ThreadStreamEvent::RunStarted { .. }
| ThreadStreamEvent::RunRetrying { .. }

This comment was marked as outdated.

else {
// No active run — but a retry loop may be sleeping between
// attempts. Signal it to stop before the next retry.
let mut cancels = self.retry_cancellations.lock().await;

This comment was marked as outdated.

}

let delay_ms =
PROVIDER_ERROR_RETRY_BASE_DELAY.as_millis() as u64 * (1u64 << (attempt - 1));

This comment was marked as outdated.

// Check whether the run was cancelled — if so, don't retry.
// (cancellation_requested is cleared after remove_active_run,
// so we check the DB run status instead)
if let Ok(Some(status)) =

This comment was marked as outdated.


// Exhausted all retries — the last RunFailed has already been emitted
// by the event loop, so the frontend will show the error.
tracing::warn!(

This comment was marked as outdated.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 7
  • Findings with unknown confidence: 0
  • Inline comments attempted: 7
  • Target files: 7
  • Covered files: 7
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

});
}

/// Monitors the event loop of the current run and, if it ends with a

This comment was marked as outdated.


/// Returns `true` when the error string looks like a retryable HTTP server /
/// rate-limit error. Auth errors (401, 403) are intentionally excluded.
fn is_retryable_provider_error(error: &str) -> bool {

This comment was marked as outdated.

else {
// No active run — but a retry loop may be sleeping between
// attempts. Signal it to stop before the next retry.
let mut cancels = self.retry_cancellations.lock().await;

This comment was marked as outdated.

// Insert new active run entry with the shared broadcast channel.
{
let mut runs = self.active_runs.lock().await;
if runs.values().any(|r| r.thread_id == thread_id) {

This comment was marked as outdated.

/// Mark all `failed` messages for the given run as `discarded` so they are
/// excluded from subsequent `list_since_last_reset` queries and do not pollute
/// the LLM history on automatic run-level retries.
pub async fn discard_failed_messages_for_run(

This comment was marked as outdated.


// Show retry progress in the thinking placeholder when the backend
// is automatically retrying after a retryable provider error.
if (event.type === "run_retrying") {

This comment was marked as outdated.

Ok(result.rows_affected())
}

/// Returns the status of a specific run by ID.

This comment was marked as outdated.

Clear thread_id from retry_cancellations in start_run_with_options
before inserting the new ActiveRun. Prevents a leftover cancel
signal from a previous run from erroneously aborting retries on
a completely new run for the same thread.
@jorben jorben merged commit 0a463e8 into master May 16, 2026
4 checks passed
@jorben jorben deleted the feat/api-retry-support branch May 16, 2026 16:47
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR review completed.

  • Findings kept: 3
  • Findings with unknown confidence: 0
  • Inline comments attempted: 3
  • Target files: 7
  • Covered files: 7
  • Uncovered files: 0
    See the summary comment for detailed analysis and coverage details.

/// Mark all `failed` messages for the given run as `discarded` so they are
/// excluded from subsequent `list_since_last_reset` queries and do not pollute
/// the LLM history on automatic run-level retries.
pub async fn discard_failed_messages_for_run(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Missing integration test for discard_failed_messages_for_run

The new SQL UPDATE in discard_failed_messages_for_run is not exercised in src-tauri/tests/, violating the project rule to add backend integration tests for persistence changes.

Suggestion: Add an integration test that seeds messages with mixed statuses across multiple runs, calls discard_failed_messages_for_run, and asserts only failed rows for the target run_id are updated and the correct rows_affected u64 is returned.

Risk: Silent regression in retry-scoped cleanup logic; SQL drift or typo in status literals will not be caught by CI.

Confidence: 0.95

[From SubAgent: testing]

}

/// Returns the status of a specific run by ID.
pub async fn find_status_by_id(pool: &SqlitePool, id: &str) -> Result<Option<String>, AppError> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Missing integration test for find_status_by_id

The new query method fetch_optional on thread_runs lacks backend integration test coverage.

Suggestion: Add integration tests asserting Some(status) for an existing run ID and None for a non-existent or empty ID.

Risk: sqlx mapping or schema changes could break Option deserialization without failing CI.

Confidence: 0.95

[From SubAgent: testing]

run_id: &str,
) -> Result<u64, AppError> {
let result = sqlx::query(
"UPDATE messages SET status = 'discarded' \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Hard-coded SQL status strings lack test validation

The UPDATE uses literal strings 'discarded' and 'failed' that must align with application enums and schema constraints, but no tests verify this alignment.

Suggestion: Write tests that insert rows using the same Rust status constants consumed by the rest of the app, then assert the function matches exactly those values after the update.

Risk: If the canonical status string changes in one module but not here, the query will silently affect zero rows and hide retry failures.

Confidence: 0.85

[From SubAgent: testing]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants