feat(agent-session): ✨ Add run-level auto-retry for retryable provider errors#186
Conversation
…r errors Automatically retry up to 5 times with exponential back-off (2s–32s) when a thread run fails with HTTP 408/429/500/502/503/504 errors. During retries the thinking placeholder shows an i18n-aware "Server error, retry X of 5…" message. After exhausting all retries the run transitions to failed with the final error displayed. Key changes: - spawn_provider_error_retry_loop monitors event loop and retries via start_retry_run with a fresh run_id and clean message history - is_retryable_provider_error matches HTTP status codes in error strings - discard_failed_messages_for_run prevents history pollution on retry - retry_cancellations set allows cancel during the retry delay gap - RunRetrying excluded from should_complete_reasoning_for_event - Frontend onRawEvent updates thinking placeholder label on run_retrying - i18n keys: run.retrying (zh-CN / en)
AI Code Review SummaryPR: #186 (feat(agent-session): ✨ Add run-level auto-retry for retryable provider errors) Overall AssessmentDetected 3 actionable findings, prioritize CRITICAL/HIGH before merge. Major Findings by Severity
Actionable Suggestions
Potential Risks
Test Suggestions
File-Level Coverage Notes
Inline Downgraded Items (processed but not inline)
Coverage Status
Uncovered list:
No-patch covered list:
Runtime/Budget
|
| /// Mark all `failed` messages for the given run as `discarded` so they are | ||
| /// excluded from subsequent `list_since_last_reset` queries and do not pollute | ||
| /// the LLM history on automatic run-level retries. | ||
| pub async fn discard_failed_messages_for_run( |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| "retrying run after retryable provider error" | ||
| ); | ||
|
|
||
| sleep(Duration::from_millis(delay_ms)).await; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
|
|
||
| /// Returns `true` when the error string looks like a retryable HTTP server / | ||
| /// rate-limit error. Auth errors (401, 403) are intentionally excluded. | ||
| fn is_retryable_provider_error(error: &str) -> bool { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // Discard any failed streaming messages left by the previous | ||
| // run so they don't pollute the LLM history on retry. | ||
| if let Err(e) = | ||
| message_repo::discard_failed_messages_for_run(&manager.pool, &last_run_id).await |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| last_run_id = new_run_id; | ||
| event_loop_handle = handle; | ||
| } | ||
| Err(e) => { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| } | ||
|
|
||
| /// Returns the status of a specific run by ID. | ||
| pub async fn find_status_by_id(pool: &SqlitePool, id: &str) -> Result<Option<String>, AppError> { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Address PR review feedback: - Re-check retry_cancellations after sleep() to honor cancel during back-off delay (#2) - Abort retry loop when discard_failed_messages_for_run fails to prevent polluted LLM history (#4) - Emit RunFailed via frontend_tx when start_retry_run fails so the frontend exits the retrying placeholder state (#5)
|
|
||
| /// Returns `true` when the error string looks like a retryable HTTP server / | ||
| /// rate-limit error. Auth errors (401, 403) are intentionally excluded. | ||
| fn is_retryable_provider_error(error: &str) -> bool { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // Also check if a cancellation was requested during the gap | ||
| // between runs (i.e. the user clicked cancel while no ActiveRun | ||
| // existed). The flag is set by cancel_run_if_active. | ||
| { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| !matches!( | ||
| event, | ||
| ThreadStreamEvent::RunStarted { .. } | ||
| | ThreadStreamEvent::RunRetrying { .. } |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| else { | ||
| // No active run — but a retry loop may be sleeping between | ||
| // attempts. Signal it to stop before the next retry. | ||
| let mut cancels = self.retry_cancellations.lock().await; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| } | ||
|
|
||
| let delay_ms = | ||
| PROVIDER_ERROR_RETRY_BASE_DELAY.as_millis() as u64 * (1u64 << (attempt - 1)); |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // Check whether the run was cancelled — if so, don't retry. | ||
| // (cancellation_requested is cleared after remove_active_run, | ||
| // so we check the DB run status instead) | ||
| if let Ok(Some(status)) = |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
|
|
||
| // Exhausted all retries — the last RunFailed has already been emitted | ||
| // by the event loop, so the frontend will show the error. | ||
| tracing::warn!( |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| }); | ||
| } | ||
|
|
||
| /// Monitors the event loop of the current run and, if it ends with a |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
|
|
||
| /// Returns `true` when the error string looks like a retryable HTTP server / | ||
| /// rate-limit error. Auth errors (401, 403) are intentionally excluded. | ||
| fn is_retryable_provider_error(error: &str) -> bool { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| else { | ||
| // No active run — but a retry loop may be sleeping between | ||
| // attempts. Signal it to stop before the next retry. | ||
| let mut cancels = self.retry_cancellations.lock().await; |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // Insert new active run entry with the shared broadcast channel. | ||
| { | ||
| let mut runs = self.active_runs.lock().await; | ||
| if runs.values().any(|r| r.thread_id == thread_id) { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| /// Mark all `failed` messages for the given run as `discarded` so they are | ||
| /// excluded from subsequent `list_since_last_reset` queries and do not pollute | ||
| /// the LLM history on automatic run-level retries. | ||
| pub async fn discard_failed_messages_for_run( |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
|
|
||
| // Show retry progress in the thinking placeholder when the backend | ||
| // is automatically retrying after a retryable provider error. | ||
| if (event.type === "run_retrying") { |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| Ok(result.rows_affected()) | ||
| } | ||
|
|
||
| /// Returns the status of a specific run by ID. |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Clear thread_id from retry_cancellations in start_run_with_options before inserting the new ActiveRun. Prevents a leftover cancel signal from a previous run from erroneously aborting retries on a completely new run for the same thread.
| /// Mark all `failed` messages for the given run as `discarded` so they are | ||
| /// excluded from subsequent `list_since_last_reset` queries and do not pollute | ||
| /// the LLM history on automatic run-level retries. | ||
| pub async fn discard_failed_messages_for_run( |
There was a problem hiding this comment.
[HIGH] Missing integration test for discard_failed_messages_for_run
The new SQL UPDATE in discard_failed_messages_for_run is not exercised in src-tauri/tests/, violating the project rule to add backend integration tests for persistence changes.
Suggestion: Add an integration test that seeds messages with mixed statuses across multiple runs, calls discard_failed_messages_for_run, and asserts only failed rows for the target run_id are updated and the correct rows_affected u64 is returned.
Risk: Silent regression in retry-scoped cleanup logic; SQL drift or typo in status literals will not be caught by CI.
Confidence: 0.95
| } | ||
|
|
||
| /// Returns the status of a specific run by ID. | ||
| pub async fn find_status_by_id(pool: &SqlitePool, id: &str) -> Result<Option<String>, AppError> { |
There was a problem hiding this comment.
[HIGH] Missing integration test for find_status_by_id
The new query method fetch_optional on thread_runs lacks backend integration test coverage.
Suggestion: Add integration tests asserting Some(status) for an existing run ID and None for a non-existent or empty ID.
Risk: sqlx mapping or schema changes could break Option deserialization without failing CI.
Confidence: 0.95
| run_id: &str, | ||
| ) -> Result<u64, AppError> { | ||
| let result = sqlx::query( | ||
| "UPDATE messages SET status = 'discarded' \ |
There was a problem hiding this comment.
[MEDIUM] Hard-coded SQL status strings lack test validation
The UPDATE uses literal strings 'discarded' and 'failed' that must align with application enums and schema constraints, but no tests verify this alignment.
Suggestion: Write tests that insert rows using the same Rust status constants consumed by the rest of the app, then assert the function matches exactly those values after the update.
Risk: If the canonical status string changes in one module but not here, the query will silently affect zero rows and hide retry failures.
Confidence: 0.85
Summary
Test Plan
cargo test --locked,npm run typecheck,npm run test:unitpass🤖 Generated with TiyCode