feat(agent-session): ✨ Add run-level auto-retry for retryable provider errors by HayWolf · Pull Request #186 · tiylabs/tiycode

HayWolf · 2026-05-16T15:55:32Z

Summary

Add run-level auto-retry (up to 5 times, exponential back-off 2s→32s) for retryable provider HTTP errors (408/429/500/502/503/504)
Show i18n-aware "Server error, retry X of 5…" in the thinking placeholder during retries
After exhausting all retries, transition to failed state with the final error displayed

Test Plan

Verify retryable errors (429/500/502) trigger automatic retry with thinking placeholder update
Verify non-retryable errors (401/403) fail immediately without retry
Verify retry exhaustion (5×) shows final error banner
Verify user can cancel during retry delay gap
Verify i18n strings display correctly in zh-CN and en
cargo test --locked, npm run typecheck, npm run test:unit pass

🤖 Generated with TiyCode

…r errors Automatically retry up to 5 times with exponential back-off (2s–32s) when a thread run fails with HTTP 408/429/500/502/503/504 errors. During retries the thinking placeholder shows an i18n-aware "Server error, retry X of 5…" message. After exhausting all retries the run transitions to failed with the final error displayed. Key changes: - spawn_provider_error_retry_loop monitors event loop and retries via start_retry_run with a fresh run_id and clean message history - is_retryable_provider_error matches HTTP status codes in error strings - discard_failed_messages_for_run prevents history pollution on retry - retry_cancellations set allows cancel during the retry delay gap - RunRetrying excluded from should_complete_reasoning_for_event - Frontend onRawEvent updates thinking placeholder label on run_retrying - i18n keys: run.retrying (zh-CN / en)

github-actions · 2026-05-16T16:14:38Z

AI Code Review Summary

PR: #186 (feat(agent-session): ✨ Add run-level auto-retry for retryable provider errors)
Preferred language: English

Overall Assessment

Detected 3 actionable findings, prioritize CRITICAL/HIGH before merge.

Major Findings by Severity

HIGH (2)
- src-tauri/src/persistence/repo/message_repo.rs:413 - Missing integration test for discard_failed_messages_for_run
- src-tauri/src/persistence/repo/run_repo.rs:257 - Missing integration test for find_status_by_id
MEDIUM (1)
- src-tauri/src/persistence/repo/message_repo.rs:418 - Hard-coded SQL status strings lack test validation

Actionable Suggestions

Add Rust integration tests for discard_failed_messages_for_run and find_status_by_id.
Confirm npm run typecheck passes for the TSX change.
Verify that a missing translation key for run.retrying does not break the UI.
Add Rust integration tests for discard_failed_messages_for_run covering mixed-status rows and correct rows_affected.
Add Rust integration tests for find_status_by_id covering existing and missing run IDs.
Manually verify thread surface placeholder transitions during run_retrying followed by run_started.
If status string constants exist elsewhere, reference them in tests to guard against literal drift.
Replace the O(N) runs.values().any scan with a HashMap<thread_id, run_id> index to make existence checks O(1).
Consider batching or transactionalizing the retry bookkeeping DB calls (status check + discard failed messages) to reduce round-trips per retry attempt.

Potential Risks

Failed-message discard filters by status; concurrent message updates during retry could race, though SQLite serializes writes.
The retrying placeholder may overlap with other thinking states if not managed carefully by the showThinkingPlaceholder helper.
Persistence helpers may regress without CI coverage.
UI placeholder state may stick if event sequence assumptions are wrong.
Increased database load during provider outages due to per-retry bookkeeping queries.
Linear mutex scan may become a bottleneck if the manager handles a large number of concurrent active runs.

Test Suggestions

Add backend integration tests for the two new repo functions.
Add UI/component test or manual verification for run_retrying placeholder lifecycle.
Backend integration test in src-tauri/tests/agent_run.rs or a message/repo-specific module for discard_failed_messages_for_run.
Backend integration test for find_status_by_id in the corresponding run repo test module.
Frontend manual test plan: trigger provider error retry and confirm placeholder updates from run_retrying to run_started cleanly.
Load-test run start/retry with many concurrent threads to profile active_runs lock hold times.
Verify that retry cancellation signals and stale flag cleanup do not leak memory under repeated start/cancel cycles.

File-Level Coverage Notes

src-tauri/src/persistence/repo/message_repo.rs: A new run-level retry cleanup helper is added but is not covered by Rust integration tests. (Per project guidance, persistence changes must include backend integration tests.)
src-tauri/src/persistence/repo/run_repo.rs: A new status lookup query is added with no test validation against the database.
src/modules/workbench-shell/ui/runtime-thread-surface.tsx: New run_retrying event branch was added to the thread surface handler. Automated UI tests are not mandated by project guidance, but the retry→start event transition is an observable boundary condition. (Project guidance allows manual verification for UI changes; however, the ordering boundary between run_retrying and run_started should be explicitly checked.)
src/i18n/locales/en.ts: Safe addition of a new translation key with expected interpolation placeholders.
src/i18n/locales/zh-CN.ts: Safe addition of the Chinese translation for the new key. (The Chinese string uses only {{attempt}} and omits {{maxAttempts}} present in the English source. This is acceptable if the i18n engine allows unused passed variables; confirm the runtime interpolator does not throw or warn when a referenced variable is missing.)
src-tauri/src/core/agent_run_manager.rs: needs_attention (The retry loop correctly reuses the frontend broadcast sender and limits retries to 5, keeping per-run task overhead bounded.)
src-tauri/src/core/agent_run_event_handler.rs: ok (Trivial match-arm addition for RunRetrying; no performance impact.)

Inline Downgraded Items (processed but not inline)

None

Coverage Status

Target files: 7
Covered files: 7
Uncovered files: 0
No-patch/binary covered as file-level: 0
Findings with unknown confidence (N/A): 0

Uncovered list:

None

No-patch covered list:

None

Runtime/Budget

Rounds used: 2/4
Planned batches: 4
Executed batches: 4
Sub-agent runs: 8
Planner calls: 2
Reviewer calls: 13
Model calls: 15/64
Structured-output summary-only degradation: NO

github-actions

Automated PR review completed.

Findings kept: 7
Findings with unknown confidence: 0
Inline comments attempted: 6
Target files: 7
Covered files: 7
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

+/// Mark all `failed` messages for the given run as `discarded` so they are
+/// excluded from subsequent `list_since_last_reset` queries and do not pollute
+/// the LLM history on automatic run-level retries.
+pub async fn discard_failed_messages_for_run(


Sign in to view

+                    "retrying run after retryable provider error"
+                );
+
+                sleep(Duration::from_millis(delay_ms)).await;


Sign in to view

+
+/// Returns `true` when the error string looks like a retryable HTTP server /
+/// rate-limit error. Auth errors (401, 403) are intentionally excluded.
+fn is_retryable_provider_error(error: &str) -> bool {


Sign in to view

+                // Discard any failed streaming messages left by the previous
+                // run so they don't pollute the LLM history on retry.
+                if let Err(e) =
+                    message_repo::discard_failed_messages_for_run(&manager.pool, &last_run_id).await


Sign in to view

+                        last_run_id = new_run_id;
+                        event_loop_handle = handle;
+                    }
+                    Err(e) => {


Sign in to view

 }

+/// Returns the status of a specific run by ID.
+pub async fn find_status_by_id(pool: &SqlitePool, id: &str) -> Result<Option<String>, AppError> {


Address PR review feedback: - Re-check retry_cancellations after sleep() to honor cancel during back-off delay (#2) - Abort retry loop when discard_failed_messages_for_run fails to prevent polluted LLM history (#4) - Emit RunFailed via frontend_tx when start_retry_run fails so the frontend exits the retrying placeholder state (#5)

github-actions

Automated PR review completed.

Findings kept: 8
Findings with unknown confidence: 0
Inline comments attempted: 7
Target files: 7
Covered files: 7
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

+
+/// Returns `true` when the error string looks like a retryable HTTP server /
+/// rate-limit error. Auth errors (401, 403) are intentionally excluded.
+fn is_retryable_provider_error(error: &str) -> bool {


Sign in to view

+                // Also check if a cancellation was requested during the gap
+                // between runs (i.e. the user clicked cancel while no ActiveRun
+                // existed). The flag is set by cancel_run_if_active.
+                {


Sign in to view

    !matches!(
        event,
        ThreadStreamEvent::RunStarted { .. }
+            | ThreadStreamEvent::RunRetrying { .. }


Sign in to view

        else {
+            // No active run — but a retry loop may be sleeping between
+            // attempts. Signal it to stop before the next retry.
+            let mut cancels = self.retry_cancellations.lock().await;


Sign in to view

+                }
+
+                let delay_ms =
+                    PROVIDER_ERROR_RETRY_BASE_DELAY.as_millis() as u64 * (1u64 << (attempt - 1));


Sign in to view

+                // Check whether the run was cancelled — if so, don't retry.
+                // (cancellation_requested is cleared after remove_active_run,
+                // so we check the DB run status instead)
+                if let Ok(Some(status)) =


Sign in to view

+
+            // Exhausted all retries — the last RunFailed has already been emitted
+            // by the event loop, so the frontend will show the error.
+            tracing::warn!(


github-actions

Automated PR review completed.

Findings kept: 7
Findings with unknown confidence: 0
Inline comments attempted: 7
Target files: 7
Covered files: 7
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

Sign in to view

        });
    }

+    /// Monitors the event loop of the current run and, if it ends with a


Sign in to view

+
+/// Returns `true` when the error string looks like a retryable HTTP server /
+/// rate-limit error. Auth errors (401, 403) are intentionally excluded.
+fn is_retryable_provider_error(error: &str) -> bool {


Sign in to view

        else {
+            // No active run — but a retry loop may be sleeping between
+            // attempts. Signal it to stop before the next retry.
+            let mut cancels = self.retry_cancellations.lock().await;


Sign in to view

+        // Insert new active run entry with the shared broadcast channel.
+        {
+            let mut runs = self.active_runs.lock().await;
+            if runs.values().any(|r| r.thread_id == thread_id) {


Sign in to view

+/// Mark all `failed` messages for the given run as `discarded` so they are
+/// excluded from subsequent `list_since_last_reset` queries and do not pollute
+/// the LLM history on automatic run-level retries.
+pub async fn discard_failed_messages_for_run(


Sign in to view


+      // Show retry progress in the thinking placeholder when the backend
+      // is automatically retrying after a retryable provider error.
+      if (event.type === "run_retrying") {


Sign in to view

    Ok(result.rows_affected())
 }

+/// Returns the status of a specific run by ID.


Clear thread_id from retry_cancellations in start_run_with_options before inserting the new ActiveRun. Prevents a leftover cancel signal from a previous run from erroneously aborting retries on a completely new run for the same thread.

github-actions

Automated PR review completed.

Findings kept: 3
Findings with unknown confidence: 0
Inline comments attempted: 3
Target files: 7
Covered files: 7
Uncovered files: 0
See the summary comment for detailed analysis and coverage details.

github-actions · 2026-05-16T17:41:25Z

+/// Mark all `failed` messages for the given run as `discarded` so they are
+/// excluded from subsequent `list_since_last_reset` queries and do not pollute
+/// the LLM history on automatic run-level retries.
+pub async fn discard_failed_messages_for_run(


[HIGH] Missing integration test for discard_failed_messages_for_run

The new SQL UPDATE in discard_failed_messages_for_run is not exercised in src-tauri/tests/, violating the project rule to add backend integration tests for persistence changes.

Suggestion: Add an integration test that seeds messages with mixed statuses across multiple runs, calls discard_failed_messages_for_run, and asserts only failed rows for the target run_id are updated and the correct rows_affected u64 is returned.

Risk: Silent regression in retry-scoped cleanup logic; SQL drift or typo in status literals will not be caught by CI.

Confidence: 0.95

[From SubAgent: testing]

github-actions · 2026-05-16T17:41:25Z

 }

+/// Returns the status of a specific run by ID.
+pub async fn find_status_by_id(pool: &SqlitePool, id: &str) -> Result<Option<String>, AppError> {


[HIGH] Missing integration test for find_status_by_id

The new query method fetch_optional on thread_runs lacks backend integration test coverage.

Suggestion: Add integration tests asserting Some(status) for an existing run ID and None for a non-existent or empty ID.

Risk: sqlx mapping or schema changes could break Option deserialization without failing CI.

Confidence: 0.95

[From SubAgent: testing]

github-actions · 2026-05-16T17:41:25Z

+    run_id: &str,
+) -> Result<u64, AppError> {
+    let result = sqlx::query(
+        "UPDATE messages SET status = 'discarded' \


[MEDIUM] Hard-coded SQL status strings lack test validation

The UPDATE uses literal strings 'discarded' and 'failed' that must align with application enums and schema constraints, but no tests verify this alignment.

Suggestion: Write tests that insert rows using the same Rust status constants consumed by the rest of the app, then assert the function matches exactly those values after the update.

Risk: If the canonical status string changes in one module but not here, the query will silently affect zero rows and hide retry failures.

Confidence: 0.85

[From SubAgent: testing]

github-actions Bot reviewed May 16, 2026

View reviewed changes

jorben added 2 commits May 17, 2026 00:26

Merge remote-tracking branch 'origin/master' into feat/api-retry-support

b4e7ae2

github-actions Bot reviewed May 16, 2026

View reviewed changes

jorben merged commit 0a463e8 into master May 16, 2026
4 checks passed

jorben deleted the feat/api-retry-support branch May 16, 2026 16:47

github-actions Bot reviewed May 16, 2026

View reviewed changes

Conversation

HayWolf commented May 16, 2026

Summary

Test Plan

Uh oh!

github-actions Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Code Review Summary

Overall Assessment

Major Findings by Severity

Actionable Suggestions

Potential Risks

Test Suggestions

File-Level Coverage Notes

Inline Downgraded Items (processed but not inline)

Coverage Status

Runtime/Budget

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 16, 2026 •

edited

Loading