Skip to content

Add retry to gRPC calls that failed due to transient errors#714

Merged
torosent merged 39 commits intomainfrom
stevosyan/add-retry-to-complete-calls
Apr 28, 2026
Merged

Add retry to gRPC calls that failed due to transient errors#714
torosent merged 39 commits intomainfrom
stevosyan/add-retry-to-complete-calls

Conversation

@sophiatev
Copy link
Copy Markdown
Contributor

Summary

What changed?

As per the PR title, this PR adds retry logic to gRPC calls in the worker process that failed due to transient errors (i.e. StatusCode.Unavailable)

Why is this change needed?

Previously the call would just fail, which means the work item would be abandoned and only picked up again after a relatively long delay. We want to retry pretty immediately for these transient errors.

Copilot AI review requested due to automatic review settings April 27, 2026 16:58
@sophiatev
Copy link
Copy Markdown
Contributor Author

@copilot add tests for the new retry logic in this PR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds retry logic around worker gRPC calls when they fail with transient transport errors, with accompanying structured logging for each retry attempt.

Changes:

  • Wrap multiple gRPC client calls (abandon/complete operations) in a shared retry helper with exponential backoff + jitter.
  • Add a new warning log event to record transient gRPC retry attempts and backoff duration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/Worker/Grpc/Logs.cs Adds a new structured log message for retry attempts.
src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Introduces ExecuteWithRetryAsync and applies it to multiple gRPC calls.

Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/Logs.cs
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 17:10
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 27, 2026

@copilot add tests for the new retry logic in this PR

Added tests for ExecuteWithRetryAsync in test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs (commit 61f8a21). The tests cover:

  • Success on first attempt – verifies no retry occurs when the action succeeds
  • Retry on all transient status codes (Unavailable, Unknown, DeadlineExceeded, Internal) – verifies the action is retried and eventually succeeds
  • No retry on non-transient errors (InvalidArgument, AlreadyExists, PermissionDenied) – verifies the exception propagates immediately
  • Cancellation during retry delay – verifies OperationCanceledException is thrown when the token is cancelled while waiting between retries
  • Log messages on retry – verifies TransientGrpcRetry log entries are emitted with the correct operation name and attempt count
  • Multiple consecutive retries – verifies each retry attempt is independently logged

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs
Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs Fixed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 17:32
Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Sophia Tevosyan and others added 3 commits April 27, 2026 10:48
…member'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 17:49
Comment thread test/Worker/Grpc.Tests/ExecuteWithRetryTests.cs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread src/Worker/Grpc/Logs.cs
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Outdated
Copilot AI review requested due to automatic review settings April 27, 2026 20:06
@sophiatev
Copy link
Copy Markdown
Contributor Author

@copilot now add integration tests for this retry functionality

Sophia Tevosyan and others added 3 commits April 27, 2026 13:50
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 27, 2026

@copilot now add integration tests for this retry functionality

Added integration-level retry tests in commit 1379dd4 in RunBackgroundTaskLoggingTests.cs. These tests exercise the full RunBackgroundTask path with a mocked gRPC client:

  • Retries_Abandon_Orchestrator_On_Transient_Error_Eventually_Succeeds — verifies that a transient Unavailable error on the first abandon call is retried, eventually succeeds, and a TransientGrpcRetry warning is logged.
  • Retries_Abandon_Activity_On_Transient_Error_Eventually_Succeeds — same scenario for activity work items.
  • Retries_Abandon_Orchestrator_Until_MaxAttempts_Then_Fails — verifies that after all retry attempts are exhausted, the abandon exception is surfaced, the Abandoned log is not emitted, and maxAttempts retry warnings were logged.

The commit also fixes the unit test ExecuteWithRetryAsync_TransientErrorExceedsMaxAttempts_ThrowsLastRpcException which was expecting the wrong call count (the loop makes maxAttempts + 1 total calls — maxAttempts retried + 1 final propagating), and adds fast-option support to CreateProcessor to avoid long delays in that test.

Comment thread test/Worker/Grpc.Tests/RunBackgroundTaskLoggingTests.cs
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
Comment thread src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs Fixed
@torosent torosent merged commit 21c740a into main Apr 28, 2026
8 checks passed
@torosent torosent deleted the stevosyan/add-retry-to-complete-calls branch April 28, 2026 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants