[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355
Open
jayantsing-db wants to merge 3 commits intomainfrom
Open
[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355jayantsing-db wants to merge 3 commits intomainfrom
jayantsing-db wants to merge 3 commits intomainfrom
Conversation
fetchBatchBytes had no retry on HTTP failures, so any single 5xx from S3 cancelled the entire query. With thousands of concurrent GETs on large result sets, even a sub-percent per-request failure rate makes at least one failure near-certain. Adds exponential backoff with equal jitter for 408/429/500/502/503/504 plus connection errors, honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config and parseable integer Retry-After response headers. Link expiry is re-checked after each backoff so retries don't outlive the presigned URL. Co-authored-by: Isaac Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
Two follow-ups from review feedback: 1. fetchBatchBytes returned res.Body on 200 OK and let the caller's io.ReadAll happen outside the retry loop, so a TCP RST or truncated body during streaming surfaced as a hard failure. With multi-MB S3 objects this is the dominant transient mode at scale and exactly the failure we want covered. Buffer the body inside the retry loop and treat read errors as transient. Decompression stays outside the loop — malformed LZ4 is data corruption, not transient. 2. The expiry-between-retries test used ExpiryTime = floor(now), which could already be expired before iter 0's check if construction crossed a Unix-second boundary (attempts=0 instead of 1). Add 1s of guaranteed headroom and bump RetryWaitMin to 4s so the post-backoff sleep deterministically pushes floor(now) past expiry on iter 1. Co-authored-by: Isaac Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
5d1790d to
4b72cac
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fetchBatchBytes had no retry on HTTP failures, so any single 5xx from S3 cancelled the entire query. With thousands of concurrent GETs on large result sets, even a sub-percent per-request failure rate makes at least one failure near-certain. Adds exponential backoff with equal jitter for 408/429/500/502/503/504 plus connection errors, honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config and parseable integer Retry-After response headers. Link expiry is re-checked after each backoff so retries don't outlive the presigned URL.
Co-authored-by: Isaac