Skip to content

[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355

Open
jayantsing-db wants to merge 3 commits intomainfrom
jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors
Open

[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355
jayantsing-db wants to merge 3 commits intomainfrom
jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors

Conversation

@jayantsing-db
Copy link
Copy Markdown

fetchBatchBytes had no retry on HTTP failures, so any single 5xx from S3 cancelled the entire query. With thousands of concurrent GETs on large result sets, even a sub-percent per-request failure rate makes at least one failure near-certain. Adds exponential backoff with equal jitter for 408/429/500/502/503/504 plus connection errors, honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config and parseable integer Retry-After response headers. Link expiry is re-checked after each backoff so retries don't outlive the presigned URL.

Co-authored-by: Isaac

fetchBatchBytes had no retry on HTTP failures, so any single 5xx from
S3 cancelled the entire query. With thousands of concurrent GETs on
large result sets, even a sub-percent per-request failure rate makes
at least one failure near-certain. Adds exponential backoff with
equal jitter for 408/429/500/502/503/504 plus connection errors,
honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config
and parseable integer Retry-After response headers. Link expiry is
re-checked after each backoff so retries don't outlive the presigned
URL.

Co-authored-by: Isaac
Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
Two follow-ups from review feedback:

1. fetchBatchBytes returned res.Body on 200 OK and let the caller's
   io.ReadAll happen outside the retry loop, so a TCP RST or truncated
   body during streaming surfaced as a hard failure. With multi-MB S3
   objects this is the dominant transient mode at scale and exactly
   the failure we want covered. Buffer the body inside the retry loop
   and treat read errors as transient. Decompression stays outside
   the loop — malformed LZ4 is data corruption, not transient.

2. The expiry-between-retries test used ExpiryTime = floor(now), which
   could already be expired before iter 0's check if construction
   crossed a Unix-second boundary (attempts=0 instead of 1). Add 1s of
   guaranteed headroom and bump RetryWaitMin to 4s so the post-backoff
   sleep deterministically pushes floor(now) past expiry on iter 1.

Co-authored-by: Isaac
Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
@jayantsing-db jayantsing-db force-pushed the jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors branch from 5d1790d to 4b72cac Compare May 7, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant