[ES-1892645] Retry transient S3 errors in CloudFetch downloads by jayantsing-db · Pull Request #355 · databricks/databricks-sql-go

jayantsing-db · 2026-05-07T19:48:16Z

fetchBatchBytes had no retry on HTTP failures, so any single 5xx from S3 cancelled the entire query. With thousands of concurrent GETs on large result sets, even a sub-percent per-request failure rate makes at least one failure near-certain. Adds exponential backoff with equal jitter for 408/429/500/502/503/504 plus connection errors, honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config and parseable integer Retry-After response headers. Link expiry is re-checked after each backoff so retries don't outlive the presigned URL.

Co-authored-by: Isaac

fetchBatchBytes had no retry on HTTP failures, so any single 5xx from S3 cancelled the entire query. With thousands of concurrent GETs on large result sets, even a sub-percent per-request failure rate makes at least one failure near-certain. Adds exponential backoff with equal jitter for 408/429/500/502/503/504 plus connection errors, honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config and parseable integer Retry-After response headers. Link expiry is re-checked after each backoff so retries don't outlive the presigned URL. Co-authored-by: Isaac Signed-off-by: Jayant Singh <jayant.singh@databricks.com>

Two follow-ups from review feedback: 1. fetchBatchBytes returned res.Body on 200 OK and let the caller's io.ReadAll happen outside the retry loop, so a TCP RST or truncated body during streaming surfaced as a hard failure. With multi-MB S3 objects this is the dominant transient mode at scale and exactly the failure we want covered. Buffer the body inside the retry loop and treat read errors as transient. Decompression stays outside the loop — malformed LZ4 is data corruption, not transient. 2. The expiry-between-retries test used ExpiryTime = floor(now), which could already be expired before iter 0's check if construction crossed a Unix-second boundary (attempts=0 instead of 1). Add 1s of guaranteed headroom and bump RetryWaitMin to 4s so the post-backoff sleep deterministically pushes floor(now) past expiry on iter 1. Co-authored-by: Isaac Signed-off-by: Jayant Singh <jayant.singh@databricks.com>

Signed-off-by: Jayant Singh <jayant.singh@databricks.com>

jayantsing-db added 3 commits May 7, 2026 20:24

Improvements

4b72cac

Signed-off-by: Jayant Singh <jayant.singh@databricks.com>

jayantsing-db force-pushed the jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors branch from 5d1790d to 4b72cac Compare May 7, 2026 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355

[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355
jayantsing-db wants to merge 3 commits intomainfrom
jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors

jayantsing-db commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jayantsing-db commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant