Skip to content

manage: retry HTTP fetches for image checksum and marker URLs#2216

Open
ideaship wants to merge 1 commit intomainfrom
checksum-fetch-retry
Open

manage: retry HTTP fetches for image checksum and marker URLs#2216
ideaship wants to merge 1 commit intomainfrom
checksum-fetch-retry

Conversation

@ideaship
Copy link
Copy Markdown
Contributor

@ideaship ideaship commented Apr 27, 2026

Problem

Four osism manage image … commands (octavia, clusterapi,
clusterapi-gardener, gardenlinux) fetch a small text file — a
marker like last-2024.1 or a .CHECKSUM / .sha256 file — from
nbg1.your-objectstorage.com (Hetzner / Ceph RadosGW) using
requests.get() with no error handling and no retry. The body is
parsed directly into a checksum or version string.

When the bucket transiently returns an XML S3 error document instead
of the expected text, the code parses <?xml as the checksum and
openstack-image-manager rejects it with
'sha256:<?xml' is not a valid checksum. This was the dominant CI
failure mode in the analysis window 2025-12-14 – 2026-04-27.

Evidence

Analysis of 295 .CHECKSUM fetch events logged across all testbed
deploy / upgrade / update builds gives the following picture. Timing
is measured from log timestamps (1 s resolution) between the
checksum_url: log line and the immediately following checksum:
result line.

84 XML failures (28.5 % of all fetches):

Response time Count % of failures
0 s (sub-second) 66 79 %
1 s 12 14 %
2 s 1 1 %
8 s 1 1 %
36 s 1 1 %
41 s 1 1 %
59 s 1 1 %
60 s 1 1 %

94 % of failures returned in ≤ 2 s (fast canned RGW XML response).
The remaining 6 % are genuine slow HTTP 503 responses — the server
connected and responded, just slowly. Zero failures were
connection-level errors (ReadError / ConnectTimeout / hang); every
failure returned an HTTP response.

211 successful fetches (71.5 % of all fetches):

stat value
p50 0 s
p90 1 s
p95 1 s
p99 9 s
max 53 s

Affected jobs and impact:

All 96 builds that logged an XML checksum body resulted in
FAILURE. Every failure terminated at the Bootstrap services
Ansible task with no recovery — the bootstrap script runs with
set -e, so the osism manage image octavia command exiting
non-zero stops the script immediately. For deploy-with-tempest
jobs this aborted the deployment before any OpenStack service was
functional; no Tempest testing ran. There was no retry at the
testbed level for this failure mode.

The six affected job variants, all in the periodic-midnight
pipeline for osism/testbed:

Job Failures
testbed-deploy-current-in-a-nutshell-with-tempest-ubuntu-24.04 17
testbed-update-stable-current-ubuntu-24.04 17
testbed-upgrade-stable-next-ubuntu-24.04 17
testbed-deploy-next-in-a-nutshell-with-tempest-ubuntu-24.04 16
testbed-deploy-stable-in-a-nutshell-with-tempest-ubuntu-24.04 15
testbed-upgrade-stable-rc-ubuntu-24.04 14

Timing pattern:

Failures first appeared as isolated incidents (2026-01-17, then
mid-February). They then became a continuous nightly outage from
2026-02-25 through 2026-03-12 (16 consecutive days), with all
six job variants failing every night — consistent with the RGW
reliably returning XML errors for these fetches during that
window. A single retry after 2 s would have cleared 94 % of the
XML failures (those that returned in ≤ 2 s) and allowed those
builds to proceed.

Solution

Add osism/utils/http.py with a single public function fetch_text
that wraps requests.get with:

  • Retry on the standard retryable status set {408, 429} ∪ 5xx
  • Retry on non-HTTPError RequestException (connection / DNS / TLS)
  • Retry when an optional validate callback rejects the body
    (belt-and-suspenders against HTTP 200 with unexpected content)
  • Immediate HTTPError raise on non-retryable 4xx (404, 403, …)
  • Structured INFO log lines per attempt (fetch_text url=… attempt=N/M …)
    so retry budget consumption is observable in Zuul output

Default retry schedule: 3 retries with 2 s / 4 s / 8 s sleeps (14 s
total sleep budget).

Wire all seven requests.get call sites in osism/commands/manage.py
to fetch_text with per-command validators:

  • _validate_marker — generic YYYY-MM-DD <filename>.qcow2 contract;
    rejects XML bodies, accepts any .qcow2 name regardless of prefix
  • _is_sha256 — requires a 64-char lowercase hex first token; matches
    sha256sum(1) output format

On the timeout question

The distributions of slow failures (8–60 s) and slow successes
(9–53 s) overlap — a 41 s duration appears as both a failure and a
success in the data. There is no timeout value that cleanly separates
the two populations. A timeout at 40 s would cut off the 41 s
success; a timeout safe above all observed successes (> 53 s) would
miss all but one of the slow failures.

The no-timeout choice is deliberate and backed by the updated latency
data. The branch still leaves per-attempt duration unbounded and may
multiply slow 5xx responses, but adding a timeout here would be more
likely to introduce false positives than fix the observed failure
mode.

Testing

34 unit tests across three new test files:

  • tests/unit/utils/test_http.py — 15 tests covering the retry
    helper (happy path, retryable statuses, RequestException,
    validate callback, mixed-failure precedence, logging contract,
    fail-fast on non-retryable 4xx)
  • tests/unit/commands/test_manage_validators.py — 15 tests for
    _validate_marker (M1–M9) and _is_sha256 (S1–S6)
  • tests/unit/commands/test_manage_wiring.py — 4 tests (W1–W4)
    asserting validator-to-call-site mapping for all four commands

flake8, black, and all unit tests pass locally.

@ideaship ideaship force-pushed the checksum-fetch-retry branch from c4492a5 to febe778 Compare April 27, 2026 12:49
@ideaship ideaship marked this pull request as ready for review April 27, 2026 13:17
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="osism/commands/manage.py" line_range="110-111" />
<code_context>
-            response = requests.get(url)
-            splitted = response.text.strip().split(" ")
+            marker_url = urljoin(base_url, f"last-{kubernetes_release}")
+            marker_body = fetch_text(marker_url, validate=_validate_marker)
+            splitted = marker_body.strip().split(" ")

             logger.info(f"date: {splitted[0]}")
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Use default `split()` for robustness against variable whitespace

In helpers we already use `body.strip().split()` for this parsing, but here we explicitly call `split(" ")`. If the marker ever includes multiple spaces or tabs, `split(" ")` can introduce empty elements and mis-parse fields, whereas `split()` collapses all whitespace and also handles tabs/newlines. Using `marker_body.strip().split()` here would keep behavior consistent and more robust without changing normal-case semantics.

```suggestion
            marker_body = fetch_text(marker_url, validate=_validate_marker)
            splitted = marker_body.strip().split()
```
</issue_to_address>

### Comment 2
<location path="osism/commands/manage.py" line_range="125-126" />
<code_context>
             logger.info(f"checksum_url: {url}.CHECKSUM")
-            response_checksum = requests.get(f"{url}.CHECKSUM")
-            splitted_checksum = response_checksum.text.strip().split(" ")
+            checksum_body = fetch_text(f"{url}.CHECKSUM", validate=_is_sha256)
+            splitted_checksum = checksum_body.strip().split(" ")
             logger.info(f"checksum: {splitted_checksum[0]}")

</code_context>
<issue_to_address>
**suggestion:** Align checksum parsing with whitespace-tolerant splitting

Using `checksum_body.strip().split(" ")` will yield empty tokens when there are multiple or trailing spaces, unlike the marker parsing which uses `split()` without arguments. Please switch to `split()` here as well so checksum parsing is whitespace-tolerant and consistent with `_is_sha256` and the rest of the codebase.

```suggestion
            checksum_body = fetch_text(f"{url}.CHECKSUM", validate=_is_sha256)
            splitted_checksum = checksum_body.strip().split()
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread osism/commands/manage.py Outdated
Comment thread osism/commands/manage.py Outdated
@ideaship ideaship force-pushed the checksum-fetch-retry branch from febe778 to 7977bf2 Compare April 27, 2026 13:36
Four `osism manage image` commands (octavia, clusterapi,
clusterapi-gardener, gardenlinux) fetch marker and checksum
files from nbg1.your-objectstorage.com using bare
requests.get() with no error handling and no retry. When the
Ceph RGW backend transiently returns an XML S3 error document,
the code parses <?xml as the checksum and
openstack-image-manager rejects it with
'sha256:<?xml' is not a valid checksum.

Analysis of 295 .CHECKSUM fetch events logged across testbed
builds in the window 2025-12-14 – 2026-04-27 shows 84 XML
failures (28.5 % of fetches). 94 % of those failures returned
in ≤ 2 s (fast canned RGW 503 response); the remaining 6 %
returned in 8–60 s. Zero failures were connection-level errors;
every failure returned an HTTP response. Successful fetches
span 0–53 s (p99 = 9 s).

New module osism/utils/http.py exports fetch_text, which wraps
requests.get with:

- Retry on {408, 429} ∪ 5xx (covers the observed RGW 503)
- Retry on non-HTTPError RequestException (connection / DNS / TLS)
- Retry when an optional validate callback rejects the body
  (guards against HTTP 200 with unexpected content)
- Immediate HTTPError propagation on non-retryable 4xx (404, 403)
- Structured INFO log lines per attempt for observability in Zuul

Default schedule: 3 retries, 2 s / 4 s / 8 s sleeps (14 s budget).

Two validators are added to manage.py:

- _validate_marker: generic YYYY-MM-DD <name>.qcow2 contract;
  rejects XML bodies without hard-coding any image-name prefix,
  so production deployments with unfamiliar names pass through
  to downstream validation rather than burning the retry budget.
- _is_sha256: requires a 64-char lowercase hex first token,
  matching sha256sum(1) output; accepting uppercase would mask a
  downstream mismatch rather than surface it.

All seven requests.get call sites in manage.py are replaced:
  clusterapi:         marker + .CHECKSUM  (take_action lines 110, 125)
  clusterapi-gardener: marker + .CHECKSUM (take_action lines 229, 245)
  gardenlinux:        .sha256             (take_action line 354)
  octavia:            marker + .CHECKSUM  (take_action lines 440, 451)

The checksum_url_status log line added to octavia in ce844a0 is
removed; fetch_text emits the status code on every attempt.

No per-attempt timeout is added. The distributions of slow
failures (8–60 s) and slow successes (9–53 s) overlap — a 41 s
duration appears as both a failure and a success in the data.
No timeout value cleanly separates the two populations without
introducing false positives on legitimate slow responses.

34 unit tests across three new files cover the retry helper
(test_http.py, 15 tests), the validators
(test_manage_validators.py, 15 tests), and the call-site wiring
(test_manage_wiring.py, 4 tests).

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship ideaship force-pushed the checksum-fetch-retry branch from 7977bf2 to ecd8c73 Compare April 27, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant