manage: retry HTTP fetches for image checksum and marker URLs by ideaship · Pull Request #2216 · osism/python-osism

ideaship · 2026-04-27T12:36:59Z

Problem

Four osism manage image … commands (octavia, clusterapi,
clusterapi-gardener, gardenlinux) fetch a small text file — a
marker like last-2024.1 or a .CHECKSUM / .sha256 file — from
nbg1.your-objectstorage.com (Hetzner / Ceph RadosGW) using
requests.get() with no error handling and no retry. The body is
parsed directly into a checksum or version string.

When the bucket transiently returns an XML S3 error document instead
of the expected text, the code parses <?xml as the checksum and
openstack-image-manager rejects it with
'sha256:<?xml' is not a valid checksum. This was the dominant CI
failure mode in the analysis window 2025-12-14 – 2026-04-27.

Evidence

Analysis of 295 .CHECKSUM fetch events logged across all testbed
deploy / upgrade / update builds gives the following picture. Timing
is measured from log timestamps (1 s resolution) between the
checksum_url: log line and the immediately following checksum:
result line.

84 XML failures (28.5 % of all fetches):

Response time	Count	% of failures
0 s (sub-second)	66	79 %
1 s	12	14 %
2 s	1	1 %
8 s	1	1 %
36 s	1	1 %
41 s	1	1 %
59 s	1	1 %
60 s	1	1 %

94 % of failures returned in ≤ 2 s (fast canned RGW XML response).
The remaining 6 % are genuine slow HTTP 503 responses — the server
connected and responded, just slowly. Zero failures were
connection-level errors (ReadError / ConnectTimeout / hang); every
failure returned an HTTP response.

211 successful fetches (71.5 % of all fetches):

stat	value
p50	0 s
p90	1 s
p95	1 s
p99	9 s
max	53 s

Affected jobs and impact:

All 96 builds that logged an XML checksum body resulted in
FAILURE. Every failure terminated at the Bootstrap services
Ansible task with no recovery — the bootstrap script runs with
set -e, so the osism manage image octavia command exiting
non-zero stops the script immediately. For deploy-with-tempest
jobs this aborted the deployment before any OpenStack service was
functional; no Tempest testing ran. There was no retry at the
testbed level for this failure mode.

The six affected job variants, all in the periodic-midnight
pipeline for osism/testbed:

Job	Failures
`testbed-deploy-current-in-a-nutshell-with-tempest-ubuntu-24.04`	17
`testbed-update-stable-current-ubuntu-24.04`	17
`testbed-upgrade-stable-next-ubuntu-24.04`	17
`testbed-deploy-next-in-a-nutshell-with-tempest-ubuntu-24.04`	16
`testbed-deploy-stable-in-a-nutshell-with-tempest-ubuntu-24.04`	15
`testbed-upgrade-stable-rc-ubuntu-24.04`	14

Timing pattern:

Failures first appeared as isolated incidents (2026-01-17, then
mid-February). They then became a continuous nightly outage from
2026-02-25 through 2026-03-12 (16 consecutive days), with all
six job variants failing every night — consistent with the RGW
reliably returning XML errors for these fetches during that
window. A single retry after 2 s would have cleared 94 % of the
XML failures (those that returned in ≤ 2 s) and allowed those
builds to proceed.

Solution

Add osism/utils/http.py with a single public function fetch_text
that wraps requests.get with:

Retry on the standard retryable status set {408, 429} ∪ 5xx
Retry on non-HTTPError RequestException (connection / DNS / TLS)
Retry when an optional validate callback rejects the body
(belt-and-suspenders against HTTP 200 with unexpected content)
Immediate HTTPError raise on non-retryable 4xx (404, 403, …)
Structured INFO log lines per attempt (fetch_text url=… attempt=N/M …)
so retry budget consumption is observable in Zuul output

Default retry schedule: 3 retries with 2 s / 4 s / 8 s sleeps (14 s
total sleep budget).

Wire all seven requests.get call sites in osism/commands/manage.py
to fetch_text with per-command validators:

_validate_marker — generic YYYY-MM-DD <filename>.qcow2 contract;
rejects XML bodies, accepts any .qcow2 name regardless of prefix
_is_sha256 — requires a 64-char lowercase hex first token; matches
sha256sum(1) output format

On the timeout question

The distributions of slow failures (8–60 s) and slow successes
(9–53 s) overlap — a 41 s duration appears as both a failure and a
success in the data. There is no timeout value that cleanly separates
the two populations. A timeout at 40 s would cut off the 41 s
success; a timeout safe above all observed successes (> 53 s) would
miss all but one of the slow failures.

The no-timeout choice is deliberate and backed by the updated latency
data. The branch still leaves per-attempt duration unbounded and may
multiply slow 5xx responses, but adding a timeout here would be more
likely to introduce false positives than fix the observed failure
mode.

Testing

34 unit tests across three new test files:

tests/unit/utils/test_http.py — 15 tests covering the retry
helper (happy path, retryable statuses, RequestException,
validate callback, mixed-failure precedence, logging contract,
fail-fast on non-retryable 4xx)
tests/unit/commands/test_manage_validators.py — 15 tests for
_validate_marker (M1–M9) and _is_sha256 (S1–S6)
tests/unit/commands/test_manage_wiring.py — 4 tests (W1–W4)
asserting validator-to-call-site mapping for all four commands

flake8, black, and all unit tests pass locally.

sourcery-ai

Hey - I've found 2 issues

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="osism/commands/manage.py" line_range="110-111" />
<code_context>
-            response = requests.get(url)
-            splitted = response.text.strip().split(" ")
+            marker_url = urljoin(base_url, f"last-{kubernetes_release}")
+            marker_body = fetch_text(marker_url, validate=_validate_marker)
+            splitted = marker_body.strip().split(" ")

             logger.info(f"date: {splitted[0]}")
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Use default `split()` for robustness against variable whitespace

In helpers we already use `body.strip().split()` for this parsing, but here we explicitly call `split(" ")`. If the marker ever includes multiple spaces or tabs, `split(" ")` can introduce empty elements and mis-parse fields, whereas `split()` collapses all whitespace and also handles tabs/newlines. Using `marker_body.strip().split()` here would keep behavior consistent and more robust without changing normal-case semantics.

```suggestion
            marker_body = fetch_text(marker_url, validate=_validate_marker)
            splitted = marker_body.strip().split()
```
</issue_to_address>

### Comment 2
<location path="osism/commands/manage.py" line_range="125-126" />
<code_context>
             logger.info(f"checksum_url: {url}.CHECKSUM")
-            response_checksum = requests.get(f"{url}.CHECKSUM")
-            splitted_checksum = response_checksum.text.strip().split(" ")
+            checksum_body = fetch_text(f"{url}.CHECKSUM", validate=_is_sha256)
+            splitted_checksum = checksum_body.strip().split(" ")
             logger.info(f"checksum: {splitted_checksum[0]}")

</code_context>
<issue_to_address>
**suggestion:** Align checksum parsing with whitespace-tolerant splitting

Using `checksum_body.strip().split(" ")` will yield empty tokens when there are multiple or trailing spaces, unlike the marker parsing which uses `split()` without arguments. Please switch to `split()` here as well so checksum parsing is whitespace-tolerant and consistent with `_is_sha256` and the rest of the codebase.

```suggestion
            checksum_body = fetch_text(f"{url}.CHECKSUM", validate=_is_sha256)
            splitted_checksum = checksum_body.strip().split()
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Four `osism manage image` commands (octavia, clusterapi, clusterapi-gardener, gardenlinux) fetch marker and checksum files from nbg1.your-objectstorage.com using bare requests.get() with no error handling and no retry. When the Ceph RGW backend transiently returns an XML S3 error document, the code parses <?xml as the checksum and openstack-image-manager rejects it with 'sha256:<?xml' is not a valid checksum. Analysis of 295 .CHECKSUM fetch events logged across testbed builds in the window 2025-12-14 – 2026-04-27 shows 84 XML failures (28.5 % of fetches). 94 % of those failures returned in ≤ 2 s (fast canned RGW 503 response); the remaining 6 % returned in 8–60 s. Zero failures were connection-level errors; every failure returned an HTTP response. Successful fetches span 0–53 s (p99 = 9 s). New module osism/utils/http.py exports fetch_text, which wraps requests.get with: - Retry on {408, 429} ∪ 5xx (covers the observed RGW 503) - Retry on non-HTTPError RequestException (connection / DNS / TLS) - Retry when an optional validate callback rejects the body (guards against HTTP 200 with unexpected content) - Immediate HTTPError propagation on non-retryable 4xx (404, 403) - Structured INFO log lines per attempt for observability in Zuul Default schedule: 3 retries, 2 s / 4 s / 8 s sleeps (14 s budget). Two validators are added to manage.py: - _validate_marker: generic YYYY-MM-DD <name>.qcow2 contract; rejects XML bodies without hard-coding any image-name prefix, so production deployments with unfamiliar names pass through to downstream validation rather than burning the retry budget. - _is_sha256: requires a 64-char lowercase hex first token, matching sha256sum(1) output; accepting uppercase would mask a downstream mismatch rather than surface it. All seven requests.get call sites in manage.py are replaced: clusterapi: marker + .CHECKSUM (take_action lines 110, 125) clusterapi-gardener: marker + .CHECKSUM (take_action lines 229, 245) gardenlinux: .sha256 (take_action line 354) octavia: marker + .CHECKSUM (take_action lines 440, 451) The checksum_url_status log line added to octavia in ce844a0 is removed; fetch_text emits the status code on every attempt. No per-attempt timeout is added. The distributions of slow failures (8–60 s) and slow successes (9–53 s) overlap — a 41 s duration appears as both a failure and a success in the data. No timeout value cleanly separates the two populations without introducing false positives on legitimate slow responses. 34 unit tests across three new files cover the retry helper (test_http.py, 15 tests), the validators (test_manage_validators.py, 15 tests), and the call-site wiring (test_manage_wiring.py, 4 tests). AI-assisted: Claude Code Signed-off-by: Roger Luethi <luethi@osism.tech>

ideaship force-pushed the checksum-fetch-retry branch from c4492a5 to febe778 Compare April 27, 2026 12:49

ideaship marked this pull request as ready for review April 27, 2026 13:17

sourcery-ai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread osism/commands/manage.py Outdated

Comment thread osism/commands/manage.py Outdated

ideaship force-pushed the checksum-fetch-retry branch from febe778 to 7977bf2 Compare April 27, 2026 13:36

ideaship force-pushed the checksum-fetch-retry branch from 7977bf2 to ecd8c73 Compare April 27, 2026 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manage: retry HTTP fetches for image checksum and marker URLs#2216

manage: retry HTTP fetches for image checksum and marker URLs#2216
ideaship wants to merge 1 commit intomainfrom
checksum-fetch-retry

ideaship commented Apr 27, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ideaship commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Evidence

Solution

On the timeout question

Testing

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ideaship commented Apr 27, 2026 •

edited

Loading