diff --git a/docs/en/solutions/Snapshotting_Prometheus_TSDB_blocks_for_offline_review.md b/docs/en/solutions/Snapshotting_Prometheus_TSDB_blocks_for_offline_review.md new file mode 100644 index 00000000..d620d675 --- /dev/null +++ b/docs/en/solutions/Snapshotting_Prometheus_TSDB_blocks_for_offline_review.md @@ -0,0 +1,185 @@ +--- +kind: + - How To +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- + +# Snapshotting Prometheus TSDB blocks for offline review +## Issue + +When a metric anomaly is reported well after the fact, or when the cluster team and the diagnosing engineer are on different sides of an air-gap, querying live Prometheus is not an option. The cluster operator needs to lift a slice of the time-series database off the running Prometheus pod, ship the resulting tarball, and let the analyst replay queries against the same data on a workstation that has no access to the source cluster. + +The native Prometheus TSDB layout (`block ULID/`, `chunks/`, `index`, `meta.json`, `tombstones`, plus the in-memory `chunks_head` and `wal/` for recent samples) makes this practical: each completed block is self-contained and can be lifted with `kubectl cp`. The recipe below produces an extractable tarball from any namespace running the Prometheus operator (in-cluster monitoring, user-workload monitoring, or a standalone Prometheus instance). + +## Resolution + +### Identify the Prometheus pod and namespace + +Variable shorthand for the rest of the procedure: + +```bash +# On ACP the platform Prometheus runs in cpaas-system, packaged as the +# kube-prometheus chart. Pod names follow the StatefulSet replica naming +# convention -; verify the actual pod name with +# `kubectl -n cpaas-system get statefulset` before substituting. +NS=cpaas-system # namespace running the Prometheus pod +POD=prometheus-kube-prometheus-0-0 # the StatefulSet replica being snapshotted +CONTAINER=prometheus # container name inside that pod +``` + +`kubectl -n $NS get statefulset` lists the available Prometheus instances. Snapshotting the `replica-0` pod is enough — replicas hold identical data when the StatefulSet is healthy. + +### Option A — full snapshot via tar + +If the Prometheus container ships `tar`, the simplest path is to stream the entire `/prometheus` directory through `kubectl exec`. Compress in flight to keep the on-cluster footprint small: + +```bash +ARTIFACT_DIR=$PWD/metrics +mkdir -p "$ARTIFACT_DIR" +kubectl -n "$NS" exec "$POD" -c "$CONTAINER" -- \ + tar cz -C /prometheus . > "$ARTIFACT_DIR/prometheus.tar.gz" +``` + +Capture the active scrape-target metadata at the same time — it is required to re-create the same label set when replaying offline: + +```bash +kubectl -n "$NS" exec "$POD" -c "$CONTAINER" -- \ + curl -sG http://localhost:9090/api/v1/targets/metadata \ + --data-urlencode 'match_target={instance!=""}' \ + > "$ARTIFACT_DIR/prometheus-target-metadata.json" +``` + +This is fast but fragile: the WAL and `chunks_head` files mutate while `tar` is still reading them. The resulting archive sometimes fails to extract on the analyst's workstation. Use Option B for any case where the snapshot has to be reliable. + +### Option B — block-by-block snapshot + +A robust snapshot copies completed blocks one at a time, then captures the WAL and `chunks_head` separately. Completed blocks are immutable, so the copy is consistent. + +List the available blocks and their time windows: + +```bash +kubectl -n "$NS" exec "$POD" -c "$CONTAINER" -- \ + promtool tsdb list -r /prometheus +``` + +Output resembles: + +```text +BLOCK ULID MIN TIME MAX TIME DURATION NUM SAMPLES NUM CHUNKS NUM SERIES SIZE +01GGQV2KWQ7DX0RAHWZZFPCNTM 2022-10-31 17:17:39 +0000 UTC 2022-10-31 18:00:00 +0000 UTC 42m20.389s 15306022 215664 215457 46MiB +01GGRRZ7B4VMK4PGENCKC642FK 2022-10-31 18:00:00 +0000 UTC 2022-11-01 00:00:00 +0000 UTC 5h59m59.811s 133853179 1122268 195484 162MiB +... +``` + +Pick the ULIDs covering the incident window and pull each block's directory. Some Prometheus images do not bundle `tar` — fall back to `cat` per file in that case: + +```bash +DEST=./prometheus-snapshot +mkdir -p "$DEST" +BLOCKS="01GGQV2KWQ7DX0RAHWZZFPCNTM 01GGRRZ7B4VMK4PGENCKC642FK" + +for ulid in $BLOCKS; do + mkdir -p "$DEST/$ulid/chunks" + for f in index meta.json tombstones; do + kubectl -n "$NS" exec "$POD" -c "$CONTAINER" -- \ + cat "/prometheus/$ulid/$f" > "$DEST/$ulid/$f" + done + # Each block has one or more chunk files numbered sequentially. + for cf in $(kubectl -n "$NS" exec "$POD" -c "$CONTAINER" -- \ + ls "/prometheus/$ulid/chunks"); do + kubectl -n "$NS" exec "$POD" -c "$CONTAINER" -- \ + cat "/prometheus/$ulid/chunks/$cf" > "$DEST/$ulid/chunks/$cf" + done +done + +# WAL + chunks_head + queries.active are needed so recent samples +# (the unsealed last 2-3 hours) survive into the offline replay. +kubectl -n "$NS" cp "$POD:chunks_head" "$DEST/chunks_head" -c "$CONTAINER" +kubectl -n "$NS" cp "$POD:wal" "$DEST/wal" -c "$CONTAINER" +kubectl -n "$NS" cp "$POD:queries.active" "$DEST/queries.active" -c "$CONTAINER" || true +``` + +A "file changed as we read it" warning on `wal/` is harmless — the active segment is being appended to by the live process. The block files (which are sealed) have no such concern. + +Compress and verify: + +```bash +tar -C "$DEST" -czf prometheus-db.tar.gz . +tar -tzf prometheus-db.tar.gz | head +``` + +The expected layout is: + +```text +./01GGQV2KWQ7DX0RAHWZZFPCNTM/index +./01GGQV2KWQ7DX0RAHWZZFPCNTM/meta.json +./01GGQV2KWQ7DX0RAHWZZFPCNTM/tombstones +./01GGQV2KWQ7DX0RAHWZZFPCNTM/chunks/000001 +./01GGRRZ7B4VMK4PGENCKC642FK/... +./chunks_head/... +./wal/00000000 +./queries.active +``` + +Missing `index`, `meta.json`, `tombstones` for any block makes that block unparseable; re-fetch them explicitly. + +### Replay the snapshot offline + +On the analyst's workstation, expand the tarball into a host directory and run a Prometheus container against it. Match the Prometheus version to the one that produced the data — a major-version mismatch can refuse to load older block formats: + +```bash +DATA=$HOME/Downloads/prometheus-db +mkdir -p "$DATA" +tar -C "$DATA" -xzf prometheus-db.tar.gz + +PROM_IMAGE=quay.io/prometheus/prometheus:v2.51.0 # match the source instance + +docker run --rm -it \ + -u "$(id -u):$(id -g)" \ + -p 9090:9090 \ + -v "$DATA":/data:Z \ + "$PROM_IMAGE" \ + --storage.tsdb.path=/data \ + --storage.tsdb.retention.time=999d \ + --config.file=/dev/null +``` + +Browse to `http://localhost:9090/graph` and run a coarse query first — `sum(kube_node_status_condition{condition="Ready",status="true"}==1)` over the full window confirms data is loaded. Highlighting the populated time window in the graph view restricts subsequent queries to the snapshotted range. + +### Cleanup + +The snapshot tarball can be large (hundreds of MiB per day). Remove it from the cluster after upload: + +```bash +rm -rf ./prometheus-snapshot prometheus-db.tar.gz +``` + +If the snapshot was taken inside a debug pod or a temporary workstation, also wipe the local copies once the analysis is complete. + +## Diagnostic Steps + +If `promtool tsdb list -r` reports no blocks, the Prometheus pod has not yet sealed a block — typical for a freshly restarted instance. Wait at least two hours after the restart, or capture the live WAL with Option A and accept the consistency risk. + +If `kubectl cp` fails with `tar: not found`, the destination Prometheus image has no `tar`. Use the per-file `cat` form shown in Option B; it is slower but works against any image. + +If the offline replay shows no metrics for the period of interest, confirm the snapshot covers the right blocks: + +```bash +kubectl -n "$NS" exec "$POD" -c "$CONTAINER" -- \ + promtool tsdb list -r /prometheus | awk 'NR==1 || /2026-04-21/' +``` + +A block whose `MAX TIME` predates the incident is the wrong one. Re-pull with the correct ULID list. + +If the offline Prometheus container fails to start with `mmap of <…>: cannot allocate memory`, the host is running out of address space — common on small workstations holding multi-day snapshots. Either increase `vm.max_map_count` (`sysctl -w vm.max_map_count=262144`) or replay smaller block subsets. + +For the in-cluster monitoring stack that runs through the platform's monitoring operator, the namespace and pod names depend on the deployment. Discover them dynamically: + +```bash +kubectl get pods -A -l app.kubernetes.io/name=prometheus +``` + +The same procedure applies for any Prometheus instance — user-workload monitoring, a stand-alone Prometheus operator instance, or a per-team monitoring stack.