[WIP] RFC-0054: HUD Integration for Out-of-Tree CI Results#96
[WIP] RFC-0054: HUD Integration for Out-of-Tree CI Results#96jewelkm89 wants to merge 3 commits intopytorch:masterfrom
Conversation
Defines the HUD-side ingestion and display layer for OOT CI results, building on RFC-0050 (Cross-Repository CI Relay). Covers write path, storage schemas, DB protection, security, and three frontend views. Reference implementation: subinz1/test-infra#1
|
Hi @jewelkm89! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
There was a problem hiding this comment.
A few quick high level comments based on the PR description (I really appreciate the diagram btw!):
-
I see some payload validation happening in HUD. Why not have that in Relay Server instead? In HUD, auth check should likely be limited to authenticating the relay server's header value and a maaaaybe payload cap to ensure we don't overburden the database. Everything else probably will find a better home in Relay Server, where the logic can be centralized. And the payload cap is likely worth adding to the relay server as well (HUD's check is more of a sanity check as the db guardian.)
-
Why does the Storage box show that only completed jobs will be replicated to ClickHouse? I think the default behavior would be to replicate both in progress and completed, and that would offer more value as well. Current practice for in-tree CI is we update the job status in clickhouse (in progress -> completed) and we should do the same for OOT too. That's why we're storing the data in dynamo, else we could have stored it in S3
Security note: We should give the relay tool it's own unique header to send to hud and only that header should be allowed to make these relay api calls. The X-Hud-Internal-Bot is a more widely used key and so far it has only been used for readonly operations. Let's not give it read-write access since the key is at higher risk of leaking.
ZainRizvi
left a comment
There was a problem hiding this comment.
Nice work — the architecture diagrams are really clear and the RFC builds on RFC-0050 well. A few things to sort out, mostly around aligning with the L2 PR's actual wire format and a couple of bugs in the extraction logic.
| const pr = cb.payload?.pull_request; | ||
|
|
||
| const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`; | ||
| const now = new Date().toISOString(); |
There was a problem hiding this comment.
There are a couple of bugs here:
-
started_atis set tonow()on every callback, includingcompleted. That's the time HUD received the POST, not when the job started. On a completed callback,started_at ≈ completed_at, so the ClickHouseduration_secondsalias will always be ≈ 0. Neitherstarted_atnorcompleted_atare available in thegithubcontext, so the action should capture its own wall-clock time and include it in the payload —in_progresssendsstarted_at,completedsendscompleted_at. Then HUD stores what the action reports instead of usingnow(). -
queue_timeis set fromtrusted.ci_metrics?.queue_time, which isnullon completed callbacks (the relay only sends it onin_progress). This needs to skip settingqueue_timewhen it's null rather than writing null over a valid value. Same applies toexecution_timeonin_progresscallbacks.
Also, the DynamoDB write needs to use UpdateItem with SET expressions instead of PutItem. PutItem does a full item replace, so even with the above fixes, the completed callback would clobber queue_time and started_at from the in_progress record. UpdateItem lets each callback write only its own fields.
There was a problem hiding this comment.
All three fixed:
- started_at/completed_at now use downstream-reported timestamps (wf.started_at, wf.completed_at) instead of new Date().toISOString(). Note: these fields need to be added to the L2 PR's callback action (action.yml), which currently doesn't send wall-clock timestamps.
- queue_time/execution_time are only set when the relay provides a non-null value. in_progress sets queue_time; completed sets execution_time. Neither overwrites the other.
- Changed from PutItem to UpdateItem with SET expressions throughout — Hop 3, data flow table, code samples, and implementation plan.
| const wf = cb.workflow; | ||
| const pr = cb.payload?.pull_request; | ||
|
|
||
| const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`; |
There was a problem hiding this comment.
This key needs to include the job name and attempt number. The callback action runs inside a specific job, and if a downstream repo uses it in multiple jobs within the same workflow (build, test-cpu, test-gpu), all those jobs share the same github.workflow name and would write to the same dynamoKey — last writer wins, the rest are silently lost. And without the attempt number, job retries would overwrite the original run.
Let's add github.job and github.run_attempt to the key: {repo}/{delivery_id}/{workflow_name}/{job_name}/{attempt}. That gives each job its own row, preserves retry history, and makes the /oot/[org]/[repo] matrix view ("columns = OOT jobs") actually useful.
There was a problem hiding this comment.
Thanks you for the thorough review @ZainRizvi, Changed from {repo}/{delivery_id}/{workflow_name} to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}. Added job_name and run_attempt to both DynamoDB and ClickHouse schemas. The L2 callback action needs to be updated to include github.job and github.run_attempt in the payload.
|
|
||
| ### Sample Payloads | ||
|
|
||
| #### In-Progress Callback |
There was a problem hiding this comment.
These samples don't match the actual wire format. They show a flat structure with top-level status, head_sha, pr_number, etc. But the L2 PR's action.yml sends a nested structure — delivery_id and payload from the dispatch envelope, with a workflow sibling containing status/conclusion/name/url. Your own TypeScript code (RelayPayload, extractDynamoRecord) parses the nested format correctly, so the code is right — but a downstream developer implementing from these samples would produce payloads the system can't parse. These need to match the actual {delivery_id, payload, workflow} structure from the action.
There was a problem hiding this comment.
Replaced all flat-structure samples with the actual nested wire format, shown at two stages:
Stage 1 (downstream → relay): {event_type, delivery_id, payload, workflow} — matching the callback action's output from action.yml
Stage 2 (relay → HUD): {trusted: {verified_repo, ci_metrics}, untrusted: {callback_payload: ...}} — matching forward_to_hud in hud.py
Also added a note about fields (job_name, run_attempt, started_at, completed_at) that need to be added to the L2 action.
| This hop requires **no new code**. The existing infrastructure handles it: | ||
|
|
||
| 1. **DynamoDB Streams** (enabled on the table with `NEW_AND_OLD_IMAGES`) captures every write | ||
| 2. **`clickhouse-replicator-dynamo`** receives stream events and inserts `completed` records into ClickHouse (`in_progress` records are filtered out — they serve only as mutable state in DynamoDB) |
There was a problem hiding this comment.
in_progress records need to be replicated to ClickHouse too, not just completed. Current practice for in-tree CI is we update job status in ClickHouse (in_progress → completed) and OOT should follow the same pattern. SharedReplacingMergeTree already supports upserts. Without this, the /pr/N page wouldn't be able to show running jobs.
There was a problem hiding this comment.
Fixed — both in_progress and completed records now replicate to ClickHouse. The replicator forwards all DynamoDB stream events without filtering. SharedReplacingMergeTree handles the upsert — completed replaces in_progress for the same dynamoKey after merges.
| | 2 | Schema validation + payload caps (2MB max body) | `400 Bad Request` | | ||
| | 3 | Write to DynamoDB | `502 Bad Gateway` | | ||
|
|
||
| The HUD API endpoint (`torchci/pages/api/oot/results.ts`) follows the existing `webhookToDynamo` pattern: |
There was a problem hiding this comment.
Schema validation and field extraction should live in the Relay Server, not HUD. The relay already does OIDC, allowlist checks, and rate limiting — adding schema validation there keeps HUD simple (authenticate the relay, write what it's told) and avoids duplicating validation if other consumers of relay data appear later. HUD should still enforce a payload size cap as a safety net.
There was a problem hiding this comment.
We have moved — schema validation is now at the relay (Hop 1, step 5). The HUD API endpoint is simplified to: validate X-OOT-Relay-Token header, enforce 2MB payload cap, extract/flatten the record, and UpdateItem write. HUD trusts that the relay has already validated the structure.
| | Hop | From → To | Auth Mechanism | Credential Scope | | ||
| |-----|-----------|----------------|------------------| | ||
| | 1 | Downstream → Result Handler | OIDC token (GHA-issued, 5 min TTL) | No secrets needed by downstream | | ||
| | 2 | Result Handler → HUD API | `X-Hud-Internal-Bot` header | Same pattern as DrCI / trymerge | |
There was a problem hiding this comment.
Let's use a dedicated auth header for the relay instead of X-Hud-Internal-Bot. That header is shared across multiple systems (DrCI, trymerge) and has only been used for read-only operations. Since this is a write path, giving the relay its own key limits blast radius if any single one leaks.
There was a problem hiding this comment.
Definitely @ZainRizvi,
we have changed to X-OOT-Relay-Token — a dedicated header scoped only to the OOT relay write path. Added an explanation in the RFC that this isolates the write path so key rotation or revocation doesn't affect other internal HUD consumers (DrCI, trymerge).
- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: moved schema validation from HUD to relay (Hop 1)
- Replication: both in_progress and completed records go to ClickHouse
- Timestamps: use downstream-reported started_at/completed_at
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Schema: added job_name, run_attempt columns
- Rate limit: 10 → 20 req/min (matches L2 PR default)
- Error handling: updated to match actual retry + raise behavior
- Sample payloads: rewritten to match actual nested {trusted, untrusted} wire format
Hello @ZainRizvi, Agreed — moved schema validation to the relay (Hop 1, step 5). The relay now validates delivery_id, workflow.status, and conclusion presence before forwarding. HUD is simplified to: auth (X-OOT-Relay-Token) + payload cap (2MB) + UpdateItem write. This aligns with the L2 PR's result_handler.py which already validates these fields at the relay level |
Fixed — removed all "completed only" language. Both in_progress and completed records now replicate to ClickHouse. SharedReplacingMergeTree handles the deduplication — when the completed record arrives for the same dynamoKey, it replaces the in_progress row after merges. This matches in-tree CI behavior and allows the /pr/N page to show running jobs.
Changed everywhere — diagrams, data flow table, authentication flow table, code samples, and implementation plan. The relay now uses a dedicated X-OOT-Relay-Token header instead of the shared X-Hud-Internal-Bot. This isolates the OOT write path so rotating/revoking the token doesn't affect DrCI/trymerge. |
Address @ZainRizvi's review on pytorch/rfcs#96: - Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header - Validation: removed schema validation from HUD (moved to relay) - Removed daily budget enforcement - DynamoDB: PutItem → UpdateItem to prevent null clobbering - DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt} - Timestamps: use downstream-reported started_at/completed_at instead of now() - Timing metrics: only set queue_time/execution_time when non-null - ClickHouse schema: added job_name, run_attempt columns - Queries: select job_name, run_attempt as proper columns - Frontend: updated interfaces to include new fields
Summary
This RFC defines the HUD-side ingestion and display layer for Out-of-Tree (OOT) CI results, building on RFC-0050 (Cross-Repository CI Relay for PyTorch Out-of-Tree Backends).
Data Flow
flowchart LR subgraph Downstream["Downstream CI (OOT Backend)"] DS["Run tests\n+ upload artifacts"] end subgraph ART["Artifact Storage (org-managed)"] STORE[("Logs, test reports,\nJUnit XML")] end subgraph Relay["Relay Server"] RH["Result Handler\n• OIDC verify\n• Allowlist check\n• Rate limit"] end subgraph HUD["HUD"] API["/api/oot/results\n• Auth check\n• Payload validation\n• Payload caps (2MB)"] end subgraph Storage["Storage"] DDB[("DynamoDB\ntorchci-oot-workflow-job\n(in_progress + completed)")] STR["DynamoDB Stream"] REP["clickhouse-replicator-dynamo"] CH[("ClickHouse\ndefault.oot_workflow_job\n(completed only)")] end subgraph Frontend["HUD Frontend"] P1["/oot — Global Summary"] P2["/oot/org/repo — Per-Backend"] P3["/pr/N — OOT Section"] end DS -->|"Upload artifacts"| STORE DS -->|"① POST in_progress\n② POST completed\n+ artifact_url\n(OIDC token)"| RH RH -->|"X-Hud-Internal-Bot\n{trusted, untrusted}"| API API -->|"PutItem"| DDB DDB --> STR --> REP -->|"completed only"| CH CH -->|"Query results +\nartifact_url"| P1 & P2 & P3 P2 & P3 -.->|"User clicks\nexternal link"| STOREKey points:
completedcallback payload and flow through the Result Handler → HUD API → DynamoDB → ClickHouseartifact_urlfrom ClickHouse and render it as an external link — no direct connection between HUD and downstream storagecompletedrecords are replicated to ClickHouse;in_progressstays in DynamoDB for mutable state trackingWhat this RFC covers
in_progresscallbacks → DynamoDB only (mutable state tracking)completedcallbacks → DynamoDB → replicated to ClickHouse (dashboard queries)/oot— Global OOT CI summary (cross-repo health overview, repos sorted by pass rate)/oot/[org]/[repo]— Per-backend dashboard (matrix view: PRs × jobs, failure drill-down, external artifact links)/pr/[number]— Collapsible "Out-of-Tree Backends" section in existing PR pagesReference implementation
A reference implementation is available at subinz1/test-infra#1, which includes the API endpoint, ClickHouse schema, replicator mapping, saved ClickHouse queries, and all three frontend pages.
Status
This is a WIP draft. Feedback welcome.