Add OHTTP-style anonymous inference endpoint by adambalogh · Pull Request #69 · OpenGradient/tee-gateway

adambalogh · 2026-05-13T01:41:16Z

Anonymous inference over Oblivious HTTP: clients submit chat completions through a relay that pays for them. The HPKE X25519 keypair is generated alongside the RSA signing key and bound to the same nitriding registration digest, so the Nitro attestation document commits to both.

/v1/ohttp is a thin wrapper around /v1/chat/completions — it HPKE-decrypts the inner request, re-issues it as an in-process WSGI sub-request against the chat endpoint, and HPKE-encrypts the response. All x402 payment, pricing, settlement and TEE response signing reuse the public chat code paths; no duplicated routing or pricing logic.

Implements RFC 9458 OHTTP for single-shot responses and draft-ietf-ohai-chunked-ohttp-08 for streaming. Fixed HPKE ciphersuite: DHKEM(X25519,HKDF-SHA256) / HKDF-SHA256 / ChaCha20-Poly1305.

Endpoints added

Endpoint	Method	Purpose
`/v1/ohttp`	POST	Anonymous chat completion (OHTTP-encapsulated, relay-paid). Body is raw HPKE ciphertext, not JSON.
`/v1/ohttp/config`	GET	HPKE key configuration (RFC 9458 key-config blob) for client discovery.

Both are mounted via add_url_rule rather than the OpenAPI spec because the request body is raw binary and connexion's JSON validation would reject it.

Flow

Client fetches /v1/ohttp/config (HPKE pubkey, key_id, suite IDs) and verifies it against the Nitro attestation.
Client → Relay: HPKE-encapsulates a standard /v1/chat/completions JSON body (no envelope, no payment material) and POSTs the ciphertext to the relay.
Relay → Enclave: forwards the ciphertext as the body of POST /v1/ohttp and attaches its own X-Payment: <x402 payload> header.
Enclave: decrypts inner body → re-issues as a WSGI sub-request to /v1/chat/completions with the relay's X-Payment header → x402 verifies and settles → LLM call runs → TEE signs the response body.
Enclave → Relay: response mode dispatched by the inner stream flag (see table below).
Relay → Client: passes the sealed body through. Client decrypts and verifies the TEE signature embedded in the response body.

Response modes

Mode	Outer content-type	Body
`stream=false`	`message/ohttp-res`	Single-shot sealed body (RFC 9458 §4.5)
`stream=true`	`message/ohttp-chunked-res`	`response_nonce \|\| (varint(len) \|\| sealed_ct)+ \|\| varint(0) \|\| sealed_final_ct` — one OHTTP chunk per SSE event, AAD=`b"final"` on the last chunk so truncation is detectable (chunked-ohttp draft §3)

On non-2xx (e.g. 402 payment required) the body is forwarded plaintext so the relay can read x402 payment requirements and retry — those bodies never contain prompts or completions.

Billing

Both modes settle the actual cost via x402 against the relay's X-Payment (upto scheme); the gateway is the source of truth for the amount.

stream=false: the outer response exposes the settled-cost headers X-Inference-Cost-OPG (smallest units, the integer x402 actually charged), X-Inference-Cost-USD, and X-Inference-Price-OPG-USD — for the relay's own bookkeeping. Model name and token counts are deliberately NOT surfaced as outer headers: they would fingerprint the inner request and have no billing role. The sealed body still carries the full usage block for the client.
stream=true: no billing detail in outer headers (they ship before any body chunk, so cost isn't known at header-write time) and the sealed chunks are opaque to the relay. The relay reads the actual settled amount from x402 — by querying the facilitator with its X-Upto-Session, or via X-Payment-Response on its next call. The client still sees cost and per-token detail in the final SSE event inside the decrypted stream (the opengradient block written by the chat controller).

Trust split

Relay terminates the client's TCP/TLS connection, so it does see the client's IP at the network layer — that's unavoidable. What the relay does NOT see is content: only the OHTTP-encapsulated ciphertext, its own wallet's x-payment material, and (single-shot only) the settled-cost outer headers it needs to bill its own customer.
Enclave sees plaintext prompts and completions (necessarily — it runs the LLM call), but at the network layer only sees the relay's IP, never the client's. This is the actual unlinkability property: the enclave cannot tie a plaintext request to a specific end user.
Client decrypts and verifies the TEE signature inside the response body against the attested public key.

Unlinkability between a client identity and a plaintext request holds unless the relay and the enclave collude (the relay would have to share its client-IP log alongside the enclave's plaintext log). Streaming additionally leaks per-chunk timing and length; clients who can't accept that signal should use stream=false.

Implements RFC 9458 Oblivious HTTP encapsulation so clients can submit chat completions through an independent relay without exposing their IP to the enclave or their prompt to the relay. The HPKE X25519 keypair is generated alongside the existing RSA signing key and bound to the same nitriding registration digest, so the Nitro attestation document commits to both. - tee_gateway/ohttp.py: HPKE wrap/unwrap helpers (DHKEM(X25519)/HKDF-SHA256/ ChaCha20-Poly1305). Response keying derived per-context per RFC 9458 §4.2. - tee_gateway/tee_manager.py: HPKE keypair, key-config blob, attestation document now includes the HPKE public key. - tee_gateway/controllers/ohttp_controller.py: /v1/ohttp dispatches the decrypted request to the existing chat handler, scrubs identifying fields before forwarding upstream, refuses stream=true. - /v1/ohttp/config exposes the HPKE key config for client discovery. - Test coverage: round-trip, wrong-suite, truncated input, tampered ciphertext. Known limitation: payment gating is not yet wired for this endpoint; a blind-token layer will follow in a separate change. https://claude.ai/code/session_01WyddtSz2rtiP61LtVJbsJy

* OHTTP: derive HPKE from TEE RSA key + gate /v1/ohttp behind x402 * Replace the random os.urandom() seed for the HPKE keypair with an HKDF derivation from the RSA TEE private key (PKCS8 DER) salted with the RSA public DER. The HPKE keypair is now a deterministic function of the attested RSA key — anything that attests the RSA signing key implicitly covers the X25519 OHTTP key, with no separate randomness source to attest. Domain-separated info "og-tee-hpke-x25519-v1" pins the derivation to this use. * ohttp.generate_keypair() -> ohttp.derive_keypair(seed), with explicit >=32-byte seed validation. Tests cover deterministic output for the same seed and rejection of short seeds. * Add /v1/ohttp to the x402 payment middleware routes with the same CHAT_COMPLETIONS_OPG_SESSION_MAX_SPEND cap and upto scheme used by /v1/chat/completions. Anonymous inference is now metered identically to the public chat endpoint. * Bridge the encrypted request/response back to the token-based cost calculator via a thread-local set in the OHTTP controller. The calculator detects path=/v1/ohttp and uses the stashed plaintext inner request/response instead of the (unparseable) ciphertext bytes the middleware would otherwise see. * Fix the response-export length to max(Nn, Nk) per RFC 9458 §4.5; the prior _NK was equal here for ChaCha20-Poly1305 but would silently break under a different AEAD. * Refactor /v1/ohttp as a thin WSGI wrapper around /v1/chat/completions Replace the parallel routing/pricing logic with an in-process WSGI sub- request: the OHTTP handler decrypts, dispatches the inner request as a POST /v1/chat/completions through the app's own wsgi_app, captures the status/headers/body, then encrypts and returns. Everything that already existed for the public chat endpoint — x402 payment verification, the pre-inference pricing gate, LangChain routing, post-inference cost settlement, TEE response signing — runs unchanged for OHTTP requests. * /v1/ohttp is no longer in the x402 RouteConfig table. Gating happens naturally when the sub-request hits /v1/chat/completions; the payment header travels inside the sealed envelope as `x-payment` so the relay never sees it. * The thread-local side channel and the OHTTP-specific branch in _session_cost_calculator are removed — there is now only one cost calculator path for the whole gateway. * Inner request envelope: `{"x-payment": "...", "body": {...}}`. Inner response envelope: `{"status": int, "headers": {...}, "body": ...}`, forwarding only x402/TEE settlement headers back to the client. * Pre-decap errors stay plaintext; post-decap errors are sealed so the relay can't distinguish failure modes by response shape. * Revert HPKE key derivation; keep random HPKE keypair independent of RSA Reverts deriving the OHTTP X25519 keypair from the RSA TEE private key. The HPKE keypair is now freshly random per enclave boot (os.urandom(32) fed to pyhpke's DeriveKeyPair). The attestation binding still works because nitriding's transcript covers both public keys, but the two private keys no longer share a derivation surface: a compromise of one cannot be used to recover the other. * ohttp.generate_keypair() restored; ohttp.derive_keypair() removed. * tee_manager.TEEKeyManager no longer pulls HKDF; HPKE keypair is generated independently right after the RSA keypair. * Test for deterministic derivation replaced with an independence test that asserts two generate_keypair() calls return different pubkeys. --------- Co-authored-by: Claude <noreply@anthropic.com>

Switches /v1/ohttp to the relay-pays model. The client encrypts only a chat-completion request — no payment material — and a relay between the client and the enclave supplies the x402 payment as a standard outer-request header. The enclave reads x-payment from the outer request, attaches it to the in-process sub-request to /v1/chat/completions, and lets the existing x402 middleware verify and settle exactly as it would for a public call. * Inner plaintext is now bare chat-completion JSON; the {x-payment, body} envelope is gone since payment travels outside the seal. * On 2xx the response body is still HPKE-sealed (it contains user prompts/completions), but the outer response surfaces token usage as headers so the relay can bill: X-Usage-Prompt-Tokens, X-Usage-Completion-Tokens, X-Usage-Total-Tokens, X-Usage-Model. x402 settlement and TEE signature headers are also forwarded. * On non-2xx (402 payment required, validation errors) the body is forwarded as plaintext so the relay can read x402 payment requirements, retry with a larger payment, or surface errors. These bodies never contain user prompts/completions. * Privacy: relay sees ciphertext + usage + settlement + relay-side wallet; never sees prompts, completions, or the client's IP. Unlinkability holds unless relay and enclave collude.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Adds streaming support per draft-ietf-ohai-chunked-ohttp-08. When the inner chat-completion request has stream=true, /v1/ohttp pipes the sub-request's SSE events through a chunked OHTTP encrypter and yields them as they arrive, instead of buffering. Non-streaming requests continue to use the existing single-shot RFC 9458 §4.5 path. ohttp.py: * QUIC varint encode/decode helpers (RFC 9000 §16). * New _LABEL_CHUNKED_RESPONSE = "message/bhttp chunked response" and a second secret export at decap time; DecapsulatedRequest now carries response_key + response_key_chunked so the controller can decide which mode to use AFTER inspecting the decrypted body. * ChunkedResponseEncrypter: response_nonce header, varint(len)||ct per chunk (AAD=""), zero-prefix final chunk (AAD=b"final") so truncation is detectable, per-chunk nonce = aead_nonce XOR encode_be(counter). * Extracted _derive_response_keys() shared between single-shot and chunked paths (HKDF-Extract on enc||response_nonce, then Expand twice for "key" and "nonce"). ohttp_controller.py: * Drop the stream=true rejection. Pass stream through to the inner sub-request and detect text/event-stream in the captured headers. * _wsgi_subrequest now returns the raw iterator instead of draining, so the streaming path can pipe chunks through Flask without buffering. close() still invoked downstream to trigger x402 settlement. * _build_streaming_response: look-ahead-by-one over the inner SSE iterator so the last event is sealed with AAD=b"final"; content-type message/ohttp-chunked-res; x402/TEE settlement headers forwarded. Usage stats stay inside the encrypted stream (final SSE event); the relay bills via X-Upto-Session as usual. Tests: varint round-trip across all 4 length classes, chunked response round-trip with a hand-rolled client-side decrypter that walks the varint frames and verifies AAD=b"final", double-finalize rejection. 96 unit tests total now passing.

Adds the two OHTTP endpoints to the API table and a concise section covering the relay-pays flow, the single-shot vs chunked response modes, billing channel for each mode, and the relay/enclave/client trust split. Refs RFC 9458 and draft-ietf-ohai-chunked-ohttp-08.

Mirrors scripts/test_bytedance.py but exercises /v1/ohttp end-to-end: fetches /v1/ohttp/config, cross-checks the HPKE pubkey against the /signing-key attestation document, HPKE-encapsulates a chat request, POSTs to /v1/ohttp, and decrypts the response. Supports both single- shot and chunked OHTTP (--stream); the chunked path decrypts the varint-framed sealed stream incrementally so you can see SSE events arrive in real time. Includes a hand-rolled QUIC varint reader so the script stays usable as a standalone client SDK reference. Usage examples in the module docstring.

The OpenAPI spec declares a global ApiKeyAuth requirement; connexion enforces it on /v1/chat/completions before any handler runs and returns 401 "No authorization token provided" when missing. Our WSGI sub-request from /v1/ohttp arrived without an Authorization header, so OHTTP requests bounced with 401 before reaching the chat backend. security_controller.info_from_ApiKeyAuth is an intentional passthrough (x402 is the real access control) so any token value satisfies the schema check. Forward the outer Authorization header to the sub-request when the relay supplied one, else inject a placeholder bearer token.

Don't forward the outer Authorization header to the chat sub-request — anything the relay attached there (API keys, JWT subjects, bearer tokens, ...) could re-identify the client and defeat unlinkability. A constant "Bearer ohttp" placeholder satisfies connexion's ApiKeyAuth schema check (security_controller is a passthrough; x402 is the real access control) and keeps every OHTTP request indistinguishable at this layer.

Set the env var to "1" before /v1/keys is POSTed and the gateway will skip attaching the x402 payment middleware. Lets developers smoke-test /v1/chat/completions and /v1/ohttp locally without a reachable facilitator URL — without it, the middleware's first-request initialize() blows up on facilitator DNS lookups. Logs a WARNING when active and is explicitly NOT for production use.

Prints the request line, headers, the inner plaintext (clearly labeled as never-on-the-wire), then a breakdown of the encapsulated body: the 7-byte OHTTP header, the 32-byte ephemeral X25519 enc, and an xxd-style hex dump of the AEAD ciphertext. Makes it visually obvious that the relay only sees opaque sealed bytes — no prompt content, no model name, no API key, nothing.

This reverts commit 58908aa.

The v2 attestation transcript labels both the RSA SPKI and the X25519 HPKE pubkey, but the previous (self.hpke_public_key_raw or b"") fallback would silently produce a "v2"-labeled digest that actually only covers RSA whenever hpke_public_key_raw was None or empty. A verifier trusting the label would then accept an enclave whose HPKE key was never bound to attestation. Add an explicit length check (must be exactly 32 bytes) outside the broad try/except, so a real misconfiguration raises clearly instead of being masked as the "Could not register with nitriding (may not be in TEE)" warning. Today _generate_keys() always sets both keys so this is a defense-in-depth guard against future partial-init regressions.

decapsulate_request's docstring promised ValueError on malformed input, but recipient.open() raises pyhpke / cryptography exception types on AEAD tag failure, bad ephemeral keys, etc., so the contract was a lie. The error strings from those libraries can encode oracle information about which specific check failed (tag verification vs. length vs. KDF), which would turn the function into a padding-oracle-style side channel if any caller logged with exc_info=True. * Wrap the crypto path (create_recipient_context + open) and re-raise as ValueError("HPKE decapsulation failed") with `from None` so the underlying exception chain is suppressed entirely. Don't wrap the HKDF exports — those are deterministic and can't fail on valid input. * Bump the minimum input length to 7 + 32 + 16 so truncated inputs hit our own "too short" ValueError instead of whatever pyhpke would raise. * Tighten test_rejects_tampered_ciphertext from pytest.raises(Exception) to pytest.raises(ValueError, match="HPKE decapsulation failed") so the contract is enforced by tests, not just documented.

The previous wording said the relay "never sees the client's IP", which is wrong — in the relay-pays model the client connects directly to the relay, so the relay necessarily sees the client's IP at the network layer. The actual privacy property is that the ENCLAVE never sees the client's IP (it only sees the relay's), and the relay sees only the encapsulated ciphertext (plus billing metadata it needs), not the prompt or completion. Reword to spell out network position vs. compute position for each party and the precise unlinkability claim (and the collusion caveat).

Mirror the docstring correction in ohttp_controller.py: the relay does see the client's IP at the network layer (it terminates the TCP/TLS connection). What it doesn't see is request/response content. The unlinkability claim is that the ENCLAVE never sees the client's IP and therefore can't tie a plaintext request to a specific end user.

The request media type was defined for symmetry with the response constants but never read. Decapsulation itself is the security gate; the unauthenticated Content-Type header gives us nothing to enforce. The response constants (OHTTP_RESPONSE_MEDIA_TYPE, OHTTP_CHUNKED_RESPONSE_MEDIA_TYPE) are still in use.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…:OpenGradient/tee-gateway into claude/anonymous-inference-privacy-SgzWN

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

+    try:
+        tee = get_tee_keys()
+        return tee.get_hpke_config(), 200
+    except Exception as exc:
+        logger.error("HPKE config error: %s", exc, exc_info=True)
+        return {"error": "Failed to retrieve HPKE config"}, 500


+        from tee_gateway import __main__ as gateway_main
+
+        with self.assertLogs(gateway_main.logger, level="CRITICAL") as cm:
+            with self.assertRaises(Exception):
+                gateway_main._session_cost_calculator(
+                    {"response_json": {"id": "chatcmpl-x"}}
+                )


claude and others added 3 commits May 13, 2026 01:32

Update test_ohttp.py

5dcbdc8

lint

a8c5c89

adambalogh marked this pull request as ready for review May 15, 2026 22:02

adambalogh requested a review from Copilot May 15, 2026 22:37

Copilot started reviewing on behalf of adambalogh May 15, 2026 22:37 View session

This comment was marked as resolved.

Sign in to view

claude and others added 2 commits May 15, 2026 23:15

Potential fix for pull request finding

9d55a8e

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

adambalogh requested a review from Copilot May 15, 2026 23:50

Copilot started reviewing on behalf of adambalogh May 15, 2026 23:51 View session

This comment was marked as resolved.

Sign in to view

adambalogh requested a review from Copilot May 16, 2026 13:05

Copilot started reviewing on behalf of adambalogh May 16, 2026 13:06 View session

This comment was marked as duplicate.

Sign in to view

claude added 11 commits May 16, 2026 13:18

Revert "Add TEE_GATEWAY_DEV_SKIP_X402 dev escape hatch"

366d2f5

This reverts commit 58908aa.

adambalogh requested a review from Copilot May 16, 2026 14:43

Copilot started reviewing on behalf of adambalogh May 16, 2026 14:44 View session

Copilot started reviewing on behalf of adambalogh May 16, 2026 15:51 View session

This comment was marked as duplicate.

Sign in to view

claude and others added 2 commits May 16, 2026 16:03

pricing

fe780d0

adambalogh requested a review from Copilot May 16, 2026 16:52

Copilot started reviewing on behalf of adambalogh May 16, 2026 16:52 View session

balogh.adam@icloud.com added 2 commits May 16, 2026 12:56

size limit

678ab00

lint

c2be6e1

This comment was marked as duplicate.

Sign in to view

adambalogh and others added 5 commits May 16, 2026 12:59

Potential fix for pull request finding

c81fe6e

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

46ffb56

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

usage

f211297

todo

399c0bb

simplify pricing

7cf86da

adambalogh requested a review from Copilot May 16, 2026 17:49

Copilot started reviewing on behalf of adambalogh May 16, 2026 17:49 View session

This comment was marked as abuse.

Sign in to view

balogh.adam@icloud.com and others added 5 commits May 16, 2026 13:54

cost

2130650

Potential fix for pull request finding

5f7942e

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

controller test

625354c

Merge branch 'claude/anonymous-inference-privacy-SgzWN' of github.com…

16d9fcb

…:OpenGradient/tee-gateway into claude/anonymous-inference-privacy-SgzWN

lint

dd19905

adambalogh requested a review from Copilot May 16, 2026 18:03

Copilot started reviewing on behalf of adambalogh May 16, 2026 18:04 View session

This comment was marked as outdated.

Sign in to view

adambalogh and others added 2 commits May 16, 2026 14:09

Potential fix for pull request finding

61471cd

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

d44ab8a

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

adambalogh requested a review from Copilot May 17, 2026 00:21

Copilot started reviewing on behalf of adambalogh May 17, 2026 00:22 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OHTTP-style anonymous inference endpoint#69

Add OHTTP-style anonymous inference endpoint#69
adambalogh wants to merge 38 commits into
mainfrom
claude/anonymous-inference-privacy-SgzWN

adambalogh commented May 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as abuse.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adambalogh commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Endpoints added

Flow

Response modes

Billing

Trust split

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as duplicate.

Uh oh!

This comment was marked as abuse.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adambalogh commented May 13, 2026 •

edited

Loading