Add OHTTP-style anonymous inference endpoint#69
Open
adambalogh wants to merge 38 commits into
Open
Conversation
Implements RFC 9458 Oblivious HTTP encapsulation so clients can submit chat completions through an independent relay without exposing their IP to the enclave or their prompt to the relay. The HPKE X25519 keypair is generated alongside the existing RSA signing key and bound to the same nitriding registration digest, so the Nitro attestation document commits to both. - tee_gateway/ohttp.py: HPKE wrap/unwrap helpers (DHKEM(X25519)/HKDF-SHA256/ ChaCha20-Poly1305). Response keying derived per-context per RFC 9458 §4.2. - tee_gateway/tee_manager.py: HPKE keypair, key-config blob, attestation document now includes the HPKE public key. - tee_gateway/controllers/ohttp_controller.py: /v1/ohttp dispatches the decrypted request to the existing chat handler, scrubs identifying fields before forwarding upstream, refuses stream=true. - /v1/ohttp/config exposes the HPKE key config for client discovery. - Test coverage: round-trip, wrong-suite, truncated input, tampered ciphertext. Known limitation: payment gating is not yet wired for this endpoint; a blind-token layer will follow in a separate change. https://claude.ai/code/session_01WyddtSz2rtiP61LtVJbsJy
* OHTTP: derive HPKE from TEE RSA key + gate /v1/ohttp behind x402
* Replace the random os.urandom() seed for the HPKE keypair with an HKDF
derivation from the RSA TEE private key (PKCS8 DER) salted with the RSA
public DER. The HPKE keypair is now a deterministic function of the
attested RSA key — anything that attests the RSA signing key implicitly
covers the X25519 OHTTP key, with no separate randomness source to
attest. Domain-separated info "og-tee-hpke-x25519-v1" pins the
derivation to this use.
* ohttp.generate_keypair() -> ohttp.derive_keypair(seed), with explicit
>=32-byte seed validation. Tests cover deterministic output for the
same seed and rejection of short seeds.
* Add /v1/ohttp to the x402 payment middleware routes with the same
CHAT_COMPLETIONS_OPG_SESSION_MAX_SPEND cap and upto scheme used by
/v1/chat/completions. Anonymous inference is now metered identically
to the public chat endpoint.
* Bridge the encrypted request/response back to the token-based cost
calculator via a thread-local set in the OHTTP controller. The
calculator detects path=/v1/ohttp and uses the stashed plaintext
inner request/response instead of the (unparseable) ciphertext bytes
the middleware would otherwise see.
* Fix the response-export length to max(Nn, Nk) per RFC 9458 §4.5; the
prior _NK was equal here for ChaCha20-Poly1305 but would silently
break under a different AEAD.
* Refactor /v1/ohttp as a thin WSGI wrapper around /v1/chat/completions
Replace the parallel routing/pricing logic with an in-process WSGI sub-
request: the OHTTP handler decrypts, dispatches the inner request as a
POST /v1/chat/completions through the app's own wsgi_app, captures the
status/headers/body, then encrypts and returns. Everything that already
existed for the public chat endpoint — x402 payment verification, the
pre-inference pricing gate, LangChain routing, post-inference cost
settlement, TEE response signing — runs unchanged for OHTTP requests.
* /v1/ohttp is no longer in the x402 RouteConfig table. Gating happens
naturally when the sub-request hits /v1/chat/completions; the payment
header travels inside the sealed envelope as `x-payment` so the relay
never sees it.
* The thread-local side channel and the OHTTP-specific branch in
_session_cost_calculator are removed — there is now only one cost
calculator path for the whole gateway.
* Inner request envelope: `{"x-payment": "...", "body": {...}}`. Inner
response envelope: `{"status": int, "headers": {...}, "body": ...}`,
forwarding only x402/TEE settlement headers back to the client.
* Pre-decap errors stay plaintext; post-decap errors are sealed so the
relay can't distinguish failure modes by response shape.
* Revert HPKE key derivation; keep random HPKE keypair independent of RSA
Reverts deriving the OHTTP X25519 keypair from the RSA TEE private key.
The HPKE keypair is now freshly random per enclave boot (os.urandom(32)
fed to pyhpke's DeriveKeyPair). The attestation binding still works
because nitriding's transcript covers both public keys, but the two
private keys no longer share a derivation surface: a compromise of one
cannot be used to recover the other.
* ohttp.generate_keypair() restored; ohttp.derive_keypair() removed.
* tee_manager.TEEKeyManager no longer pulls HKDF; HPKE keypair is
generated independently right after the RSA keypair.
* Test for deterministic derivation replaced with an independence test
that asserts two generate_keypair() calls return different pubkeys.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Switches /v1/ohttp to the relay-pays model. The client encrypts only
a chat-completion request — no payment material — and a relay between
the client and the enclave supplies the x402 payment as a standard
outer-request header. The enclave reads x-payment from the outer
request, attaches it to the in-process sub-request to
/v1/chat/completions, and lets the existing x402 middleware verify
and settle exactly as it would for a public call.
* Inner plaintext is now bare chat-completion JSON; the {x-payment,
body} envelope is gone since payment travels outside the seal.
* On 2xx the response body is still HPKE-sealed (it contains user
prompts/completions), but the outer response surfaces token usage
as headers so the relay can bill: X-Usage-Prompt-Tokens,
X-Usage-Completion-Tokens, X-Usage-Total-Tokens, X-Usage-Model.
x402 settlement and TEE signature headers are also forwarded.
* On non-2xx (402 payment required, validation errors) the body is
forwarded as plaintext so the relay can read x402 payment
requirements, retry with a larger payment, or surface errors.
These bodies never contain user prompts/completions.
* Privacy: relay sees ciphertext + usage + settlement + relay-side
wallet; never sees prompts, completions, or the client's IP.
Unlinkability holds unless relay and enclave collude.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Adds streaming support per draft-ietf-ohai-chunked-ohttp-08. When the inner chat-completion request has stream=true, /v1/ohttp pipes the sub-request's SSE events through a chunked OHTTP encrypter and yields them as they arrive, instead of buffering. Non-streaming requests continue to use the existing single-shot RFC 9458 §4.5 path. ohttp.py: * QUIC varint encode/decode helpers (RFC 9000 §16). * New _LABEL_CHUNKED_RESPONSE = "message/bhttp chunked response" and a second secret export at decap time; DecapsulatedRequest now carries response_key + response_key_chunked so the controller can decide which mode to use AFTER inspecting the decrypted body. * ChunkedResponseEncrypter: response_nonce header, varint(len)||ct per chunk (AAD=""), zero-prefix final chunk (AAD=b"final") so truncation is detectable, per-chunk nonce = aead_nonce XOR encode_be(counter). * Extracted _derive_response_keys() shared between single-shot and chunked paths (HKDF-Extract on enc||response_nonce, then Expand twice for "key" and "nonce"). ohttp_controller.py: * Drop the stream=true rejection. Pass stream through to the inner sub-request and detect text/event-stream in the captured headers. * _wsgi_subrequest now returns the raw iterator instead of draining, so the streaming path can pipe chunks through Flask without buffering. close() still invoked downstream to trigger x402 settlement. * _build_streaming_response: look-ahead-by-one over the inner SSE iterator so the last event is sealed with AAD=b"final"; content-type message/ohttp-chunked-res; x402/TEE settlement headers forwarded. Usage stats stay inside the encrypted stream (final SSE event); the relay bills via X-Upto-Session as usual. Tests: varint round-trip across all 4 length classes, chunked response round-trip with a hand-rolled client-side decrypter that walks the varint frames and verifies AAD=b"final", double-finalize rejection. 96 unit tests total now passing.
Adds the two OHTTP endpoints to the API table and a concise section covering the relay-pays flow, the single-shot vs chunked response modes, billing channel for each mode, and the relay/enclave/client trust split. Refs RFC 9458 and draft-ietf-ohai-chunked-ohttp-08.
Mirrors scripts/test_bytedance.py but exercises /v1/ohttp end-to-end: fetches /v1/ohttp/config, cross-checks the HPKE pubkey against the /signing-key attestation document, HPKE-encapsulates a chat request, POSTs to /v1/ohttp, and decrypts the response. Supports both single- shot and chunked OHTTP (--stream); the chunked path decrypts the varint-framed sealed stream incrementally so you can see SSE events arrive in real time. Includes a hand-rolled QUIC varint reader so the script stays usable as a standalone client SDK reference. Usage examples in the module docstring.
The OpenAPI spec declares a global ApiKeyAuth requirement; connexion enforces it on /v1/chat/completions before any handler runs and returns 401 "No authorization token provided" when missing. Our WSGI sub-request from /v1/ohttp arrived without an Authorization header, so OHTTP requests bounced with 401 before reaching the chat backend. security_controller.info_from_ApiKeyAuth is an intentional passthrough (x402 is the real access control) so any token value satisfies the schema check. Forward the outer Authorization header to the sub-request when the relay supplied one, else inject a placeholder bearer token.
Don't forward the outer Authorization header to the chat sub-request — anything the relay attached there (API keys, JWT subjects, bearer tokens, ...) could re-identify the client and defeat unlinkability. A constant "Bearer ohttp" placeholder satisfies connexion's ApiKeyAuth schema check (security_controller is a passthrough; x402 is the real access control) and keeps every OHTTP request indistinguishable at this layer.
Set the env var to "1" before /v1/keys is POSTed and the gateway will skip attaching the x402 payment middleware. Lets developers smoke-test /v1/chat/completions and /v1/ohttp locally without a reachable facilitator URL — without it, the middleware's first-request initialize() blows up on facilitator DNS lookups. Logs a WARNING when active and is explicitly NOT for production use.
Prints the request line, headers, the inner plaintext (clearly labeled as never-on-the-wire), then a breakdown of the encapsulated body: the 7-byte OHTTP header, the 32-byte ephemeral X25519 enc, and an xxd-style hex dump of the AEAD ciphertext. Makes it visually obvious that the relay only sees opaque sealed bytes — no prompt content, no model name, no API key, nothing.
This reverts commit 58908aa.
The v2 attestation transcript labels both the RSA SPKI and the X25519 HPKE pubkey, but the previous (self.hpke_public_key_raw or b"") fallback would silently produce a "v2"-labeled digest that actually only covers RSA whenever hpke_public_key_raw was None or empty. A verifier trusting the label would then accept an enclave whose HPKE key was never bound to attestation. Add an explicit length check (must be exactly 32 bytes) outside the broad try/except, so a real misconfiguration raises clearly instead of being masked as the "Could not register with nitriding (may not be in TEE)" warning. Today _generate_keys() always sets both keys so this is a defense-in-depth guard against future partial-init regressions.
decapsulate_request's docstring promised ValueError on malformed input,
but recipient.open() raises pyhpke / cryptography exception types on
AEAD tag failure, bad ephemeral keys, etc., so the contract was a lie.
The error strings from those libraries can encode oracle information
about which specific check failed (tag verification vs. length vs. KDF),
which would turn the function into a padding-oracle-style side channel
if any caller logged with exc_info=True.
* Wrap the crypto path (create_recipient_context + open) and re-raise
as ValueError("HPKE decapsulation failed") with `from None` so the
underlying exception chain is suppressed entirely. Don't wrap the
HKDF exports — those are deterministic and can't fail on valid input.
* Bump the minimum input length to 7 + 32 + 16 so truncated inputs hit
our own "too short" ValueError instead of whatever pyhpke would raise.
* Tighten test_rejects_tampered_ciphertext from pytest.raises(Exception)
to pytest.raises(ValueError, match="HPKE decapsulation failed") so the
contract is enforced by tests, not just documented.
The previous wording said the relay "never sees the client's IP", which is wrong — in the relay-pays model the client connects directly to the relay, so the relay necessarily sees the client's IP at the network layer. The actual privacy property is that the ENCLAVE never sees the client's IP (it only sees the relay's), and the relay sees only the encapsulated ciphertext (plus billing metadata it needs), not the prompt or completion. Reword to spell out network position vs. compute position for each party and the precise unlinkability claim (and the collusion caveat).
Mirror the docstring correction in ohttp_controller.py: the relay does see the client's IP at the network layer (it terminates the TCP/TLS connection). What it doesn't see is request/response content. The unlinkability claim is that the ENCLAVE never sees the client's IP and therefore can't tie a plaintext request to a specific end user.
The request media type was defined for symmetry with the response constants but never read. Decapsulation itself is the security gate; the unauthenticated Content-Type header gives us nothing to enforce. The response constants (OHTTP_RESPONSE_MEDIA_TYPE, OHTTP_CHUNKED_RESPONSE_MEDIA_TYPE) are still in use.
added 2 commits
May 16, 2026 12:56
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…:OpenGradient/tee-gateway into claude/anonymous-inference-privacy-SgzWN
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment on lines
+400
to
+405
| try: | ||
| tee = get_tee_keys() | ||
| return tee.get_hpke_config(), 200 | ||
| except Exception as exc: | ||
| logger.error("HPKE config error: %s", exc, exc_info=True) | ||
| return {"error": "Failed to retrieve HPKE config"}, 500 |
Comment on lines
+214
to
+220
| from tee_gateway import __main__ as gateway_main | ||
|
|
||
| with self.assertLogs(gateway_main.logger, level="CRITICAL") as cm: | ||
| with self.assertRaises(Exception): | ||
| gateway_main._session_cost_calculator( | ||
| {"response_json": {"id": "chatcmpl-x"}} | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Anonymous inference over Oblivious HTTP: clients submit chat completions through a relay that pays for them. The HPKE X25519 keypair is generated alongside the RSA signing key and bound to the same nitriding registration digest, so the Nitro attestation document commits to both.
/v1/ohttpis a thin wrapper around/v1/chat/completions— it HPKE-decrypts the inner request, re-issues it as an in-process WSGI sub-request against the chat endpoint, and HPKE-encrypts the response. All x402 payment, pricing, settlement and TEE response signing reuse the public chat code paths; no duplicated routing or pricing logic.Implements RFC 9458 OHTTP for single-shot responses and draft-ietf-ohai-chunked-ohttp-08 for streaming. Fixed HPKE ciphersuite: DHKEM(X25519,HKDF-SHA256) / HKDF-SHA256 / ChaCha20-Poly1305.
Endpoints added
/v1/ohttp/v1/ohttp/configBoth are mounted via
add_url_rulerather than the OpenAPI spec because the request body is raw binary and connexion's JSON validation would reject it.Flow
/v1/ohttp/config(HPKE pubkey, key_id, suite IDs) and verifies it against the Nitro attestation./v1/chat/completionsJSON body (no envelope, no payment material) and POSTs the ciphertext to the relay.POST /v1/ohttpand attaches its ownX-Payment: <x402 payload>header./v1/chat/completionswith the relay'sX-Paymentheader → x402 verifies and settles → LLM call runs → TEE signs the response body.streamflag (see table below).Response modes
stream=falsemessage/ohttp-resstream=truemessage/ohttp-chunked-resresponse_nonce || (varint(len) || sealed_ct)+ || varint(0) || sealed_final_ct— one OHTTP chunk per SSE event, AAD=b"final"on the last chunk so truncation is detectable (chunked-ohttp draft §3)On non-2xx (e.g. 402 payment required) the body is forwarded plaintext so the relay can read x402 payment requirements and retry — those bodies never contain prompts or completions.
Billing
Both modes settle the actual cost via x402 against the relay's
X-Payment(uptoscheme); the gateway is the source of truth for the amount.stream=false: the outer response exposes the settled-cost headersX-Inference-Cost-OPG(smallest units, the integer x402 actually charged),X-Inference-Cost-USD, andX-Inference-Price-OPG-USD— for the relay's own bookkeeping. Model name and token counts are deliberately NOT surfaced as outer headers: they would fingerprint the inner request and have no billing role. The sealed body still carries the fullusageblock for the client.stream=true: no billing detail in outer headers (they ship before any body chunk, so cost isn't known at header-write time) and the sealed chunks are opaque to the relay. The relay reads the actual settled amount from x402 — by querying the facilitator with itsX-Upto-Session, or viaX-Payment-Responseon its next call. The client still sees cost and per-token detail in the final SSE event inside the decrypted stream (theopengradientblock written by the chat controller).Trust split
x-paymentmaterial, and (single-shot only) the settled-cost outer headers it needs to bill its own customer.Unlinkability between a client identity and a plaintext request holds unless the relay and the enclave collude (the relay would have to share its client-IP log alongside the enclave's plaintext log). Streaming additionally leaks per-chunk timing and length; clients who can't accept that signal should use
stream=false.