Skip to content

Add OHTTP-style anonymous inference endpoint#69

Open
adambalogh wants to merge 38 commits into
mainfrom
claude/anonymous-inference-privacy-SgzWN
Open

Add OHTTP-style anonymous inference endpoint#69
adambalogh wants to merge 38 commits into
mainfrom
claude/anonymous-inference-privacy-SgzWN

Conversation

@adambalogh
Copy link
Copy Markdown
Contributor

@adambalogh adambalogh commented May 13, 2026

Anonymous inference over Oblivious HTTP: clients submit chat completions through a relay that pays for them. The HPKE X25519 keypair is generated alongside the RSA signing key and bound to the same nitriding registration digest, so the Nitro attestation document commits to both.

/v1/ohttp is a thin wrapper around /v1/chat/completions — it HPKE-decrypts the inner request, re-issues it as an in-process WSGI sub-request against the chat endpoint, and HPKE-encrypts the response. All x402 payment, pricing, settlement and TEE response signing reuse the public chat code paths; no duplicated routing or pricing logic.

Implements RFC 9458 OHTTP for single-shot responses and draft-ietf-ohai-chunked-ohttp-08 for streaming. Fixed HPKE ciphersuite: DHKEM(X25519,HKDF-SHA256) / HKDF-SHA256 / ChaCha20-Poly1305.

Endpoints added

Endpoint Method Purpose
/v1/ohttp POST Anonymous chat completion (OHTTP-encapsulated, relay-paid). Body is raw HPKE ciphertext, not JSON.
/v1/ohttp/config GET HPKE key configuration (RFC 9458 key-config blob) for client discovery.

Both are mounted via add_url_rule rather than the OpenAPI spec because the request body is raw binary and connexion's JSON validation would reject it.

Flow

  1. Client fetches /v1/ohttp/config (HPKE pubkey, key_id, suite IDs) and verifies it against the Nitro attestation.
  2. Client → Relay: HPKE-encapsulates a standard /v1/chat/completions JSON body (no envelope, no payment material) and POSTs the ciphertext to the relay.
  3. Relay → Enclave: forwards the ciphertext as the body of POST /v1/ohttp and attaches its own X-Payment: <x402 payload> header.
  4. Enclave: decrypts inner body → re-issues as a WSGI sub-request to /v1/chat/completions with the relay's X-Payment header → x402 verifies and settles → LLM call runs → TEE signs the response body.
  5. Enclave → Relay: response mode dispatched by the inner stream flag (see table below).
  6. Relay → Client: passes the sealed body through. Client decrypts and verifies the TEE signature embedded in the response body.

Response modes

Mode Outer content-type Body
stream=false message/ohttp-res Single-shot sealed body (RFC 9458 §4.5)
stream=true message/ohttp-chunked-res response_nonce || (varint(len) || sealed_ct)+ || varint(0) || sealed_final_ct — one OHTTP chunk per SSE event, AAD=b"final" on the last chunk so truncation is detectable (chunked-ohttp draft §3)

On non-2xx (e.g. 402 payment required) the body is forwarded plaintext so the relay can read x402 payment requirements and retry — those bodies never contain prompts or completions.

Billing

Both modes settle the actual cost via x402 against the relay's X-Payment (upto scheme); the gateway is the source of truth for the amount.

  • stream=false: the outer response exposes the settled-cost headers X-Inference-Cost-OPG (smallest units, the integer x402 actually charged), X-Inference-Cost-USD, and X-Inference-Price-OPG-USD — for the relay's own bookkeeping. Model name and token counts are deliberately NOT surfaced as outer headers: they would fingerprint the inner request and have no billing role. The sealed body still carries the full usage block for the client.
  • stream=true: no billing detail in outer headers (they ship before any body chunk, so cost isn't known at header-write time) and the sealed chunks are opaque to the relay. The relay reads the actual settled amount from x402 — by querying the facilitator with its X-Upto-Session, or via X-Payment-Response on its next call. The client still sees cost and per-token detail in the final SSE event inside the decrypted stream (the opengradient block written by the chat controller).

Trust split

  • Relay terminates the client's TCP/TLS connection, so it does see the client's IP at the network layer — that's unavoidable. What the relay does NOT see is content: only the OHTTP-encapsulated ciphertext, its own wallet's x-payment material, and (single-shot only) the settled-cost outer headers it needs to bill its own customer.
  • Enclave sees plaintext prompts and completions (necessarily — it runs the LLM call), but at the network layer only sees the relay's IP, never the client's. This is the actual unlinkability property: the enclave cannot tie a plaintext request to a specific end user.
  • Client decrypts and verifies the TEE signature inside the response body against the attested public key.

Unlinkability between a client identity and a plaintext request holds unless the relay and the enclave collude (the relay would have to share its client-IP log alongside the enclave's plaintext log). Streaming additionally leaks per-chunk timing and length; clients who can't accept that signal should use stream=false.

claude and others added 3 commits May 13, 2026 01:32
Implements RFC 9458 Oblivious HTTP encapsulation so clients can submit chat
completions through an independent relay without exposing their IP to the
enclave or their prompt to the relay. The HPKE X25519 keypair is generated
alongside the existing RSA signing key and bound to the same nitriding
registration digest, so the Nitro attestation document commits to both.

- tee_gateway/ohttp.py: HPKE wrap/unwrap helpers (DHKEM(X25519)/HKDF-SHA256/
  ChaCha20-Poly1305). Response keying derived per-context per RFC 9458 §4.2.
- tee_gateway/tee_manager.py: HPKE keypair, key-config blob, attestation
  document now includes the HPKE public key.
- tee_gateway/controllers/ohttp_controller.py: /v1/ohttp dispatches the
  decrypted request to the existing chat handler, scrubs identifying fields
  before forwarding upstream, refuses stream=true.
- /v1/ohttp/config exposes the HPKE key config for client discovery.
- Test coverage: round-trip, wrong-suite, truncated input, tampered ciphertext.

Known limitation: payment gating is not yet wired for this endpoint; a
blind-token layer will follow in a separate change.

https://claude.ai/code/session_01WyddtSz2rtiP61LtVJbsJy
@adambalogh adambalogh marked this pull request as ready for review May 15, 2026 22:02
* OHTTP: derive HPKE from TEE RSA key + gate /v1/ohttp behind x402

* Replace the random os.urandom() seed for the HPKE keypair with an HKDF
  derivation from the RSA TEE private key (PKCS8 DER) salted with the RSA
  public DER. The HPKE keypair is now a deterministic function of the
  attested RSA key — anything that attests the RSA signing key implicitly
  covers the X25519 OHTTP key, with no separate randomness source to
  attest. Domain-separated info "og-tee-hpke-x25519-v1" pins the
  derivation to this use.
* ohttp.generate_keypair() -> ohttp.derive_keypair(seed), with explicit
  >=32-byte seed validation. Tests cover deterministic output for the
  same seed and rejection of short seeds.
* Add /v1/ohttp to the x402 payment middleware routes with the same
  CHAT_COMPLETIONS_OPG_SESSION_MAX_SPEND cap and upto scheme used by
  /v1/chat/completions. Anonymous inference is now metered identically
  to the public chat endpoint.
* Bridge the encrypted request/response back to the token-based cost
  calculator via a thread-local set in the OHTTP controller. The
  calculator detects path=/v1/ohttp and uses the stashed plaintext
  inner request/response instead of the (unparseable) ciphertext bytes
  the middleware would otherwise see.
* Fix the response-export length to max(Nn, Nk) per RFC 9458 §4.5; the
  prior _NK was equal here for ChaCha20-Poly1305 but would silently
  break under a different AEAD.

* Refactor /v1/ohttp as a thin WSGI wrapper around /v1/chat/completions

Replace the parallel routing/pricing logic with an in-process WSGI sub-
request: the OHTTP handler decrypts, dispatches the inner request as a
POST /v1/chat/completions through the app's own wsgi_app, captures the
status/headers/body, then encrypts and returns. Everything that already
existed for the public chat endpoint — x402 payment verification, the
pre-inference pricing gate, LangChain routing, post-inference cost
settlement, TEE response signing — runs unchanged for OHTTP requests.

* /v1/ohttp is no longer in the x402 RouteConfig table. Gating happens
  naturally when the sub-request hits /v1/chat/completions; the payment
  header travels inside the sealed envelope as `x-payment` so the relay
  never sees it.
* The thread-local side channel and the OHTTP-specific branch in
  _session_cost_calculator are removed — there is now only one cost
  calculator path for the whole gateway.
* Inner request envelope: `{"x-payment": "...", "body": {...}}`. Inner
  response envelope: `{"status": int, "headers": {...}, "body": ...}`,
  forwarding only x402/TEE settlement headers back to the client.
* Pre-decap errors stay plaintext; post-decap errors are sealed so the
  relay can't distinguish failure modes by response shape.

* Revert HPKE key derivation; keep random HPKE keypair independent of RSA

Reverts deriving the OHTTP X25519 keypair from the RSA TEE private key.
The HPKE keypair is now freshly random per enclave boot (os.urandom(32)
fed to pyhpke's DeriveKeyPair). The attestation binding still works
because nitriding's transcript covers both public keys, but the two
private keys no longer share a derivation surface: a compromise of one
cannot be used to recover the other.

* ohttp.generate_keypair() restored; ohttp.derive_keypair() removed.
* tee_manager.TEEKeyManager no longer pulls HKDF; HPKE keypair is
  generated independently right after the RSA keypair.
* Test for deterministic derivation replaced with an independence test
  that asserts two generate_keypair() calls return different pubkeys.

---------

Co-authored-by: Claude <noreply@anthropic.com>

This comment was marked as resolved.

claude and others added 2 commits May 15, 2026 23:15
Switches /v1/ohttp to the relay-pays model. The client encrypts only
a chat-completion request — no payment material — and a relay between
the client and the enclave supplies the x402 payment as a standard
outer-request header. The enclave reads x-payment from the outer
request, attaches it to the in-process sub-request to
/v1/chat/completions, and lets the existing x402 middleware verify
and settle exactly as it would for a public call.

* Inner plaintext is now bare chat-completion JSON; the {x-payment,
  body} envelope is gone since payment travels outside the seal.
* On 2xx the response body is still HPKE-sealed (it contains user
  prompts/completions), but the outer response surfaces token usage
  as headers so the relay can bill: X-Usage-Prompt-Tokens,
  X-Usage-Completion-Tokens, X-Usage-Total-Tokens, X-Usage-Model.
  x402 settlement and TEE signature headers are also forwarded.
* On non-2xx (402 payment required, validation errors) the body is
  forwarded as plaintext so the relay can read x402 payment
  requirements, retry with a larger payment, or surface errors.
  These bodies never contain user prompts/completions.
* Privacy: relay sees ciphertext + usage + settlement + relay-side
  wallet; never sees prompts, completions, or the client's IP.
  Unlinkability holds unless relay and enclave collude.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

This comment was marked as resolved.

Adds streaming support per draft-ietf-ohai-chunked-ohttp-08. When the
inner chat-completion request has stream=true, /v1/ohttp pipes the
sub-request's SSE events through a chunked OHTTP encrypter and yields
them as they arrive, instead of buffering. Non-streaming requests
continue to use the existing single-shot RFC 9458 §4.5 path.

ohttp.py:
* QUIC varint encode/decode helpers (RFC 9000 §16).
* New _LABEL_CHUNKED_RESPONSE = "message/bhttp chunked response" and a
  second secret export at decap time; DecapsulatedRequest now carries
  response_key + response_key_chunked so the controller can decide
  which mode to use AFTER inspecting the decrypted body.
* ChunkedResponseEncrypter: response_nonce header, varint(len)||ct per
  chunk (AAD=""), zero-prefix final chunk (AAD=b"final") so truncation
  is detectable, per-chunk nonce = aead_nonce XOR encode_be(counter).
* Extracted _derive_response_keys() shared between single-shot and
  chunked paths (HKDF-Extract on enc||response_nonce, then Expand twice
  for "key" and "nonce").

ohttp_controller.py:
* Drop the stream=true rejection. Pass stream through to the inner
  sub-request and detect text/event-stream in the captured headers.
* _wsgi_subrequest now returns the raw iterator instead of draining,
  so the streaming path can pipe chunks through Flask without
  buffering. close() still invoked downstream to trigger x402
  settlement.
* _build_streaming_response: look-ahead-by-one over the inner SSE
  iterator so the last event is sealed with AAD=b"final"; content-type
  message/ohttp-chunked-res; x402/TEE settlement headers forwarded.
  Usage stats stay inside the encrypted stream (final SSE event); the
  relay bills via X-Upto-Session as usual.

Tests: varint round-trip across all 4 length classes, chunked
response round-trip with a hand-rolled client-side decrypter that
walks the varint frames and verifies AAD=b"final", double-finalize
rejection. 96 unit tests total now passing.

This comment was marked as duplicate.

claude added 11 commits May 16, 2026 13:18
Adds the two OHTTP endpoints to the API table and a concise section
covering the relay-pays flow, the single-shot vs chunked response
modes, billing channel for each mode, and the relay/enclave/client
trust split. Refs RFC 9458 and draft-ietf-ohai-chunked-ohttp-08.
Mirrors scripts/test_bytedance.py but exercises /v1/ohttp end-to-end:
fetches /v1/ohttp/config, cross-checks the HPKE pubkey against the
/signing-key attestation document, HPKE-encapsulates a chat request,
POSTs to /v1/ohttp, and decrypts the response. Supports both single-
shot and chunked OHTTP (--stream); the chunked path decrypts the
varint-framed sealed stream incrementally so you can see SSE events
arrive in real time. Includes a hand-rolled QUIC varint reader so the
script stays usable as a standalone client SDK reference.

Usage examples in the module docstring.
The OpenAPI spec declares a global ApiKeyAuth requirement; connexion
enforces it on /v1/chat/completions before any handler runs and
returns 401 "No authorization token provided" when missing. Our
WSGI sub-request from /v1/ohttp arrived without an Authorization
header, so OHTTP requests bounced with 401 before reaching the
chat backend.

security_controller.info_from_ApiKeyAuth is an intentional
passthrough (x402 is the real access control) so any token value
satisfies the schema check. Forward the outer Authorization header
to the sub-request when the relay supplied one, else inject a
placeholder bearer token.
Don't forward the outer Authorization header to the chat sub-request —
anything the relay attached there (API keys, JWT subjects, bearer
tokens, ...) could re-identify the client and defeat unlinkability.
A constant "Bearer ohttp" placeholder satisfies connexion's
ApiKeyAuth schema check (security_controller is a passthrough; x402
is the real access control) and keeps every OHTTP request
indistinguishable at this layer.
Set the env var to "1" before /v1/keys is POSTed and the gateway will
skip attaching the x402 payment middleware. Lets developers smoke-test
/v1/chat/completions and /v1/ohttp locally without a reachable
facilitator URL — without it, the middleware's first-request
initialize() blows up on facilitator DNS lookups.

Logs a WARNING when active and is explicitly NOT for production use.
Prints the request line, headers, the inner plaintext (clearly labeled
as never-on-the-wire), then a breakdown of the encapsulated body:
the 7-byte OHTTP header, the 32-byte ephemeral X25519 enc, and an
xxd-style hex dump of the AEAD ciphertext. Makes it visually obvious
that the relay only sees opaque sealed bytes — no prompt content, no
model name, no API key, nothing.
The v2 attestation transcript labels both the RSA SPKI and the X25519
HPKE pubkey, but the previous (self.hpke_public_key_raw or b"") fallback
would silently produce a "v2"-labeled digest that actually only covers
RSA whenever hpke_public_key_raw was None or empty. A verifier trusting
the label would then accept an enclave whose HPKE key was never bound
to attestation.

Add an explicit length check (must be exactly 32 bytes) outside the
broad try/except, so a real misconfiguration raises clearly instead of
being masked as the "Could not register with nitriding (may not be in
TEE)" warning. Today _generate_keys() always sets both keys so this is
a defense-in-depth guard against future partial-init regressions.
decapsulate_request's docstring promised ValueError on malformed input,
but recipient.open() raises pyhpke / cryptography exception types on
AEAD tag failure, bad ephemeral keys, etc., so the contract was a lie.
The error strings from those libraries can encode oracle information
about which specific check failed (tag verification vs. length vs. KDF),
which would turn the function into a padding-oracle-style side channel
if any caller logged with exc_info=True.

* Wrap the crypto path (create_recipient_context + open) and re-raise
  as ValueError("HPKE decapsulation failed") with `from None` so the
  underlying exception chain is suppressed entirely. Don't wrap the
  HKDF exports — those are deterministic and can't fail on valid input.
* Bump the minimum input length to 7 + 32 + 16 so truncated inputs hit
  our own "too short" ValueError instead of whatever pyhpke would raise.
* Tighten test_rejects_tampered_ciphertext from pytest.raises(Exception)
  to pytest.raises(ValueError, match="HPKE decapsulation failed") so the
  contract is enforced by tests, not just documented.
The previous wording said the relay "never sees the client's IP", which
is wrong — in the relay-pays model the client connects directly to the
relay, so the relay necessarily sees the client's IP at the network
layer. The actual privacy property is that the ENCLAVE never sees the
client's IP (it only sees the relay's), and the relay sees only the
encapsulated ciphertext (plus billing metadata it needs), not the
prompt or completion.

Reword to spell out network position vs. compute position for each
party and the precise unlinkability claim (and the collusion caveat).
Mirror the docstring correction in ohttp_controller.py: the relay does
see the client's IP at the network layer (it terminates the TCP/TLS
connection). What it doesn't see is request/response content. The
unlinkability claim is that the ENCLAVE never sees the client's IP and
therefore can't tie a plaintext request to a specific end user.

This comment was marked as duplicate.

claude and others added 2 commits May 16, 2026 16:03
The request media type was defined for symmetry with the response
constants but never read. Decapsulation itself is the security gate;
the unauthenticated Content-Type header gives us nothing to enforce.
The response constants (OHTTP_RESPONSE_MEDIA_TYPE,
OHTTP_CHUNKED_RESPONSE_MEDIA_TYPE) are still in use.
balogh.adam@icloud.com added 2 commits May 16, 2026 12:56

This comment was marked as duplicate.

adambalogh and others added 5 commits May 16, 2026 12:59
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

This comment was marked as abuse.

balogh.adam@icloud.com and others added 5 commits May 16, 2026 13:54
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…:OpenGradient/tee-gateway into claude/anonymous-inference-privacy-SgzWN

This comment was marked as outdated.

adambalogh and others added 2 commits May 16, 2026 14:09
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Comment on lines +400 to +405
try:
tee = get_tee_keys()
return tee.get_hpke_config(), 200
except Exception as exc:
logger.error("HPKE config error: %s", exc, exc_info=True)
return {"error": "Failed to retrieve HPKE config"}, 500
Comment on lines +214 to +220
from tee_gateway import __main__ as gateway_main

with self.assertLogs(gateway_main.logger, level="CRITICAL") as cm:
with self.assertRaises(Exception):
gateway_main._session_cost_calculator(
{"response_json": {"id": "chatcmpl-x"}}
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants