diff --git a/docs/superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md b/docs/superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md new file mode 100644 index 000000000..9be1607dd --- /dev/null +++ b/docs/superpowers/specs/2026-04-22-rsl-ai-crawler-licensing-design.md @@ -0,0 +1,1055 @@ +# Trusted Server AI Crawler Licensing (RSL-compliant) + +*April 2026* + +--- + +## 1. Product Positioning & Scope + +### 1.1 What It Is + +An edge-deployed AI crawler detection and RSL licensing enforcement layer for +Trusted Server publishers. TS classifies incoming requests against published AI +crawler fingerprints, serves publisher-defined RSL licensing terms as +machine-readable XML, and enforces per-route access decisions with +standards-compliant HTTP responses (402 Payment Required for honest crawlers, +403 Forbidden for stealth or prohibited crawlers). + +### 1.2 What It Is Not + +- Not a payment/billing system (phase 2) +- Not a proprietary licensing protocol — RSL-native, standards-aligned +- Not a reverse proxy — runs in-line at the edge, no redirect, no subdomain +- Not a bot-blocking product in general — scoped to AI crawler licensing + +### 1.3 Audience + +**Primary buyer:** Large publishers and publisher consortiums (Hearst, Arena +Group, Condé Nast-tier) who already run TS or are evaluating it. + +**Secondary audience:** Entitlement platforms that want to transact with these +publishers at scale without bespoke per-publisher integrations. + +### 1.4 Value Proposition + +1. **RSL-native, standards-compliant** — publishes `license.xml`, honors OLP + where applicable, interoperates with any compliant entitlement platform. +2. **Edge-native, low-latency** — classification and enforcement at the CDN + layer, no proxy hop, no subdomain redirect. +3. **Multi-signal detection including JA4** — TLS-layer fingerprinting catches + crawlers that spoof user-agent strings. +4. **Publisher-owned config** — single `license.toml` file, version-controlled, + no lock-in to a vendor's dashboard. +5. **Open source** — publishers can audit the enforcement behavior. + +### 1.5 Reference to IAB Tech Lab CoMP + +The IAB Tech Lab Content Monetization Protocol (CoMP) is a complementary +commercial framework for AI content licensing. It is currently in working group +formation. This spec references CoMP only as a future-compatible framework; +this POC does not build against CoMP endpoints or schemas. When CoMP +stabilizes, the TS RSL implementation can extend to emit or consume CoMP +commercial signals without changing the core detection or enforcement layers. + +--- + +## 2. Scope Decisions + +### 2.1 In Scope (POC / MVP) + +1. AI crawler detection via six signals (see Section 4) +2. Publisher-authored public RSL terms via `license.toml` +3. Publisher-authored private enforcement rules via `license.private.toml` +4. `/license.xml` generation served from publisher's first-party domain +5. `/robots.txt` augmentation with `License:` directive +6. `Link: rel="license"` HTTP header on all responses +7. Enforcement actions: 200 (allow), 402 (honest crawler blocked), 403 (stealth + or prohibited) +8. Debug endpoints: `/_ts/debug/rsl/summary`, `/_ts/debug/rsl/recent`, + `/_ts/debug/rsl/license` +9. Structured logging of every classified request +10. Permissive-by-default mode with per-route or publisher-wide Strict override +11. Crawler-specific overrides (allow/deny/enforce-default) via private config +12. Integration with existing TS architecture without disrupting other + integrations (Monetize, Edge Cookie, consent, PBS, etc.) + +### 2.2 Out of Scope (Deferred) + +1. **OLP license server** — token issuance, `/token`/`/introspect`/`/key` + endpoints. Phase 2. +2. **Billing, invoicing, payment rails** — no money moves in the POC. RSL 402 + responses point to the publisher's contact information for out-of-band + negotiation. +3. **Encrypted Media Standard (EMS)** — content encryption requires the OLP + `/key` endpoint. Phase 2. +4. **Behavioral anomaly detection** — request rate/pattern analysis, path-depth + heuristics, referer-chain analysis. Future phase if POC metrics justify it. +5. **Publisher SaaS dashboard** — structured logs + debug endpoints only. + Publishers render their own visualizations from the log stream. +6. **AI-company cooperation agreements** — TS publishes standards-compliant + signals and assumes honest actors will respect them. TS does not negotiate + deals on publishers' behalf. +7. **Multi-publisher consortium management** — single publisher at a time for + POC. Consortium config patterns (Hearst-style) come post-POC. +8. **CoMP framework integration** — referenced only; not built against. +9. **Creative / content transformation for AI consumers** — TS does not rewrite + content for AI, generate summaries, or serve different versions. +10. **CAPTCHA / proof-of-human challenges** — Apple Private Access Tokens are + available in TS if a publisher wants cryptographic human attestation, but + not required for this POC. JA4 + IP + UA signals are sufficient for AI + crawler classification. + +--- + +## 3. Architecture + +### 3.1 High-Level + +```text +┌────────────────────────────────────────────────────────────────────┐ +│ Trusted Server — Edge Compute (WASM on Fastly today) │ +│ Planned: Akamai EdgeWorkers, Cloudflare Workers │ +│ │ +│ ┌───────────────────────────────────────────────────────────┐ │ +│ │ Request arrives (human or crawler, any path) │ │ +│ └──────┬────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────┐ ┌──────────────────────┐ │ +│ │ RSL Classifier │ │ License.toml Loader │ │ +│ │ (JA4 + UA + IP + ASN)│ │ (route → terms) │ │ +│ └──────┬───────────────┘ └──────┬───────────────┘ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ Enforcement Decision │ │ +│ │ - classified as: {human | honest_ai | stealth_ai} │ │ +│ │ - mode: {permissive | strict} │ │ +│ │ - license terms for this route │ │ +│ └──────┬───────────────────────────────────────────────────┘ │ +│ │ │ +│ ├──> [allow] pass through to origin/integration │ +│ ├──> [402] RSL-compliant license-required response │ +│ └──> [403] forbidden (stealth or prohibited) │ +│ │ +│ Special routes (always served by TS): │ +│ /license.xml — generated from license.toml │ +│ /robots.txt — augmented with License: directive │ +│ /_ts/debug/rsl/summary — dashboard JSON │ +│ /_ts/debug/rsl/recent — recent classifications │ +│ /_ts/debug/rsl/license — verify published terms │ +└────────────────────────────────────────────────────────────────────┘ +``` + +### 3.2 Functional Units + +**RSL Classifier.** Given a request, returns a classification verdict: +`{category, confidence, signals_matched, crawler_identity}`. Pure function over +request features (TLS fingerprint, headers, source IP, ASN lookup, UA string). +Consults static allowlists of published AI crawler fingerprints refreshed +periodically from `openai.com/gptbot.json`, Anthropic published ranges, and +equivalent. + +**License Resolver.** Given a request path, returns the applicable RSL terms +from `license.toml` (default + most-specific route override). Produces a +`LicenseTerms` struct used by both the enforcement layer and the +`/license.xml` generator. + +**Enforcement Layer.** Takes the classification verdict, license terms, and +publisher mode; produces an `Action`: `{Allow, Challenge402(reason, link), +Forbid403(reason)}`. Applied to the response before hitting origin. + +### 3.3 Data Flow + +```text +Request → Classifier → Verdict ──┐ + ├──> Enforcement → Action → Response +License.toml → Resolver → Terms ─┘ │ + ▼ + Structured log + + Debug endpoint state +``` + +### 3.4 State Footprint at Edge + +- AI crawler IP/UA/JA4 allowlists: compiled into the WASM binary at build + time, ~tens of KB, refreshed when TS is rebuilt with updated allowlists. +- `license.toml` and `license.private.toml`: compiled into the WASM binary at + build time (same mechanism as other TS configs). +- Per-request classification: emitted as structured log lines, optionally + counted in a small in-memory ring buffer for debug endpoints (no KV writes + on the hot path). + +### 3.5 Integration with Existing TS Infrastructure + +| Existing capability | How RSL uses it | +|---|---| +| `IntegrationRegistration` builder | New hook types: `with_request_classifier`, `with_special_route_augmenter` | +| JA4 signal from edge TLS | Input to `classifier::classify()` | +| Bot gate (H2 + JA4) | Supporting signal for stealth detection | +| `/robots.txt` handling | Integration augments existing `robots.txt` response | +| `/_ts/debug/*` auth pattern | Debug endpoints reuse existing token auth | +| Structured logging (`log-fastly`) | Classification events emitted as structured log lines | +| Settings (`trusted-server.toml`) | RSL config block added to existing settings parser | + +**No changes required to:** Edge Cookie, auction orchestrator / PBS integration, +Monetize ad-server client, consent handling, existing integrations (Permutive, +Lockr, Datadome, etc.), HTML processor, cache layer. + +### 3.6 Request-Path Insertion Point + +RSL classification runs **after** the bot gate (so RSL sees classified +bot-or-not state) but **before** any integration that might touch the response +body (so RSL can 402/403 before wasted work). + +Integration registration: + +```rust +IntegrationRegistration::builder("rsl") + .with_request_classifier() // runs classifier, attaches verdict to request context + .with_special_route("/license.xml") + .with_special_route_augmenter("/robots.txt") + .with_response_modifier() // adds Link header on all responses + .with_debug_routes(&[ + "/_ts/debug/rsl/summary", + "/_ts/debug/rsl/recent", + "/_ts/debug/rsl/license", + ]) + .build() +``` + +`with_request_classifier()` and `with_special_route_augmenter()` are new hook +types added to the integration framework. All other mechanisms are existing +patterns. + +### 3.7 Module Structure + +```text +crates/trusted-server-core/src/rsl/ +├── mod.rs # public API + IntegrationRegistration +├── classifier.rs # AI crawler detection logic +├── config/ +│ ├── public.rs # license.toml parser → LicenseTerms +│ └── private.rs # license.private.toml parser → EnforcementRules +├── xml_generator.rs # LicenseTerms → RSL XML document +├── fingerprints/ +│ ├── ua_patterns.rs # static honest-UA matchers +│ ├── ip_allowlists.rs # compiled-in crawler operator IP ranges +│ └── ja4_db.rs # JA4 fingerprint database for LLM fetchers +├── enforcement.rs # verdict + terms + mode → Action +├── endpoints.rs # /license.xml, /robots.txt augmentation, debug routes +└── logging.rs # structured log emission +``` + +### 3.8 Dependencies + +- `quick-xml` (or existing XML crate if one is already present) for + `license.xml` generation +- No new heavy dependencies — the classifier is pure pattern-matching against + compiled-in data structures; no ML, no external services + +### 3.9 Binary Size Impact + +- IP allowlists compiled in: ~30-50 KB (a few thousand CIDR ranges from major + operators) +- JA4 fingerprint database: ~5-10 KB (a few hundred common LLM fetcher + fingerprints) +- UA pattern table: negligible +- Total new code: estimated 2-3K lines of Rust, well within existing crate + structure + +--- + +## 4. Bot Detection Signals + +### 4.1 Six Signals + +| # | Signal | Source | Strength | Coverage | +|---|---|---|---|---| +| 1 | **Honest User-Agent match** | HTTP `User-Agent` header | Definitive when paired with #2 | GPTBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, OAI-SearchBot, ChatGPT-User, Meta-ExternalAgent | +| 2 | **Published IP allowlist match** | JSON lists from crawler operators | Definitive when paired with #1 | openai.com/gptbot.json, openai.com/searchbot.json, openai.com/chatgpt-user.json, Anthropic published ranges, Perplexity ranges | +| 3 | **JA4 TLS fingerprint match** | TLS ClientHello at edge | Strong (catches spoofed UAs) | Common LLM fetcher libraries: Python `requests`, `aiohttp`, `httpx`, Go `net/http`, Node `fetch`, cURL, Scrapy, Playwright, Puppeteer | +| 4 | **ASN classification** | IP → ASN lookup | Supporting signal only (never decisive alone) | Datacenter/hosting ASNs (AWS, GCP, Azure, DigitalOcean, Hetzner, OVH), VPN/proxy ASNs, residential ASNs | +| 5 | **H2 handshake presence** | Edge TLS/HTTP layer | Supporting signal (humans nearly always H2; many scrapers still H1) | All traffic | +| 6 | **`/robots.txt` and `/license.xml` fetch correlation** | TS request logs | Supporting signal (honest bots fetch before crawling) | All traffic | + +### 4.2 Classification Categories + +```rust +pub enum Classification { + /// Confirmed human or browser-class traffic. + /// Signals: browser JA4 + H2 + consistent browsing patterns + Human, + + /// Confirmed AI crawler with honest identity. + /// Signals: UA match AND IP allowlist match (or JA4 match for known library) + HonestAiCrawler { + operator: String, // "openai", "anthropic", "perplexity", etc. + bot_name: String, // "gptbot", "claude-user", "perplexitybot", etc. + purpose: AiPurpose, // training, search, in-conversation, index + }, + + /// Strong AI crawler suspicion without honest identity. + /// Signals: datacenter ASN + LLM-library JA4 + no H2 or irregular headers + StealthAiCrawler { + signals: Vec, + confidence: Confidence, // High | Medium | Low + }, + + /// Cannot classify with enough confidence. + /// Default action depends on publisher mode (Permissive vs Strict). + Ambiguous { + signals: Vec, + }, +} +``` + +### 4.3 Signal Refresh Cadence + +- **IP allowlists from crawler operators:** fetched by a control-plane job + from each operator's published JSON endpoint (e.g., `openai.com/gptbot.json`). + Bundled into TS releases. Publishers pick up new IP ranges when they update + to a newer TS version. Recommended publisher refresh cadence: weekly. +- **UA patterns:** static, updated via TS release. +- **JA4 fingerprint database:** static, updated via TS release + (community-maintained list). +- **ASN database:** updated via Maxmind or equivalent on publisher's own + schedule. + +### 4.4 Example Log Entry for a Classified Request + +```json +{ + "request_id": "01HXYZ...", + "timestamp": "2026-04-22T14:32:11Z", + "path": "/article/some-slug", + "classification": "honest_ai_crawler", + "operator": "anthropic", + "bot_name": "claudebot", + "purpose": "ai_train", + "signals_matched": ["ua_honest", "ip_allowlist:anthropic"], + "action": "403_forbidden", + "action_reason": "license.toml prohibits ai-train for this route", + "license_terms_applied": "default", + "mode": "permissive" +} +``` + +### 4.5 Stealth Classification Example + +A scraper running on AWS using Python `requests` with a Chrome user-agent +gets classified as `StealthAiCrawler`: + +- Signal: `asn:aws` ✓ +- Signal: `ja4:python_requests` ✓ +- Signal: `ua_spoofed_chrome` (UA claims Chrome but JA4 says Python) ✓ +- Confidence: High +- Action depends on mode: Permissive → allow through (but logged for publisher + review); Strict → 403. + +### 4.6 Detection Posture (Default vs. Override) + +**Default:** Permissive. Block only confirmed crawlers whose license terms +prohibit access. Stealth crawlers and ambiguous traffic are allowed through +but logged for publisher review. + +**Override:** Strict, configurable per-publisher or per-route. Stealth and +ambiguous traffic get 403. Use for high-value routes (premium content, APIs). + +### 4.7 Transparency Model + +- **Transparent to the publisher:** full classification detail in logs and + debug endpoints (which signals matched, what the decision was, why). +- **Opaque to the crawler:** crawlers receive standards-compliant RSL responses + (401/402/403 with `WWW-Authenticate: License` and `Link` header pointing to + `license.xml`). Crawlers are not told which signals flagged them — this + prevents adversarial training against TS detection. + +--- + +## 5. Configuration + +### 5.1 Two-File Split: Public vs Private + +**`license.toml`** — PUBLIC RSL terms. Safe to commit to a public git repo. +Everything here ends up in `/license.xml`. + +**`license.private.toml`** — PRIVATE enforcement rules. Never exposed via any +endpoint. Contains per-crawler commercial overrides, enforcement mode +configuration, and any NDA-bound IP allowlist extensions. + +Both files are compiled into the WASM binary at build time (same mechanism as +existing TS configs). Any change requires a rebuild and redeploy via the +standard TS deploy pipeline. Deploy time is fast (standard +`fastly compute publish` flow — typically under a minute for the publish, +plus CI build time). + +**Optional future enhancement (not POC):** runtime config loading from edge KV +or Fastly Config Store so terms can be updated without a redeploy. Deferred +because it introduces failure modes (KV availability, eventual consistency, +auth) that add complexity. + +### 5.2 Public `license.toml` — Minimal Example + +```toml +# license.toml — publisher's RSL terms, read by Trusted Server +# Served as /license.xml (auto-generated), referenced in /robots.txt + +[publisher] +name = "Example Publisher, Inc." +contact = "licensing@example.com" +contact_url = "https://example.com/licensing" +copyright_holder = "Example Publisher, Inc." +copyright_type = "organization" + +# Default terms for all content not matching a more specific route +[default] +# What uses are allowed (RSL usage vocabulary) +permits = ["search", "ai-input"] +# What uses are explicitly prohibited +prohibits = ["ai-train"] +# Default payment model for permitted uses +payment = "attribution" +``` + +### 5.3 Public `license.toml` — Full Example With Route Overrides + +```toml +[publisher] +name = "Example Publisher, Inc." +contact = "licensing@example.com" +contact_url = "https://example.com/licensing" +copyright_holder = "Example Publisher, Inc." +copyright_type = "organization" + +# Default terms — homepage, category pages, public articles +[default] +permits = ["search", "ai-input"] +prohibits = ["ai-train", "ai-index"] +payment = "attribution" + +# Premium/paywalled content — strict, contact for licensing +[routes."/premium/*"] +permits = [] +prohibits = ["ai-all", "search"] +payment = "subscription" +amount = "10.00" +currency = "USD" + +# News archive — crawl fees apply +[routes."/archive/*"] +permits = ["ai-input", "ai-index"] +prohibits = ["ai-train"] +payment = "crawl" +amount = "0.005" +currency = "USD" + +# API endpoints — no AI use at all +[routes."/api/*"] +permits = [] +prohibits = ["ai-all"] +payment = "free" +``` + +### 5.4 Private `license.private.toml` — Example + +```toml +# Enforcement mode per route (not published — operational decision) +[enforcement] +default_mode = "permissive" + +[[enforcement.routes]] +pattern = "/premium/*" +mode = "strict" + +[[enforcement.routes]] +pattern = "/api/*" +mode = "strict" + +# Per-crawler commercial overrides — commercial secrets +[[crawler_overrides]] +bot_name = "gptbot" +action = "allow" +reason_internal = "Direct license agreement - contract #2026-OAI-001" + +[[crawler_overrides]] +bot_name = "perplexitybot" +action = "deny" +reason_internal = "Pending commercial agreement" + +[[crawler_overrides]] +bot_name = "claudebot" +action = "enforce_default" # apply license.toml terms as-is + +# Per-operator IP allowlist extensions (e.g., publisher has a direct feed from +# a crawler operator not in the default public list) +[[ip_allowlist_extensions]] +operator = "example_ai_partner" +cidrs = ["203.0.113.0/24", "198.51.100.0/24"] +note = "Partner under NDA — not publicly disclosed" +``` + +### 5.5 Key Config Design Points + +1. **`permits` / `prohibits` use RSL's usage vocabulary** — `search`, `ai-all`, + `ai-train`, `ai-input`, `ai-index`. Prohibition always wins when both apply + (per RSL spec). +2. **`payment` types match RSL's payment vocabulary** — `purchase`, + `subscription`, `training`, `crawl`, `use`, `contribution`, `attribution`, + `free`. +3. **Route patterns are RFC 9309-compliant** (same syntax as robots.txt) — + wildcards supported, more specific paths override less specific. +4. **`mode` per route lives in the private config** — permissive vs strict is + an operational decision, not a published term. +5. **`[crawler_overrides]` in private config** — per-bot exceptions for + publishers with direct commercial deals. `"allow"` bypasses enforcement + entirely (they've paid via a separate contract); `"deny"` blocks regardless + of terms; `"enforce_default"` applies the route's public terms (default + behavior). +6. **No secrets in the public file** — `license.toml` can live in a public git + repo if the publisher wants. + +### 5.6 Trusted Server Settings + +```toml +# trusted-server.toml +[integrations.rsl] +enabled = true +public_config = "license.toml" +private_config = "license.private.toml" +``` + +--- + +## 6. HTTP Response Behavior + +### 6.1 Allowed Request (GPTBot on Route That Permits `ai-input`) + +```http +GET /article/hello-world HTTP/2 +User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot +``` + +```http +HTTP/2 200 OK +Link: ; rel="license"; type="application/rsl+xml" +Content-Type: text/html +... +
...full content...
+``` + +TS adds the `Link` header on every response so honest crawlers can discover +license terms on any request, not just by fetching `robots.txt` first. + +### 6.2 Blocked Honest Crawler (ClaudeBot on Route That Prohibits `ai-train`) + +```http +GET /premium/report HTTP/2 +User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ClaudeBot/1.0; +http://www.anthropic.com/claudebot +``` + +```http +HTTP/2 402 Payment Required +Link: ; rel="license"; type="application/rsl+xml" +WWW-Authenticate: License realm="example.com", terms_url="https://example.com/licensing" +Content-Type: application/rsl+xml +Cache-Control: no-store + + + + + + ai-all + + 10.00 + + + + +``` + +The body is an inline RSL fragment describing exactly the terms the crawler +needs to satisfy. The RSL spec recommends this — crawlers don't have to +re-fetch `/license.xml` to understand what's required. + +### 6.3 Blocked Stealth Crawler (Python Scraper With Spoofed Chrome UA) + +```http +GET /article/hello-world HTTP/2 +User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 +``` + +```http +HTTP/2 403 Forbidden +Link: ; rel="license"; type="application/rsl+xml" +Content-Type: text/plain + +Forbidden. +``` + +No `WWW-Authenticate`, no explanation, no RSL fragment. Stealth crawlers get +minimum information. The `Link` header is still included for standards +compliance, but there's no negotiation invited. + +### 6.4 Crawler-Specific Override — Direct License Deal + +GPTBot with `[[crawler_overrides]] bot_name = "gptbot"` `action = "allow"` in +private config: + +```http +GET /premium/report HTTP/2 +User-Agent: Mozilla/5.0 ... GPTBot/1.1 ... +``` + +```http +HTTP/2 200 OK +Link: ; rel="license"; type="application/rsl+xml" +Content-Type: text/html +... +``` + +GPTBot gets through even on the `/premium/*` route that prohibits `ai-all`, +because the private config has an override. The public `/license.xml` doesn't +reveal this — it still shows `/premium/*` as `prohibits: ai-all, subscription +$10`. Only the publisher's internal logs record that GPTBot was allowed due +to an override. + +### 6.5 `/robots.txt` Response + +```http +GET /robots.txt HTTP/2 +``` + +```http +HTTP/2 200 OK +Link: ; rel="license"; type="application/rsl+xml" +Content-Type: text/plain + +License: https://example.com/license.xml + +User-agent: * +Disallow: /admin/ +Sitemap: https://example.com/sitemap.xml +``` + +TS preserves the publisher's existing `robots.txt` content and prepends the +`License:` directive. If the publisher doesn't have a `robots.txt` at origin, +TS generates a minimal one with just the License directive. + +### 6.6 `/license.xml` Response (Public RSL Terms) + +```http +GET /license.xml HTTP/2 +``` + +```http +HTTP/2 200 OK +Content-Type: application/rsl+xml +Cache-Control: public, max-age=2592000 +ETag: "v1-abc123" + + + + + + search + ai-input + ai-train + ai-index + + + + Example Publisher, Inc. + + + + + + ai-all + search + + 10.00 + + + + + + + ai-input + ai-index + ai-train + + 0.005 + + + + + + + ai-all + + + + +``` + +30-day cache — matches RSL's `max-age` default. Crawlers cache this and only +re-fetch when the ETag changes. + +### 6.7 Full Response Matrix + +| Classification | Route permits? | Action | Status | Body | +|---|---|---|---|---| +| Human | (n/a) | Allow | 200 | Normal content + `Link` header | +| Honest AI crawler | Yes | Allow | 200 | Normal content + `Link` header | +| Honest AI crawler | No | Block (polite) | 402 | RSL fragment with terms | +| Honest AI crawler | Private override: deny | Block | 403 | Minimal "Forbidden" | +| Honest AI crawler | Private override: allow | Allow | 200 | Normal content + `Link` header | +| Stealth AI crawler (strict mode) | (n/a) | Block | 403 | Minimal "Forbidden" | +| Stealth AI crawler (permissive mode) | (n/a) | Allow + log | 200 | Normal content, flagged for publisher review | +| Ambiguous (permissive mode) | (n/a) | Allow + log | 200 | Normal content | +| Ambiguous (strict mode) | (n/a) | Block | 403 | Minimal "Forbidden" | + +--- + +## 7. Debug Endpoints & Structured Logs + +### 7.1 Auth Model + +All debug endpoints require `Authorization: Bearer $TS_DEBUG_TOKEN` — same +pattern as existing `/_ts/debug/*` routes. Publisher configures the token via +existing TS settings. + +### 7.2 `GET /_ts/debug/rsl/summary` + +Rolled-up classification counts for a configurable window. + +```http +GET /_ts/debug/rsl/summary?window=24h HTTP/2 +Authorization: Bearer $TS_DEBUG_TOKEN +``` + +```json +{ + "window": "last_24h", + "generated_at": "2026-04-22T14:32:11Z", + "totals": { + "requests": 1843201, + "human": 1832847, + "honest_ai_crawler": 8932, + "stealth_ai_crawler": 1217, + "ambiguous": 205 + }, + "honest_crawlers": { + "openai.gptbot": {"requests": 3412, "allowed": 3412, "blocked_402": 0, "blocked_403": 0}, + "openai.chatgpt-user": {"requests": 891, "allowed": 891, "blocked_402": 0, "blocked_403": 0}, + "anthropic.claudebot": {"requests": 2104, "allowed": 0, "blocked_402": 2104, "blocked_403": 0}, + "anthropic.claude-user": {"requests": 612, "allowed": 612, "blocked_402": 0, "blocked_403": 0}, + "perplexity.perplexitybot": {"requests": 1205, "allowed": 0, "blocked_402": 0, "blocked_403": 1205}, + "google.extended": {"requests": 478, "allowed": 478, "blocked_402": 0, "blocked_403": 0}, + "bytedance.bytespider": {"requests": 230, "allowed": 0, "blocked_402": 0, "blocked_403": 230} + }, + "stealth_signals": { + "ja4:python_requests_asn:aws": {"requests": 687, "action": "allowed_permissive"}, + "ja4:scrapy_asn:gcp": {"requests": 312, "action": "allowed_permissive"}, + "ja4:httpx_asn:digitalocean": {"requests": 218, "action": "blocked_403_strict"} + }, + "top_paths_crawled": [ + {"path": "/article/*", "requests": 5831}, + {"path": "/archive/*", "requests": 2104}, + {"path": "/premium/*", "requests": 897} + ] +} +``` + +### 7.3 `GET /_ts/debug/rsl/recent` + +Last N classified requests, newest first. Backed by an in-process ring buffer +(no KV writes on hot path). Default 1000 entries, configurable. + +```http +GET /_ts/debug/rsl/recent?limit=50&filter=honest_ai_crawler HTTP/2 +Authorization: Bearer $TS_DEBUG_TOKEN +``` + +```json +{ + "generated_at": "2026-04-22T14:32:11Z", + "entries": [ + { + "request_id": "01HXYZ...", + "timestamp": "2026-04-22T14:32:05Z", + "path": "/article/hello-world", + "classification": "honest_ai_crawler", + "operator": "openai", + "bot_name": "gptbot", + "purpose": "ai_train", + "signals_matched": ["ua_honest", "ip_allowlist:openai", "ja4:openai_fetcher"], + "action": "200_allowed", + "action_reason_public": "license permits ai-train on this route", + "route_matched": "default", + "mode": "permissive" + }, + { + "request_id": "01HXYZ...", + "timestamp": "2026-04-22T14:32:03Z", + "path": "/premium/q3-report", + "classification": "honest_ai_crawler", + "operator": "anthropic", + "bot_name": "claudebot", + "purpose": "ai_train", + "signals_matched": ["ua_honest", "ip_allowlist:anthropic"], + "action": "402_payment_required", + "action_reason_public": "license prohibits ai-train on /premium/*", + "route_matched": "/premium/*", + "mode": "strict" + } + ] +} +``` + +### 7.4 `GET /_ts/debug/rsl/license` + +Returns what TS is actually serving as `/license.xml`. Useful for debugging +"why isn't my route-specific term being applied?" without exposing the private +config. + +```json +{ + "generated_at": "2026-04-22T14:32:11Z", + "license_xml_etag": "v1-abc123", + "source_file": "license.toml", + "source_hash_sha256": "a1b2c3...", + "rendered_xml": "` in license.xml +- OLP server runs separately from the edge (not WASM) — typically a small + service on Cloud Run, Fly.io, DigitalOcean, or equivalent +- Implements RSL OLP endpoints: `/token`, `/introspect`, `/key` +- Entitlement platform obtains tokens programmatically, presents + `Authorization: License ` on crawl requests +- TS at the edge validates tokens via local HMAC check (fast path) or + optional `/introspect` callback (stronger security, higher latency) + +**Split architecture:** + +- **Hot path (WASM at edge):** token validation only. HMAC check against a + shared signing key. Sub-millisecond. No KV writes. +- **Cold path (separate service):** token issuance, billing, dashboards, key + management for EMS-encrypted content. Writes to persistent storage. + +This split means TS stays lightweight at the edge while the commercial layer +can scale independently. Phase 2 is out of POC scope but the POC's RSL output +already declares future-compatibility via the `server` attribute when enabled. + +--- + +## 10. Success Criteria for POC + +Set before running against real traffic: + +1. **Classification accuracy:** 100% of honest AI crawlers (OpenAI, Anthropic, + Perplexity, Google-Extended, CCBot) correctly identified by UA + IP + allowlist signals. Verified against published crawler documentation. +2. **Zero human-traffic impact:** no latency regression, no CLS, no changes + to Core Web Vitals, no false-positive blocks on human users. +3. **Standards compliance:** `/license.xml` validates against the RSL 1.0 + schema; 402 responses include `WWW-Authenticate: License` and inline RSL + fragment per spec. +4. **Publisher visibility:** dashboard summary populates within 5 minutes of + deploy; recent endpoint shows live classifications. +5. **Deploy time:** onboarding from `[integrations.rsl] enabled = true` to + live classification in under a day for an existing TS publisher. +6. **Binary size:** total RSL integration adds <100 KB to the WASM binary. + +--- + +## 11. Open Questions + +1. **Which JA4 fingerprint database do we bundle?** Community-maintained lists + exist but vary in quality. Recommend starting with a curated list of + ~100-200 known LLM fetcher fingerprints and expanding based on POC data. + +2. **How should `mode` be represented per-route in the public vs private + split?** The current design puts `mode` entirely in the private config. + Alternative: allow publishers to publish `mode` for transparency, under the + argument that stating "strict mode on /premium/*" is reasonable operational + disclosure. Default here is private for maximum commercial flexibility. + +3. **Should `/license.xml` include a `max-age` derived from config, or always + 30 days?** RSL default is 30 days. Publishers might want shorter (e.g., 1 + day) if they're actively iterating on terms. Recommend configurable with + 30-day default. + +4. **Do we include PSP/CoMP signals in license.xml today?** CoMP is in working + group formation. Safest: no CoMP-specific output today; add when the + framework stabilizes and there are real consumers of the signal. + +5. **How do we handle requests that hit `/license.xml` from obviously-stealth + clients?** Current design serves `/license.xml` to everyone (public + endpoint by definition). Consider rate-limiting or special handling if + abuse patterns emerge. + +6. **Which specific crawler operators get IP allowlists compiled into the + default POC release?** Proposed starting set: OpenAI (GPTBot, ChatGPT-User, + OAI-SearchBot), Anthropic (ClaudeBot, Claude-User, Claude-SearchBot), + Perplexity, Google-Extended, CCBot, Bytespider, Amazonbot, Applebot-Extended, + Meta-ExternalAgent. Others (CohereBot, xAI Grok, etc.) added in subsequent + releases as they publish verifiable IP ranges.