feat(llm-obs): port non-org2 MCP trace and eval endpoints by mbldatadog · Pull Request #489 · DataDog/pup

mbldatadog · 2026-05-11T20:06:24Z

Summary

Ports the non-org2-gated LLMObs MCP server endpoints to pup as first-class CLI commands, and adds evaluator CRUD via the unstable MCP endpoints and a `--summary` flag for span search.

Coverage gap analysis that motivated this work: compared 26 MCP tools against existing pup commands post-v0.55.0, identified 15 missing; this PR implements the 8 non-org2-gated trace/span tools plus 7 eval commands.

Orphaned-span root cause writeup: https://gist.github.com/mbldatadog/69eb7ebd162e155da9b6b9c3afbad516

(Notebook `edit` command split to #495 for notebook-team review.)

New commands

`pup llm-obs spans` — 6 new subcommands

Command	What it does
`get-trace --trace-id`	Full span hierarchy tree with depth/error summary
`get-details --trace-id --span-ids`	Timing, cost metrics, children IDs; warns on stderr when a span is orphaned from the trace tree
`get-content --trace-id --span-id --field`	Raw content fields — `--field` is required (server returns 400 without it)
`find-errors --trace-id`	All error spans with type, message, and parent context
`expand --trace-id --span-ids`	Direct children with `has_input`/`has_output` flags
`get-agent-loop --trace-id`	Chronological agent execution steps

`pup llm-obs evals` — 5 new subcommands (2 existing)

Command	What it does
`get-evaluator `	Full LLM-judge config via MCP: span filters, sampling, scope, prompt, schema. Use before `create-or-update` — read/write body schemas differ by backend design
`get-config `	Prompt template, assessment criteria, output schema
`get-aggregate-stats `	Pass/fail rates and score distributions over a time window; `--ml-app` narrows to one app
`create-or-update --file`	Full-replace publish (flat body — see schema note below)
`delete `	Removes named evaluator

`pup llm-obs spans search --summary`

Strips `tags`, `llm_info`, and content previews from each span, keeping 11 essential fields (`span_id`, `trace_id`, `apm_trace_id`, `name`, `span_kind`, `ml_app`, `service`, `status`, `duration_ms`, `start_ms`, `parent_id`). Reduces payload ~80% for bulk analysis phases.

Changes

`src/commands/llm_obs.rs` — 13 new handler functions, 79 tests (26 new)
`src/main.rs` — new enum variants and routing for all new commands

Orphaned-span warning

`get-details` warns on stderr when fewer spans are returned than requested:

```
warning: 1 of 1 requested span(s) not found in trace hierarchy — the span may
exist but be orphaned (no path to a root span). Use 'spans get-content' to
retrieve its content directly.
```

Root cause: `spans search` returns spans by raw `@trace_id` (includes orphaned spans); `get-details` reconstructs the BFS tree and silently drops spans unreachable from any root. The response key path is `resp["spans"]` (raw API) not `resp["data"]["spans"]` (agent-mode formatter envelope) — regression test added.

Schema note for `evals create-or-update`

The `--file` uses a flat body (all fields top-level: `application_name`, `enabled`, `prompt_template`, `output_schema`, etc.), which differs from the nested structure returned by `get-evaluator` (`target.application_name`, `llm_judge_config.prompt_template`, `llm_provider.*`). This is a backend design decision — the read and write APIs use different shapes.

Testing

Automated

`cargo test` — 79 tests, all passing
`cargo clippy -- -D warnings` — clean
`cargo fmt --check` — clean

Manual smoke tests (run against org2)

Trace/span IDs age out of the index quickly — substitute fresh IDs from `pup llm-obs spans search --limit 1` as needed.

`evals get-evaluator`

`pup llm-obs evals get-evaluator failure-to-answer-verdict` — returns full LLM-judge config (prompt, schema, target, provider)
`pup llm-obs evals get-evaluator nonexistent-eval-xyz` — 404 "custom evaluator not found: …" (exit 1)

`evals get-config`

`pup llm-obs evals get-config failure-to-answer-verdict` — returns prompt template + output schema
`pup llm-obs evals get-config nonexistent-eval-xyz` — 404 (exit 1)

`evals get-aggregate-stats`

`pup llm-obs evals get-aggregate-stats failure-to-answer-verdict --ml-app docs_ai --from 24h` — returns pass rate, fail count, top categorical values
`pup llm-obs evals get-aggregate-stats failure-to-answer-verdict --from not-a-time` — local parse error before any network call (exit 1)

`evals create-or-update` + `delete` (creates and cleans up a disposable evaluator)
```bash
cat > /tmp/pup_smoke_eval.json << 'BODY'
{
"application_name": "docs_ai", "enabled": false, "sampling_percentage": 1,
"root_spans_only": true, "eval_scope": "span",
"integration_provider": "openai", "model_name": "gpt-4.1-mini",
"parsing_type": "structured_output",
"prompt_template": [
{"role": "system", "content": "Rate as pass or fail."},
{"role": "user", "content": "{{span_output}}"}
],
"assessment_criteria": {"pass_values": ["pass"]},
"output_schema": {
"name": "categorical_eval",
"schema": {
"additionalProperties": false,
"properties": {
"categorical_eval": {"anyOf": [{"const": "pass", "description": "Satisfactory."}, {"const": "fail", "description": "Unsatisfactory."}], "type": "string"},
"reasoning": {"description": "Brief explanation.", "type": "string"}
},
"required": ["categorical_eval", "reasoning"], "type": "object"
},
"strict": true
}
}
BODY
```

`pup llm-obs evals create-or-update pup-smoke-test-DELETE-ME --file /tmp/pup_smoke_eval.json` — prints confirmation (exit 0)
`pup llm-obs evals get-evaluator pup-smoke-test-DELETE-ME` — confirms it was created
`pup llm-obs evals delete pup-smoke-test-DELETE-ME` — prints confirmation (exit 0)
`echo '{}' | pup llm-obs evals create-or-update pup-smoke-bad --file /dev/stdin` — 400 "target.application_name cannot be empty" (exit 1)
`pup llm-obs evals delete nonexistent-eval-xyz` — 404 (exit 1)

`spans search --summary`

`pup llm-obs spans search --span-kind llm --root-spans-only --limit 3 --from 15m --summary` — each span has only 11 fields, no `tags`/`llm_info`/previews
Same without `--summary` — compare to see what's stripped

`spans get-trace`

`pup llm-obs spans get-trace --trace-id --include-tree` — returns span tree with `has_errors`, service list, span kind counts
`pup llm-obs spans get-trace --trace-id 0000000000000000` — 404 (exit 1)

`spans get-details`

`pup llm-obs spans get-details --trace-id --span-ids <span_id>` — returns metadata including cost metrics
With an orphaned span ID: stderr warning fires, spans array empty, exit 0

`spans get-content`

`pup llm-obs spans get-content --trace-id --span-id <span_id> --field input` — returns full content
Same without `--field` — clap error "required arguments not provided" (exit 2)

`spans find-errors`

`pup llm-obs spans find-errors --trace-id ` — returns error spans with type, message, parent context
`pup llm-obs spans find-errors --trace-id 0000000000000000` — empty `error_spans` array, exit 0

`spans expand`

`pup llm-obs spans expand --trace-id --span-ids <root_span_id>` — returns children with `has_input`/`has_output` flags
`pup llm-obs spans expand --trace-id 0000000000000000 --span-ids deadbeef` — empty, exit 0

`spans get-agent-loop`

`pup llm-obs spans get-agent-loop --trace-id ` — returns agent span name and iterations
On a trace with no agent span — 404 "no agent span (kind=agent) found; get_agent_loop requires an agent span with LLM children" (exit 1)

🤖 Generated with Claude Code

mbldatadog · 2026-05-12T14:00:06Z

NB - please don't review/merge yet - trying to get one clean PR and tested PR that has all the pup functionality needed to run a set of skills we have that can currently only run on MCP, in advance of shipping/publicizing those skills.

…ary, notebooks edit Ports the non-org2-gated LLMObs MCP server endpoints to pup, adds evaluator CRUD commands, a --summary flag for span search, and append-only notebook edits. ## New commands ### pup llm-obs spans (6 new subcommands) - get-trace --trace-id Full span hierarchy tree with depth/error summary - get-details --span-ids Timing, cost metrics, children IDs; warns on stderr when a span exists in raw storage but is orphaned from the LLMObs trace tree (see note below) - get-content --field Raw content fields (input/output/messages/metadata) --field is required; server returns 400 without it - find-errors All error spans with type, message, parent context - expand --span-ids Direct children with has_input/has_output flags - get-agent-loop Chronological agent execution steps ### pup llm-obs evals (7 subcommands, 5 new) - list Org-wide evaluator list (existing) - list-by-ml-app Per-app evaluator list (existing) - get-evaluator Full LLM-judge config via MCP (span filters, sampling, scope, prompt, schema) — use before create-or-update; read/write schemas are flat vs nested by backend design - get-config Prompt template, assessment criteria, output schema - get-aggregate-stats Pass/fail rates and score distributions over a time window; --ml-app narrows to one app - create-or-update Full-replace publish (flat body; see --help) - delete Removes named evaluator ### pup llm-obs spans search --summary Strips tags, llm_info, and content previews from each span, keeping 11 essential fields. Reduces payload ~80% for bulk analysis phases. ### pup notebooks edit <id> --file <cells.json> Append-only update: fetches current notebook, appends cells from file (array of cell objects), writes back. Prevents clobbering existing content. ## Orphaned-span warning get-details warns on stderr when fewer spans come back than were requested: warning: 1 of 1 requested span(s) not found in trace hierarchy — the span may exist but be orphaned (no path to a root span). Use 'spans get-content' to retrieve its content directly. Root cause: spans search returns by raw @trace_id (includes orphaned spans); get-details reconstructs the BFS tree and silently drops spans unreachable from any root. Documented at: https://gist.github.com/mbldatadog/69eb7ebd162e155da9b6b9c3afbad516 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

The /api/unstable/llm-obs-mcp/v1/eval/config endpoint is being removed upstream. All callers should use evals get-evaluator instead, which returns a strict superset of what get-config returned (includes prompt template, output schema, assessment criteria plus span filters, sampling, and scope). Removes: evals_get_config function, GetConfig enum variant, routing, and tests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

mbldatadog · 2026-05-14T18:18:58Z

Okay @platinummonkey - this is fully tested on the set of skills I want it to work with and they behave roughly the same as on the MCP endpoints, mind merging it now?

mbldatadog requested a review from a team as a code owner May 11, 2026 20:06

mbldatadog force-pushed the worktree-port_mcp_functionality branch from b693070 to 348d806 Compare May 11, 2026 23:26

platinummonkey previously approved these changes May 12, 2026

View reviewed changes

mbldatadog dismissed platinummonkey’s stale review via 02e6564 May 12, 2026 13:18

platinummonkey added enhancement New feature or request product:bits-ai labels May 12, 2026

mbldatadog force-pushed the worktree-port_mcp_functionality branch 2 times, most recently from 09be958 to ef4eab2 Compare May 12, 2026 15:41

mbldatadog and others added 2 commits May 13, 2026 08:11

mbldatadog force-pushed the worktree-port_mcp_functionality branch from e0b3278 to 950c165 Compare May 13, 2026 12:12

platinummonkey approved these changes May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm-obs): port non-org2 MCP trace and eval endpoints#489

feat(llm-obs): port non-org2 MCP trace and eval endpoints#489
mbldatadog wants to merge 2 commits into
DataDog:mainfrom
mbldatadog:worktree-port_mcp_functionality

mbldatadog commented May 11, 2026 •

edited

Loading

Uh oh!

mbldatadog commented May 12, 2026

Uh oh!

mbldatadog commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbldatadog commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New commands

`pup llm-obs spans` — 6 new subcommands

`pup llm-obs evals` — 5 new subcommands (2 existing)

`pup llm-obs spans search --summary`

Changes

Orphaned-span warning

Schema note for `evals create-or-update`

Testing

Automated

Manual smoke tests (run against org2)

Uh oh!

mbldatadog commented May 12, 2026

Uh oh!

mbldatadog commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbldatadog commented May 11, 2026 •

edited

Loading