Skip to content

feat(llm-obs): port non-org2 MCP trace and eval endpoints#489

Open
mbldatadog wants to merge 2 commits into
DataDog:mainfrom
mbldatadog:worktree-port_mcp_functionality
Open

feat(llm-obs): port non-org2 MCP trace and eval endpoints#489
mbldatadog wants to merge 2 commits into
DataDog:mainfrom
mbldatadog:worktree-port_mcp_functionality

Conversation

@mbldatadog
Copy link
Copy Markdown
Contributor

@mbldatadog mbldatadog commented May 11, 2026

Summary

Ports the non-org2-gated LLMObs MCP server endpoints to pup as first-class CLI commands, and adds evaluator CRUD via the unstable MCP endpoints and a `--summary` flag for span search.

Coverage gap analysis that motivated this work: compared 26 MCP tools against existing pup commands post-v0.55.0, identified 15 missing; this PR implements the 8 non-org2-gated trace/span tools plus 7 eval commands.

Orphaned-span root cause writeup: https://gist.github.com/mbldatadog/69eb7ebd162e155da9b6b9c3afbad516

(Notebook `edit` command split to #495 for notebook-team review.)

New commands

`pup llm-obs spans` — 6 new subcommands

Command What it does
`get-trace --trace-id` Full span hierarchy tree with depth/error summary
`get-details --trace-id --span-ids` Timing, cost metrics, children IDs; warns on stderr when a span is orphaned from the trace tree
`get-content --trace-id --span-id --field` Raw content fields — `--field` is required (server returns 400 without it)
`find-errors --trace-id` All error spans with type, message, and parent context
`expand --trace-id --span-ids` Direct children with `has_input`/`has_output` flags
`get-agent-loop --trace-id` Chronological agent execution steps

`pup llm-obs evals` — 5 new subcommands (2 existing)

Command What it does
`get-evaluator ` Full LLM-judge config via MCP: span filters, sampling, scope, prompt, schema. Use before `create-or-update` — read/write body schemas differ by backend design
`get-config ` Prompt template, assessment criteria, output schema
`get-aggregate-stats ` Pass/fail rates and score distributions over a time window; `--ml-app` narrows to one app
`create-or-update --file` Full-replace publish (flat body — see schema note below)
`delete ` Removes named evaluator

`pup llm-obs spans search --summary`

Strips `tags`, `llm_info`, and content previews from each span, keeping 11 essential fields (`span_id`, `trace_id`, `apm_trace_id`, `name`, `span_kind`, `ml_app`, `service`, `status`, `duration_ms`, `start_ms`, `parent_id`). Reduces payload ~80% for bulk analysis phases.

Changes

  • `src/commands/llm_obs.rs` — 13 new handler functions, 79 tests (26 new)
  • `src/main.rs` — new enum variants and routing for all new commands

Orphaned-span warning

`get-details` warns on stderr when fewer spans are returned than requested:

```
warning: 1 of 1 requested span(s) not found in trace hierarchy — the span may
exist but be orphaned (no path to a root span). Use 'spans get-content' to
retrieve its content directly.
```

Root cause: `spans search` returns spans by raw `@trace_id` (includes orphaned spans); `get-details` reconstructs the BFS tree and silently drops spans unreachable from any root. The response key path is `resp["spans"]` (raw API) not `resp["data"]["spans"]` (agent-mode formatter envelope) — regression test added.

Schema note for `evals create-or-update`

The `--file` uses a flat body (all fields top-level: `application_name`, `enabled`, `prompt_template`, `output_schema`, etc.), which differs from the nested structure returned by `get-evaluator` (`target.application_name`, `llm_judge_config.prompt_template`, `llm_provider.*`). This is a backend design decision — the read and write APIs use different shapes.

Testing

Automated

  • `cargo test` — 79 tests, all passing
  • `cargo clippy -- -D warnings` — clean
  • `cargo fmt --check` — clean

Manual smoke tests (run against org2)

Trace/span IDs age out of the index quickly — substitute fresh IDs from `pup llm-obs spans search --limit 1` as needed.

`evals get-evaluator`

  • `pup llm-obs evals get-evaluator failure-to-answer-verdict` — returns full LLM-judge config (prompt, schema, target, provider)
  • `pup llm-obs evals get-evaluator nonexistent-eval-xyz` — 404 "custom evaluator not found: …" (exit 1)

`evals get-config`

  • `pup llm-obs evals get-config failure-to-answer-verdict` — returns prompt template + output schema
  • `pup llm-obs evals get-config nonexistent-eval-xyz` — 404 (exit 1)

`evals get-aggregate-stats`

  • `pup llm-obs evals get-aggregate-stats failure-to-answer-verdict --ml-app docs_ai --from 24h` — returns pass rate, fail count, top categorical values
  • `pup llm-obs evals get-aggregate-stats failure-to-answer-verdict --from not-a-time` — local parse error before any network call (exit 1)

`evals create-or-update` + `delete` (creates and cleans up a disposable evaluator)
```bash
cat > /tmp/pup_smoke_eval.json << 'BODY'
{
"application_name": "docs_ai", "enabled": false, "sampling_percentage": 1,
"root_spans_only": true, "eval_scope": "span",
"integration_provider": "openai", "model_name": "gpt-4.1-mini",
"parsing_type": "structured_output",
"prompt_template": [
{"role": "system", "content": "Rate as pass or fail."},
{"role": "user", "content": "{{span_output}}"}
],
"assessment_criteria": {"pass_values": ["pass"]},
"output_schema": {
"name": "categorical_eval",
"schema": {
"additionalProperties": false,
"properties": {
"categorical_eval": {"anyOf": [{"const": "pass", "description": "Satisfactory."}, {"const": "fail", "description": "Unsatisfactory."}], "type": "string"},
"reasoning": {"description": "Brief explanation.", "type": "string"}
},
"required": ["categorical_eval", "reasoning"], "type": "object"
},
"strict": true
}
}
BODY
```

  • `pup llm-obs evals create-or-update pup-smoke-test-DELETE-ME --file /tmp/pup_smoke_eval.json` — prints confirmation (exit 0)
  • `pup llm-obs evals get-evaluator pup-smoke-test-DELETE-ME` — confirms it was created
  • `pup llm-obs evals delete pup-smoke-test-DELETE-ME` — prints confirmation (exit 0)
  • `echo '{}' | pup llm-obs evals create-or-update pup-smoke-bad --file /dev/stdin` — 400 "target.application_name cannot be empty" (exit 1)
  • `pup llm-obs evals delete nonexistent-eval-xyz` — 404 (exit 1)

`spans search --summary`

  • `pup llm-obs spans search --span-kind llm --root-spans-only --limit 3 --from 15m --summary` — each span has only 11 fields, no `tags`/`llm_info`/previews
  • Same without `--summary` — compare to see what's stripped

`spans get-trace`

  • `pup llm-obs spans get-trace --trace-id --include-tree` — returns span tree with `has_errors`, service list, span kind counts
  • `pup llm-obs spans get-trace --trace-id 0000000000000000` — 404 (exit 1)

`spans get-details`

  • `pup llm-obs spans get-details --trace-id --span-ids <span_id>` — returns metadata including cost metrics
  • With an orphaned span ID: stderr warning fires, spans array empty, exit 0

`spans get-content`

  • `pup llm-obs spans get-content --trace-id --span-id <span_id> --field input` — returns full content
  • Same without `--field` — clap error "required arguments not provided" (exit 2)

`spans find-errors`

  • `pup llm-obs spans find-errors --trace-id ` — returns error spans with type, message, parent context
  • `pup llm-obs spans find-errors --trace-id 0000000000000000` — empty `error_spans` array, exit 0

`spans expand`

  • `pup llm-obs spans expand --trace-id --span-ids <root_span_id>` — returns children with `has_input`/`has_output` flags
  • `pup llm-obs spans expand --trace-id 0000000000000000 --span-ids deadbeef` — empty, exit 0

`spans get-agent-loop`

  • `pup llm-obs spans get-agent-loop --trace-id ` — returns agent span name and iterations
  • On a trace with no agent span — 404 "no agent span (kind=agent) found; get_agent_loop requires an agent span with LLM children" (exit 1)

🤖 Generated with Claude Code

@mbldatadog mbldatadog requested a review from a team as a code owner May 11, 2026 20:06
@mbldatadog mbldatadog force-pushed the worktree-port_mcp_functionality branch from b693070 to 348d806 Compare May 11, 2026 23:26
platinummonkey
platinummonkey previously approved these changes May 12, 2026
@mbldatadog
Copy link
Copy Markdown
Contributor Author

NB - please don't review/merge yet - trying to get one clean PR and tested PR that has all the pup functionality needed to run a set of skills we have that can currently only run on MCP, in advance of shipping/publicizing those skills.

@mbldatadog mbldatadog force-pushed the worktree-port_mcp_functionality branch 2 times, most recently from 09be958 to ef4eab2 Compare May 12, 2026 15:41
mbldatadog and others added 2 commits May 13, 2026 08:11
…ary, notebooks edit

Ports the non-org2-gated LLMObs MCP server endpoints to pup, adds evaluator
CRUD commands, a --summary flag for span search, and append-only notebook edits.

## New commands

### pup llm-obs spans (6 new subcommands)
- get-trace --trace-id      Full span hierarchy tree with depth/error summary
- get-details --span-ids    Timing, cost metrics, children IDs; warns on stderr
                            when a span exists in raw storage but is orphaned
                            from the LLMObs trace tree (see note below)
- get-content --field       Raw content fields (input/output/messages/metadata)
                            --field is required; server returns 400 without it
- find-errors               All error spans with type, message, parent context
- expand --span-ids         Direct children with has_input/has_output flags
- get-agent-loop            Chronological agent execution steps

### pup llm-obs evals (7 subcommands, 5 new)
- list                      Org-wide evaluator list (existing)
- list-by-ml-app            Per-app evaluator list (existing)
- get-evaluator             Full LLM-judge config via MCP (span filters,
                            sampling, scope, prompt, schema) — use before
                            create-or-update; read/write schemas are flat vs
                            nested by backend design
- get-config                Prompt template, assessment criteria, output schema
- get-aggregate-stats       Pass/fail rates and score distributions over a
                            time window; --ml-app narrows to one app
- create-or-update          Full-replace publish (flat body; see --help)
- delete                    Removes named evaluator

### pup llm-obs spans search --summary
Strips tags, llm_info, and content previews from each span, keeping 11
essential fields. Reduces payload ~80% for bulk analysis phases.

### pup notebooks edit <id> --file <cells.json>
Append-only update: fetches current notebook, appends cells from file
(array of cell objects), writes back. Prevents clobbering existing content.

## Orphaned-span warning

get-details warns on stderr when fewer spans come back than were requested:

  warning: 1 of 1 requested span(s) not found in trace hierarchy — the span
  may exist but be orphaned (no path to a root span). Use 'spans get-content'
  to retrieve its content directly.

Root cause: spans search returns by raw @trace_id (includes orphaned spans);
get-details reconstructs the BFS tree and silently drops spans unreachable
from any root. Documented at:
https://gist.github.com/mbldatadog/69eb7ebd162e155da9b6b9c3afbad516

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The /api/unstable/llm-obs-mcp/v1/eval/config endpoint is being removed
upstream. All callers should use evals get-evaluator instead, which returns
a strict superset of what get-config returned (includes prompt template,
output schema, assessment criteria plus span filters, sampling, and scope).

Removes: evals_get_config function, GetConfig enum variant, routing, and tests.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@mbldatadog mbldatadog force-pushed the worktree-port_mcp_functionality branch from e0b3278 to 950c165 Compare May 13, 2026 12:12
@mbldatadog
Copy link
Copy Markdown
Contributor Author

Okay @platinummonkey - this is fully tested on the set of skills I want it to work with and they behave roughly the same as on the MCP endpoints, mind merging it now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request product:bits-ai

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants