feat(llm-obs): port non-org2 MCP trace and eval endpoints#489
Open
mbldatadog wants to merge 2 commits into
Open
feat(llm-obs): port non-org2 MCP trace and eval endpoints#489mbldatadog wants to merge 2 commits into
mbldatadog wants to merge 2 commits into
Conversation
b693070 to
348d806
Compare
platinummonkey
previously approved these changes
May 12, 2026
Contributor
Author
|
NB - please don't review/merge yet - trying to get one clean PR and tested PR that has all the pup functionality needed to run a set of skills we have that can currently only run on MCP, in advance of shipping/publicizing those skills. |
09be958 to
ef4eab2
Compare
…ary, notebooks edit
Ports the non-org2-gated LLMObs MCP server endpoints to pup, adds evaluator
CRUD commands, a --summary flag for span search, and append-only notebook edits.
## New commands
### pup llm-obs spans (6 new subcommands)
- get-trace --trace-id Full span hierarchy tree with depth/error summary
- get-details --span-ids Timing, cost metrics, children IDs; warns on stderr
when a span exists in raw storage but is orphaned
from the LLMObs trace tree (see note below)
- get-content --field Raw content fields (input/output/messages/metadata)
--field is required; server returns 400 without it
- find-errors All error spans with type, message, parent context
- expand --span-ids Direct children with has_input/has_output flags
- get-agent-loop Chronological agent execution steps
### pup llm-obs evals (7 subcommands, 5 new)
- list Org-wide evaluator list (existing)
- list-by-ml-app Per-app evaluator list (existing)
- get-evaluator Full LLM-judge config via MCP (span filters,
sampling, scope, prompt, schema) — use before
create-or-update; read/write schemas are flat vs
nested by backend design
- get-config Prompt template, assessment criteria, output schema
- get-aggregate-stats Pass/fail rates and score distributions over a
time window; --ml-app narrows to one app
- create-or-update Full-replace publish (flat body; see --help)
- delete Removes named evaluator
### pup llm-obs spans search --summary
Strips tags, llm_info, and content previews from each span, keeping 11
essential fields. Reduces payload ~80% for bulk analysis phases.
### pup notebooks edit <id> --file <cells.json>
Append-only update: fetches current notebook, appends cells from file
(array of cell objects), writes back. Prevents clobbering existing content.
## Orphaned-span warning
get-details warns on stderr when fewer spans come back than were requested:
warning: 1 of 1 requested span(s) not found in trace hierarchy — the span
may exist but be orphaned (no path to a root span). Use 'spans get-content'
to retrieve its content directly.
Root cause: spans search returns by raw @trace_id (includes orphaned spans);
get-details reconstructs the BFS tree and silently drops spans unreachable
from any root. Documented at:
https://gist.github.com/mbldatadog/69eb7ebd162e155da9b6b9c3afbad516
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The /api/unstable/llm-obs-mcp/v1/eval/config endpoint is being removed upstream. All callers should use evals get-evaluator instead, which returns a strict superset of what get-config returned (includes prompt template, output schema, assessment criteria plus span filters, sampling, and scope). Removes: evals_get_config function, GetConfig enum variant, routing, and tests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
e0b3278 to
950c165
Compare
Contributor
Author
|
Okay @platinummonkey - this is fully tested on the set of skills I want it to work with and they behave roughly the same as on the MCP endpoints, mind merging it now? |
platinummonkey
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the non-org2-gated LLMObs MCP server endpoints to pup as first-class CLI commands, and adds evaluator CRUD via the unstable MCP endpoints and a `--summary` flag for span search.
Coverage gap analysis that motivated this work: compared 26 MCP tools against existing pup commands post-v0.55.0, identified 15 missing; this PR implements the 8 non-org2-gated trace/span tools plus 7 eval commands.
Orphaned-span root cause writeup: https://gist.github.com/mbldatadog/69eb7ebd162e155da9b6b9c3afbad516
(Notebook `edit` command split to #495 for notebook-team review.)
New commands
`pup llm-obs spans` — 6 new subcommands
`pup llm-obs evals` — 5 new subcommands (2 existing)
`pup llm-obs spans search --summary`
Strips `tags`, `llm_info`, and content previews from each span, keeping 11 essential fields (`span_id`, `trace_id`, `apm_trace_id`, `name`, `span_kind`, `ml_app`, `service`, `status`, `duration_ms`, `start_ms`, `parent_id`). Reduces payload ~80% for bulk analysis phases.
Changes
Orphaned-span warning
`get-details` warns on stderr when fewer spans are returned than requested:
```
warning: 1 of 1 requested span(s) not found in trace hierarchy — the span may
exist but be orphaned (no path to a root span). Use 'spans get-content' to
retrieve its content directly.
```
Root cause: `spans search` returns spans by raw `@trace_id` (includes orphaned spans); `get-details` reconstructs the BFS tree and silently drops spans unreachable from any root. The response key path is `resp["spans"]` (raw API) not `resp["data"]["spans"]` (agent-mode formatter envelope) — regression test added.
Schema note for `evals create-or-update`
The `--file` uses a flat body (all fields top-level: `application_name`, `enabled`, `prompt_template`, `output_schema`, etc.), which differs from the nested structure returned by `get-evaluator` (`target.application_name`, `llm_judge_config.prompt_template`, `llm_provider.*`). This is a backend design decision — the read and write APIs use different shapes.
Testing
Automated
Manual smoke tests (run against org2)
Trace/span IDs age out of the index quickly — substitute fresh IDs from `pup llm-obs spans search --limit 1` as needed.
`evals get-evaluator`
`evals get-config`
`evals get-aggregate-stats`
`evals create-or-update` + `delete` (creates and cleans up a disposable evaluator)
```bash
cat > /tmp/pup_smoke_eval.json << 'BODY'
{
"application_name": "docs_ai", "enabled": false, "sampling_percentage": 1,
"root_spans_only": true, "eval_scope": "span",
"integration_provider": "openai", "model_name": "gpt-4.1-mini",
"parsing_type": "structured_output",
"prompt_template": [
{"role": "system", "content": "Rate as pass or fail."},
{"role": "user", "content": "{{span_output}}"}
],
"assessment_criteria": {"pass_values": ["pass"]},
"output_schema": {
"name": "categorical_eval",
"schema": {
"additionalProperties": false,
"properties": {
"categorical_eval": {"anyOf": [{"const": "pass", "description": "Satisfactory."}, {"const": "fail", "description": "Unsatisfactory."}], "type": "string"},
"reasoning": {"description": "Brief explanation.", "type": "string"}
},
"required": ["categorical_eval", "reasoning"], "type": "object"
},
"strict": true
}
}
BODY
```
`spans search --summary`
`spans get-trace`
`spans get-details`
`spans get-content`
`spans find-errors`
`spans expand`
`spans get-agent-loop`
🤖 Generated with Claude Code