Evaluation infrastructure for agent systems.
agent-eval gives agent products a reusable way to record what happened,
verify outcomes, classify failures, compare variants, optimize prompts or
policies, and make release decisions from evidence instead of anecdotes.
It does not own your product state, credentials, UI, or model routing. Product teams keep those boundaries; this package standardizes how runs are recorded, checked, compared, and promoted.
- When To Use It
- Architecture
- Install
- Quick Start
- Core Primitives
- Adoption Path
- Examples
- Documentation
- Development
- Related Packages
Use agent-eval when you need one or more of these:
- A reproducible eval harness for coding agents, builder agents, or multi-tool workflows.
- Structured traces for agent runs: spans, artifacts, events, budgets, tool calls, retrieval, judge output, and sandbox execution.
- Deterministic gates around build/test/deploy checks.
- LLM-as-judge or deterministic judge fleets with calibration and canaries.
- Dataset splits, holdouts, paired statistics, and release confidence gates.
- Failure taxonomy that distinguishes prompt, tool, sandbox, retrieval, evaluator, and knowledge-readiness failures.
- Optimization loops over prompts, steering, code mutations, or full multi-shot trajectories.
- Report data for internal launch reviews, CI gates, and research analysis.
agent/product run
-> TraceEmitter / TraceStore
-> TraceAnalyst / failure taxonomy
-> SandboxHarness / MultiLayerVerifier / JudgeRunner
-> metrics + run records
-> paired stats + held-out gates
-> optimization + release confidence + reportsPackage responsibilities:
agent-eval: run evidence, eval contracts, verification, statistics, optimization, reporting.- Product app: domain state, tools, credentials, UI, storage, deployment, model gateway.
@tangle-network/agent-runtime: production agent-loop/session runtime.@tangle-network/agent-knowledge: evidence stores, claim/page synthesis, retrieval, knowledge readiness implementation.
pnpm add @tangle-network/agent-evalWire protocol / CLI:
npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005Python client source lives in clients/python. Until the PyPI package is
published, install it from the repo:
cd clients/python
pip install -e .Wrap the real product loop first. Do not build a toy eval path that users never exercise.
import {
objectiveEval,
runAgentControlLoop,
} from '@tangle-network/agent-eval'
const result = await runAgentControlLoop({
intent: task.prompt,
budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
async observe() {
return productAdapter.readState(task.id)
},
async validate({ state }) {
return [
objectiveEval({
id: 'build-passes',
passed: state.build.exitCode === 0,
severity: 'critical',
metadata: state.build,
}),
objectiveEval({
id: 'preview-serves',
passed: state.preview.httpStatus === 200,
severity: 'critical',
}),
]
},
async decide({ evals }) {
return evals.every((evalResult) => evalResult.passed)
? { type: 'stop', reason: 'all critical checks passed' }
: { type: 'continue', action: { type: 'repair' }, reason: 'checks failed' }
},
async act(action) {
return productAdapter.runAgentStep(task.id, action)
},
})
await productAdapter.storeControlResult(task.id, result)Once this loop represents production behavior, convert completed runs into feedback trajectories, split them into train/dev/test/holdout sets, and run multi-shot optimization against the same adapter.
| Primitive | Purpose |
|---|---|
TraceEmitter, TraceStore |
Append-only run/span/event/artifact/budget records. |
TraceAnalyst |
Bounded investigation over trace corpora for systemic failure modes. |
SandboxHarness |
Build/test/runtime checks with captured stdout, stderr, exit codes, wall time, and parsed test counts. |
MultiLayerVerifier |
Ordered verification stages with dependencies, skip-on-fail, findings, scores, and time caps. |
JudgeRunner |
Parallel deterministic or LLM-backed judges over the same artifact/run. |
runAgentControlLoop |
Observe/validate/decide/act loop with budgets, stop policies, and structured eval results. |
controlRunToRunRecord |
Converts control-loop evidence into strict promotion/report rows. |
Dataset, RunRecord, HeldOutGate |
Versioned corpora, reproducible run metadata, and held-out promotion decisions. |
pairedBootstrap, pairedWilcoxon, bhAdjust |
Paired experiment statistics and multiple-comparison correction. |
classifyFailure |
Rule-based failure classification for agent, tool, sandbox, retrieval, and knowledge failures. |
runMultiShotOptimization |
Optimization over full agent trajectories with actionable side information. |
runPromptEvolution |
Prompt/steering/code evolution over scenario scores. |
evaluateReleaseConfidence |
Release scorecard across evidence volume, pass rate, score, overfit, cost, latency, and gates. |
renderReleaseReport, summaryTable, paretoChart, gainHistogram |
Report-ready decision artifacts and chart specs. |
KnowledgeRequirement, KnowledgeBundle |
Shared contracts for knowledge readiness. |
NoopResearcher is a fail-loud sentinel for wiring tests. Production systems
should implement Researcher directly or use CallbackResearcher.
- Choose one real workflow: code generation, browser task, research task, workflow builder, voice interaction, or domain agent task.
- Write a product adapter that can observe state and execute one agent step.
- Add deterministic validators first: build, test, serve, schema, policy, permission, retrieval, and deployment checks.
- Add LLM judges only for subjective quality that deterministic checks cannot measure.
- Emit traces and convert successful and failed attempts into
FeedbackTrajectoryrecords. - Build train/dev/test/holdout scenarios from those trajectories.
- Run
runMultiShotOptimization()or prompt/code evolution on train/dev. - Promote only when test/holdout gates and real product telemetry improve.
For a complete product integration guide, see Product Eval Adoption.
Runnable examples live in the repository's
examples/
directory. They are not part of the published npm package.
examples/same-sandbox-harness- run multiple eval passes against the same workspace.examples/multi-shot-optimization- optimize full agent trajectories with held-out promotion.examples/benchmarks- benchmark adapter shape and reference benchmark wrappers.
The examples are intentionally kept outside the README so they can be expanded, tested, and copied without turning this page into a tutorial.
- Concepts
- Feature Guide
- Product Eval Adoption
- Trace Analysis
- Control Runtime
- Knowledge Readiness
- Integration Launch Gates
- Multi-Shot Optimization
- Feedback Trajectories
- Wire Protocol
pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapiRun the local server:
pnpm build
node dist/cli.js serve --port 5005Python client tests:
pnpm build
cd clients/python
pip install -e ".[dev]"
pytest@tangle-network/agent-eval publishes to npm. The Python client lives under
clients/python and is versioned from this repository.
@tangle-network/agent-runtime@tangle-network/agent-knowledge@tangle-network/agent-integrations@tangle-network/agent-gateway@tangle-network/tcloud
MIT