@tangle-network/agent-eval

Evaluation infrastructure for agent systems.

agent-eval gives agent products a reusable way to record what happened, verify outcomes, classify failures, compare variants, optimize prompts or policies, and make release decisions from evidence instead of anecdotes.

It does not own your product state, credentials, UI, or model routing. Product teams keep those boundaries; this package standardizes how runs are recorded, checked, compared, and promoted.

When To Use It

Use agent-eval when you need one or more of these:

A reproducible eval harness for coding agents, builder agents, or multi-tool workflows.
Structured traces for agent runs: spans, artifacts, events, budgets, tool calls, retrieval, judge output, and sandbox execution.
Deterministic gates around build/test/deploy checks.
LLM-as-judge or deterministic judge fleets with calibration and canaries.
Dataset splits, holdouts, paired statistics, and release confidence gates.
Failure taxonomy that distinguishes prompt, tool, sandbox, retrieval, evaluator, and knowledge-readiness failures.
Optimization loops over prompts, steering, code mutations, or full multi-shot trajectories.
Report data for internal launch reviews, CI gates, and research analysis.

Architecture

agent/product run
  -> TraceEmitter / TraceStore
  -> TraceAnalyst / failure taxonomy
  -> SandboxHarness / MultiLayerVerifier / JudgeRunner
  -> metrics + run records
  -> paired stats + held-out gates
  -> optimization + release confidence + reports

Package responsibilities:

agent-eval: run evidence, eval contracts, verification, statistics, optimization, reporting.
Product app: domain state, tools, credentials, UI, storage, deployment, model gateway.
@tangle-network/agent-runtime: production agent-loop/session runtime.
@tangle-network/agent-knowledge: evidence stores, claim/page synthesis, retrieval, knowledge readiness implementation.

Install

pnpm add @tangle-network/agent-eval

Wire protocol / CLI:

npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005

Python client source lives in clients/python. Until the PyPI package is published, install it from the repo:

cd clients/python
pip install -e .

Quick Start

Wrap the real product loop first. Do not build a toy eval path that users never exercise.

import {
  objectiveEval,
  runAgentControlLoop,
} from '@tangle-network/agent-eval'

const result = await runAgentControlLoop({
  intent: task.prompt,
  budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },

  async observe() {
    return productAdapter.readState(task.id)
  },

  async validate({ state }) {
    return [
      objectiveEval({
        id: 'build-passes',
        passed: state.build.exitCode === 0,
        severity: 'critical',
        metadata: state.build,
      }),
      objectiveEval({
        id: 'preview-serves',
        passed: state.preview.httpStatus === 200,
        severity: 'critical',
      }),
    ]
  },

  async decide({ evals }) {
    return evals.every((evalResult) => evalResult.passed)
      ? { type: 'stop', reason: 'all critical checks passed' }
      : { type: 'continue', action: { type: 'repair' }, reason: 'checks failed' }
  },

  async act(action) {
    return productAdapter.runAgentStep(task.id, action)
  },
})

await productAdapter.storeControlResult(task.id, result)

Once this loop represents production behavior, convert completed runs into feedback trajectories, split them into train/dev/test/holdout sets, and run multi-shot optimization against the same adapter.

Core Primitives

Primitive	Purpose
`TraceEmitter`, `TraceStore`	Append-only run/span/event/artifact/budget records.
`TraceAnalyst`	Bounded investigation over trace corpora for systemic failure modes.
`SandboxHarness`	Build/test/runtime checks with captured stdout, stderr, exit codes, wall time, and parsed test counts.
`MultiLayerVerifier`	Ordered verification stages with dependencies, skip-on-fail, findings, scores, and time caps.
`JudgeRunner`	Parallel deterministic or LLM-backed judges over the same artifact/run.
`runAgentControlLoop`	Observe/validate/decide/act loop with budgets, stop policies, and structured eval results.
`controlRunToRunRecord`	Converts control-loop evidence into strict promotion/report rows.
`Dataset`, `RunRecord`, `HeldOutGate`	Versioned corpora, reproducible run metadata, and held-out promotion decisions.
`pairedBootstrap`, `pairedWilcoxon`, `bhAdjust`	Paired experiment statistics and multiple-comparison correction.
`classifyFailure`	Rule-based failure classification for agent, tool, sandbox, retrieval, and knowledge failures.
`runMultiShotOptimization`	Optimization over full agent trajectories with actionable side information.
`runPromptEvolution`	Prompt/steering/code evolution over scenario scores.
`evaluateReleaseConfidence`	Release scorecard across evidence volume, pass rate, score, overfit, cost, latency, and gates.
`renderReleaseReport`, `summaryTable`, `paretoChart`, `gainHistogram`	Report-ready decision artifacts and chart specs.
`KnowledgeRequirement`, `KnowledgeBundle`	Shared contracts for knowledge readiness.

NoopResearcher is a fail-loud sentinel for wiring tests. Production systems should implement Researcher directly or use CallbackResearcher.

Adoption Path

Choose one real workflow: code generation, browser task, research task, workflow builder, voice interaction, or domain agent task.
Write a product adapter that can observe state and execute one agent step.
Add deterministic validators first: build, test, serve, schema, policy, permission, retrieval, and deployment checks.
Add LLM judges only for subjective quality that deterministic checks cannot measure.
Emit traces and convert successful and failed attempts into FeedbackTrajectory records.
Build train/dev/test/holdout scenarios from those trajectories.
Run runMultiShotOptimization() or prompt/code evolution on train/dev.
Promote only when test/holdout gates and real product telemetry improve.

For a complete product integration guide, see Product Eval Adoption.

Examples

Runnable examples live in the repository's examples/ directory. They are not part of the published npm package.

examples/same-sandbox-harness - run multiple eval passes against the same workspace.
examples/multi-shot-optimization - optimize full agent trajectories with held-out promotion.
examples/benchmarks - benchmark adapter shape and reference benchmark wrappers.

The examples are intentionally kept outside the README so they can be expanded, tested, and copied without turning this page into a tutorial.

Documentation

Development

pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi

Run the local server:

pnpm build
node dist/cli.js serve --port 5005

Python client tests:

pnpm build
cd clients/python
pip install -e ".[dev]"
pytest

Release

@tangle-network/agent-eval publishes to npm. The Python client lives under clients/python and is versioned from this repository.

Related Packages

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.claude/skills/agent-eval		.claude/skills/agent-eval
.github/workflows		.github/workflows
clients/python		clients/python
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@tangle-network/agent-eval

Contents

When To Use It

Architecture

Install

Quick Start

Core Primitives

Adoption Path

Examples

Documentation

Development

Release

Related Packages

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@tangle-network/agent-eval

Contents

When To Use It

Architecture

Install

Quick Start

Core Primitives

Adoption Path

Examples

Documentation

Development

Release

Related Packages

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages