Skip to content

tangle-network/agent-eval

Repository files navigation

@tangle-network/agent-eval

Evaluation infrastructure for agent systems.

agent-eval gives agent products a reusable way to record what happened, verify outcomes, classify failures, compare variants, optimize prompts or policies, and make release decisions from evidence instead of anecdotes.

It does not own your product state, credentials, UI, or model routing. Product teams keep those boundaries; this package standardizes how runs are recorded, checked, compared, and promoted.

Contents

When To Use It

Use agent-eval when you need one or more of these:

  • A reproducible eval harness for coding agents, builder agents, or multi-tool workflows.
  • Structured traces for agent runs: spans, artifacts, events, budgets, tool calls, retrieval, judge output, and sandbox execution.
  • Deterministic gates around build/test/deploy checks.
  • LLM-as-judge or deterministic judge fleets with calibration and canaries.
  • Dataset splits, holdouts, paired statistics, and release confidence gates.
  • Failure taxonomy that distinguishes prompt, tool, sandbox, retrieval, evaluator, and knowledge-readiness failures.
  • Optimization loops over prompts, steering, code mutations, or full multi-shot trajectories.
  • Report data for internal launch reviews, CI gates, and research analysis.

Architecture

agent/product run
  -> TraceEmitter / TraceStore
  -> TraceAnalyst / failure taxonomy
  -> SandboxHarness / MultiLayerVerifier / JudgeRunner
  -> metrics + run records
  -> paired stats + held-out gates
  -> optimization + release confidence + reports

Package responsibilities:

  • agent-eval: run evidence, eval contracts, verification, statistics, optimization, reporting.
  • Product app: domain state, tools, credentials, UI, storage, deployment, model gateway.
  • @tangle-network/agent-runtime: production agent-loop/session runtime.
  • @tangle-network/agent-knowledge: evidence stores, claim/page synthesis, retrieval, knowledge readiness implementation.

Install

pnpm add @tangle-network/agent-eval

Wire protocol / CLI:

npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005

Python client source lives in clients/python. Until the PyPI package is published, install it from the repo:

cd clients/python
pip install -e .

Quick Start

Wrap the real product loop first. Do not build a toy eval path that users never exercise.

import {
  objectiveEval,
  runAgentControlLoop,
} from '@tangle-network/agent-eval'

const result = await runAgentControlLoop({
  intent: task.prompt,
  budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },

  async observe() {
    return productAdapter.readState(task.id)
  },

  async validate({ state }) {
    return [
      objectiveEval({
        id: 'build-passes',
        passed: state.build.exitCode === 0,
        severity: 'critical',
        metadata: state.build,
      }),
      objectiveEval({
        id: 'preview-serves',
        passed: state.preview.httpStatus === 200,
        severity: 'critical',
      }),
    ]
  },

  async decide({ evals }) {
    return evals.every((evalResult) => evalResult.passed)
      ? { type: 'stop', reason: 'all critical checks passed' }
      : { type: 'continue', action: { type: 'repair' }, reason: 'checks failed' }
  },

  async act(action) {
    return productAdapter.runAgentStep(task.id, action)
  },
})

await productAdapter.storeControlResult(task.id, result)

Once this loop represents production behavior, convert completed runs into feedback trajectories, split them into train/dev/test/holdout sets, and run multi-shot optimization against the same adapter.

Core Primitives

Primitive Purpose
TraceEmitter, TraceStore Append-only run/span/event/artifact/budget records.
TraceAnalyst Bounded investigation over trace corpora for systemic failure modes.
SandboxHarness Build/test/runtime checks with captured stdout, stderr, exit codes, wall time, and parsed test counts.
MultiLayerVerifier Ordered verification stages with dependencies, skip-on-fail, findings, scores, and time caps.
JudgeRunner Parallel deterministic or LLM-backed judges over the same artifact/run.
runAgentControlLoop Observe/validate/decide/act loop with budgets, stop policies, and structured eval results.
controlRunToRunRecord Converts control-loop evidence into strict promotion/report rows.
Dataset, RunRecord, HeldOutGate Versioned corpora, reproducible run metadata, and held-out promotion decisions.
pairedBootstrap, pairedWilcoxon, bhAdjust Paired experiment statistics and multiple-comparison correction.
classifyFailure Rule-based failure classification for agent, tool, sandbox, retrieval, and knowledge failures.
runMultiShotOptimization Optimization over full agent trajectories with actionable side information.
runPromptEvolution Prompt/steering/code evolution over scenario scores.
evaluateReleaseConfidence Release scorecard across evidence volume, pass rate, score, overfit, cost, latency, and gates.
renderReleaseReport, summaryTable, paretoChart, gainHistogram Report-ready decision artifacts and chart specs.
KnowledgeRequirement, KnowledgeBundle Shared contracts for knowledge readiness.

NoopResearcher is a fail-loud sentinel for wiring tests. Production systems should implement Researcher directly or use CallbackResearcher.

Adoption Path

  1. Choose one real workflow: code generation, browser task, research task, workflow builder, voice interaction, or domain agent task.
  2. Write a product adapter that can observe state and execute one agent step.
  3. Add deterministic validators first: build, test, serve, schema, policy, permission, retrieval, and deployment checks.
  4. Add LLM judges only for subjective quality that deterministic checks cannot measure.
  5. Emit traces and convert successful and failed attempts into FeedbackTrajectory records.
  6. Build train/dev/test/holdout scenarios from those trajectories.
  7. Run runMultiShotOptimization() or prompt/code evolution on train/dev.
  8. Promote only when test/holdout gates and real product telemetry improve.

For a complete product integration guide, see Product Eval Adoption.

Examples

Runnable examples live in the repository's examples/ directory. They are not part of the published npm package.

The examples are intentionally kept outside the README so they can be expanded, tested, and copied without turning this page into a tutorial.

Documentation

Development

pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi

Run the local server:

pnpm build
node dist/cli.js serve --port 5005

Python client tests:

pnpm build
cd clients/python
pip install -e ".[dev]"
pytest

Release

@tangle-network/agent-eval publishes to npm. The Python client lives under clients/python and is versioned from this repository.

Related Packages

License

MIT

About

Domain-agnostic evaluation framework for Tangle agent apps

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors