Skip to content

fraware/LabTrust-Gym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

117 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LabTrust-Gym logo

LabTrust-Gym

License: Apache-2.0 Python 3.11+

A multi-agent environment (PettingZoo/Gym) for hospital lab automation

Install: pip install labtrust-gym[env,plots] then labtrust --version and labtrust quick-eval.

Documentation — The published site at https://fraware.github.io/LabTrust-Gym/ covers getting started, benchmarks, coordination, PCS, and operations.

Research and simulation only — This project targets research, benchmarks, and regression testing inside a simulated hospital lab environment. It excludes production medical or laboratory software. See SECURITY.md for scope and reporting issues.

What is LabTrust-Gym?

Pillar Goal
Environment Pip-installable, standard multi-agent API (PettingZoo AEC or parallel).
Trust skeleton Roles/permissions, signed actions, hash-chained audit log, invariants, reason codes.
Benchmarks Tasks (throughput_sla, adversarial_disruption, insider_key_misuse, coord_scale, coord_risk) and baselines (scripted, MARL, LLM). The golden suite defines correctness; regression means passing the suite. Safety/throughput trade-offs are measurable.
Coordination Pluggable coordination methods; coord_scale (scale stress) and coord_risk (under injection). Method–risk matrix and coordination security pack with gate thresholds; SOTA and method-class comparison.
Security & safety Security attack suite (prompt injection, tool, memory, detector, coordination-under-attack); risk register bundle with evidence and gaps; coverage gate (required_bench); safety case. Evidence bundles and verify-release chain for auditability.

Principles

  • Golden scenarios drive development — Correctness is defined as passing the golden suite, which acts as the specification for regression. The suite covers the scenarios encoded in policy; uncovered failure modes remain outside assured behavior until added to golden policy.
  • Policy is data — Invariants, tokens, reason codes, catalogue, zones live in versioned files under policy/.
  • Explicit failures — Missing hooks or invalid data raise reason codes instead of continuing silently.
  • Evidence over claims — Security and safety are evidenced by the attack suite, coordination security pack, and risk register; required_bench cells must be covered or explicitly waived.

The system and threat model are described in Systems and threat model.

Limitation — Passing every simulation test and gate demonstrates behavior inside the benchmark harness. Production deployments still require the integrator to account for distribution shift, live adversaries, key and operations failures, and environment drift. Use the simulation for development and regression; production assurance remains the integrator's responsibility.


Who is this for? / I want to...

I want to... First step
Run benchmarks only pip install labtrust-gym[env,plots] then labtrust quick-eval
Add my coordination method (or task) Extension development + entry_points; see examples/extension_example
Fork and customize policy Forker guide and labtrust forker-quickstart
Use as a library without forking Extension development + --profile + extension_packages in a lab profile
Run the full security suite labtrust run-security-suite; needs .[env]; use --skip-system-level when env is not installed
Run the PCS QC-release demo (proof-carrying science) PCS docs and examples/pcs_qc_releasescripts/setup_pcs_dev.ps1, then labtrust run-demo qc-release

Stable surface for extensions: Public API.


Installation (pip)

From PyPI (env + plots for benchmarks and quick-eval)

pip install labtrust-gym[env,plots]
labtrust --version
labtrust quick-eval

Runs one episode each of throughput_sla, adversarial_disruption, and multi_site_stat with scripted baselines; summary and logs under ./labtrust_runs/.

From source (development)

git clone https://github.com/fraware/LabTrust-Gym.git
cd LabTrust-Gym
pip install -e ".[dev]"
labtrust validate-policy
pytest -q

Full stack (benchmarks, studies, plots)

pip install -e ".[dev,env,plots]"
labtrust run-benchmark --task throughput_sla --episodes 5 --out results.json
labtrust reproduce --profile minimal

New to the repo? Forker guide and Quick demos for customizing and running commands end-to-end.

Extending without forking

  • Option A — Fork and customize via partner overlay and policy. Forker guide.
  • Option B — Install labtrust-gym and ship your own pip package (domains, tasks, coordination methods, etc. via register_* or entry_points; --profile and extension_packages). Extension development – Option B.

Optional extras

Extra Purpose
[env] PettingZoo/Gymnasium (benchmarks and full security suite including coord_pack_ref)
[plots] Matplotlib and Pillow (study figures, data tables)
[llm_openai] OpenAI live backend (openai_live)
[llm_anthropic] Anthropic live backend (anthropic_live)
[marl] Stable-Baselines3 (PPO train/eval)
[marl_hpo] Optuna (HPO for PPO)
[docs] MkDocs + mkdocstrings

Full security suite (including coord_pack_ref) requires [env]; use --skip-system-level when env is not installed.


Pipelines

Benchmarks run in one of three modes: deterministic | llm_offline | llm_live (Live LLM). Defaults are offline (no network, no API cost).

flowchart LR
    Run["Run benchmark"]
    Run --> D["deterministic (default)"]
    Run --> O["llm_offline"]
    Run --> L["llm_live + --allow-network"]
    D --> NoNet["No network"]
    O --> NoNet
    L --> Net["Network / API"]
Loading
Mode Network Agents Use case
deterministic No Scripted only CI, regression, reproduce, paper artifact (default)
llm_offline No LLM interface, deterministic backend only Offline LLM evaluation, no API calls
llm_live Yes (opt-in) Live OpenAI/Ollama Interactive or cost-accepting runs; requires --allow-network

Set mode with --pipeline-mode; for live LLM add --allow-network or LABTRUST_ALLOW_NETWORK=1.


Quick eval

labtrust quick-eval

Output: markdown summary (throughput, violations, blocked counts) and logs under ./labtrust_runs/quick_eval_<timestamp>/. Use --seed and --out-dir to customize.

Canonical demos: labtrust forker-quickstart, labtrust quick-eval, labtrust run-summary --run <dir>, labtrust run-official-pack (add --include-coordination-pack for coordination and security evidence). Quick demos lists "if you want to see X, run Y."

Example agents: Example experiments; agents and configs in examples/. Optional notebook examples/quick_eval.ipynb (requires .[env,plots]). External agent:

labtrust eval-agent --agent 'examples.external_agent_demo:SafeNoOpAgent' --task throughput_sla --episodes 2 --out out.json

CLI

Put CLI outputs in labtrust_runs/ or --out. Exit codes, minimal smoke args, and output paths: CLI output contract. Commands are smoke-tested in tests/test_cli_smoke_matrix.py.

Policy and validation

Command Description
validate-policy Validate policy YAML/JSON. --domain <domain_id> merges base + policy/domains/<domain_id>/; --partner <id> for overlay.
forker-quickstart One-command forker: validate-policy, coordination pack, lab report, risk register export. Forker guide.

Benchmarking and evaluation

Command Description
quick-eval One episode each of throughput_sla, adversarial_disruption, multi_site_stat; summary + logs under ./labtrust_runs/.
run-benchmark Run tasks (throughput_sla, stat_insertion, qc_cascade, adversarial_disruption, multi_site_stat, insider_key_misuse, coord_scale, coord_risk). Requires --task, --out. Options: --episodes, --seed, --coord-method, --injection, --scale, --timing, --llm-backend, --llm-agents, --always-step-timing, --approval-hook. Agent-centric: --agent-driven, --multi-agentic; optional --use-parallel-multi-agentic. Live LLM, Scale limits.
run-summary One-line stats for a run dir. --run <dir>, --format json.
eval-agent Benchmark with external agent (e.g. examples.external_agent_demo:SafeNoOpAgent or PPO via LABTRUST_PPO_MODEL and labtrust_gym.baselines.marl.ppo_agent:PPOAgent).
bench-smoke One episode per task (throughput_sla, stat_insertion, qc_cascade).
determinism-report Run twice; assert v0.2 metrics and episode log hash. Requires --task, --episodes, --seed, --out.
train-ppo, eval-ppo PPO train/eval (.[marl]). Writes train_config.json. Optional HPO: .[marl_hpo]. MARL baselines.

Export and verification

Command Description
export-receipts Receipt.v0.1 and EvidenceBundle.v0.1 from episode log.
export-fhir HL7 FHIR R4 Bundle from receipts (data-absent-reason, no placeholder IDs). FHIR export.
validate-fhir Validate bundle codes: --bundle <path> --terminology <value_set_json> [--strict]. FHIR export.
verify-bundle Verify one EvidenceBundle.v0.1. --strict-fingerprints for coordination, memory, rbac, tool_registry.
verify-release Verify release: EvidenceBundles, risk register, RELEASE_MANIFEST hashes. --strict-fingerprints for releases. Trust verification.
build-release-manifest Write RELEASE_MANIFEST.v0.1.json into --release-dir. Run after export-risk-register; then verify-release.
ui-export UI-ready zip (index, events, receipts_index, reason_codes). UI data contract.

Security and safety

Command Description
run-security-suite Smoke/full; SECURITY/attack_results.json. Options: `--agent-driven-mode single
safety-case Generate SAFETY_CASE/. Risk register.
run-official-pack Official pack (baselines, coordination, security, safety, transparency). --out <dir>, --seed-base, --include-coordination-pack for coordination_pack/ and lab report. Official benchmark pack.

Risk register

Command Description
export-risk-register RiskRegisterBundle.v0.1 to --out; --runs (repeatable) for evidence dirs. Gaps as first-class. Risk register.
build-risk-register-bundle Same bundle to explicit path.
validate-coverage Required_bench evidenced or waived. --strict to fail on missing.

Coordination and studies

Command Description
run-coordination-study Scale x method x injection; summary_coord.csv, pareto.md, SOTA leaderboard. Coordination studies.
run-coordination-security-pack Regression pack. --out, --matrix-preset (hospital_lab, hospital_lab_full, full_matrix, exploratory_*). pack_results/, pack_summary.csv, pack_gate.md. Security attack suite.
summarize-coordination SOTA leaderboard, method-class comparison.
recommend-coordination-method COORDINATION_DECISION.v0.1.json from run dir.
build-coordination-matrix CoordinationMatrix v0.1 from llm_live run.
run-study Study from spec (--spec, --out).
make-plots Figures and data tables from study run.

Release and reproducibility

Command Description
reproduce Minimal/full results + figures (`--profile minimal
package-release Release artifact: receipts, FHIR, MANIFEST, BENCHMARK_CARD. --profile paper_v0.1 for paper-ready. Paper provenance.
generate-official-baselines Core tasks with official baselines. Registry: benchmarks/baseline_registry.v0.1.yaml.
summarize-results summary_v0.2.csv, summary_v0.3.csv, summary.md (bounded memory). Metrics contract.
serve HTTP server (auth, rate limits). Security controls.

Repository structure

Path Description
policy/ YAML/JSON: schemas, emits, invariants, tokens, reason_codes, zones, catalogue, coordination, golden, official, llm, partners, risks (risk_registry, waivers, required_bench_plan.v0.1). labtrust validate-policy.
src/labtrust_gym/ Package: config, engine/, envs/, baselines/, benchmarks/, policy/, security/, studies/, export/, online/, runner/, cli/.
tests/ Pytest: golden suite, policy, benchmarks, coordination, risk_injections, studies, export, online, CLI smoke (test_cli_smoke_matrix.py).
benchmarks/ Baseline registry, official baselines (v0.1, v0.2).
examples/ Example agents (external_agent_demo, scripted_ops_agent, llm_agent_mock_demo, etc.).
docs/ Published site: fraware.github.io/LabTrust-Gym. Source under docs/ (MkDocs): architecture, benchmarks, coordination, PCS, contracts, security, agents. Forker guide. docs/assets/ — repo logo (Logo.png).
scripts/ run_hospital_lab_full_pipeline.py (orchestrator; --include-coordination-pack, --providers), check_llm_backends_live.py, quickstart, run_required_bench_matrix, extract_paper_claims_snapshot, build_release_fixture, build_viewer_data_from_release, run_external_reviewer_checks.
tests/fixtures/ui_fixtures/ Minimal results, episode log, evidence bundle for offline UI.

Reproducibility and citation

Cite using CITATION.cff.

Action Command / reference
Reproduce labtrust reproduce --profile minimalReproduce.
Release artifact labtrust package-release --profile minimal --out /tmp/labtrust_release. Paper-ready: --profile paper_v0.1Paper provenance.
Research and audit Paper-ready artifact + verify-release — Quick demos, Paper provenance.
Standardized evaluation Benchmark card, official baselines v0.2 — Use cases and impact.
Official baselines v0.2 in benchmarks/baselines_official/v0.2/. Regenerate: labtrust generate-official-baselines --out benchmarks/baselines_official/v0.2/ --episodes 3 --seed 123 --force. Compare: labtrust summarize-results --in benchmarks/baselines_official/v0.2/results/ your_results.json --out /tmp/compare.
Cite CITATION.cff or LabTrust-Gym: a multi-agent environment for hospital lab automation (pathology lab / blood sciences) with a trust skeleton. https://github.com/fraware/LabTrust-Gym.

License

Apache-2.0.

About

A multi-agent environment (PettingZoo/Gym) for hospital lab automation, with a reference trust skeleton.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors

Languages