Skip to content

feat(checkpoint): wire checkpointing into agent event loop#2190

Open
JackYPCOnline wants to merge 7 commits intostrands-agents:mainfrom
JackYPCOnline:checkpoint_1
Open

feat(checkpoint): wire checkpointing into agent event loop#2190
JackYPCOnline wants to merge 7 commits intostrands-agents:mainfrom
JackYPCOnline:checkpoint_1

Conversation

@JackYPCOnline
Copy link
Copy Markdown
Contributor

@JackYPCOnline JackYPCOnline commented Apr 22, 2026

Description

Wires the Checkpoint data model (landed in #2181) into the agent runtime so an opt-in checkpointing=True agent pauses at ReAct cycle boundaries and resumes cleanly from persisted checkpoints — including across fresh process boundaries, which is what makes durability providers like Temporal, Dapr, and AWS Step Functions usable with Strands.

The design mirrors the existing interrupt pattern by construction — stop_reason="checkpoint", checkpointResume content block for resume, snapshot-based state transfer. Users who know interrupts know this.

User-facing API (zero breaking changes — opt-in only):

agent = Agent(tools=[...], checkpointing=True)
result = await agent.invoke_async("do the thing")

while result.stop_reason == "checkpoint":
    # persist anywhere: Temporal Event History, DB, file, etc.
    save_somewhere(result.checkpoint.to_dict())

    # resume in a fresh process / Agent instance
    result = await fresh_agent.invoke_async(
        [{"checkpointResume": {"checkpoint": result.checkpoint.to_dict()}}]
    )

print(result.message)  # stop_reason == "end_turn"

What changed:

  • Agent.__init__ — new checkpointing: bool = False parameter and two internal fields (_checkpointing, _checkpoint_resume_context). Default False: zero behavioral change for existing callers.
  • Agent._try_consume_checkpoint_resume — new helper extracted from _convert_prompt_to_messages. Detects checkpointResume content blocks, validates shape (mirrors _InterruptState.resume() conventions: TypeError for shape, KeyError for lookup, ValueError for misconfig, CheckpointException for schema mismatch), loads the snapshot, and stashes the resume context.
  • event_loop_cycle — one priming block (reads + one-shot clears resume context) plus two checkpoint emission points (after_model and after_tools) factored through _build_checkpoint_stop_event. All gated on agent._checkpointing; non-checkpointing callers see no behavioral change, including the cancel-during-tool-execution path.
  • AgentResult — new checkpoint: Checkpoint | None = None field; to_dict / from_dict round-trip it.
  • EventLoopStopEvent — extended constructor with checkpoint kwarg; the 7-tuple matches AgentResult field order for positional unpacking.
  • strands.experimental.checkpoint — exports Checkpoint, CheckpointPosition, CHECKPOINT_SCHEMA_VERSION. Checkpoint is @dataclass(frozen=True).

State-machine verification. Four scenarios traced against the code and covered by tests:

  1. Fresh call, checkpointing=False → identical to pre-change.
  2. Fresh call, checkpointing=True, tool_use → after_model checkpoint at cycle_index=0.
  3. Resume from after_model → snapshot restored, model call skipped (assistant tool_use is already last message), tools run, after_tools checkpoint at cycle_index=0.
  4. Resume from after_tools at cycle_index=N → primes invocation_state["_checkpoint_cycle_index"]=N+1, model runs, next after_model checkpoint carries cycle_index=N+1.

Precedence rules (documented in checkpoint.py module docstring):

  • Interrupts > checkpoint: an interrupt raised during a checkpointing cycle returns stop_reason="interrupt" and skips the after_tools checkpoint.
  • Cancel > checkpoint: a cancel signal set at either checkpoint boundary suppresses emission and surfaces as stop_reason="cancelled". Non-checkpointing cancel paths are unchanged from main.

Durability proof — the killer test. test_crash_after_tools_does_not_rerun_completed_tools: three tools with independent call counters, agent runs through after_tools, the Agent instance is discarded entirely (del), a fresh Agent resumes from the persisted checkpoint, and the post-crash model returns end_turn. Assertion: each tool's counter is exactly 1. Completed work survives worker loss.

V0 known limitations (documented in checkpoint.py module docstring, not blockers):

  • Metrics reset on each resume call — the orchestrator is responsible for aggregating metrics across a durable run.
  • OpenAIResponsesModel(stateful=True) not supported — _model_state is not in take_snapshot(preset="session"). Follow-up issue to extend the snapshot preset.
  • AgentResult.message at after_tools is the assistant message that requested the tools (tool results are inside checkpoint.snapshot).
  • BeforeInvocationEvent / AfterInvocationEvent fire on every resume call (same as interrupts — hooks counting invocations see each resume as a separate invocation).
  • Per-tool granularity within a cycle requires a custom ToolExecutor (e.g. a future TemporalToolExecutor). The SDK checkpoint operates at cycle boundaries.
  • Streaming callbacks do not re-emit on replay.

Related Issues

Documentation PR

Type of Change

New feature

Testing

Verified the changes do not break functionality or introduce warnings in consuming repositories.

  • I ran hatch run prepare

Evidence from fresh runs:

  • hatch test — 2673 passed, 0 failed.
  • hatch run hatch-static-analysis:lint-checkruff check and mypy both clean.
  • hatch run hatch-static-analysis:format-check — all files formatted.

New tests added (checkpoint-scope):

  • Unit — Checkpoint dataclass: tests/strands/experimental/checkpoint/test_checkpoint.py (6 unit tests — round-trip, frozen schema version, defaults, schema mismatch, missing schema version, unknown-fields warning).
  • Unit — AgentResult.checkpoint: tests/strands/agent/test_agent_result.py (6 new tests — field default, accepts checkpoint, to_dict includes/omits checkpoint, from_dict round-trip, missing-checkpoint resilience).
  • Unit — EventLoopStopEvent checkpoint kwarg: tests/strands/types/test__events.py (2 new tests — tuple length, default None).
  • Unit — Agent.__init__ flag: tests/strands/agent/test_agent.py (2 new tests — default False, flag stored).
  • Unit — _try_consume_checkpoint_resume validation: tests/strands/agent/test_agent.py (5 new tests — checkpointing=False error, mixed content, multiple blocks, missing key, schema mismatch).
  • Cycle-level — event loop emission: tests/strands/event_loop/test_event_loop.py (7 new tests — after_model emission, after_tools emission, cycle-index continuity across resume, and four cancel-precedence tests including a regression test that pins non-checkpointing cancel-after-tools still recurses through the existing cancel-during-model-stream path).
  • Integration — end-to-end durability: tests/strands/experimental/checkpoint/test_checkpoint.py (2 new async tests — round-trip across three cycles through fresh Agent instances, and the killer crash-after-tools test).

The 7-tuple shape change to EventLoopStopEvent required updating pre-existing test-side tuple unpackers. Done mechanically (add one slot each); all pre-existing tests still pass.

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly (user-guide page is a follow-up PR in agent-docs; module-level docstring in checkpoint.py covers V0 limitations, precedence rules, and usage)
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed (reference Temporal / Dapr / Step Functions examples are the next milestone — M1/M2/M3 in the durable-execution tracking plan)
  • My changes generate no new warnings
  • Any dependent changes have been merged and published (Part A — feat: introduce checkpoint in experimental #2181 — is merged on main; this PR builds on it)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@JackYPCOnline JackYPCOnline marked this pull request as draft April 22, 2026 20:39
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 98.38710% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/strands/event_loop/event_loop.py 96.42% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/strands/types/_events.py
Comment thread src/strands/event_loop/event_loop.py Outdated
Comment thread src/strands/agent/agent.py Outdated
Comment thread src/strands/agent/agent.py
Comment thread src/strands/agent/agent.py Outdated
Comment thread src/strands/event_loop/event_loop.py
Comment thread src/strands/experimental/checkpoint/checkpoint.py
Comment thread tests/strands/experimental/checkpoint/test_checkpoint.py Outdated
Comment thread src/strands/agent/agent.py Outdated
@github-actions
Copy link
Copy Markdown

Assessment: Comment

This is a well-structured PR that wires checkpoint functionality into the agent loop with a clean opt-in design. The state machine is carefully reasoned and the integration tests (especially the crash-after-tools test) are compelling. Two themes warrant attention before merge:

Review Themes
  • API Review Required: This introduces meaningful new public API surface (Agent parameter, AgentResult field, new StopReason, content block types). Per the API Bar Raising process, it needs a needs-api-review label and reviewer sign-off. Key design questions: is checkpointing: bool the right level of configurability, and should there be a high-level resume_from_checkpoint() method alongside the content-block primitive?

  • Error Contract Consistency: The resume validation comments claim to mirror _InterruptState.resume() conventions (TypeError/KeyError/ValueError), but Checkpoint.from_dict now raises CheckpointException. The exception hierarchy should be consistent and documented.

  • Coupling Pattern: The event loop directly accesses private agent attributes (_checkpointing, _checkpoint_resume_context). This mirrors the existing interrupt pattern but extends the coupling surface. Consider exposing checkpoint config as an explicit parameter or read-only property.

  • Test Coverage Gap: Missing a test for the checkpointing=True + end_turn (no tool use) path, and the Codecov report shows 1 partial line in event_loop.py.

The feature design, state-machine logic, and durability proof are solid. The integration tests are particularly well-designed.

Comment thread src/strands/event_loop/event_loop.py Outdated
@github-actions
Copy link
Copy Markdown

Assessment: Comment

Good progress since the last round — the frozen=True dataclass, _build_checkpoint_stop_event extraction, and updated error convention documentation address several prior concerns. A few new items surfaced:

New Review Items
  • Docstring accuracy: event_loop_cycle Yields docstring still documents a 4-element tuple but the actual event is now 7 elements. The cancel() docstring uses "checkpoint" in a way that now conflicts with the durable-execution Checkpoint concept introduced here.
  • AGENTS.md update: The directory structure section needs to be updated to include experimental/checkpoint/ per the repo's own guidelines.
  • Cancel + checkpoint interaction: When both checkpointing=True and cancel_signal are set, checkpoint emission takes precedence over cancel. This is probably correct but should be documented or tested.

The prior-round items around API review (needs-api-review label, high-level resume method question) remain open for maintainer decision. The core state-machine logic and test coverage are solid.

@github-actions github-actions Bot added size/xl and removed size/l labels Apr 23, 2026
Address second-round review feedback:

- Tighten docstrings across checkpoint.py, event_loop.py, and agent.py.
  Remove redundant explanations; keep what the reader needs.
- Extract checkpointResume validation in agent.py into
  _try_consume_checkpoint_resume helper for readability.
- Remove unused CheckpointResumeDict and CheckpointResumeContent TypedDicts
  (fold types.py into checkpoint.py) and drop the now-empty test_types.py.
- Update __init__.py exports and AGENTS.md to reflect the new layout.
Comment thread src/strands/event_loop/event_loop.py Outdated
@github-actions
Copy link
Copy Markdown

Assessment: Comment

All prior-round feedback has been addressed with high quality. The code is significantly improved: cancel > checkpoint precedence is properly implemented and tested, docstrings are accurate, AGENTS.md is updated, and the resume logic is cleanly extracted into _try_consume_checkpoint_resume.

One new item
  • Post-tool cancel check affects all users: The new cancel fallback at lines 666-674 of event_loop.py runs regardless of checkpointing flag, introducing a behavioral change for existing non-checkpointing users (faster cancel response after tool execution). This is likely an improvement but should be documented in the PR description and tested with checkpointing=False.

The remaining open thread (CheckpointException hierarchy) is a design choice the author has reasonably justified. This is looking close to merge-ready pending the API review label decision from maintainers.

The post-tool cancel check in _handle_tool_execution runs regardless of
checkpointing=True. Previously, cancel during tool execution surfaced via
the next model streaming call; now it short-circuits immediately. Add a
test that asserts the cancel stops the loop without triggering the
recursive model call (model.stream.call_count stays at 1).
@github-actions
Copy link
Copy Markdown

Assessment: Approve

All prior-round feedback has been addressed. The R3 concern about the post-tool cancel check affecting non-checkpointing users was fixed cleanly — the fallback is now gated on agent._checkpointing (line 670) with a clear explanatory comment, and a new test (test_event_loop_cycle_cancel_mid_cycle_after_tools_non_checkpointing) pins the invariant that non-checkpointing callers still fall through to recurse_event_loop.

No new issues found in this round. The implementation is well-structured: cancel > interrupt > checkpoint precedence is correct, tested, and documented; the _try_consume_checkpoint_resume helper is cleanly extracted; frozen=True prevents accidental checkpoint mutation; and the integration tests convincingly demonstrate cross-process durability.

@JackYPCOnline JackYPCOnline marked this pull request as ready for review April 23, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant