feat(checkpoint): wire checkpointing into agent event loop#2190
feat(checkpoint): wire checkpointing into agent event loop#2190JackYPCOnline wants to merge 7 commits intostrands-agents:mainfrom
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Assessment: Comment This is a well-structured PR that wires checkpoint functionality into the agent loop with a clean opt-in design. The state machine is carefully reasoned and the integration tests (especially the crash-after-tools test) are compelling. Two themes warrant attention before merge: Review Themes
The feature design, state-machine logic, and durability proof are solid. The integration tests are particularly well-designed. |
|
Assessment: Comment Good progress since the last round — the New Review Items
The prior-round items around API review ( |
Address second-round review feedback: - Tighten docstrings across checkpoint.py, event_loop.py, and agent.py. Remove redundant explanations; keep what the reader needs. - Extract checkpointResume validation in agent.py into _try_consume_checkpoint_resume helper for readability. - Remove unused CheckpointResumeDict and CheckpointResumeContent TypedDicts (fold types.py into checkpoint.py) and drop the now-empty test_types.py. - Update __init__.py exports and AGENTS.md to reflect the new layout.
|
Assessment: Comment All prior-round feedback has been addressed with high quality. The code is significantly improved: One new item
The remaining open thread (CheckpointException hierarchy) is a design choice the author has reasonably justified. This is looking close to merge-ready pending the API review label decision from maintainers. |
The post-tool cancel check in _handle_tool_execution runs regardless of checkpointing=True. Previously, cancel during tool execution surfaced via the next model streaming call; now it short-circuits immediately. Add a test that asserts the cancel stops the loop without triggering the recursive model call (model.stream.call_count stays at 1).
|
Assessment: Approve All prior-round feedback has been addressed. The R3 concern about the post-tool cancel check affecting non-checkpointing users was fixed cleanly — the fallback is now gated on No new issues found in this round. The implementation is well-structured: cancel > interrupt > checkpoint precedence is correct, tested, and documented; the |
Description
Wires the
Checkpointdata model (landed in #2181) into the agent runtime so an opt-incheckpointing=Trueagent pauses at ReAct cycle boundaries and resumes cleanly from persisted checkpoints — including across fresh process boundaries, which is what makes durability providers like Temporal, Dapr, and AWS Step Functions usable with Strands.The design mirrors the existing interrupt pattern by construction —
stop_reason="checkpoint",checkpointResumecontent block for resume, snapshot-based state transfer. Users who know interrupts know this.User-facing API (zero breaking changes — opt-in only):
What changed:
Agent.__init__— newcheckpointing: bool = Falseparameter and two internal fields (_checkpointing,_checkpoint_resume_context). Default False: zero behavioral change for existing callers.Agent._try_consume_checkpoint_resume— new helper extracted from_convert_prompt_to_messages. DetectscheckpointResumecontent blocks, validates shape (mirrors_InterruptState.resume()conventions:TypeErrorfor shape,KeyErrorfor lookup,ValueErrorfor misconfig,CheckpointExceptionfor schema mismatch), loads the snapshot, and stashes the resume context.event_loop_cycle— one priming block (reads + one-shot clears resume context) plus two checkpoint emission points (after_modelandafter_tools) factored through_build_checkpoint_stop_event. All gated onagent._checkpointing; non-checkpointing callers see no behavioral change, including the cancel-during-tool-execution path.AgentResult— newcheckpoint: Checkpoint | None = Nonefield;to_dict/from_dictround-trip it.EventLoopStopEvent— extended constructor withcheckpointkwarg; the 7-tuple matchesAgentResultfield order for positional unpacking.strands.experimental.checkpoint— exportsCheckpoint,CheckpointPosition,CHECKPOINT_SCHEMA_VERSION.Checkpointis@dataclass(frozen=True).State-machine verification. Four scenarios traced against the code and covered by tests:
checkpointing=False→ identical to pre-change.checkpointing=True, tool_use →after_modelcheckpoint atcycle_index=0.after_model→ snapshot restored, model call skipped (assistant tool_use is already last message), tools run,after_toolscheckpoint atcycle_index=0.after_toolsatcycle_index=N→ primesinvocation_state["_checkpoint_cycle_index"]=N+1, model runs, nextafter_modelcheckpoint carriescycle_index=N+1.Precedence rules (documented in
checkpoint.pymodule docstring):stop_reason="interrupt"and skips theafter_toolscheckpoint.stop_reason="cancelled". Non-checkpointing cancel paths are unchanged frommain.Durability proof — the killer test.
test_crash_after_tools_does_not_rerun_completed_tools: three tools with independent call counters, agent runs throughafter_tools, the Agent instance is discarded entirely (del), a fresh Agent resumes from the persisted checkpoint, and the post-crash model returnsend_turn. Assertion: each tool's counter is exactly 1. Completed work survives worker loss.V0 known limitations (documented in
checkpoint.pymodule docstring, not blockers):OpenAIResponsesModel(stateful=True)not supported —_model_stateis not intake_snapshot(preset="session"). Follow-up issue to extend the snapshot preset.AgentResult.messageatafter_toolsis the assistant message that requested the tools (tool results are insidecheckpoint.snapshot).BeforeInvocationEvent/AfterInvocationEventfire on every resume call (same as interrupts — hooks counting invocations see each resume as a separate invocation).ToolExecutor(e.g. a futureTemporalToolExecutor). The SDK checkpoint operates at cycle boundaries.Related Issues
Documentation PR
Type of Change
New feature
Testing
Verified the changes do not break functionality or introduce warnings in consuming repositories.
hatch run prepareEvidence from fresh runs:
hatch test— 2673 passed, 0 failed.hatch run hatch-static-analysis:lint-check—ruff checkandmypyboth clean.hatch run hatch-static-analysis:format-check— all files formatted.New tests added (checkpoint-scope):
Checkpointdataclass:tests/strands/experimental/checkpoint/test_checkpoint.py(6 unit tests — round-trip, frozen schema version, defaults, schema mismatch, missing schema version, unknown-fields warning).AgentResult.checkpoint:tests/strands/agent/test_agent_result.py(6 new tests — field default, accepts checkpoint,to_dictincludes/omits checkpoint,from_dictround-trip, missing-checkpoint resilience).EventLoopStopEventcheckpoint kwarg:tests/strands/types/test__events.py(2 new tests — tuple length, default None).Agent.__init__flag:tests/strands/agent/test_agent.py(2 new tests — default False, flag stored)._try_consume_checkpoint_resumevalidation:tests/strands/agent/test_agent.py(5 new tests —checkpointing=Falseerror, mixed content, multiple blocks, missing key, schema mismatch).tests/strands/event_loop/test_event_loop.py(7 new tests —after_modelemission,after_toolsemission, cycle-index continuity across resume, and four cancel-precedence tests including a regression test that pins non-checkpointing cancel-after-tools still recurses through the existing cancel-during-model-stream path).tests/strands/experimental/checkpoint/test_checkpoint.py(2 new async tests — round-trip across three cycles through fresh Agent instances, and the killer crash-after-tools test).The 7-tuple shape change to
EventLoopStopEventrequired updating pre-existing test-side tuple unpackers. Done mechanically (add one slot each); all pre-existing tests still pass.Checklist
agent-docs; module-level docstring incheckpoint.pycovers V0 limitations, precedence rules, and usage)main; this PR builds on it)By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.