feat: annotation queue testset export#4261
Conversation
Adds PRD, RFC, and README to docs/designs/testset-annotation-queue/. The design covers: - Per-scenario "Add to Testset" button in the annotate tab - "Done with queue" screen supporting both trace and testcase queues - All-annotations tab with row selection and upgraded commit modal - New AddToTestsetModal component with EntityPicker integration - Controller actions for addScenariosToTestset (trace + testcase paths) - Default target testset heuristics and last-used persistence https://claude.ai/code/session_01B2uQKidAr1KJ4CR9sroY2H
Key corrections: - Replace new AddToTestsetModal with EntityCommitModal + renderModeContent - Replace useState for selected testset with pendingTestsetSelectionAtom in annotationSessionController — survives re-renders and can be read imperatively by addScenariosToTestset without closure capture - selectedScenarioIdsAtom for row selection (also atom, not useState) - openAddToTestsetModal seeds pendingTestsetSelectionAtom from default - addScenariosToTestset reads target testset from atom, not payload - Add state atom summary table documenting all atom lifecycles https://claude.ai/code/session_01B2uQKidAr1KJ4CR9sroY2H
Inputs (agData.inputs) spread into N columns — one per input key. Outputs (agData.outputs) always map to a single "output" column regardless of value shape, matching extractOutputs() behaviour in trace/utils/selectors.ts which treats outputs as a leaf. Annotation values add one column per evaluator slug.
…n reset 1. Trace annotation resolution: use scenarioAnnotationsAtomFamily(scenarioId) not a raw query by traceId. The atom handles step-based resolution to avoid cross-queue bleed (as documented in the controller). 2. Scope label when scope="all": addToTestsetScenarioIds() is empty for the "all" case — read actual count from scenarioIds() instead of falling back to the "all" string. 3. Row selection reset: addScenariosToTestset clears selectedScenarioIdsAtom after a successful export so stale selections don't persist.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis PR adds a design and implementation for creating testsets directly from annotation queues. It includes design documentation (PRD, RFC, README) outlining requirements and architecture, new controller atoms/actions for managing modal state and background export jobs, UI components for "Add to Testset" buttons across annotation views, testset row-builder utilities for traces and testcases, and entity UI enhancements (modal, picker) to support testset selection and creation during export. ChangesDesign Documentation
Add-to-Testset Feature Implementation
Sequence DiagramsequenceDiagram
participant User
participant AnnotationSession
participant EntityCommitModal
participant EntityPicker
participant annotationSessionController
participant TestsetAPI
participant QueryClient
User->>AnnotationSession: Click "Add to Testset"
AnnotationSession->>annotationSessionController: openAddToTestsetModal({scope, scenarioIds})
annotationSessionController->>EntityCommitModal: Set isAddToTestsetModalOpen = true
EntityCommitModal->>User: Display modal with commit mode selector
alt Mode = "select"
User->>EntityCommitModal: Choose target testset
EntityCommitModal->>EntityPicker: Load testsets list
EntityPicker->>TestsetAPI: Fetch testsets
TestsetAPI-->>EntityPicker: Return testsets
EntityPicker-->>User: Display testset options
User->>EntityCommitModal: Select testset
EntityCommitModal->>annotationSessionController: setPendingTestsetSelection(testsetId)
else Mode = "create"
User->>EntityCommitModal: Enter new testset name
EntityCommitModal->>annotationSessionController: setPendingTestsetSelection({name})
end
User->>EntityCommitModal: Click "Confirm"
EntityCommitModal->>annotationSessionController: addScenariosToTestset(scope, mode, fields)
annotationSessionController->>annotationSessionController: Resolve scope (single/selected/all/complete)
annotationSessionController->>annotationSessionController: Build export rows from scenarios
alt Target testset exists
annotationSessionController->>TestsetAPI: patchRevision(testsetId, {addedColumns, addedRows})
else Create new testset
annotationSessionController->>TestsetAPI: createTestset({name, rows})
end
TestsetAPI-->>annotationSessionController: Success response
annotationSessionController->>QueryClient: Invalidate testset queries
annotationSessionController->>annotationSessionController: Update job status to success
EntityCommitModal->>AnnotationSession: Close modal
AnnotationSession->>User: Display success toast
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related Issues
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
docs/designs/testset-annotation-queue/PRD.md (1)
42-48: 💤 Low valueAdd blank lines around tables for markdown compliance.
The tables in this document are missing surrounding blank lines, which causes markdownlint warnings. This affects tables at lines 42, 89, 107, and 118.
📝 Example fix for the competitive reference table
## Competitive Reference ### LangSmith + | Capability | Supported | |------------|-----------| | Create or extend a testset from both test cases and traces | Yes | | Add a single scenario to a testset | Yes | | Add to a testset from the focused/annotate tab | Yes | | Select table rows and add them to a testset | Yes | + LangSmith supports all four capabilities; Agenta currently supports none of them in full.docs/designs/testset-annotation-queue/RFC.md (1)
18-57: 💤 Low valueAdd language specifiers to fenced code blocks.
Several code blocks in this RFC lack language specifiers, which affects syntax highlighting and readability. The architecture diagram at line 18 and examples at lines 244, 361, 366, 385, 391, 485, 500, 517 should specify a language (use
textorplaintextfor ASCII diagrams).📝 Example fix for ASCII diagram
-``` +```text User action (button click) │ ▼
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 22636d88-f5f8-408b-a088-51bb814f81e5
📒 Files selected for processing (11)
docs/designs/testset-annotation-queue/PRD.mddocs/designs/testset-annotation-queue/README.mddocs/designs/testset-annotation-queue/RFC.mdweb/packages/agenta-annotation-ui/src/components/AnnotationSession/FocusView.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/ScenarioListView.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/SessionNavigation.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/index.tsxweb/packages/agenta-entity-ui/src/modals/commit/components/EntityCommitContent.tsxweb/packages/agenta-entity-ui/src/selection/components/UnifiedEntityPicker/types.tsweb/packages/agenta-entity-ui/src/selection/components/UnifiedEntityPicker/variants/CascadingVariant.tsxweb/packages/agenta-entity-ui/src/selection/hooks/modes/useCascadingMode.ts
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
web/packages/agenta-annotation-ui/src/components/AnnotationSession/ScenarioListView.tsx (1)
552-569:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winKeep the testcase fallback key separate from the metric path.
AnnotationOutputKeyCellpassesoutputKeyintouseAnnotationCellFallback(), and after Line 912 that value is often the full mapped path (attributes.ag.data.outputs.score, etc.). The fallback hook indexes the testcase container object by that same string, so grouped annotation cells now miss synced testcase values and render—until an annotation exists.Possible fix
const AnnotationOutputKeyCell = memo(function AnnotationOutputKeyCell({ scenarioId, def, outputKey, + fallbackOutputKey, fallbackDataKey, }: { scenarioId: string def: AnnotationColumnDef outputKey: string + fallbackOutputKey?: string fallbackDataKey?: string | null }) { const runId = useAtomValue(annotationSessionController.selectors.activeRunId()) ?? undefined const PopoverWrapper = useMetricPopoverWrapper() const {fallbackValue, isPending} = useAnnotationCellFallback( scenarioId, fallbackDataKey, - outputKey, + fallbackOutputKey ?? outputKey, )render: (_value: unknown, record: ScenarioTableRow) => ( <AnnotationOutputKeyCell scenarioId={record.scenarioId} def={outputColumn.annotationDef} outputKey={outputColumn.annotationDef.path ?? outputColumn.title} + fallbackOutputKey={outputColumn.title} fallbackDataKey={def.fallbackDataKey} /> ),Also applies to: 899-913
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: e691f741-e404-4527-b6c9-1dbc7e659952
📒 Files selected for processing (12)
web/packages/agenta-annotation-ui/src/components/AnnotationSession/ScenarioListView.tsxweb/packages/agenta-annotation/src/index.tsweb/packages/agenta-annotation/src/state/controllers/annotationSessionController.tsweb/packages/agenta-annotation/src/state/controllers/index.tsweb/packages/agenta-annotation/src/state/index.tsweb/packages/agenta-annotation/src/state/testsetSync.tsweb/packages/agenta-annotation/src/state/types.tsweb/packages/agenta-entities/src/testset/api/api.tsweb/packages/agenta-entities/src/testset/api/index.tsweb/packages/agenta-entities/src/testset/index.tsweb/packages/agenta-entity-ui/src/modals/commit/components/EntityCommitModal.tsxweb/packages/agenta-entity-ui/src/modals/commit/state.ts
💤 Files with no reviewable changes (2)
- web/packages/agenta-entity-ui/src/modals/commit/state.ts
- web/packages/agenta-entity-ui/src/modals/commit/components/EntityCommitModal.tsx
✅ Files skipped from review due to trivial changes (1)
- web/packages/agenta-annotation/src/state/index.ts
| async function waitForStoreAtomValue<T>( | ||
| atomToWatch: unknown, | ||
| isReady: (value: T) => boolean, | ||
| timeoutMs = 5000, | ||
| ): Promise<T> { | ||
| const store = getStore() | ||
| const atomRef = atomToWatch as unknown as Parameters<typeof store.get>[0] | ||
| const subRef = atomToWatch as unknown as Parameters<typeof store.sub>[0] | ||
| const current = store.get(atomRef) as T | ||
| if (isReady(current)) return current | ||
|
|
||
| return await new Promise<T>((resolve) => { | ||
| const timeout = setTimeout(() => { | ||
| unsubscribe() | ||
| resolve(store.get(atomRef) as T) | ||
| }, timeoutMs) | ||
|
|
||
| const unsubscribe = store.sub(subRef, () => { | ||
| const next = store.get(atomRef) as T | ||
| if (isReady(next)) { | ||
| clearTimeout(timeout) | ||
| unsubscribe() | ||
| resolve(next) | ||
| } | ||
| }) | ||
| }) | ||
| } |
There was a problem hiding this comment.
Fail the export when annotation queries never settle.
Unlike the trace query, the waits on Lines 2633-2644 never validate the returned state. If scenario steps or annotations are still pending — or already errored — Lines 2646-2653 still build the export row from the current store value, which collapses to empty annotations. That turns a slow/failing fetch into a successful export with missing evaluator outputs.
Also applies to: 2633-2653
| async function fetchLatestRevisionWithRows(params: { | ||
| projectId: string | ||
| testsetId: string | ||
| }): Promise<LatestRevisionWithRows> { | ||
| const latestRevision = await fetchLatestRevisionWithTestcases({ | ||
| projectId: params.projectId, | ||
| testsetId: params.testsetId, | ||
| testcaseLimit: 1, | ||
| }) | ||
| if (!latestRevision?.id) { | ||
| throw new Error("The latest revision for the selected testset could not be resolved.") | ||
| } | ||
|
|
||
| return latestRevision as LatestRevisionWithRows |
There was a problem hiding this comment.
Don't infer the target schema from a single testcase row.
This branch builds existingColumns from fetchLatestRevisionWithRows(), but that helper requests testcaseLimit: 1. If the sampled row does not contain every leaf path already present in the target testset, remapRowsToExistingLeafColumns() can miss valid existing columns and patchRevision() will add duplicate/misaligned paths instead of reusing the current schema.
Also applies to: 2824-2830
| function applyAnnotationOutputEntries( | ||
| data: Record<string, unknown>, | ||
| entries: {columnKey: string; outputs: Record<string, unknown>}[], | ||
| ): string[] { | ||
| const evaluatorColumnKeys: string[] = [] | ||
|
|
||
| for (const entry of entries) { | ||
| const existingValue = data[entry.columnKey] | ||
| data[entry.columnKey] = { | ||
| ...(isPlainRecord(existingValue) ? existingValue : {}), | ||
| ...entry.outputs, | ||
| } | ||
|
|
||
| for (const [fieldKey, fieldValue] of Object.entries(entry.outputs)) { | ||
| if (fieldValue !== undefined) { | ||
| evaluatorColumnKeys.push(`${entry.columnKey}.${fieldKey}`) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return evaluatorColumnKeys | ||
| } |
There was a problem hiding this comment.
Flatten nested annotation outputs before recording evaluatorColumnKeys.
Line 219 only appends ${entry.columnKey}.${fieldKey} for first-level properties. If an evaluator emits structured output like {rubric: {score: 1}}, the sync preview will declare judge.rubric as the column while the row data still contains deeper leaf paths. That leaves nested annotation fields undeclared/remapped incorrectly during testcase sync.
| }: RevisionDetailParams): Promise<Revision | null> { | ||
| testcaseLimit, | ||
| }: RevisionDetailParams & {testcaseLimit?: number}): Promise<Revision | null> { | ||
| if (testcaseLimit) { |
There was a problem hiding this comment.
Use explicit undefined checks for testcaseLimit and validate bounds.
Line 62 and Line 108 use truthy checks. That treats 0 as “not provided” and can still allow invalid numeric values through, leading to unintended full testcase fetches or bad API requests.
Suggested patch
export async function fetchRevisionWithTestcases({
id,
projectId,
testcaseLimit,
}: RevisionDetailParams & {testcaseLimit?: number}): Promise<Revision | null> {
- if (testcaseLimit) {
+ const hasTestcaseLimit = testcaseLimit !== undefined && testcaseLimit !== null
+ if (hasTestcaseLimit) {
+ if (!Number.isInteger(testcaseLimit) || testcaseLimit < 0) {
+ throw new Error(
+ "[fetchRevisionWithTestcases] testcaseLimit must be a non-negative integer",
+ )
+ }
const response = await axios.post(
`${getAgentaApiUrl()}/testsets/revisions/retrieve`,
{
testset_revision_ref: {id},
include_testcases: true,
windowing: {limit: testcaseLimit},
},
{params: {project_id: projectId}},
)
@@
export async function fetchLatestRevisionWithTestcases({
projectId,
testsetId,
testcaseLimit,
}: RevisionListParams & {testcaseLimit?: number}): Promise<Revision | null> {
+ const hasTestcaseLimit = testcaseLimit !== undefined && testcaseLimit !== null
+ if (hasTestcaseLimit && (!Number.isInteger(testcaseLimit) || testcaseLimit < 0)) {
+ throw new Error(
+ "[fetchLatestRevisionWithTestcases] testcaseLimit must be a non-negative integer",
+ )
+ }
const response = await axios.post(
`${getAgentaApiUrl()}/testsets/revisions/retrieve`,
{
testset_ref: {id: testsetId},
include_testcases: true,
- ...(testcaseLimit ? {windowing: {limit: testcaseLimit}} : {}),
+ ...(hasTestcaseLimit ? {windowing: {limit: testcaseLimit}} : {}),
},
{params: {project_id: projectId}},
)Also applies to: 108-108
Railway Preview Environment
|
No description provided.