Skip to content

ENG-1699 - Implement batch upsert for concepts in Supabase#1000

Merged
mdroidian merged 1 commit intomainfrom
eng-1699-batch-upsert_concepts
May 5, 2026
Merged

ENG-1699 - Implement batch upsert for concepts in Supabase#1000
mdroidian merged 1 commit intomainfrom
eng-1699-batch-upsert_concepts

Conversation

@mdroidian
Copy link
Copy Markdown
Member

@mdroidian mdroidian commented May 5, 2026

*Written by codex 5.5 xhigh after long debugging session.

Summary

Fixes large first-sync failures by batching upsert_concepts calls during Roam Supabase sync.

Previously, concept sync sent the entire concept payload in one upsert_concepts RPC. On larger graphs, this can exceed Supabase/Postgres statement
timeout, causing the whole concept transaction to roll back and leaving the graph with synced Content/embeddings but zero Concept rows.

Changes

  • Added sequential batching for upsert_concepts with a batch size of 200.
  • Reused a shared chunk helper for sync batching.
  • Preserved dependency ordering by ordering concepts before batching.
  • Added explicit handling for negative upsert_concepts sentinel return values (-1, -2) so row-level failures fail the sync with batch context.
  • Added warning output for concept dependencies missing from the current sync batch.

Why

A customer graph had ~6,700 synced content rows and embeddings, but upsert_concepts failed with:

canceling statement due to statement timeout

Because Concept upsert was one large RPC, timeout rolled back the entire concept sync, leaving Concept empty. Batching makes the work fit within
normal statement timeout limits and allows retries to continue idempotently.

Summary by CodeRabbit

  • New Features

    • Optimized data synchronization with batch processing for concept and content operations to improve efficiency.
  • Bug Fixes

    • Enhanced error handling with detailed per-batch failure reporting for improved visibility into sync issues.

- Introduced a new function `upsertConceptBatches` to handle the upsert of concepts in batches, improving efficiency and error handling.
- Added utility functions for chunking arrays and summarizing failed upsert IDs.
- Updated the `convertDgToSupabaseConcepts` function to utilize the new batching logic, ensuring better management of concept dependencies during synchronization.
- Removed redundant chunking logic from `upsertNodesToSupabaseAsContentWithEmbeddings` for cleaner code.

This change enhances the synchronization process with Supabase by allowing for more manageable data handling and clearer error reporting.
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 5, 2026

@supabase
Copy link
Copy Markdown

supabase Bot commented May 5, 2026

This pull request has been ignored for the connected project zytfjzqyijgagqxrzbmz because there are no changes detected in packages/database/supabase directory. You can change this behaviour in Project Integrations Settings ↗︎.


Preview Branches by Supabase.
Learn more about Supabase Branching ↗︎.

@mdroidian mdroidian requested a review from maparent May 5, 2026 20:56
@mdroidian
Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

This PR introduces batching support for concept and content upserts in the sync utility. It replaces single RPC calls with fixed-size batch operations via new helper functions, adds per-batch error reporting, and refactors the concept upsertion flow to include dependency ordering and batch-aware error handling.

Changes

Concept and Content Upsert Batching

Layer / File(s) Summary
Constants & Type Definitions
apps/roam/src/utils/syncDgNodesToSupabase.ts (lines 37, 46)
CONCEPT_BATCH_SIZE constant and LocalConceptDataInput type alias introduced for batching strategy and data shape.
Batch Utility Functions
apps/roam/src/utils/syncDgNodesToSupabase.ts (lines 48–66)
Generic chunk helper partitions arrays by batch size; summarizeFailedConceptUpsertIds formats per-batch failure summaries for RPC results.
Batch Processing Logic
apps/roam/src/utils/syncDgNodesToSupabase.ts (lines 67–99)
upsertConceptBatches executes concept upserts in batches via RPC with per-batch error collection and aggregated failure reporting.
Integration & Refactoring
apps/roam/src/utils/syncDgNodesToSupabase.ts (lines 320–333, 372–383)
convertDgToSupabaseConcepts refactored to compute ordered concepts, log missing dependencies, and delegate to upsertConceptBatches; uploadBatches added to batch content upserts with BATCH_SIZE control.

Sequence Diagram

sequenceDiagram
    participant Client as Sync Utility
    participant Batcher as Batch Processor
    participant RPC as Supabase RPC

    rect rgba(100, 150, 255, 0.5)
    Note over Client,RPC: Concept Upsertion (Batched)
    Client->>Batcher: convertDgToSupabaseConcepts(nodes)
    Batcher->>Batcher: orderConcepts (dependency resolution)
    Batcher->>Batcher: chunk into CONCEPT_BATCH_SIZE
    loop For each batch
        Batcher->>RPC: upsert_concepts RPC call
        RPC-->>Batcher: {success: [...], failed_ids: [...]}
        Batcher->>Batcher: Collect per-batch failures
    end
    Batcher->>Batcher: summarizeFailedConceptUpsertIds
    Batcher-->>Client: Aggregated failure report
    end

    rect rgba(100, 255, 150, 0.5)
    Note over Client,RPC: Content Upsertion (Batched)
    Client->>Batcher: uploadBatches(embeddings)
    Batcher->>Batcher: chunk into BATCH_SIZE
    loop For each batch
        Batcher->>RPC: upsert_content RPC call
        RPC-->>Batcher: Response
    end
    Batcher-->>Client: Batched upload complete
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • DiscourseGraphs/discourse-graph#202: Client-side batching logic directly calls upsert_content and upsert_concepts RPCs that were added or modified in that PR.
  • DiscourseGraphs/discourse-graph#343: Both PRs modify the same file and functions (syncDgNodesToSupabase and embedding utilities), with this PR building on prior batching or upsert logic changes.
  • DiscourseGraphs/discourse-graph#215: This PR refactors client-side calls to upsert_concepts RPC in batches; that PR introduced the database-side upsert_concepts function being targeted.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title directly and clearly describes the main change: implementing batch upsert for concepts in Supabase, which aligns with the primary objective of fixing large first-sync failures by batching upsert_concepts RPC calls.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
apps/roam/src/utils/syncDgNodesToSupabase.ts (2)

320-327: ⚡ Quick win

This warning will likely fire on healthy incremental syncs.

missing here only means “not in the current batch,” and nodeTypeSince(...) intentionally excludes unchanged schema concepts. That makes warning-level logging noisy for blocks whose dependencies were already synced earlier. Consider downgrading this to debug or only warning after confirming the dependency is absent remotely.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/roam/src/utils/syncDgNodesToSupabase.ts` around lines 320 - 327, The
current console.warn that logs when orderConceptsByDependency(conceptsToUpsert)
returns missing is too noisy because missing only means "not in the current
batch" (nodeTypeSince(...) excludes unchanged schema concepts); change the
behavior to either call a debug-level logger instead of console.warn for the
missing array, or conditionally promote to a warning only after verifying the
dependency is actually absent in Supabase (e.g., query the remote store for the
missing concept IDs before logging). Update the block that destructures const {
ordered, missing } = orderConceptsByDependency(conceptsToUpsert) to use the
chosen approach and keep the descriptive message about missing dependencies when
logging.

56-65: ⚡ Quick win

Rename the sentinel-summary helper to match what the RPC returns.

These are failure codes, not concept IDs, so summarizeFailedConceptUpsertIds / failedIds reads misleadingly in the path you most need to debug quickly.

Proposed rename
-const summarizeFailedConceptUpsertIds = (failedIds: number[]): string => {
-  const counts = failedIds.reduce<Record<string, number>>((acc, id) => {
-    acc[id] = (acc[id] ?? 0) + 1;
+const summarizeFailedConceptUpsertStatuses = (
+  failedStatuses: number[],
+): string => {
+  const counts = failedStatuses.reduce<Record<string, number>>((acc, status) => {
+    acc[status] = (acc[status] ?? 0) + 1;
     return acc;
   }, {});
-    const failedIds = (data || []).filter((id) => id < 0);
-    if (failedIds.length > 0) {
+    const failedStatuses = (data || []).filter((status) => status < 0);
+    if (failedStatuses.length > 0) {
       throw new Error(
-        `upsert_concepts returned row failures for batch ${idx + 1}/${batches.length}: ${summarizeFailedConceptUpsertIds(failedIds)}`,
+        `upsert_concepts returned row failures for batch ${idx + 1}/${batches.length}: ${summarizeFailedConceptUpsertStatuses(failedStatuses)}`,
       );
     }
As per coding guidelines, "Function names should describe their purpose clearly" and "Choose descriptive function names that make comments unnecessary".

Also applies to: 92-95

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/roam/src/utils/syncDgNodesToSupabase.ts` around lines 56 - 65, The
helper summarizeFailedConceptUpsertIds is misnamed because it summarizes failure
codes (sentinels) not concept IDs; rename the function to something like
summarizeFailureCodes (and its parameter failedIds to failureCodes) and update
all references (including the other occurrence around the code that handles
lines 92-95) so the name and parameter clearly indicate they are
failure/sentinel codes returned by the RPC; keep the same implementation and
type (number[] -> number[]) but change identifiers and any inline comments to
match the new names to avoid confusion when debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/roam/src/utils/syncDgNodesToSupabase.ts`:
- Around line 320-327: The current console.warn that logs when
orderConceptsByDependency(conceptsToUpsert) returns missing is too noisy because
missing only means "not in the current batch" (nodeTypeSince(...) excludes
unchanged schema concepts); change the behavior to either call a debug-level
logger instead of console.warn for the missing array, or conditionally promote
to a warning only after verifying the dependency is actually absent in Supabase
(e.g., query the remote store for the missing concept IDs before logging).
Update the block that destructures const { ordered, missing } =
orderConceptsByDependency(conceptsToUpsert) to use the chosen approach and keep
the descriptive message about missing dependencies when logging.
- Around line 56-65: The helper summarizeFailedConceptUpsertIds is misnamed
because it summarizes failure codes (sentinels) not concept IDs; rename the
function to something like summarizeFailureCodes (and its parameter failedIds to
failureCodes) and update all references (including the other occurrence around
the code that handles lines 92-95) so the name and parameter clearly indicate
they are failure/sentinel codes returned by the RPC; keep the same
implementation and type (number[] -> number[]) but change identifiers and any
inline comments to match the new names to avoid confusion when debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 29382a68-9bf8-44f9-8e46-1d7df9e88c0d

📥 Commits

Reviewing files that changed from the base of the PR and between bb5607f and 51af028.

📒 Files selected for processing (1)
  • apps/roam/src/utils/syncDgNodesToSupabase.ts

Copy link
Copy Markdown
Collaborator

@maparent maparent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM. It fails globally if any batch fails, which I think is the correct behaviour so we can retry globally.
Tested and the sync completed.

@mdroidian mdroidian merged commit cc26688 into main May 5, 2026
9 checks passed
@mdroidian mdroidian deleted the eng-1699-batch-upsert_concepts branch May 5, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants