From 22f54991c90ed23a0232725b7a9ab8f7c3c300d4 Mon Sep 17 00:00:00 2001 From: bean1352 Date: Mon, 20 Apr 2026 13:27:07 +0200 Subject: [PATCH 1/5] Added replication-lag.mdx with references --- docs.json | 1 + maintenance-ops/monitoring-and-alerting.mdx | 4 + .../production-readiness-guide.mdx | 4 + maintenance-ops/replication-lag.mdx | 192 ++++++++++++++++++ 4 files changed, 201 insertions(+) create mode 100644 maintenance-ops/replication-lag.mdx diff --git a/docs.json b/docs.json index d55f044e..a702d3a1 100644 --- a/docs.json +++ b/docs.json @@ -418,6 +418,7 @@ ] }, "maintenance-ops/monitoring-and-alerting", + "maintenance-ops/replication-lag", "maintenance-ops/production-readiness-guide", "maintenance-ops/compacting-buckets", { diff --git a/maintenance-ops/monitoring-and-alerting.mdx b/maintenance-ops/monitoring-and-alerting.mdx index 2d47e63d..e984aacb 100644 --- a/maintenance-ops/monitoring-and-alerting.mdx +++ b/maintenance-ops/monitoring-and-alerting.mdx @@ -17,6 +17,10 @@ You can monitor activity and alert on issues and usage for your PowerSync Cloud These features can assist with troubleshooting common issues (e.g. replication errors due to a logical replication slot problem), investigating usage spikes, or being notified when usage exceeds a specific threshold. + + Investigating replication lag specifically? See [Replication Lag](/maintenance-ops/replication-lag) for what it is, how to monitor it, and common causes. + + \* The availability of these features depends on your PowerSync Cloud plan. See the table below for a summary. More details are provided further below. ### Summary of Feature Availability (by PowerSync Cloud Plan) diff --git a/maintenance-ops/production-readiness-guide.mdx b/maintenance-ops/production-readiness-guide.mdx index 1ce30c9d..61ae5299 100644 --- a/maintenance-ops/production-readiness-guide.mdx +++ b/maintenance-ops/production-readiness-guide.mdx @@ -256,6 +256,10 @@ The easiest way to check for replication issues is to look at the Diagnostics en ### Managing & Monitoring Replication Lag + + For a broader overview of replication lag across source databases, how to monitor it, and common causes, see [Replication Lag](/maintenance-ops/replication-lag). + + Because PowerSync relies on Postgres logical replication, it's important to consider the size of the `max_slot_wal_keep_size` and monitoring lag of replication slots used by PowerSync in a production environment to ensure lag of replication slots do not exceed the `max_slot_wal_keep_size`. The `max_slot_wal_keep_size` Postgres [configuration parameter](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-SLOT-WAL-KEEP-SIZE) limits the size of the Write-Ahead Log (WAL) files that replication slots can hold. diff --git a/maintenance-ops/replication-lag.mdx b/maintenance-ops/replication-lag.mdx new file mode 100644 index 00000000..c9c02424 --- /dev/null +++ b/maintenance-ops/replication-lag.mdx @@ -0,0 +1,192 @@ +--- +title: "Replication Lag" +description: "Understand, monitor, and reduce replication lag between your source database and the PowerSync Service." +--- + +Replication lag is the delay between a change being committed in your source database (Postgres, MongoDB, MySQL, SQL Server) and that change being available in the PowerSync Service for clients to sync. A small amount of lag is normal. Sustained or growing lag usually points to a specific cause that you can investigate and act on. + +This page covers what replication lag is, how to monitor it, what commonly causes it, and how to reduce it. + +## Overview + +A change committed in the source database goes through roughly three stages before a client sees it: + +1. The source database writes the change to its replication stream. The exact mechanism differs per source: + - **Postgres**: logical replication via the Write-Ahead Log (WAL), read through a replication slot. + - **MongoDB**: change streams backed by the oplog. + - **MySQL**: the binary log (binlog), read using GTIDs. + - **SQL Server**: Change Data Capture (CDC) change tables, populated by a capture job that scans the transaction log. +2. The PowerSync Service reads the change from that stream and processes it into its internal bucket storage. +3. Connected clients receive the change on their next checkpoint. + +Replication lag refers specifically to stage 2: the time or volume of changes that have been committed to the source but not yet processed by the PowerSync Service. On Postgres, this is reported as `replication_lag_bytes` (bytes of WAL ahead of the PowerSync replication slot). + + + SQL Server has an additional source of latency inside stage 1: the CDC capture job itself runs on an interval (default 5 seconds on SQL Server, fixed at 20 seconds on Azure SQL), so changes do not appear in the CDC change tables instantly. See [SQL Server](#sql-server) below. + + +## How to monitor replication lag + +### PowerSync Dashboard + +The [PowerSync Dashboard](https://dashboard.powersync.com/) exposes a **Replication Lag** chart in the **Metrics** view of each instance. Use it to spot spikes and trends over time. + +See [Monitoring and Alerting](/maintenance-ops/monitoring-and-alerting) for alert and notification options available on your plan. + +### Management API Diagnostics + +The Diagnostics endpoint returns the current replication state, including `replication_lag_bytes` for each active Sync Streams/Sync Rules connection. See [Managing & Monitoring Replication Lag](/maintenance-ops/production-readiness-guide#managing--monitoring-replication-lag) in the production readiness guide for an example response and Postgres queries that check replication slot lag directly. + +### Instance Logs + +[Instance Logs](/maintenance-ops/monitoring-and-alerting#instance-logs) include **Replicator** entries that reflect replication activity from your source database to the PowerSync Service. Replication errors and restarts appear here and are often the first signal when lag starts climbing. + +## What "*normal*" looks like + +Replication lag is not expected to be exactly zero at all times. Short fluctuations are routine and generally not a concern. As a rough guide: + +* **Steady state**: lag stays low (typically in the single-digit seconds, or a few MB of WAL on Postgres) and returns to near-zero between bursts. +* **Write bursts**: a batch of writes in the source database causes a short spike while the service catches up. Lag should recover within seconds to a minute once the burst ends. +* **Sustained or growing lag**: lag that keeps climbing, or does not recover after a burst, indicates a problem worth investigating. + +## Common causes + +The causes below are grouped into ones that apply to any source, and ones that are specific to a given source database. + + + Replication lag is separate from client sync lag. A client can be behind the PowerSync Service because of its own connection or app state, even when replication lag is zero. + + +### All sources + +#### Initial replication of a large dataset + +When you first connect a source database, or when you deploy Sync Config changes that trigger reprocessing, the PowerSync Service replicates the full set of matching rows. During this period: + +* Replication lag will be elevated until the initial snapshot completes. +* The source-side replication buffer (WAL on Postgres, oplog on MongoDB, binlog on MySQL, CDC change tables on SQL Server) grows because the service has not yet acknowledged those changes. + +This is expected. Plan for it by sizing the relevant retention setting appropriately (see the source-specific sections below) and by coordinating large Sync Config changes during lower-traffic windows. + +#### Source database load + +Replication lag is sensitive to activity on the source database: + +* Long-running transactions on the source hold back the replication position until they commit. +* CPU, IO, or connection saturation on the source slows how fast changes are written to the replication stream in the first place. + +If lag correlates with specific workloads, profile those workloads on the source database before looking at the PowerSync Service. + +#### Bursty write workloads exceeding replication throughput + +Replication lag is a function of how fast changes arrive vs. how fast PowerSync can consume them. If a workload produces changes faster than the service can replicate, lag will accumulate until the burst ends and then drain as the service catches up. The service's published throughput (see [Performance and Limits](/resources/performance-and-limits)) is roughly: + +* **2,000-4,000 operations per second** for small rows +* **Up to 5 MB per second** for large rows +* **~60 transactions per second** for smaller transactions + +Workloads that commonly push past these rates, and therefore commonly cause visible lag spikes, include: + +* **Scheduled jobs**: cron jobs, nightly batches, or queue workers that flush on a timer. These tend to produce very sharp lag spikes at predictable times. +* **Bulk `UPDATE`s across indexed columns**: a single statement can generate millions of row-change events in the replication stream, even if the SQL itself runs quickly on the source. +* **Backfills and data migrations**: schema changes, column backfills, or re-keying jobs. On Postgres these can also rewrite large portions of a table, multiplying WAL volume. +* **Bulk imports** (`COPY`, `LOAD DATA`, `BULK INSERT`, `insertMany`): import throughput on the source is often far higher than replication throughput. + + + If a burst is unavoidable, prefer to run it during lower-traffic windows, batch it into smaller chunks rather than one large transaction, and make sure your source-side retention setting is large enough to cover the time it takes PowerSync to catch up afterwards. See the source-specific sections below: [Postgres](#postgres), [MongoDB](#mongodb), [MySQL](#mysql), [SQL Server](#sql-server). + + +#### Sync Config complexity + +Complex Sync Streams/Sync Rules (large numbers of buckets, heavy parameter queries, or joins against large tables) increase the amount of work required per replicated change. If you see lag climb after a Sync Config deploy and stay elevated, review the new configuration for expensive patterns. See [Performance and Limits](/resources/performance-and-limits) for limits that are worth staying well inside of. + +### Postgres + +#### WAL retention (`max_slot_wal_keep_size`) + +If the WAL grows faster than the PowerSync Service can consume it, and the total unconsumed WAL exceeds `max_slot_wal_keep_size`, Postgres will invalidate the replication slot. PowerSync then has to restart replication from scratch, which extends the period of elevated lag. + + + The [`max_slot_wal_keep_size`](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-SLOT-WAL-KEEP-SIZE) Postgres parameter limits how much WAL a replication slot can retain. Setting it too low on a write-heavy database risks slot invalidation during bursts or during initial replication. + + +See [Managing & Monitoring Replication Lag](/maintenance-ops/production-readiness-guide#managing--monitoring-replication-lag) for queries to check the current setting and the current slot lag, and for guidance on sizing it. + +#### Inactive replication slots holding WAL + +When Sync Streams/Sync Rules are redeployed, PowerSync creates a new replication slot and retires the old one once reprocessing completes. If an instance is stopped, deprovisioned, or hits an error before that handover finishes, an inactive slot can remain on the source database and continue to hold WAL, which can contribute to disk pressure and can mask what "real" lag looks like. + +See [Managing Replication Slots](/maintenance-ops/production-readiness-guide#managing-replication-slots) for queries to find and drop inactive slots, and for notes on the Postgres 18+ `idle_replication_slot_timeout` parameter. + +### MongoDB + +* An oplog that is undersized relative to write volume can cause change stream cursors to fall behind. +* Long-running change stream operations that time out have to be re-established, which can extend lag if it happens repeatedly. +* Documents with deep nesting or very large arrays take longer to transform into the PowerSync internal format. + +Make sure the oplog is sized to retain enough history to cover your expected replication windows, especially during initial replication. + +### MySQL + +* **Binlog expiry**: if required binlog files are purged before PowerSync has read them (for example, after extended downtime or sustained lag), replication has to restart from scratch. Make sure `binlog_expire_logs_seconds` is long enough to cover expected downtime and lag bursts. This is the MySQL analogue to Postgres slot invalidation. +* **`binlog-do-db` / `binlog-ignore-db` filters**: if these filters are set, every database referenced by your Sync Config must be included. Tables in excluded databases will not produce binlog events for PowerSync to replicate. + +See [MySQL setup](/configuration/source-db/setup#mysql) for required binlog settings. + +### SQL Server + +* **CDC retention**: the CDC cleanup job expires data from CDC change tables after a retention window (default 3 days). If the PowerSync Service is offline longer than this, data will need to be fully re-synced. +* **Capture job interval**: the SQL Server capture job scans the transaction log every 5 seconds by default; on Azure SQL Database this is fixed at 20 seconds. This interval is a floor on end-to-end lag. +* **`_powersync_checkpoints` table**: CDC must remain enabled on `dbo._powersync_checkpoints` for PowerSync to generate regular checkpoints. If CDC is disabled on this table, checkpoints stop advancing even when the rest of replication is healthy. + +See [SQL Server setup](/configuration/source-db/setup#sql-server) for CDC configuration and recommended capture job tuning. + +## Reducing replication lag + +Work through the checks below in order. The source-specific steps only apply if you are using that source database - skip the others. + + + + Check CPU, IO, connection count, and long-running transactions on the source. A saturated source will cause replication lag that no amount of tuning on the PowerSync side can fix. + + + Run the queries in [Managing & Monitoring Replication Lag](/maintenance-ops/production-readiness-guide#managing--monitoring-replication-lag) to see current slot lag and `max_slot_wal_keep_size`. Increase `max_slot_wal_keep_size` if lag routinely approaches it, especially before deploying Sync Config changes against large datasets. + + + If WAL is growing on the source but lag reported by the PowerSync Service is low, look for inactive slots. See [Managing Replication Slots](/maintenance-ops/production-readiness-guide#managing-replication-slots) to identify and drop them. + + + Confirm the oplog is sized to cover your expected replication window, especially around initial replication. An undersized oplog causes change stream cursors to fall behind and then have to be re-established. + + + Confirm `binlog_expire_logs_seconds` is long enough to tolerate expected downtime or lag bursts, and that any `binlog-do-db` / `binlog-ignore-db` filters include every database referenced by your Sync Config. See [MySQL](#mysql) above. + + + Confirm the CDC capture job is running and has not exceeded its retention window (default 3 days), and that CDC is still enabled on `dbo._powersync_checkpoints`. See [SQL Server](#sql-server) above and [SQL Server setup](/configuration/source-db/setup#sql-server) for capture job tuning. + + + Look for Sync Config changes that could be producing significantly more buckets or heavier parameter queries than before. Simplify where possible and deploy large changes during lower-traffic windows. + + + [Replicator logs](/maintenance-ops/monitoring-and-alerting#instance-logs) often contain the specific error (slot invalidation, change stream failure, binlog purge, CDC retention expiry, source connectivity) behind a lag incident. + + + +If lag persists after these steps, reach out on the PowerSync [Discord](https://discord.gg/powersync) or contact support with your instance ID, the time range of the incident, and a screenshot of the Replication Lag chart. + +## Related + + + + Configure usage metrics, logs, issue alerts, and notifications. + + + Database best practices, including replication slot management. + + + Common issues and pointers for debugging sync and replication. + + + Service limits that are worth staying well inside of. + + From b5672a0432ab266e15a6f2535ab8b79b2efa997b Mon Sep 17 00:00:00 2001 From: bean1352 Date: Mon, 20 Apr 2026 14:05:49 +0200 Subject: [PATCH 2/5] Improved doc flow and added Supabase defaults --- .../production-readiness-guide.mdx | 2 +- maintenance-ops/replication-lag.mdx | 100 +++++++++--------- 2 files changed, 53 insertions(+), 49 deletions(-) diff --git a/maintenance-ops/production-readiness-guide.mdx b/maintenance-ops/production-readiness-guide.mdx index 61ae5299..2f780a89 100644 --- a/maintenance-ops/production-readiness-guide.mdx +++ b/maintenance-ops/production-readiness-guide.mdx @@ -254,7 +254,7 @@ The easiest way to check for replication issues is to look at the Diagnostics en ## Postgres -### Managing & Monitoring Replication Lag +### Managing and Monitoring Replication Lag For a broader overview of replication lag across source databases, how to monitor it, and common causes, see [Replication Lag](/maintenance-ops/replication-lag). diff --git a/maintenance-ops/replication-lag.mdx b/maintenance-ops/replication-lag.mdx index c9c02424..7e44a61b 100644 --- a/maintenance-ops/replication-lag.mdx +++ b/maintenance-ops/replication-lag.mdx @@ -33,10 +33,6 @@ The [PowerSync Dashboard](https://dashboard.powersync.com/) exposes a **Replicat See [Monitoring and Alerting](/maintenance-ops/monitoring-and-alerting) for alert and notification options available on your plan. -### Management API Diagnostics - -The Diagnostics endpoint returns the current replication state, including `replication_lag_bytes` for each active Sync Streams/Sync Rules connection. See [Managing & Monitoring Replication Lag](/maintenance-ops/production-readiness-guide#managing--monitoring-replication-lag) in the production readiness guide for an example response and Postgres queries that check replication slot lag directly. - ### Instance Logs [Instance Logs](/maintenance-ops/monitoring-and-alerting#instance-logs) include **Replicator** entries that reflect replication activity from your source database to the PowerSync Service. Replication errors and restarts appear here and are often the first signal when lag starts climbing. @@ -47,7 +43,8 @@ Replication lag is not expected to be exactly zero at all times. Short fluctuati * **Steady state**: lag stays low (typically in the single-digit seconds, or a few MB of WAL on Postgres) and returns to near-zero between bursts. * **Write bursts**: a batch of writes in the source database causes a short spike while the service catches up. Lag should recover within seconds to a minute once the burst ends. -* **Sustained or growing lag**: lag that keeps climbing, or does not recover after a burst, indicates a problem worth investigating. +* **PowerSync infrastructure events**: brief replication lag can also appear during internal PowerSync scaling events. These are expected to recover on their own within a few minutes without any action from you. This is most likely to affect instances on **Free** and **Pro** plans, which run on shared infrastructure; **Team** and **Enterprise** plans are less affected. +* **Sustained or growing lag**: lag that keeps climbing, or does not recover after a burst or infrastructure event, indicates a problem worth investigating. ## Common causes @@ -79,7 +76,7 @@ If lag correlates with specific workloads, profile those workloads on the source #### Bursty write workloads exceeding replication throughput -Replication lag is a function of how fast changes arrive vs. how fast PowerSync can consume them. If a workload produces changes faster than the service can replicate, lag will accumulate until the burst ends and then drain as the service catches up. The service's published throughput (see [Performance and Limits](/resources/performance-and-limits)) is roughly: +Replication lag is a function of how fast changes arrive vs. how fast PowerSync can consume them. If a workload produces changes faster than the service can replicate, lag will accumulate until the burst ends and then drain as the service catches up. The service's published throughput (see [Performance and Limits](/resources/performance-and-limits#performance-expectations)) is roughly: * **2,000-4,000 operations per second** for small rows * **Up to 5 MB per second** for large rows @@ -110,7 +107,15 @@ If the WAL grows faster than the PowerSync Service can consume it, and the total The [`max_slot_wal_keep_size`](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-SLOT-WAL-KEEP-SIZE) Postgres parameter limits how much WAL a replication slot can retain. Setting it too low on a write-heavy database risks slot invalidation during bursts or during initial replication. -See [Managing & Monitoring Replication Lag](/maintenance-ops/production-readiness-guide#managing--monitoring-replication-lag) for queries to check the current setting and the current slot lag, and for guidance on sizing it. + + **Supabase defaults**: Supabase projects ship with `max_slot_wal_keep_size = 4GB` and a limit of 5 replication slots. The 4GB cap is easy to exceed during initial replication of a large dataset or a sustained write burst, after which the slot will be invalidated and PowerSync has to restart replication from scratch. Raise this value before connecting a large Supabase database to PowerSync. + + +See [Managing and Monitoring Replication Lag](/maintenance-ops/production-readiness-guide#managing-and-monitoring-replication-lag) for queries to check the current setting and the current slot lag, and for guidance on sizing it. + +#### `TRUNCATE` on replicated tables + +A `TRUNCATE` on a table in your Sync Config is treated as a change event for every row in that table, which can force the service to re-process large amounts of bucket data. If `TRUNCATE` runs on a regular schedule (for example, a cron that truncates-and-reloads a table), each run will produce a visible lag spike. Prefer `DELETE` with a filter, or redesign the job so it does not truncate a replicated table. #### Inactive replication slots holding WAL @@ -120,59 +125,58 @@ See [Managing Replication Slots](/maintenance-ops/production-readiness-guide#man ### MongoDB -* An oplog that is undersized relative to write volume can cause change stream cursors to fall behind. -* Long-running change stream operations that time out have to be re-established, which can extend lag if it happens repeatedly. -* Documents with deep nesting or very large arrays take longer to transform into the PowerSync internal format. - -Make sure the oplog is sized to retain enough history to cover your expected replication windows, especially during initial replication. +* **Change stream timeouts**: a significant delay on the source database in reading the change stream can cause timeouts (see [`PSYNC_S1345`](/debugging/error-codes#psync-s13xx-mongodb-replication-issues)). If this is not resolved after retries, replication may need to be restarted from scratch. +* **Change stream invalidation**: replication restarts with a new change stream if the existing one is invalidated, for example if the `startAfter`/`resumeToken` is no longer valid, if the replication connection changes, or if the database is dropped (see [`PSYNC_S1344`](/debugging/error-codes#psync-s13xx-mongodb-replication-issues)). +* **Deeply nested documents**: JSON or embedded-document nesting deeper than 20 levels will fail replication with [`PSYNC_S1004`](/debugging/error-codes#psync-s1xxx-replication-issues). +* **Post-image configuration**: if post-images are set to `read_only`, every replicated collection must have `changeStreamPreAndPostImages: { enabled: true }` set or replication will error. See [Post Images](/configuration/source-db/setup#post-images). ### MySQL -* **Binlog expiry**: if required binlog files are purged before PowerSync has read them (for example, after extended downtime or sustained lag), replication has to restart from scratch. Make sure `binlog_expire_logs_seconds` is long enough to cover expected downtime and lag bursts. This is the MySQL analogue to Postgres slot invalidation. -* **`binlog-do-db` / `binlog-ignore-db` filters**: if these filters are set, every database referenced by your Sync Config must be included. Tables in excluded databases will not produce binlog events for PowerSync to replicate. +* **Binlog retention**: PowerSync reads from the MySQL binary log. If required binlog files are purged before PowerSync has read them (for example, after extended downtime or sustained lag), replication has to restart from scratch. Configure MySQL binlog retention to be long enough to cover expected downtime and lag bursts. +* **`binlog-do-db` / `binlog-ignore-db` filters**: these filters are optional, but if set, every database referenced by your Sync Config must be included. Tables in excluded databases will not produce binlog events for PowerSync to replicate. See [Additional Configuration (Optional) → Binlog](/configuration/source-db/setup#additional-configuration-optional) in the MySQL setup docs. See [MySQL setup](/configuration/source-db/setup#mysql) for required binlog settings. ### SQL Server -* **CDC retention**: the CDC cleanup job expires data from CDC change tables after a retention window (default 3 days). If the PowerSync Service is offline longer than this, data will need to be fully re-synced. -* **Capture job interval**: the SQL Server capture job scans the transaction log every 5 seconds by default; on Azure SQL Database this is fixed at 20 seconds. This interval is a floor on end-to-end lag. -* **`_powersync_checkpoints` table**: CDC must remain enabled on `dbo._powersync_checkpoints` for PowerSync to generate regular checkpoints. If CDC is disabled on this table, checkpoints stop advancing even when the rest of replication is healthy. +* **CDC retention**: the CDC cleanup job expires data from CDC change tables after a retention window (default 3 days). If the PowerSync Service is offline longer than this period, data will need to be fully re-synced. +* **Latency from CDC polling**: end-to-end latency has two components. First, the SQL Server capture job's transaction log scan interval (default 5 seconds, recommended 1 second; fixed at 20 seconds on Azure SQL Database). Second, PowerSync's own polling interval (`pollingIntervalMs`, default 1000ms, self-hosted only). Both contribute to the minimum achievable lag. +* **`_powersync_checkpoints` table**: CDC must be enabled on `dbo._powersync_checkpoints` for PowerSync to generate regular checkpoints. -See [SQL Server setup](/configuration/source-db/setup#sql-server) for CDC configuration and recommended capture job tuning. +See [SQL Server setup](/configuration/source-db/setup#sql-server) for CDC configuration, recommended capture job tuning, and the [Latency](/configuration/source-db/setup#latency) section. ## Reducing replication lag -Work through the checks below in order. The source-specific steps only apply if you are using that source database - skip the others. - - - - Check CPU, IO, connection count, and long-running transactions on the source. A saturated source will cause replication lag that no amount of tuning on the PowerSync side can fix. - - - Run the queries in [Managing & Monitoring Replication Lag](/maintenance-ops/production-readiness-guide#managing--monitoring-replication-lag) to see current slot lag and `max_slot_wal_keep_size`. Increase `max_slot_wal_keep_size` if lag routinely approaches it, especially before deploying Sync Config changes against large datasets. - - - If WAL is growing on the source but lag reported by the PowerSync Service is low, look for inactive slots. See [Managing Replication Slots](/maintenance-ops/production-readiness-guide#managing-replication-slots) to identify and drop them. - - - Confirm the oplog is sized to cover your expected replication window, especially around initial replication. An undersized oplog causes change stream cursors to fall behind and then have to be re-established. - - - Confirm `binlog_expire_logs_seconds` is long enough to tolerate expected downtime or lag bursts, and that any `binlog-do-db` / `binlog-ignore-db` filters include every database referenced by your Sync Config. See [MySQL](#mysql) above. - - - Confirm the CDC capture job is running and has not exceeded its retention window (default 3 days), and that CDC is still enabled on `dbo._powersync_checkpoints`. See [SQL Server](#sql-server) above and [SQL Server setup](/configuration/source-db/setup#sql-server) for capture job tuning. - - - Look for Sync Config changes that could be producing significantly more buckets or heavier parameter queries than before. Simplify where possible and deploy large changes during lower-traffic windows. - - - [Replicator logs](/maintenance-ops/monitoring-and-alerting#instance-logs) often contain the specific error (slot invalidation, change stream failure, binlog purge, CDC retention expiry, source connectivity) behind a lag incident. - - - -If lag persists after these steps, reach out on the PowerSync [Discord](https://discord.gg/powersync) or contact support with your instance ID, the time range of the incident, and a screenshot of the Replication Lag chart. +Start with the "All sources" checks, then go to the section for your source database. + +### All sources + +* **Confirm the source database is healthy**: check CPU, IO, connection count, and long-running transactions on the source. A saturated source will cause replication lag that no amount of tuning on the PowerSync side can fix. +* **Pause or reduce large writes while the service catches up**: if lag is already elevated, holding off on scheduled jobs, bulk updates, migrations, and backfills is usually the fastest way to let it drain. If a large write is unavoidable, batch it into smaller transactions and pace them so the service has time to drain between batches, rather than running it as one large transaction. +* **Review Sync Config**: look for Sync Config changes that could be producing significantly more buckets or heavier parameter queries than before. Simplify where possible and deploy large changes during lower-traffic windows. +* **Check for source schema changes**: `ALTER TABLE` and similar changes on replicated tables can stall or invalidate replication until reconfigured. See [Implementing Schema Changes](/maintenance-ops/implementing-schema-changes) for the recommended flow. +* **Check instance logs for errors**: [Replicator logs](/maintenance-ops/monitoring-and-alerting#instance-logs) often contain the specific error (slot invalidation, change stream failure, binlog purge, CDC retention expiry, source connectivity) behind a lag incident. + +### Postgres + +* Run the queries in [Managing and Monitoring Replication Lag](/maintenance-ops/production-readiness-guide#managing-and-monitoring-replication-lag) to see current slot lag and `max_slot_wal_keep_size`. Increase `max_slot_wal_keep_size` if lag routinely approaches it, especially before deploying Sync Config changes against large datasets. On Supabase, raise the default 4GB cap before connecting a large database. +* If WAL is growing on the source but lag reported by the PowerSync Service is low, look for inactive slots. See [Managing Replication Slots](/maintenance-ops/production-readiness-guide#managing-replication-slots) to identify and drop them. +* Avoid `TRUNCATE` on tables in your Sync Config. See [`TRUNCATE` on replicated tables](#truncate-on-replicated-tables) above. + +### MongoDB + +* Check [Replicator logs](/maintenance-ops/monitoring-and-alerting#instance-logs) for change-stream errors ([`PSYNC_S1344`](/debugging/error-codes#psync-s13xx-mongodb-replication-issues), [`PSYNC_S1345`](/debugging/error-codes#psync-s13xx-mongodb-replication-issues)). Persistent timeouts or invalidation generally require the change stream to be re-established, which may restart replication. +* If you are using `read_only` post-images, confirm every replicated collection has `changeStreamPreAndPostImages` enabled. See [Post Images](/configuration/source-db/setup#post-images). + +### MySQL + +* Confirm MySQL binlog retention is long enough to tolerate expected downtime or lag bursts, and that any `binlog-do-db` / `binlog-ignore-db` filters include every database referenced by your Sync Config. See [MySQL](#mysql) above. + +### SQL Server + +* Confirm the CDC capture job is running and has not exceeded its retention window (default 3 days), and that CDC is still enabled on `dbo._powersync_checkpoints`. See [SQL Server](#sql-server) above and [SQL Server setup](/configuration/source-db/setup#sql-server) for capture job tuning. + +If lag persists after these checks, reach out on the PowerSync [Discord](https://discord.gg/powersync) or contact support with your instance ID, the time range of the incident, and a screenshot of the Replication Lag chart. ## Related From 87ac374986dbaabb29c922d8b1415067d1610b15 Mon Sep 17 00:00:00 2001 From: bean1352 Date: Mon, 20 Apr 2026 14:16:36 +0200 Subject: [PATCH 3/5] Clarified description of PowerSync infrastructure events --- maintenance-ops/replication-lag.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/maintenance-ops/replication-lag.mdx b/maintenance-ops/replication-lag.mdx index 7e44a61b..3bb646a8 100644 --- a/maintenance-ops/replication-lag.mdx +++ b/maintenance-ops/replication-lag.mdx @@ -43,7 +43,7 @@ Replication lag is not expected to be exactly zero at all times. Short fluctuati * **Steady state**: lag stays low (typically in the single-digit seconds, or a few MB of WAL on Postgres) and returns to near-zero between bursts. * **Write bursts**: a batch of writes in the source database causes a short spike while the service catches up. Lag should recover within seconds to a minute once the burst ends. -* **PowerSync infrastructure events**: brief replication lag can also appear during internal PowerSync scaling events. These are expected to recover on their own within a few minutes without any action from you. This is most likely to affect instances on **Free** and **Pro** plans, which run on shared infrastructure; **Team** and **Enterprise** plans are less affected. +* **PowerSync infrastructure events**: brief replication lag can also occur during internal PowerSync scaling events. These are expected to recover on their own within a few minutes without any action from you. * **Sustained or growing lag**: lag that keeps climbing, or does not recover after a burst or infrastructure event, indicates a problem worth investigating. ## Common causes From 4da46e2d78a615cfad236dd80cd3162a1ffc44c7 Mon Sep 17 00:00:00 2001 From: bean1352 Date: Mon, 20 Apr 2026 14:59:14 +0200 Subject: [PATCH 4/5] Drop replication_lag_bytes claim per review --- maintenance-ops/replication-lag.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/maintenance-ops/replication-lag.mdx b/maintenance-ops/replication-lag.mdx index 3bb646a8..cc5b9291 100644 --- a/maintenance-ops/replication-lag.mdx +++ b/maintenance-ops/replication-lag.mdx @@ -19,7 +19,7 @@ A change committed in the source database goes through roughly three stages befo 2. The PowerSync Service reads the change from that stream and processes it into its internal bucket storage. 3. Connected clients receive the change on their next checkpoint. -Replication lag refers specifically to stage 2: the time or volume of changes that have been committed to the source but not yet processed by the PowerSync Service. On Postgres, this is reported as `replication_lag_bytes` (bytes of WAL ahead of the PowerSync replication slot). +Replication lag refers specifically to stage 2: the time or volume of changes that have been committed to the source but not yet processed by the PowerSync Service. SQL Server has an additional source of latency inside stage 1: the CDC capture job itself runs on an interval (default 5 seconds on SQL Server, fixed at 20 seconds on Azure SQL), so changes do not appear in the CDC change tables instantly. See [SQL Server](#sql-server) below. From fbee363b20f37396bdaebed7c1a9f8a710f17682 Mon Sep 17 00:00:00 2001 From: bean1352 Date: Mon, 20 Apr 2026 15:22:42 +0200 Subject: [PATCH 5/5] Clarify that slower throughput correlates with the number of buckets per replicated row, not with query complexity. --- maintenance-ops/replication-lag.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/maintenance-ops/replication-lag.mdx b/maintenance-ops/replication-lag.mdx index cc5b9291..f63c9ca3 100644 --- a/maintenance-ops/replication-lag.mdx +++ b/maintenance-ops/replication-lag.mdx @@ -95,7 +95,7 @@ Workloads that commonly push past these rates, and therefore commonly cause visi #### Sync Config complexity -Complex Sync Streams/Sync Rules (large numbers of buckets, heavy parameter queries, or joins against large tables) increase the amount of work required per replicated change. If you see lag climb after a Sync Config deploy and stay elevated, review the new configuration for expensive patterns. See [Performance and Limits](/resources/performance-and-limits) for limits that are worth staying well inside of. +Slower replication performance is correlated with the number of buckets a replicated row ends up in, i.e. a row written once to the source database can be replicated to many buckets if many queries in your Sync Config reference it. If lag climbs after a Sync Config deploy and stays elevated, review the new configuration for rows that end up in many buckets. See [Performance and Limits](/resources/performance-and-limits) for limits that are worth staying well inside of. ### Postgres