Skip to content

Make transaction log read paths lock-free under concurrency#545

Draft
harper-joseph wants to merge 2 commits into
mainfrom
transaction-log-lock-free-reads
Draft

Make transaction log read paths lock-free under concurrency#545
harper-joseph wants to merge 2 commits into
mainfrom
transaction-log-lock-free-reads

Conversation

@harper-joseph
Copy link
Copy Markdown

@harper-joseph harper-joseph commented May 8, 2026

Eliminates dataSetsMutex and weak_ptr::lock() contention from the steady-state read path, unlocking high-QPS workloads with many concurrent subscribers (e.g., CDC replication).

Read-path changes (TransactionLogStore + TransactionLogHandle):

  • Per-handle file snapshot cache (cachedFiles + filesVersion). Handles refresh the snapshot only when the store's filesVersion atomic advances (rotation, registration, purge); steady-state reads walk the local snapshot lock-free.
  • Lock-free findPosition: walks cachedFiles newest-to-oldest, skipping files by their stored timestamp. Falls back to the store-level slow path only when the snapshot is empty.
  • Lock-free getLogFileSize fast path: reads logFile->size atomic via the cached snapshot. Falls through to the slow path only when sequenceNumber=0 or the file is in the snapshot but not yet opened.
  • Cached shared_ptr on the handle replaces the per-call weak_ptr::lock() CAS. Each read-method now does an isClosing.load() check instead. addEntry re-resolves to a fresh store and clears cachedFiles when isClosing is observed.
  • currentSequenceNumber is now atomic (was plain uint32_t) so handle fast-path readers can compare without acquiring dataSetsMutex.

Per-file index changes (TransactionLogFile):

  • Lock-free in-file timestamp index using a stable buffer + packed atomic state (low 32 bits = entry count, high 32 bits = position indexed up to). Replaces the std::map + indexMutex serialization. Acquire/release ordering on indexState publishes new entries safely to lock-free readers.
  • Slow-path index extension serializes only on indexExtendMutex (per-file), not the global dataSetsMutex.
  • Removed eager ensureIndexUpToDate at registerLogFile — saves up to maxFileSize/13 × 16 bytes per recovered file at startup. The lazy slow path handles the first reader instead.
  • Inlined extendIndexLocked into findPositionByTimestamp (only caller).

Test coverage:

Adds 9 new regression tests covering the cases the changes affected:

  • per-handle cache invalidation across many rotations
  • concurrent first-time readers on a freshly-opened (unindexed) log
  • multiple handles on the same log staying consistent
  • iterator resume across rotations
  • crash-free behavior under concurrent reads + purgeLogs(destroy:true)
  • queries after attempting to purge earlier files
  • ...and more

Bench:

Adds benchmark/worker-transaction-log-read.bench.ts — 4 access patterns (bulk forward scan, bulk forward scan with concurrent writer, high-frequency short-range queries, cursor-advance tail scan with writer) at 8 workers, comparing rocksdb-js against lmdb with matched lazy-durability semantics (noSync) and Harper-realistic LMDB options (snapshot:false, numeric keys).

Eliminates dataSetsMutex and weak_ptr::lock() contention from the steady-state
read path, unlocking high-QPS workloads with many concurrent subscribers
(e.g., CDC replication).

Read-path changes (TransactionLogStore + TransactionLogHandle):

- Per-handle file snapshot cache (cachedFiles + filesVersion). Handles refresh
  the snapshot only when the store's filesVersion atomic advances (rotation,
  registration, purge); steady-state reads walk the local snapshot lock-free.
- Lock-free findPosition: walks cachedFiles newest-to-oldest, skipping files
  by their stored timestamp. Falls back to the store-level slow path only
  when the snapshot is empty.
- Lock-free getLogFileSize fast path: reads logFile->size atomic via the
  cached snapshot. Falls through to the slow path only when sequenceNumber=0
  or the file is in the snapshot but not yet opened.
- Cached shared_ptr<TransactionLogStore> on the handle replaces the per-call
  weak_ptr::lock() CAS. Each read-method now does an isClosing.load() check
  instead. addEntry re-resolves to a fresh store and clears cachedFiles when
  isClosing is observed.
- currentSequenceNumber is now atomic (was plain uint32_t) so handle
  fast-path readers can compare without acquiring dataSetsMutex.

Per-file index changes (TransactionLogFile):

- Lock-free in-file timestamp index using a stable buffer + packed atomic
  state (low 32 bits = entry count, high 32 bits = position indexed up to).
  Replaces the std::map + indexMutex serialization. Acquire/release ordering
  on indexState publishes new entries safely to lock-free readers.
- Slow-path index extension serializes only on indexExtendMutex (per-file),
  not the global dataSetsMutex.
- Removed eager ensureIndexUpToDate at registerLogFile — saves up to
  maxFileSize/13 × 16 bytes per recovered file at startup. The lazy slow
  path handles the first reader instead.
- Inlined extendIndexLocked into findPositionByTimestamp (only caller).

Test coverage:

Adds 9 new regression tests covering the cases the changes affected:
- per-handle cache invalidation across many rotations
- concurrent first-time readers on a freshly-opened (unindexed) log
- multiple handles on the same log staying consistent
- iterator resume across rotations
- crash-free behavior under concurrent reads + purgeLogs(destroy:true)
- queries after attempting to purge earlier files
- ...and more

Bench:

Adds benchmark/worker-transaction-log-read.bench.ts — 4 access patterns
(bulk forward scan, bulk forward scan with concurrent writer, high-frequency
short-range queries, cursor-advance tail scan with writer) at 8 workers,
comparing rocksdb-js against lmdb with matched lazy-durability semantics
(noSync) and Harper-realistic LMDB options (snapshot:false, numeric keys).

Headline impact (8 workers, 3-run averages):

  Short-range queries (1k iterators × 8 workers):
    baseline: 114 hz | optimized: 2,576 hz (22.6× speedup, 5.0× faster than lmdb)

  Bulk forward scan with concurrent writer:
    baseline: 2,248 hz | optimized: 2,502 hz (+11%, 26.6× faster than lmdb)

  Bulk forward scan, no writes:
    baseline: 3,136 hz | optimized: 3,168 hz (+1%, 38× faster than lmdb)

The high-QPS short-range pattern is most representative of CDC subscriber
polling — that workload moves from a regression vs lmdb to a 5× lead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harper-joseph harper-joseph linked an issue May 8, 2026 that may be closed by this pull request
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

📊 Benchmark Results

get-sync.bench.ts

getSync() > random keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 24.38K ops/sec 41.02 39.74 694.525 0.116 121,888
🥈 rocksdb 2 12.54K ops/sec 79.77 77.37 22,482.244 0.890 62,679

getSync() > sequential keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 28.38K ops/sec 35.23 34.16 754.692 0.106 141,912
🥈 rocksdb 2 13.18K ops/sec 75.87 74.78 567.205 0.048 65,900

ranges.bench.ts

getRange() > small range (100 records, 50 range)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 26.42K ops/sec 37.84 35.23 1,628.831 0.281 132,123
🥈 rocksdb 2 17.16K ops/sec 58.27 51.77 2,132.248 0.144 85,811

realistic-load.bench.ts

Realistic write load with workers > write variable records with transaction log

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 197.30 ops/sec 5,068.55 71.28 141,022.184 38.61 395
🥈 lmdb 2 26.43 ops/sec 37,831.511 423.412 1,191,509.07 136.374 64.00

transaction-log.bench.ts

Transaction log > read 100 iterators while write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 35.38K ops/sec 28.26 12.96 14,199.917 0.601 176,903
🥈 lmdb 2 439.76 ops/sec 2,273.972 175.537 28,497.843 1.65 2,199

Transaction log > read one entry from random position from log with 1000 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 741.86K ops/sec 1.35 1.18 3,290.04 0.146 3,709,315
🥈 lmdb 2 423.72K ops/sec 2.36 1.22 2,841.055 0.316 2,118,582

worker-put-sync.bench.ts

putSync() > random keys - small key size (100 records, 10 workers)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 862.62 ops/sec 1,159.252 992.12 1,806.368 0.295 1,726
🥈 lmdb 2 1.17 ops/sec 856,476.561 812,851.925 883,536.324 1.75 10.00

worker-transaction-log-read.bench.ts

Transaction log read access patterns (8 workers) > Bulk forward scan, no writes: 8 workers each scan ~8000 entries

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 966.66 ops/sec 1,034.493 933.384 2,499.365 0.337 1,934
🥈 lmdb 2 100.10 ops/sec 9,990.446 7,142.306 14,021.624 1.61 201

Transaction log read access patterns (8 workers) > Bulk forward scan with 1 concurrent writer (7 readers)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 1.01K ops/sec 990.958 815.75 1,474.379 0.472 2,019
🥈 lmdb 2 107.45 ops/sec 9,306.826 6,715.124 12,684.923 1.67 215

Transaction log read access patterns (8 workers) > Short-range queries: 8 workers each open 1000 iterators per tick

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 1.16K ops/sec 861.938 679.434 1,581.451 1.11 2,321
🥈 lmdb 2 434.90 ops/sec 2,299.392 1,406.1 5,185.222 1.43 870

Transaction log read access patterns (8 workers) > Cursor-advance tail scan with 1 concurrent writer (7 readers, 50 writes/tick)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 801.67 ops/sec 1,247.402 1,041.197 17,238.003 2.90 1,604
🥈 rocksdb 2 501.69 ops/sec 1,993.266 1,542.796 3,387.585 0.725 1,004

worker-transaction-log.bench.ts

Transaction log with workers > write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 18.19K ops/sec 54.98 30.20 558.912 0.499 36,377
🥈 lmdb 2 817.20 ops/sec 1,223.685 293.444 11,213.584 5.34 1,635

Results from commit 6ba95df

* to a fresh store if this one has been marked closing.
*/
std::weak_ptr<TransactionLogStore> store;
std::shared_ptr<TransactionLogStore> store;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical that this will work. You are now bounding the life of the TransactionLogStore to the life of the TransactionLogHandle. When you call db.purgeLogs({ destroy: true }), it won't be able to fully close a TransactionLogStore and almost certainly the files will fail to be cleaned up on Windows.

The current design allows the TransactionLogStore to be destroyed while TransactionLogHandles reference it. The TransactionLogHandle life is bound to V8's garbage collection and TransactionLogStore is not.

With that said, there were some recent race conditions that where fixed by adding isClosing checks. I suppose it's possible the initial reason for the weak_ptr no longer exists.

@cb1kenobi
Copy link
Copy Markdown
Member

cb1kenobi commented May 8, 2026

Ignore the failed tests. They are failing due to pnpm 11 dropping Node 20 and Bun/Deno not supporting sqlite. I fixed it in #548 and so it'll fix your tests when it lands.

Copy link
Copy Markdown
Member

@kriszyp kriszyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious what is meant by "read path"? The most common "read path" in Harper for most application is long-lived iterators, used by replication. And the dominant iteration involves zero native/C++ calls, it is entirely in JavaScript. And this is intentional because it basically guarantees that the dominant read path not only does not use locks, but it can not use locks. And that's why we get many millions of iterations per second. Iteration is faster than is even possible with native calls, much less mutexes, I believe.
So what read path are we are talking about? Is this more for random access queries on the transaction log (this is more common with MQTT applications)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore using atomics to reduce mutex locks

3 participants