Skip to content

fix: use unique partition keys in QueryReturnTypesIT to avoid LWT contention#882

Merged
dkropachev merged 1 commit intoscylladb:scylla-4.xfrom
nikagra:fix/lwt-write-timeout-retry
May 8, 2026
Merged

fix: use unique partition keys in QueryReturnTypesIT to avoid LWT contention#882
dkropachev merged 1 commit intoscylladb:scylla-4.xfrom
nikagra:fix/lwt-write-timeout-retry

Conversation

@nikagra
Copy link
Copy Markdown

@nikagra nikagra commented May 6, 2026

Problem

QueryReturnTypesIT is annotated @Category(ParallelizableTests.class) and all test methods were using the same hardcoded partition key (id=1). On Scylla with tablets, initial LWT queries to the same partition key can be routed to random nodes, causing Paxos contention across parallel test threads and resulting in WriteTimeoutException.

Fix

Assign each test method instance a unique partition key via a static AtomicInteger counter incremented in @Before, so no two concurrently running tests contend on the same partition. The "not found" probe previously using id=2 now uses -(testId + 1), which is always negative and therefore guaranteed to never be assigned by the counter.

@Lorak-mmk
Copy link
Copy Markdown

Two questions:

  • Why does this error even happen? In other word, why Scylla fails to achieve qorum in single-node scenario?
  • Can't this be solved with a retry policy instead of separate function? Also, should the default retry policy maybe handle this case?

@dkropachev
Copy link
Copy Markdown

Two questions:

  • Why does this error even happen? In other word, why Scylla fails to achieve qorum in single-node scenario?
  • Can't this be solved with a retry policy instead of separate function? Also, should the default retry policy maybe handle this case?

This is acutally funny, it happened because query are run on the same PK and run in parallel, so it is regular LWT congestion .

@Lorak-mmk
Copy link
Copy Markdown

Contention and timeout from just 2 queries? Wow. I did not expect LWT to be THAT slow

@nikagra nikagra force-pushed the fix/lwt-write-timeout-retry branch from 0704c1f to 6c725b1 Compare May 7, 2026 12:55
@nikagra nikagra changed the title fix: retry LWT operations on WriteTimeoutException in QueryReturnTypesIT fix: use unique partition keys in QueryReturnTypesIT to avoid LWT contention May 7, 2026
@nikagra
Copy link
Copy Markdown
Author

nikagra commented May 7, 2026

@Lorak-mmk, I've reworked fix following @dkropachev's feedback

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the QueryReturnTypesIT integration test to avoid LWT (Paxos) contention when tests are executed in parallel, by ensuring each test instance operates on a distinct partition key.

Changes:

  • Introduced a per-test unique partition key generated from a static AtomicInteger.
  • Updated all DAO calls/assertions to use the per-test testId instead of hardcoded IDs.
  • Updated “not found” probes to use a guaranteed-unassigned negative ID derived from testId.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dkropachev
Copy link
Copy Markdown

Contention and timeout from just 2 queries? Wow. I did not expect LWT to be THAT slow

Why only two ? integration-tests/pom.xml:41 sets test.parallel.threads=16, so the whole suit running at once, targeting same PK.

@dkropachev
Copy link
Copy Markdown

Contention and timeout from just 2 queries? Wow. I did not expect LWT to be THAT slow

Why only two ? integration-tests/pom.xml:41 sets test.parallel.threads=16, so the whole suit running at once, targeting same PK.

Actually, it is not what happening, tests in the test suit running one by one, but different test suits are running in parallel.
Anyways splitting them into different PKs will remove LWT extra sync anyway, so this PR still should work.

@dkropachev dkropachev merged commit bd9b26a into scylladb:scylla-4.x May 8, 2026
28 checks passed
@nikagra nikagra deleted the fix/lwt-write-timeout-retry branch May 8, 2026 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants