Fix manul failover with CLIENT PAUSE/UNPAUSE#389
Conversation
|
In actual testing with the Controller and Kvrocks deployed in the same IDC, the system achieved a single-node write QPS of 10k/s with 20MB/s throughput. The write-stop duration (stall time) remained consistently under 10ms. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## unstable #389 +/- ##
============================================
+ Coverage 43.38% 49.67% +6.28%
============================================
Files 37 45 +8
Lines 2971 4103 +1132
============================================
+ Hits 1289 2038 +749
- Misses 1544 1839 +295
- Partials 138 226 +88
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jihuayu All issues are fixed. Ready for another round of review. Thanks! |
|
@jihuayu @git-hulk By the way,I initially planned to use slave lags for sync checks, but found it adds a 1s delay because the master only updates these stats every second. To get real-time offsets, I had to query both master and slaves, which unfortunately complicates the logic. Why does Kvrocks cap the slave lag update frequency at 1s rather than providing real-time updates? Is it a performance trade-off to avoid overhead? |
Motivation
Implement Controller modifications to ensure master-replica data consistency during failover, aligning with server-side changes in apache/kvrocks#3377
Solution
Step 1 (Pause): Send CLIENT PAUSE WRITE from the controller to the current master.
Step 2 (Wait): Monitor the master-replica sequence gap until it hits zero, ensuring no data loss.
Step 3 (Metadata): Update the global topology metadata for the switchover.
Step 4 (Switch & Unpause): Promote the target and demote the old master; then explicitly call CLIENT UNPAUSE on the old master to restore its status.
Step 5 (Replicate): Reconfigure all other followers to sync from the new master.
Configuration Options
To prevent excessive blocking durations during periods of high write traffic, a maximum pause timeout has been introduced; the failover process will fail if the synchronization times out. The following parameters are added to the Controller failover configuration:
"force_on_timeout": false,
"sync_timeout_ms": 100,
"pause_timeout_ms": 500
Related Issues
Fixes #384