Skip to content

MDEV-38486: Semi-sync must not await ACK for @@skip_replication events#4978

Open
bodyhedia44 wants to merge 1 commit intoMariaDB:10.11from
bodyhedia44:MDEV-38486-semisync-skip-replication
Open

MDEV-38486: Semi-sync must not await ACK for @@skip_replication events#4978
bodyhedia44 wants to merge 1 commit intoMariaDB:10.11from
bodyhedia44:MDEV-38486-semisync-skip-replication

Conversation

@bodyhedia44
Copy link
Copy Markdown
Contributor

Description

Root Cause

Semi-sync replication works by registering every committed transaction with the semi-sync plugin before the commit returns to the client. The plugin then blocks the thread until a replica ACKs receipt of that transaction's binlog position.

The @@skip_replication session variable (combined with replicate_events_marked_for_skip=FILTER_ON_MASTER on the slave) is a mechanism to prevent specific transactions from being replicated — the events are written to the binlog with the LOG_EVENT_SKIP_REPLICATION_F flag, and the dump thread uses that flag to silently skip sending them to the replica.

The bug is that the semi-sync layer has no awareness of this flag. It registers every committed transaction as needing an ACK, regardless of whether the replica will ever receive it. When a transaction is skipped, the replica never receives it, never sends an ACK, and the primary hangs until rpl_semi_sync_master_timeout expires — at which point semi-sync degrades to async mode.

The Fix

Two places in log.cc call repl_semisync_master.report_binlog_update(), which is what registers a transaction as needing an ACK:

  • Direct write path — MYSQL_BIN_LOG::write(): Used when a single event is written directly (e.g., DDL statements like CREATE DATABASE). Guard the call with a check on event_info->flags & LOG_EVENT_SKIP_REPLICATION_F.
  • Group commit leader path — MYSQL_BIN_LOG::trx_group_commit_leader(): Used for the common transactional case where multiple threads' commits are batched. For each entry in the commit queue, guard the call with a check on current->thd->variables.option_bits & OPTION_SKIP_REPLICATION.

In both cases, if the transaction is marked for skip, we simply do not call report_binlog_update(), so the semi-sync plugin never registers it as needing an ACK.

How to Test

The test performs the following sequence:
./mtr --suite=rpl rpl_semi_sync_skip_replication

  1. Starts master and slave with semi-sync enabled and rpl_semi_sync_master_timeout=2000.
  2. Sets replicate_events_marked_for_skip=FILTER_ON_MASTER on the slave (so the dump thread will filter skip_replication events).
  3. Commits a normal transaction — confirmed to be ACKed.
  4. Commits a CREATE DATABASE with @@skip_replication=1.
  5. Asserts Rpl_semi_sync_master_status is still ON — without the fix, it becomes OFF because the 2s timeout fires.

@bodyhedia44 bodyhedia44 changed the base branch from main to 10.11 April 22, 2026 22:04
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Apr 23, 2026
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! This is a preliminary review.

Your test fails on buildbot as follows, please fix that. IMHO this should not be happening and is indicative of of some sort of a protocol state mismatch:

rpl.rpl_semi_sync_skip_replication 'row' w25 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2026-04-22 22:13:26
line
2026-04-22 22:13:25 8 [ERROR] mariadbd: Got an error writing communication packets
2026-04-22 22:13:25 12 [ERROR] mariadbd: Got an error writing communication packets
2026-04-22 22:13:26 14 [ERROR] mariadbd: Got an error writing communication packets
^ Found warnings in /home/buildbot/amd64-msan-clang-20/build/mysql-test/var/25/log/mysqld.1.err
ok
 - saving '/home/buildbot/amd64-msan-clang-20/build/mysql-test/var/25/log/rpl.rpl_semi_sync_skip_replication-row/' to '/home/buildbot/amd64-msan-clang-20/build/mysql-test/var/log/rpl.rpl_semi_sync_skip_replication-row/'

@gkodinov gkodinov self-assigned this Apr 23, 2026
When a transaction is committed with @@skip_replication=1, its binlog
events are written with LOG_EVENT_SKIP_REPLICATION_F. The dump thread
respects this flag (when the slave has replicate_events_marked_for_skip=
FILTER_ON_MASTER) and never sends the events to the replica. Because the
replica never receives them, it never sends an ACK, causing the primary
to hang in semi-sync wait until rpl_semi_sync_master_timeout expires and
replication degrades from semi-sync to async.

Fix: In both the direct binlog write path and the group commit leader
path, skip calling repl_semisync_master.report_binlog_update() when the
event carries LOG_EVENT_SKIP_REPLICATION_F (direct path) or when the
committing THD has OPTION_SKIP_REPLICATION set (group commit path).

Test: rpl.rpl_semi_sync_skip_replication verifies that
Rpl_semi_sync_master_status remains ON after committing a transaction
with skip_replication=1 while FILTER_ON_MASTER is active.
@bodyhedia44 bodyhedia44 force-pushed the MDEV-38486-semisync-skip-replication branch from f9bd14f to cbdae24 Compare April 23, 2026 09:43
@bodyhedia44 bodyhedia44 requested a review from gkodinov April 23, 2026 10:32
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except I do not completely like the decrease of the log_warning level.

Stay tuned for the final review please.

@@ -0,0 +1,7 @@
!include ../my.cnf

[mysqld.1]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning these warnings off (by not setting log_warnings vs setting it to 9 as it used to be) is basically killing the messenger.

But, this is something for the final reviewer to consider I guess.

@gkodinov gkodinov assigned bnestere and unassigned gkodinov Apr 23, 2026
@gkodinov gkodinov requested a review from bnestere April 23, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

3 participants