MDEV-36832 UPDATE...IN(SELECT) acquires spurious gap locks on FK index by arcivanov · Pull Request #4682 · MariaDB/server

arcivanov · 2026-02-23T00:02:57Z

UPDATE ticket SET is_valid=0 WHERE ticket_id IN (SELECT ticket_id FROM ticket WHERE booking_id=1) acquires S next-key and gap locks on the FK secondary index fk_ticket_2_booking, causing deadlocks between concurrent transactions operating on non-overlapping booking_id values. The expected behavior — matching UPDATE ... WHERE ticket_id IN (1,2) (explicit PK list) — is to only lock the PRIMARY key records being updated.

Root cause: the optimizer converts the IN(SELECT) to a semi-join, which creates two table references for the same table. The SQL layer then processes the statement as SQLCOM_UPDATE_MULTI. In ha_innobase::store_lock(), the read-only (subquery) table handle receives TL_READ, but the existing LOCK_NONE optimization for DML read tables was gated on isolation_level <= READ_COMMITTED and did not include SQLCOM_UPDATE_MULTI at all. In REPEATABLE READ (the default), the subquery's secondary index scan therefore acquired LOCK_S next-key locks, including gap locks spanning into adjacent key ranges.

Fix: add SQLCOM_UPDATE_MULTI to the LOCK_NONE (consistent read) condition in store_lock():

For TL_READ (binlog off or ROW format): use consistent read at all isolation levels. This is safe because:
- The MVCC snapshot is stable within a transaction in REPEATABLE READ
- Rows being modified still acquire X locks via the write-side table handle (F_WRLCK / LOCK_X)
- SERIALIZABLE is handled by ::external_lock() upgrading LOCK_NONE back to LOCK_S
For TL_READ_NO_INSERT (statement-based binlog): extend the existing READ COMMITTED optimization to also cover SQLCOM_UPDATE_MULTI

The fix is deliberately limited to SQLCOM_UPDATE_MULTI (not SQLCOM_UPDATE) because single-table UPDATE with scalar subqueries on other tables traditionally uses S locks to block concurrent reads of the subquery table, and broadening the change there would alter observable behavior for existing applications.

Lock count before/after (semi-join UPDATE on booking_id=1):

Before: 5 lock structs, 5 row locks (IS + S×3 on FK index + IX + X×2 on PK)
After: 2 lock structs, 2 row locks (IX + X×2 on PK) — identical to explicit WHERE ticket_id IN (1,2)

arcivanov · 2026-02-23T01:41:12Z

@dr-m

grooverdan · 2026-02-23T02:46:39Z

@arcivanov , while waiting on a review from the the InnoDB team on this, do you have the time/interest in looking at MDEV-38817? I'm suspecting the test failure is actually the correct result, and should have been always occurring since 10.6 when MDEV-13115 was implemented.

arcivanov · 2026-02-23T04:23:54Z

@arcivanov , while waiting on a review from the the InnoDB team on this, do you have the time/interest in looking at MDEV-38817? I'm suspecting the test failure is actually the correct result, and should have been always occurring since 10.6 when MDEV-13115 was implemented.

I was unable to reproduce it locally but the early send_eof seems to not be appropriate for non-read-only transactions due to cleanup occurring after the packet is sent resulting in a potential race. Please see #4684.

gkodinov

This is preliminary review. Thank you for your contribution. Please stand by for the final review.

dr-m · 2026-02-23T12:07:38Z

 		if (sql_command == SQLCOM_CHECKSUM
 		    || sql_command == SQLCOM_CREATE_SEQUENCE
 		    || (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ)
+		    || (lock_type == TL_READ
+			&& sql_command == SQLCOM_UPDATE_MULTI)
 		    || (trx->isolation_level <= TRX_ISO_READ_COMMITTED
 			&& (lock_type == TL_READ
 			    || lock_type == TL_READ_NO_INSERT)
 			&& (sql_command == SQLCOM_INSERT_SELECT
 			    || sql_command == SQLCOM_REPLACE_SELECT
 			    || sql_command == SQLCOM_UPDATE
+			    || sql_command == SQLCOM_UPDATE_MULTI
 			    || sql_command == SQLCOM_CREATE_SEQUENCE
 			    || sql_command == SQLCOM_CREATE_TABLE))) {


As far as I can tell, this business of downgrading the requested lock type is a hack that should have been replaced years ago. A better solution would be to introduce a lock_type value TL_READ_NO_LOCK, which would be set at the appropriate places in the SQL query execution, and have InnoDB never downgrade any TL_READ or TL_READ_NO_INSERT.

My concern is that it could be unsafe to downgrade some locks in some fragments of a complex SQL statement. But there is only a single thd_sql_command(thd) value returned for the entire statement.

So what would be a satisfactory outcome for you as far as implementation/testing/validation etc that would allow this fix to move forward (the overall historic state of this code notwithstanding)? While I understand the frustration with code quality your comment does not seem actionable 🥲

The actionable part of my comment was that instead of revising the complex conditions for downgrading the requested lock, such as choosing

m_prebuilt->select_lock_type = LOCK_NONE; m_prebuilt->stored_select_lock_type = LOCK_NONE;

over

m_prebuilt->select_lock_type = LOCK_S; m_prebuilt->stored_select_lock_type = LOCK_S;

for specific values of thd_sql_command(thd), a new lock_type=TL_READ_NO_LOCK needs to be introduced that would be requested when appropriate. At the same time, most or all of the sql_command checks should be removed from ha_innobase::store_lock().

I understand that implementing this is far from trivial. But I do not think that extending the current hack is in any way sustainable or acceptable.

for specific values of thd_sql_command(thd), a new lock_type=TL_READ_NO_LOCK needs to be introduced that would be requested when appropriate. At the same time, most or all of the sql_command checks should be removed from ha_innobase::store_lock().\n\nI understand that implementing this is far from trivial. But I do not think that extending the current hack is in any way sustainable or acceptable.

Fair enough, this is actionable. I have a feeling though that this type of a change wouldn't be allowed into the stable and wait time on fixing this in cloud prod would be in years (since it'll go into 13.1+). I understand this is a hack but it's a hack pattern that is already successfully employed in stable and I'm merely extending it.

That said I'm willing to take a stab at the proper implementation as well.

If @vuvova thinks that your original change is OK, then we could perhaps go with it. However, I think that in that case we would have to fix the MDEV-37365 family of MDL bugs first, because currently our ability to run stress tests on FOREIGN KEY is being seriously hurt by the insufficient meta-data locking during DDL operations.

It would be really great if you could work on the proper implementation, targeting the main branch, in a separate pull request.

The change kind of makes sense to me. Old code handled and old comment explicitly mentioned "UPDATE ... = (SELECT ...)" case. It is difficult to explain to the user why sometimes there's an S lock and sometimes there is none (depending on how optimizer decided to execute the subquery)

the new comment shouldn't mention a subquery, the fix has nothing to do with subqueries, mutli-update updates some tables and only read others — these others don't need an S lock. It doesn't matter whether multi-update used a join or the join was implicitly created by the optimizer from a subquery. Also, don't mention MDEV in the comment

what about multi-delete? does it have the same issue?

Multi-delete in this scenario won't work as delete takes write locks on PK and FK no matter what.

Comment updated per your specification.

vuvova · 2026-04-19T08:16:39Z

 		    || sql_command == SQLCOM_CREATE_SEQUENCE
 		    || (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ)
+		    || (lock_type == TL_READ
+			&& sql_command == SQLCOM_UPDATE_MULTI)


why do you need that? the change below should be enough

Because, as the comment above states, the one below is only for READ_COMMITTED, and we need to cover REPEATABLE_READ (without binlog statement replication) as well:

|| (trx->isolation_level <= TRX_ISO_READ_COMMITTED ... && ( ... || sql_command == SQLCOM_UPDATE_MULTI )

vuvova · 2026-04-19T08:16:52Z

 		if (sql_command == SQLCOM_CHECKSUM
 		    || sql_command == SQLCOM_CREATE_SEQUENCE
 		    || (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ)
+		    || (lock_type == TL_READ
+			&& sql_command == SQLCOM_UPDATE_MULTI)
 		    || (trx->isolation_level <= TRX_ISO_READ_COMMITTED
 			&& (lock_type == TL_READ
 			    || lock_type == TL_READ_NO_INSERT)
 			&& (sql_command == SQLCOM_INSERT_SELECT
 			    || sql_command == SQLCOM_REPLACE_SELECT
 			    || sql_command == SQLCOM_UPDATE
+			    || sql_command == SQLCOM_UPDATE_MULTI
 			    || sql_command == SQLCOM_CREATE_SEQUENCE
 			    || sql_command == SQLCOM_CREATE_TABLE))) {


The change kind of makes sense to me. Old code handled and old comment explicitly mentioned "UPDATE ... = (SELECT ...)" case. It is difficult to explain to the user why sometimes there's an S lock and sometimes there is none (depending on how optimizer decided to execute the subquery)

the new comment shouldn't mention a subquery, the fix has nothing to do with subqueries, mutli-update updates some tables and only read others — these others don't need an S lock. It doesn't matter whether multi-update used a join or the join was implicitly created by the optimizer from a subquery. Also, don't mention MDEV in the comment

what about multi-delete? does it have the same issue?

`UPDATE ticket SET is_valid=0 WHERE ticket_id IN (SELECT ticket_id FROM ticket WHERE booking_id=1)` acquires S next-key and gap locks on the FK secondary index `fk_ticket_2_booking`, causing deadlocks between concurrent transactions operating on non-overlapping `booking_id` values. The expected behavior — matching `UPDATE ... WHERE ticket_id IN (1,2)` (explicit PK list) — is to only lock the PRIMARY key records being updated. **Root cause**: the optimizer converts the `IN(SELECT)` to a semi-join, which creates two table references for the same table. The SQL layer then processes the statement as `SQLCOM_UPDATE_MULTI`. In `ha_innobase::store_lock()`, the read-only (subquery) table handle receives `TL_READ`, but the existing `LOCK_NONE` optimization for DML read tables was gated on `isolation_level <= READ_COMMITTED` and did not include `SQLCOM_UPDATE_MULTI` at all. In `REPEATABLE READ` (the default), the subquery's secondary index scan therefore acquired `LOCK_S` next-key locks, including gap locks spanning into adjacent key ranges. **Fix**: add `SQLCOM_UPDATE_MULTI` to the `LOCK_NONE` (consistent read) condition in `store_lock()`: 1. For `TL_READ` (binlog off or ROW format): use consistent read at **all** isolation levels. This is safe because: - The MVCC snapshot is stable within a transaction in `REPEATABLE READ` - Rows being modified still acquire X locks via the write-side table handle (`F_WRLCK` / `LOCK_X`) - `SERIALIZABLE` is handled by `::external_lock()` upgrading `LOCK_NONE` back to `LOCK_S` 2. For `TL_READ_NO_INSERT` (statement-based binlog): extend the existing `READ COMMITTED` optimization to also cover `SQLCOM_UPDATE_MULTI` The fix is deliberately limited to `SQLCOM_UPDATE_MULTI` (not `SQLCOM_UPDATE`) because single-table UPDATE with scalar subqueries on other tables traditionally uses S locks to block concurrent reads of the subquery table, and broadening the change there would alter observable behavior for existing applications. **Lock count before/after** (semi-join UPDATE on `booking_id=1`): - Before: 5 lock structs, 5 row locks (IS + S×3 on FK index + IX + X×2 on PK) - After: 2 lock structs, 2 row locks (IX + X×2 on PK) — identical to explicit `WHERE ticket_id IN (1,2)` **Tests**: `update_subquery_fk_gap_lock.test` — verifies that concurrent `UPDATE...IN(SELECT)` operations on non-overlapping `booking_id` values do not block each other. Two connections each update their own `booking_id`'s tickets via semi-join subquery, then insert new tickets for their `booking_id`. Both operations must succeed without lock wait timeout, confirming that the spurious FK index gap locks are eliminated. `update_subquery_fk_gap_lock_isolation.test` — ACID isolation regression tests (requires ROW binlog format) verifying that replacing S locks with consistent read (`LOCK_NONE`) on the semi-join read table does not cause isolation violations at `REPEATABLE READ`: 1. **Phantom insert**: T1 updates `booking_id=1` tickets via semi-join, T2 concurrently inserts a new ticket with `booking_id=1`. T2 must not block (no S gap locks). T1's MVCC snapshot does not see the new row, so T1 updates only the 2 original tickets. Correct under snapshot isolation. 2. **Concurrent UPDATE on filter column**: T1 updates `booking_id=1` tickets, T2 changes `ticket_id=2`'s `booking_id` from 1 to 2. Both need X lock on `ticket_id=2`'s PK record — serialized via PK X lock alone, proving FK index S locks are unnecessary. 3. **Concurrent DELETE**: T1 updates `booking_id=1` tickets, T2 deletes `ticket_id=1`. Both need PK X lock — serialized, no isolation violation. 4. **SERIALIZABLE upgrade**: At `SERIALIZABLE`, `::external_lock()` upgrades `LOCK_NONE` to `LOCK_S`, so T2's INSERT into the subquery range must block with `ER_LOCK_WAIT_TIMEOUT`. Confirms the fix does not weaken `SERIALIZABLE` guarantees. 5. **Multi-table UPDATE on different tables**: T1 does `UPDATE t_write JOIN t_read ... SET t_write.val = t_read.val`, T2 concurrently modifies `t_read`. T2 must not block (T1 holds no S locks on `t_read`). T1's snapshot values are used. Correct under snapshot isolation. 6. **MVCC snapshot consistency**: T1 establishes snapshot, T2 modifies `t_read`, T1 does multi-table UPDATE using snapshot then re-reads. Re-read must return the same values the UPDATE used, confirming MVCC snapshot stability at `REPEATABLE READ`.

arcivanov force-pushed the MDEV-36832 branch from f5dceb2 to fc840a3 Compare February 23, 2026 00:25

gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Feb 23, 2026

gkodinov approved these changes Feb 23, 2026

View reviewed changes

dr-m reviewed Feb 23, 2026

View reviewed changes

arcivanov force-pushed the MDEV-36832 branch 2 times, most recently from 2439ca0 to 7e57958 Compare March 1, 2026 01:54

arcivanov force-pushed the MDEV-36832 branch from 7e57958 to 95a2c94 Compare March 3, 2026 05:52

gkodinov requested a review from dr-m March 24, 2026 12:49

dr-m requested review from vuvova and removed request for dr-m March 26, 2026 08:15

vuvova requested changes Apr 19, 2026

View reviewed changes

arcivanov force-pushed the MDEV-36832 branch from 95a2c94 to ea27a66 Compare April 24, 2026 02:25

Uh oh!

Conversation

arcivanov commented Feb 23, 2026

Uh oh!

arcivanov commented Feb 23, 2026

Uh oh!

grooverdan commented Feb 23, 2026

Uh oh!

arcivanov commented Feb 23, 2026

Uh oh!

gkodinov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arcivanov Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

arcivanov Feb 23, 2026 •

edited

Loading