Skip to content

MDEV-36832 UPDATE...IN(SELECT) acquires spurious gap locks on FK index#4682

Open
arcivanov wants to merge 1 commit intoMariaDB:10.11from
arcivanov:MDEV-36832
Open

MDEV-36832 UPDATE...IN(SELECT) acquires spurious gap locks on FK index#4682
arcivanov wants to merge 1 commit intoMariaDB:10.11from
arcivanov:MDEV-36832

Conversation

@arcivanov
Copy link
Copy Markdown
Contributor

UPDATE ticket SET is_valid=0 WHERE ticket_id IN (SELECT ticket_id FROM ticket WHERE booking_id=1) acquires S next-key and gap locks on the FK secondary index fk_ticket_2_booking, causing deadlocks between concurrent transactions operating on non-overlapping booking_id values. The expected behavior — matching UPDATE ... WHERE ticket_id IN (1,2) (explicit PK list) — is to only lock the PRIMARY key records being updated.

Root cause: the optimizer converts the IN(SELECT) to a semi-join, which creates two table references for the same table. The SQL layer then processes the statement as SQLCOM_UPDATE_MULTI. In ha_innobase::store_lock(), the read-only (subquery) table handle receives TL_READ, but the existing LOCK_NONE optimization for DML read tables was gated on isolation_level <= READ_COMMITTED and did not include SQLCOM_UPDATE_MULTI at all. In REPEATABLE READ (the default), the subquery's secondary index scan therefore acquired LOCK_S next-key locks, including gap locks spanning into adjacent key ranges.

Fix: add SQLCOM_UPDATE_MULTI to the LOCK_NONE (consistent read) condition in store_lock():

  1. For TL_READ (binlog off or ROW format): use consistent read at all isolation levels. This is safe because:
    • The MVCC snapshot is stable within a transaction in REPEATABLE READ
    • Rows being modified still acquire X locks via the write-side table handle (F_WRLCK / LOCK_X)
    • SERIALIZABLE is handled by ::external_lock() upgrading LOCK_NONE back to LOCK_S
  2. For TL_READ_NO_INSERT (statement-based binlog): extend the existing READ COMMITTED optimization to also cover SQLCOM_UPDATE_MULTI

The fix is deliberately limited to SQLCOM_UPDATE_MULTI (not SQLCOM_UPDATE) because single-table UPDATE with scalar subqueries on other tables traditionally uses S locks to block concurrent reads of the subquery table, and broadening the change there would alter observable behavior for existing applications.

Lock count before/after (semi-join UPDATE on booking_id=1):

  • Before: 5 lock structs, 5 row locks (IS + S×3 on FK index + IX + X×2 on PK)
  • After: 2 lock structs, 2 row locks (IX + X×2 on PK) — identical to explicit WHERE ticket_id IN (1,2)

@arcivanov
Copy link
Copy Markdown
Contributor Author

@dr-m

@grooverdan
Copy link
Copy Markdown
Member

@arcivanov , while waiting on a review from the the InnoDB team on this, do you have the time/interest in looking at MDEV-38817? I'm suspecting the test failure is actually the correct result, and should have been always occurring since 10.6 when MDEV-13115 was implemented.

@arcivanov
Copy link
Copy Markdown
Contributor Author

@arcivanov , while waiting on a review from the the InnoDB team on this, do you have the time/interest in looking at MDEV-38817? I'm suspecting the test failure is actually the correct result, and should have been always occurring since 10.6 when MDEV-13115 was implemented.

I was unable to reproduce it locally but the early send_eof seems to not be appropriate for non-read-only transactions due to cleanup occurring after the packet is sent resulting in a potential race. Please see #4684.

@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Feb 23, 2026
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is preliminary review. Thank you for your contribution. Please stand by for the final review.

Comment thread storage/innobase/handler/ha_innodb.cc
Comment on lines 16728 to 16741
if (sql_command == SQLCOM_CHECKSUM
|| sql_command == SQLCOM_CREATE_SEQUENCE
|| (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ)
|| (lock_type == TL_READ
&& sql_command == SQLCOM_UPDATE_MULTI)
|| (trx->isolation_level <= TRX_ISO_READ_COMMITTED
&& (lock_type == TL_READ
|| lock_type == TL_READ_NO_INSERT)
&& (sql_command == SQLCOM_INSERT_SELECT
|| sql_command == SQLCOM_REPLACE_SELECT
|| sql_command == SQLCOM_UPDATE
|| sql_command == SQLCOM_UPDATE_MULTI
|| sql_command == SQLCOM_CREATE_SEQUENCE
|| sql_command == SQLCOM_CREATE_TABLE))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this business of downgrading the requested lock type is a hack that should have been replaced years ago. A better solution would be to introduce a lock_type value TL_READ_NO_LOCK, which would be set at the appropriate places in the SQL query execution, and have InnoDB never downgrade any TL_READ or TL_READ_NO_INSERT.

My concern is that it could be unsafe to downgrade some locks in some fragments of a complex SQL statement. But there is only a single thd_sql_command(thd) value returned for the entire statement.

Copy link
Copy Markdown
Contributor Author

@arcivanov arcivanov Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what would be a satisfactory outcome for you as far as implementation/testing/validation etc that would allow this fix to move forward (the overall historic state of this code notwithstanding)? While I understand the frustration with code quality your comment does not seem actionable 🥲

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actionable part of my comment was that instead of revising the complex conditions for downgrading the requested lock, such as choosing

			m_prebuilt->select_lock_type = LOCK_NONE;
			m_prebuilt->stored_select_lock_type = LOCK_NONE;

over

			m_prebuilt->select_lock_type = LOCK_S;
			m_prebuilt->stored_select_lock_type = LOCK_S;

for specific values of thd_sql_command(thd), a new lock_type=TL_READ_NO_LOCK needs to be introduced that would be requested when appropriate. At the same time, most or all of the sql_command checks should be removed from ha_innobase::store_lock().

I understand that implementing this is far from trivial. But I do not think that extending the current hack is in any way sustainable or acceptable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for specific values of thd_sql_command(thd), a new lock_type=TL_READ_NO_LOCK needs to be introduced that would be requested when appropriate. At the same time, most or all of the sql_command checks should be removed from ha_innobase::store_lock().\n\nI understand that implementing this is far from trivial. But I do not think that extending the current hack is in any way sustainable or acceptable.

Fair enough, this is actionable. I have a feeling though that this type of a change wouldn't be allowed into the stable and wait time on fixing this in cloud prod would be in years (since it'll go into 13.1+). I understand this is a hack but it's a hack pattern that is already successfully employed in stable and I'm merely extending it.

That said I'm willing to take a stab at the proper implementation as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @vuvova thinks that your original change is OK, then we could perhaps go with it. However, I think that in that case we would have to fix the MDEV-37365 family of MDL bugs first, because currently our ability to run stress tests on FOREIGN KEY is being seriously hurt by the insufficient meta-data locking during DDL operations.

It would be really great if you could work on the proper implementation, targeting the main branch, in a separate pull request.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The change kind of makes sense to me. Old code handled and old comment explicitly mentioned "UPDATE ... = (SELECT ...)" case. It is difficult to explain to the user why sometimes there's an S lock and sometimes there is none (depending on how optimizer decided to execute the subquery)
  • the new comment shouldn't mention a subquery, the fix has nothing to do with subqueries, mutli-update updates some tables and only read others — these others don't need an S lock. It doesn't matter whether multi-update used a join or the join was implicitly created by the optimizer from a subquery. Also, don't mention MDEV in the comment
  • what about multi-delete? does it have the same issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-delete in this scenario won't work as delete takes write locks on PK and FK no matter what.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment updated per your specification.

@arcivanov arcivanov force-pushed the MDEV-36832 branch 2 times, most recently from 2439ca0 to 7e57958 Compare March 1, 2026 01:54
@gkodinov gkodinov requested a review from dr-m March 24, 2026 12:49
@dr-m dr-m requested review from vuvova and removed request for dr-m March 26, 2026 08:15
|| sql_command == SQLCOM_CREATE_SEQUENCE
|| (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ)
|| (lock_type == TL_READ
&& sql_command == SQLCOM_UPDATE_MULTI)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need that? the change below should be enough

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because, as the comment above states, the one below is only for READ_COMMITTED, and we need to cover REPEATABLE_READ (without binlog statement replication) as well:

|| (trx->isolation_level <= TRX_ISO_READ_COMMITTED
...
&& (
...
|| sql_command == SQLCOM_UPDATE_MULTI
)

Comment on lines 16728 to 16741
if (sql_command == SQLCOM_CHECKSUM
|| sql_command == SQLCOM_CREATE_SEQUENCE
|| (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ)
|| (lock_type == TL_READ
&& sql_command == SQLCOM_UPDATE_MULTI)
|| (trx->isolation_level <= TRX_ISO_READ_COMMITTED
&& (lock_type == TL_READ
|| lock_type == TL_READ_NO_INSERT)
&& (sql_command == SQLCOM_INSERT_SELECT
|| sql_command == SQLCOM_REPLACE_SELECT
|| sql_command == SQLCOM_UPDATE
|| sql_command == SQLCOM_UPDATE_MULTI
|| sql_command == SQLCOM_CREATE_SEQUENCE
|| sql_command == SQLCOM_CREATE_TABLE))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The change kind of makes sense to me. Old code handled and old comment explicitly mentioned "UPDATE ... = (SELECT ...)" case. It is difficult to explain to the user why sometimes there's an S lock and sometimes there is none (depending on how optimizer decided to execute the subquery)
  • the new comment shouldn't mention a subquery, the fix has nothing to do with subqueries, mutli-update updates some tables and only read others — these others don't need an S lock. It doesn't matter whether multi-update used a join or the join was implicitly created by the optimizer from a subquery. Also, don't mention MDEV in the comment
  • what about multi-delete? does it have the same issue?

`UPDATE ticket SET is_valid=0 WHERE ticket_id IN
(SELECT ticket_id FROM ticket WHERE booking_id=1)` acquires S next-key
and gap locks on the FK secondary index `fk_ticket_2_booking`, causing
deadlocks between concurrent transactions operating on non-overlapping
`booking_id` values. The expected behavior — matching
`UPDATE ... WHERE ticket_id IN (1,2)` (explicit PK list) — is to only
lock the PRIMARY key records being updated.

**Root cause**: the optimizer converts the `IN(SELECT)` to a semi-join,
which creates two table references for the same table. The SQL layer
then processes the statement as `SQLCOM_UPDATE_MULTI`. In
`ha_innobase::store_lock()`, the read-only (subquery) table handle
receives `TL_READ`, but the existing `LOCK_NONE` optimization for DML
read tables was gated on `isolation_level <= READ_COMMITTED` and did
not include `SQLCOM_UPDATE_MULTI` at all. In `REPEATABLE READ` (the
default), the subquery's secondary index scan therefore acquired
`LOCK_S` next-key locks, including gap locks spanning into adjacent key
ranges.

**Fix**: add `SQLCOM_UPDATE_MULTI` to the `LOCK_NONE` (consistent read)
condition in `store_lock()`:

1. For `TL_READ` (binlog off or ROW format): use consistent read at
   **all** isolation levels. This is safe because:
   - The MVCC snapshot is stable within a transaction in
     `REPEATABLE READ`
   - Rows being modified still acquire X locks via the write-side
     table handle (`F_WRLCK` / `LOCK_X`)
   - `SERIALIZABLE` is handled by `::external_lock()` upgrading
     `LOCK_NONE` back to `LOCK_S`
2. For `TL_READ_NO_INSERT` (statement-based binlog): extend the
   existing `READ COMMITTED` optimization to also cover
   `SQLCOM_UPDATE_MULTI`

The fix is deliberately limited to `SQLCOM_UPDATE_MULTI` (not
`SQLCOM_UPDATE`) because single-table UPDATE with scalar subqueries on
other tables traditionally uses S locks to block concurrent reads of the
subquery table, and broadening the change there would alter observable
behavior for existing applications.

**Lock count before/after** (semi-join UPDATE on `booking_id=1`):
- Before: 5 lock structs, 5 row locks (IS + S×3 on FK index + IX +
  X×2 on PK)
- After: 2 lock structs, 2 row locks (IX + X×2 on PK) — identical to
  explicit `WHERE ticket_id IN (1,2)`

**Tests**:

`update_subquery_fk_gap_lock.test` — verifies that concurrent
`UPDATE...IN(SELECT)` operations on non-overlapping `booking_id` values
do not block each other. Two connections each update their own
`booking_id`'s tickets via semi-join subquery, then insert new tickets
for their `booking_id`. Both operations must succeed without lock wait
timeout, confirming that the spurious FK index gap locks are eliminated.

`update_subquery_fk_gap_lock_isolation.test` — ACID isolation regression
tests (requires ROW binlog format) verifying that replacing S locks with
consistent read (`LOCK_NONE`) on the semi-join read table does not cause
isolation violations at `REPEATABLE READ`:

1. **Phantom insert**: T1 updates `booking_id=1` tickets via semi-join,
   T2 concurrently inserts a new ticket with `booking_id=1`. T2 must
   not block (no S gap locks). T1's MVCC snapshot does not see the new
   row, so T1 updates only the 2 original tickets. Correct under
   snapshot isolation.
2. **Concurrent UPDATE on filter column**: T1 updates `booking_id=1`
   tickets, T2 changes `ticket_id=2`'s `booking_id` from 1 to 2. Both
   need X lock on `ticket_id=2`'s PK record — serialized via PK X
   lock alone, proving FK index S locks are unnecessary.
3. **Concurrent DELETE**: T1 updates `booking_id=1` tickets, T2 deletes
   `ticket_id=1`. Both need PK X lock — serialized, no isolation
   violation.
4. **SERIALIZABLE upgrade**: At `SERIALIZABLE`, `::external_lock()`
   upgrades `LOCK_NONE` to `LOCK_S`, so T2's INSERT into the subquery
   range must block with `ER_LOCK_WAIT_TIMEOUT`. Confirms the fix
   does not weaken `SERIALIZABLE` guarantees.
5. **Multi-table UPDATE on different tables**: T1 does
   `UPDATE t_write JOIN t_read ... SET t_write.val = t_read.val`, T2
   concurrently modifies `t_read`. T2 must not block (T1 holds no S
   locks on `t_read`). T1's snapshot values are used. Correct under
   snapshot isolation.
6. **MVCC snapshot consistency**: T1 establishes snapshot, T2 modifies
   `t_read`, T1 does multi-table UPDATE using snapshot then re-reads.
   Re-read must return the same values the UPDATE used, confirming
   MVCC snapshot stability at `REPEATABLE READ`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

5 participants