MDEV-36832 UPDATE...IN(SELECT) acquires spurious gap locks on FK index#4682
MDEV-36832 UPDATE...IN(SELECT) acquires spurious gap locks on FK index#4682arcivanov wants to merge 1 commit intoMariaDB:10.11from
Conversation
f5dceb2 to
fc840a3
Compare
|
@arcivanov , while waiting on a review from the the InnoDB team on this, do you have the time/interest in looking at MDEV-38817? I'm suspecting the test failure is actually the correct result, and should have been always occurring since 10.6 when MDEV-13115 was implemented. |
I was unable to reproduce it locally but the early send_eof seems to not be appropriate for non-read-only transactions due to cleanup occurring after the packet is sent resulting in a potential race. Please see #4684. |
gkodinov
left a comment
There was a problem hiding this comment.
This is preliminary review. Thank you for your contribution. Please stand by for the final review.
| if (sql_command == SQLCOM_CHECKSUM | ||
| || sql_command == SQLCOM_CREATE_SEQUENCE | ||
| || (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ) | ||
| || (lock_type == TL_READ | ||
| && sql_command == SQLCOM_UPDATE_MULTI) | ||
| || (trx->isolation_level <= TRX_ISO_READ_COMMITTED | ||
| && (lock_type == TL_READ | ||
| || lock_type == TL_READ_NO_INSERT) | ||
| && (sql_command == SQLCOM_INSERT_SELECT | ||
| || sql_command == SQLCOM_REPLACE_SELECT | ||
| || sql_command == SQLCOM_UPDATE | ||
| || sql_command == SQLCOM_UPDATE_MULTI | ||
| || sql_command == SQLCOM_CREATE_SEQUENCE | ||
| || sql_command == SQLCOM_CREATE_TABLE))) { |
There was a problem hiding this comment.
As far as I can tell, this business of downgrading the requested lock type is a hack that should have been replaced years ago. A better solution would be to introduce a lock_type value TL_READ_NO_LOCK, which would be set at the appropriate places in the SQL query execution, and have InnoDB never downgrade any TL_READ or TL_READ_NO_INSERT.
My concern is that it could be unsafe to downgrade some locks in some fragments of a complex SQL statement. But there is only a single thd_sql_command(thd) value returned for the entire statement.
There was a problem hiding this comment.
So what would be a satisfactory outcome for you as far as implementation/testing/validation etc that would allow this fix to move forward (the overall historic state of this code notwithstanding)? While I understand the frustration with code quality your comment does not seem actionable 🥲
There was a problem hiding this comment.
The actionable part of my comment was that instead of revising the complex conditions for downgrading the requested lock, such as choosing
m_prebuilt->select_lock_type = LOCK_NONE;
m_prebuilt->stored_select_lock_type = LOCK_NONE;over
m_prebuilt->select_lock_type = LOCK_S;
m_prebuilt->stored_select_lock_type = LOCK_S;for specific values of thd_sql_command(thd), a new lock_type=TL_READ_NO_LOCK needs to be introduced that would be requested when appropriate. At the same time, most or all of the sql_command checks should be removed from ha_innobase::store_lock().
I understand that implementing this is far from trivial. But I do not think that extending the current hack is in any way sustainable or acceptable.
There was a problem hiding this comment.
for specific values of thd_sql_command(thd), a new lock_type=TL_READ_NO_LOCK needs to be introduced that would be requested when appropriate. At the same time, most or all of the sql_command checks should be removed from ha_innobase::store_lock().\n\nI understand that implementing this is far from trivial. But I do not think that extending the current hack is in any way sustainable or acceptable.
Fair enough, this is actionable. I have a feeling though that this type of a change wouldn't be allowed into the stable and wait time on fixing this in cloud prod would be in years (since it'll go into 13.1+). I understand this is a hack but it's a hack pattern that is already successfully employed in stable and I'm merely extending it.
That said I'm willing to take a stab at the proper implementation as well.
There was a problem hiding this comment.
If @vuvova thinks that your original change is OK, then we could perhaps go with it. However, I think that in that case we would have to fix the MDEV-37365 family of MDL bugs first, because currently our ability to run stress tests on FOREIGN KEY is being seriously hurt by the insufficient meta-data locking during DDL operations.
It would be really great if you could work on the proper implementation, targeting the main branch, in a separate pull request.
There was a problem hiding this comment.
- The change kind of makes sense to me. Old code handled and old comment explicitly mentioned "UPDATE ... = (SELECT ...)" case. It is difficult to explain to the user why sometimes there's an S lock and sometimes there is none (depending on how optimizer decided to execute the subquery)
- the new comment shouldn't mention a subquery, the fix has nothing to do with subqueries, mutli-update updates some tables and only read others — these others don't need an S lock. It doesn't matter whether multi-update used a join or the join was implicitly created by the optimizer from a subquery. Also, don't mention MDEV in the comment
- what about multi-delete? does it have the same issue?
There was a problem hiding this comment.
Multi-delete in this scenario won't work as delete takes write locks on PK and FK no matter what.
There was a problem hiding this comment.
Comment updated per your specification.
2439ca0 to
7e57958
Compare
| || sql_command == SQLCOM_CREATE_SEQUENCE | ||
| || (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ) | ||
| || (lock_type == TL_READ | ||
| && sql_command == SQLCOM_UPDATE_MULTI) |
There was a problem hiding this comment.
why do you need that? the change below should be enough
There was a problem hiding this comment.
Because, as the comment above states, the one below is only for READ_COMMITTED, and we need to cover REPEATABLE_READ (without binlog statement replication) as well:
|| (trx->isolation_level <= TRX_ISO_READ_COMMITTED
...
&& (
...
|| sql_command == SQLCOM_UPDATE_MULTI
)
| if (sql_command == SQLCOM_CHECKSUM | ||
| || sql_command == SQLCOM_CREATE_SEQUENCE | ||
| || (sql_command == SQLCOM_ANALYZE && lock_type == TL_READ) | ||
| || (lock_type == TL_READ | ||
| && sql_command == SQLCOM_UPDATE_MULTI) | ||
| || (trx->isolation_level <= TRX_ISO_READ_COMMITTED | ||
| && (lock_type == TL_READ | ||
| || lock_type == TL_READ_NO_INSERT) | ||
| && (sql_command == SQLCOM_INSERT_SELECT | ||
| || sql_command == SQLCOM_REPLACE_SELECT | ||
| || sql_command == SQLCOM_UPDATE | ||
| || sql_command == SQLCOM_UPDATE_MULTI | ||
| || sql_command == SQLCOM_CREATE_SEQUENCE | ||
| || sql_command == SQLCOM_CREATE_TABLE))) { |
There was a problem hiding this comment.
- The change kind of makes sense to me. Old code handled and old comment explicitly mentioned "UPDATE ... = (SELECT ...)" case. It is difficult to explain to the user why sometimes there's an S lock and sometimes there is none (depending on how optimizer decided to execute the subquery)
- the new comment shouldn't mention a subquery, the fix has nothing to do with subqueries, mutli-update updates some tables and only read others — these others don't need an S lock. It doesn't matter whether multi-update used a join or the join was implicitly created by the optimizer from a subquery. Also, don't mention MDEV in the comment
- what about multi-delete? does it have the same issue?
`UPDATE ticket SET is_valid=0 WHERE ticket_id IN
(SELECT ticket_id FROM ticket WHERE booking_id=1)` acquires S next-key
and gap locks on the FK secondary index `fk_ticket_2_booking`, causing
deadlocks between concurrent transactions operating on non-overlapping
`booking_id` values. The expected behavior — matching
`UPDATE ... WHERE ticket_id IN (1,2)` (explicit PK list) — is to only
lock the PRIMARY key records being updated.
**Root cause**: the optimizer converts the `IN(SELECT)` to a semi-join,
which creates two table references for the same table. The SQL layer
then processes the statement as `SQLCOM_UPDATE_MULTI`. In
`ha_innobase::store_lock()`, the read-only (subquery) table handle
receives `TL_READ`, but the existing `LOCK_NONE` optimization for DML
read tables was gated on `isolation_level <= READ_COMMITTED` and did
not include `SQLCOM_UPDATE_MULTI` at all. In `REPEATABLE READ` (the
default), the subquery's secondary index scan therefore acquired
`LOCK_S` next-key locks, including gap locks spanning into adjacent key
ranges.
**Fix**: add `SQLCOM_UPDATE_MULTI` to the `LOCK_NONE` (consistent read)
condition in `store_lock()`:
1. For `TL_READ` (binlog off or ROW format): use consistent read at
**all** isolation levels. This is safe because:
- The MVCC snapshot is stable within a transaction in
`REPEATABLE READ`
- Rows being modified still acquire X locks via the write-side
table handle (`F_WRLCK` / `LOCK_X`)
- `SERIALIZABLE` is handled by `::external_lock()` upgrading
`LOCK_NONE` back to `LOCK_S`
2. For `TL_READ_NO_INSERT` (statement-based binlog): extend the
existing `READ COMMITTED` optimization to also cover
`SQLCOM_UPDATE_MULTI`
The fix is deliberately limited to `SQLCOM_UPDATE_MULTI` (not
`SQLCOM_UPDATE`) because single-table UPDATE with scalar subqueries on
other tables traditionally uses S locks to block concurrent reads of the
subquery table, and broadening the change there would alter observable
behavior for existing applications.
**Lock count before/after** (semi-join UPDATE on `booking_id=1`):
- Before: 5 lock structs, 5 row locks (IS + S×3 on FK index + IX +
X×2 on PK)
- After: 2 lock structs, 2 row locks (IX + X×2 on PK) — identical to
explicit `WHERE ticket_id IN (1,2)`
**Tests**:
`update_subquery_fk_gap_lock.test` — verifies that concurrent
`UPDATE...IN(SELECT)` operations on non-overlapping `booking_id` values
do not block each other. Two connections each update their own
`booking_id`'s tickets via semi-join subquery, then insert new tickets
for their `booking_id`. Both operations must succeed without lock wait
timeout, confirming that the spurious FK index gap locks are eliminated.
`update_subquery_fk_gap_lock_isolation.test` — ACID isolation regression
tests (requires ROW binlog format) verifying that replacing S locks with
consistent read (`LOCK_NONE`) on the semi-join read table does not cause
isolation violations at `REPEATABLE READ`:
1. **Phantom insert**: T1 updates `booking_id=1` tickets via semi-join,
T2 concurrently inserts a new ticket with `booking_id=1`. T2 must
not block (no S gap locks). T1's MVCC snapshot does not see the new
row, so T1 updates only the 2 original tickets. Correct under
snapshot isolation.
2. **Concurrent UPDATE on filter column**: T1 updates `booking_id=1`
tickets, T2 changes `ticket_id=2`'s `booking_id` from 1 to 2. Both
need X lock on `ticket_id=2`'s PK record — serialized via PK X
lock alone, proving FK index S locks are unnecessary.
3. **Concurrent DELETE**: T1 updates `booking_id=1` tickets, T2 deletes
`ticket_id=1`. Both need PK X lock — serialized, no isolation
violation.
4. **SERIALIZABLE upgrade**: At `SERIALIZABLE`, `::external_lock()`
upgrades `LOCK_NONE` to `LOCK_S`, so T2's INSERT into the subquery
range must block with `ER_LOCK_WAIT_TIMEOUT`. Confirms the fix
does not weaken `SERIALIZABLE` guarantees.
5. **Multi-table UPDATE on different tables**: T1 does
`UPDATE t_write JOIN t_read ... SET t_write.val = t_read.val`, T2
concurrently modifies `t_read`. T2 must not block (T1 holds no S
locks on `t_read`). T1's snapshot values are used. Correct under
snapshot isolation.
6. **MVCC snapshot consistency**: T1 establishes snapshot, T2 modifies
`t_read`, T1 does multi-table UPDATE using snapshot then re-reads.
Re-read must return the same values the UPDATE used, confirming
MVCC snapshot stability at `REPEATABLE READ`.
UPDATE ticket SET is_valid=0 WHERE ticket_id IN (SELECT ticket_id FROM ticket WHERE booking_id=1)acquires S next-key and gap locks on the FK secondary indexfk_ticket_2_booking, causing deadlocks between concurrent transactions operating on non-overlappingbooking_idvalues. The expected behavior — matchingUPDATE ... WHERE ticket_id IN (1,2)(explicit PK list) — is to only lock the PRIMARY key records being updated.Root cause: the optimizer converts the
IN(SELECT)to a semi-join, which creates two table references for the same table. The SQL layer then processes the statement asSQLCOM_UPDATE_MULTI. Inha_innobase::store_lock(), the read-only (subquery) table handle receivesTL_READ, but the existingLOCK_NONEoptimization for DML read tables was gated onisolation_level <= READ_COMMITTEDand did not includeSQLCOM_UPDATE_MULTIat all. InREPEATABLE READ(the default), the subquery's secondary index scan therefore acquiredLOCK_Snext-key locks, including gap locks spanning into adjacent key ranges.Fix: add
SQLCOM_UPDATE_MULTIto theLOCK_NONE(consistent read) condition instore_lock():TL_READ(binlog off or ROW format): use consistent read at all isolation levels. This is safe because:REPEATABLE READF_WRLCK/LOCK_X)SERIALIZABLEis handled by::external_lock()upgradingLOCK_NONEback toLOCK_STL_READ_NO_INSERT(statement-based binlog): extend the existingREAD COMMITTEDoptimization to also coverSQLCOM_UPDATE_MULTIThe fix is deliberately limited to
SQLCOM_UPDATE_MULTI(notSQLCOM_UPDATE) because single-table UPDATE with scalar subqueries on other tables traditionally uses S locks to block concurrent reads of the subquery table, and broadening the change there would alter observable behavior for existing applications.Lock count before/after (semi-join UPDATE on
booking_id=1):WHERE ticket_id IN (1,2)