Skip to content

MDEV-38412 System tablespace fails to shrink due to legacy tables#4884

Merged
Thirunarayanan merged 1 commit into11.4from
MDEV-38412
Apr 28, 2026
Merged

MDEV-38412 System tablespace fails to shrink due to legacy tables#4884
Thirunarayanan merged 1 commit into11.4from
MDEV-38412

Conversation

@Thirunarayanan
Copy link
Copy Markdown
Member

@Thirunarayanan Thirunarayanan commented Mar 31, 2026

MDEV-38412 System tablespace fails to shrink due to legacy tables
Problem:

  • InnoDB system tablespace fails to autoshrink when it contains
    legacy internal tables. These are non-user tables and internal
    table exist from older version. Because the current shrink logic
    does recognize these entries as user table, they block the
    defragmentation process required to reduce the tablespace size.

Solution:

To enable successful shrinking, InnoDB has been updated to
identify and remove these legacy entries during the startup:

drop_all_orphaned_tables(): A new function that scans the InnoDB
system tables for entries lacking the / naming convention.
It triggers the removal of these table objects from InnoDB system
tables and ensures the purge thread subsequently drops any
associated index trees in SYS_INDEXES.

scan_system_tablespace_tables(): Scan the records in
SYS_TABLES and invoke callback(orphaned record, user table exist check)
on non-system tables in system tablespace

dict_drop_table_metadata(): Function is to remove specific
table IDs from internal system tables.

fsp_system_tablespace_truncate(): If legacy tables are detected,
InnoDB prioritizes their removal. To ensure data integrity and
complete the shrink, a two-restart sequence may be required:

  1. purge the legacy table
  2. Defragment the system tablespace and shrink the
    system tablespace further

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment thread mysql-test/suite/innodb/t/sys_truncate_debug.test Outdated
Comment thread storage/innobase/handler/ha_innodb.cc Outdated
DBUG_RETURN(2);
}

static bool should_drop_garbage_table(const rec_t* rec, ulint len)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing noexcept, and it’d be better to use size_t instead of the alias ulint. There is no comment explaining the return value and the parameters.

Comment thread storage/innobase/srv/srv0start.cc Outdated
Comment on lines +1390 to +1393
@param[in] rec SYS_TABLES record containing the table name
@param[in] len length of the table name
@return true if the table should be dropped, false otherwise */
static bool should_drop_unknown_table(const rec_t* rec, ulint len)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing noexcept. A space around the * is misplaced. Please, no [in] in new code. The @return comment should not mention individual return values; we have @retval for that.

Comment thread storage/innobase/srv/srv0start.cc Outdated
Comment on lines +1399 to +1401
const byte *field= rec_get_nth_field_old(rec, DICT_FLD__SYS_TABLES__ID, &id_len);
if (dict_sys.is_sys_table(mach_read_from_8(field)))
return false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if expression is missing id_len != 8 ||.

Comment thread storage/innobase/srv/srv0start.cc Outdated
Comment on lines +1301 to +1303
if (dict_table_t *table= dict_sys.load_table(
{reinterpret_cast<const char*>(pcur.old_rec), len},
DICT_ERR_IGNORE_DROP))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have some entries in SYS_INDEXES but the table cannot be loaded because the entries in SYS_TABLES, SYS_COLUMNS, SYS_FIELDS are incorrect? In that case, we would fail to drop the table.

Do we really have to load a table definition into the cache? Would it suffice to unconditionally execute some simpler SQL to delete the corresponding entries from SYS_TABLES, SYS_INDEXES, SYS_FIELDS, SYS_COLUMNS, during a slow shutdown when :autoshrinkis enabled? What really matters here is a call todict_drop_index_tree(). That could be executed by row_purge_remove_clust_if_poss_low()` as part of the slow shutdown.

The basic algorithm would be like this:

  1. Collect the table ID from SYS_TABLES and report the table names. (Check for SYS_TABLES.SPACE=0 directly from each record.)
  2. For each table ID, execute the SQL to DELETE FROM SYS_… WHERE SPACE=0 AND TABLE_ID=:id.
  3. Let the purge run into completion. It will take care of invoking dict_drop_index_tree(), and it’s already checking for SYS_INDEXES.PAGE=FIL_NULL there.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread storage/innobase/srv/srv0start.cc Outdated
Comment thread storage/innobase/srv/srv0start.cc Outdated
Comment on lines +1326 to +1329
if (err == DB_SUCCESS)
{
trx->commit();
ut_ad(deleted.empty());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted.empty() should hold irrespective of the err value.

Comment thread storage/innobase/srv/srv0start.cc Outdated
Comment on lines +1365 to +1366
for (pfs_os_file_t d : deleted)
os_file_close(d);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no handles of deleted files to close, because these tables are located in the system tablespace (fil_system.sys_space), which will not be deleted.

Comment thread storage/innobase/srv/srv0start.cc Outdated
Comment on lines +1403 to +1404
sql_print_information("InnoDB: Dropping the unknown table %.*s",
static_cast<int>(len), rec);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(len) is shorter and equivalent to static_cast<int>(len).

Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc
Comment thread storage/innobase/dict/dict0crea.cc Outdated
Comment thread mysql-test/suite/innodb/t/sys_truncate_debug.test
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment on lines +5699 to +5701
using sys_callback_t = std::function<bool(table_id_t table_id,
const byte* table_name,
ulint name_len)>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s not make the MDEV-25861 situation worse by adding even more declarations ending in _t.

I would suggest a cleaner API:

using SysCallback= dberr_t(table_id_t, st_::span<const char>);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially used a raw function pointer (sysCallback*) to avoid the overhead
of std::function. However, this approach has a significant limitation like function
pointers cannot accept lambdas with captures (e.g., [&orphaned_tbl]), which are needed
in both call sites to collect results. I had to make stateless lambdas with static std::vector
orphaned_tbl.

Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment on lines +5706 to +5708
static dberr_t scan_system_tablespace_tables(sys_callback_t callback) noexcept
{
ut_ad(callback);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not __attribute__((nonnull))?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using sysCallback= std::function<dberr_t(table_id_t, st_::span<const char>)>;
static dberr_t scan_system_tablespace_tables(sysCallback callback) noexcept
{
  if (UNIV_UNLIKELY(!callback))
    return DB_CORRUPTION;
}

std::function provides operator bool() to check if it contains a valid callable target
so avoided _attribute(nonnull)

Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc
not contain '/' (indicating it's not a proper database/table name).
@retval DB_SUCCESS_LOCKED_REC if legacy table exists
@return error code or DB_SUCCESS */
static dberr_t drop_all_orphaned_tables()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exporting the low-level function dict_drop_table_metadata(), I think that it would be more appropriate to define this function and some of its static dependencies in dict/drop.cc. This should be the non-static entry point that would be invoked from fsp_system_tablespace_truncate().

Maybe you considered this, but it got too complex.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes it complex. I stick with simple approach.

Copy link
Copy Markdown
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we must also cover a scenario where a SYS_TABLES record is missing but we have some garbage associated with SYS_INDEXES records.

Comment thread storage/innobase/fsp/fsp0fsp.cc
Comment thread mysql-test/suite/innodb/t/sys_truncate_debug.test Outdated
Comment thread storage/innobase/dict/dict0crea.cc
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/dict/drop.cc
Comment on lines -5729 to +5747
field= rec_get_nth_field_old(rec, DICT_FLD__SYS_TABLES__ID, &len);

field_no= scan_indexes ? ulint(DICT_FLD__SYS_INDEXES__TABLE_ID)
: ulint(DICT_FLD__SYS_TABLES__ID);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to change the generated code. We only need the following:

static_assert(DICT_FLD__SYS_INDEXES__TABLE_ID == DICT_FLD__SYS_TABLES__ID, "");

Both are the first column.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How both are same?

enum dict_fld_sys_tables_enum { 
        DICT_FLD__SYS_TABLES__NAME              = 0,
        DICT_FLD__SYS_TABLES__DB_TRX_ID         = 1,
        DICT_FLD__SYS_TABLES__DB_ROLL_PTR       = 2, 
        DICT_FLD__SYS_TABLES__ID                = 3,

/* The field numbers in the SYS_INDEXES clustered index */
enum dict_fld_sys_indexes_enum {
        DICT_FLD__SYS_INDEXES__TABLE_ID         = 0,
        DICT_FLD__SYS_INDEXES__ID               = 1,
        DICT_FLD__SYS_INDEXES__DB_TRX_ID        = 2,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, my mistake. I forgot that unfortunately, SYS_TABLES.ID is not the primary key, but a unique secondary index. Lots of things (including the recovery and rollback of RENAME TABLE) would have been simpler in that case.

Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment on lines +5790 to +5797
std::unordered_set<table_id_t> orphaned_tbl;
std::unordered_set<table_id_t> valid_sys_tbl;

/* Step 1: Scan SYS_TABLES for legacy tables */
dberr_t err= scan_system_tablespace_metadata(dict_sys.sys_tables,
[&](table_id_t table_id, st_::span<const char> name) -> dberr_t
{
valid_sys_tbl.insert(table_id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having two hash tables, could we have just one std::unordered_map<const table_id_t,bool> where the value indicates whether the table needs to be dropped?

Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment on lines +5811 to +5815
err= scan_system_tablespace_metadata(dict_sys.sys_indexes,
[&](table_id_t table_id, st_::span<const char>) -> dberr_t
{
if (orphaned_tbl.count(table_id) || valid_sys_tbl.count(table_id))
return DB_SUCCESS;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my idea of std::unordered_map<const table_id_t,bool> found_tables this could become a lot simpler:

if (!found_tables.emplace({table_id, true}).second)
  return DB_SUCCESS;

However, is there any way that would avoid all tables to be registered? There could be billions of entries. Should the SYS_INDEXES scan be enabled by an additional parameter, so that we can avoid the excessive memory allocation?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I realized that we only store the table_id for all tables that are defined in the system tablespace. We should be fine. The number of such tables should be less than 10. That number won’t include the 4 hard-coded tables SYS_TABLES, SYS_INDEXES, SYS_FIELDS, SYS_COLUMNS, because their metadata is not stored in SYS_TABLES or SYS_INDEXES.

Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Copy link
Copy Markdown
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, this branch was accidentally replaced with a fix of MDEV-39261.

Copy link
Copy Markdown
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the idea that I suggested yesterday.

diff --git a/storage/innobase/fsp/fsp0fsp.cc b/storage/innobase/fsp/fsp0fsp.cc
index 670794244fe..111528c6fb8 100644
--- a/storage/innobase/fsp/fsp0fsp.cc
+++ b/storage/innobase/fsp/fsp0fsp.cc
@@ -5787,44 +5787,44 @@ tablespace.
 @return error code or DB_SUCCESS */
 static dberr_t drop_all_orphaned_tables()
 {
-  std::unordered_set<table_id_t> orphaned_tbl;
-  std::unordered_set<table_id_t> valid_sys_tbl;
+  /* SELECT table_id,to_drop FROM SYS_TABLES WHERE SPACE=0 AND ... */
+  std::unordered_map<table_id_t,bool> system_tables;
 
   /* Step 1: Scan SYS_TABLES for legacy tables */
   dberr_t err= scan_system_tablespace_metadata(dict_sys.sys_tables,
     [&](table_id_t table_id, st_::span<const char> name) -> dberr_t
     {
-      valid_sys_tbl.insert(table_id);
-      if (!memchr(name.data(), '/', name.size()))
-      {
+      const bool to_drop{!memchr(name.data(), '/', name.size())};
+      if (to_drop)
         sql_print_information("InnoDB: Found an unknown table %.*s",
-                              (int)name.size(), name.data());
-        orphaned_tbl.insert(table_id);
-      }
+                              int(name.size()), name.data());
+      system_tables.emplace(table_id, to_drop);
       return DB_SUCCESS;
     });
 
-  if (err)
+  if (err != DB_SUCCESS)
     return err;
 
   /* Step 2: Scan SYS_INDEXES for orphaned indexes not in orphaned_tbl */
   err= scan_system_tablespace_metadata(dict_sys.sys_indexes,
     [&](table_id_t table_id, st_::span<const char>) -> dberr_t
     {
-      if (orphaned_tbl.count(table_id) || valid_sys_tbl.count(table_id))
-        return DB_SUCCESS;
-
-      sql_print_information("InnoDB: Found orphaned index for table_id "
-                            UINT64PF, table_id);
-      orphaned_tbl.insert(table_id);
+      if (!system_tables.emplace(table_id, true).second)
+        sql_print_information("InnoDB: Found orphaned index for table_id "
+                              UINT64PF, table_id);
       return DB_SUCCESS;
     });
 
-  if (err || orphaned_tbl.empty())
+  if (err != DB_SUCCESS)
     return err;
 
-  sql_print_information("InnoDB: Dropping %zu orphaned table(s)",
-                        orphaned_tbl.size());
+  size_t count{0};
+  for (const auto &td : system_tables)
+    count+= td.second;
+  if (!count)
+    return DB_SUCCESS;
+
+  sql_print_information("InnoDB: Dropping %zu orphaned table(s)", count);
 
   trx_t *trx= trx_create();
   trx_start_for_ddl(trx);
@@ -5834,12 +5834,10 @@ static dberr_t drop_all_orphaned_tables()
 
   dict_sys.lock(SRW_LOCK_CALL);
 
-  for (table_id_t table_id : orphaned_tbl)
-  {
-    err= dict_drop_table_metadata(table_id, trx);
-    if (err)
+  for (auto &td : system_tables)
+    if (td.second &&
+        (err= dict_drop_table_metadata(td.first, trx)) != DB_SUCCESS)
       break;
-  }
 
   dict_sys.unlock();
 

The test innodb.sys_truncate_debug passes with this patch. Please include it.

Problem:
=======
 - InnoDB system tablespace fails to autoshrink when it contains
legacy internal tables. These are non-user tables and internal
table exist from older version. Because the current shrink logic
does recognize these entries as user table, they block the
defragmentation process required to reduce the tablespace size.

Solution:
=========
To enable successful shrinking, InnoDB has been updated to
identify and remove these legacy entries during the startup:

drop_all_orphaned_tables(): A new function that scans the InnoDB
system tables for entries lacking the / naming convention.
It triggers the removal of these table objects from InnoDB system
tables and ensures the purge thread subsequently drops any
associated index trees in SYS_INDEXES.

scan_system_tablespace_metadata(): Scan SYS_TABLES or SYS_INDEXES
and invoke callback for non-system entries in system tablespace.
This function scans either SYS_TABLES or SYS_INDEXES (based on the
table parameter) and invokes the provided callback for each non-system
table or index found in the system tablespace (SPACE=0).

dict_drop_table_metadata(): Function is to remove specific
table IDs from internal system tables.

fsp_system_tablespace_truncate(): If legacy tables are detected,
InnoDB prioritizes their removal. To ensure data integrity and
complete the shrink, a two-restart sequence may be required:
1) purge the legacy table
2) Defragment the system tablespace and shrink the
system tablespace further

Thanks Marko Mäkelä for the contribution of your patch in
drop_orphaned_tables()
@Thirunarayanan Thirunarayanan merged commit e43e0ba into 11.4 Apr 28, 2026
17 of 19 checks passed
@Thirunarayanan Thirunarayanan deleted the MDEV-38412 branch April 28, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants