It is great to be able to build small utilities on top of an excellent RDBMS. Thank you MySQL.
This is a small but complex utility to perform online schema change for MySQL. We call it OSC and the source is here.
Some ALTER TABLE statements take too long form the perspective of some MySQL users. The fast index create feature for the InnoDB plugin in MySQL 5.1 makes this less of an issue but this can still take minutes to hours for a large table and for some MySQL deployments that is too long.
A workaround is to perform the change on a slave first and then promote the slave to be the new master. But this requires a slave located near the master. MySQL 5.0 added support for triggers and some replication systems have been built using triggers to capture row changes. Why not use triggers for this? The openarkkit toolkit did just that with oak-online-alter-table. We have published our version of an online schema change utility (OnlineSchemaChange.php aka OSC).
The remainder of this document is copied from the internal documents that were written for this project. Note that this project was done by Vamsi and he did an amazing job with it. In addition to writing the tool, writing the docs and doing a lot of testing he also found and fixed or avoided a few bugs in MySQL to make sure OSC would be reliable.
Overview
If the row format of database allows addition of a new column (possibly positioned at the end of existing columns with some default value) without modifying every row of the table, addition of a column could simply be just a metadata change which can be done very fast. In such databases, an exclusive lock is needed only for a very short time for the metadata change. Our understanding is that InnoDB row format does not allow this, and changing row format would be a significant project. Hence we do not consider this approach. Also, note that this approach would not work for operations like defragmentation.
OSC algorithms typically have several phases:
Note that the above operations can be done within the storage engine itself, or using an external (PHP) script. We followed the latter approach as it can be implemented much faster. An advantage of doing within storage engine is some optimizations can be done that are not available while doing these operations externally.
Copy Phase
When the binlog is enabled Innodb gets read locks on table S during a statement such as "insert into T select * from S". To avoid this and to reduce the load on MySQL we select the data into an outfile and load from the outfile.
Rather than generating one big outfile, we do multiple scans of the table, where each scan covers about 500000 rows (batchsize is a parameter to the OnlineSchemaChange and default value is set to 500000). The first scan scans from start of the table using LIMIT 500000. Subsequent scans start at the posistion where the previous scan left off. For example, for a 2-column PK, if the 1st select reads till [x, y] the where clause of the 2nd select will look like ((col1 = x and col2 > y) OR (col1 > x)). We patched InnoDB to not get the read locks and expect to get rid of these outfiles in the future. However, we will continue to do multiple scans of table with each scan going after different rows than previous scans.
For efficiency, in innodb _plugin 5.1 we drop all non-clustered indexes before loading data into copytable, and recreate them after load. As a future optimization there may be some cases where it is useful to drop and recreate C index as well. We do not drop NC indexes in which 1st column is AUTO_INCREMENT column. Also, in innodb 5.0 we do not drop non-clustered indexes as recreating them is very slow.
Capturing changes
Some of the approaches for capturing changes for replay are as follows:
The change capture table is referred to as the deltas table. It has has all columns as original table plus two additional columns: an integer autoincrement column to track order of changes AND an integer column to track dml type (insert, update or delete).
Replay phase
It is useful to do most of the replay work without blocking changes to the original table. Mutliple replay passes are used and only the final replay is done under WRITE lock on the table. Because there are multiple passes we need to avoid replaying the same change multiple times. The following approaches are available to do this:
We choose to save IDs in a temporary table.
Another consideration is how to propagate changes from deltas table to the copytable. There are at least two approaches:
We use the to only feetch the ID columns.
There are three phases for replaying changes: after all outfiles are loaded, after indexes are recreated and during the swap phase.
Cut-over phase
Mysql offers two ways of renaming a table foo to bar.
Our cut-over phase looks like
Since alter table causes an implicit commit in innodb, innodb locks get released after the first alter table. So any transaction that sneaks in after the first alter table and before the second alter table gets a 'table not found' error. The second alter table is expected to be very fast though because copytable is not visible to other transactions and so there is no need to wait.
Error handling
There are two basic cases of errors:
Here are the various entities created by OSC:
As we create an entity, we use a variable to track its cleanup. For example, when we create deltas, we set a variable $this->cleanup_deltastable to indicate that deltas needs to be cleaned up. This is not necessary for temp tables as they are automatically nuked when the script ends. A cleanup() method does the cleanup based on these cleanup variables. The cleanup() method is used during both successful termination of the script as well as failure termination.
However if mysql server crashes, cleanup steps would also fail. The plan to handle mysql failures is to have a mode 'force_cleanup' for the OSC script, which would cleanup all the triggers, non-temporary tables, and outfiles that would have been created by OSC. One caution while using the force_cleanup mode is if the names of triggers/outfiles/tables that OSC would have created coincide with an existing trigger/outfile/table that has nothing to do with OSC, that entity may get dropped. The chances of coincidence are very slim though as we use prefixes like __osc_ for entities that OSC creates. This issue does not arise during regular cleanup (i.e non-forced) because cleanup is done based on cleanup variables in that case.
Note that normally the failures during OSC don't have to be acted on urgently, as the existence of stray tables/outfiles/triggers is not a critical problem. However, an exception is if failure happens after the original table is renamed to a backup table but before copy table is renamed as original table. In that case there should be two tables - backup table and copytable with identical contents except for the schema change. Applications would get 'table not found' errors until the issue is fixed. During force_cleanup, if it detects that both backup table and copytable exist, it renames backup table to original table.
Replication
OSC is is not really making any changes on its own, but only propagating the changes done by other transactions (which are replicated). So we set sql_log_bin = 0 for OSC. For schema changes like adding a column, this puts a requirement that the schema change must be done on slaves first.
Assumptions that are validated in the code
Assumptions that are NOT validated in the code
Steps in detail (listed in the order of execution)
They are described in more detail below.
Slight difference in the sequence of steps in 5.0 and 5.1
Note that (unfortunately) we need to use slightly different sequences for 5.0 and 5.1 – and that is not good. This must be done to compensate for different behavior in those versions.
This order works in 5.1 but not 5.0 (I am only showing the relevant part of the sequence):
Since the scan done in step 6 should already see the changes captured in step 5, we exclude them from replay.
The above order does not work for 5.0 because creating trigger after locking table hangs in 5.0. Seebug 46780.
This order works in 5.0 but not in 5.1
Same as above except that 1 and 2 are reversed i.e create triggers before locking.
Note that the table lock is for ensuring that transactions that changed the table before triggers were created are all committed. Any changes done after snapshot transaction began in step 4 should be captured by triggers. So even if we get table lock after creating triggers, the purpose of waiting for all prior transactions would still be achieved. So it should work in theory.
However, this sequence does not work in 5.1 in my automated unit tests as it causes the scan in step 5 to exclude some changes from replay that are not captured in scan in step 6. (For example, if a concurrent xact updates row R, the snapshot xact step 5 is seeing the row in deltas table inserted by the update, but step 6 is seeing old image of row instead of new image.)
MySQL docs state
For transactional tables, failure of a statement should cause rollback of all changes performed by the statement. Failure of a trigger causes the statement to fail, so trigger failure also causes rollback.
So that means trigger is executed as part of same transaction as the DML that activated the trigger, right? We don't know why the snapshot xact in OSC is seeing the affect of trigger but not the affect of original DML and are not sure if this is a bug.
Code Vocabulary/Glossary
Initialization
This query retrieves column names.
$query = "select column_name, column_key from ".
"information_schema.columns ".
"where table_name ='%s' and table_schema='%s'";
$query = sprintf($query, $this->tablename,
$this->dbname);
if column_key is not 'PRI', we infer that it is NOT part of primary key.
// for PK columns we need them to be in correct order as well.
$query = "select * from information_schema.statistics ".
"where table_name = '%s' and TABLE_SCHEMA = '%s' ".
" and INDEX_NAME = 'PRIMARY' ".
"order by INDEX_NAME, SEQ_IN_INDEX";
$query = sprintf($query, $this->tablename, $this->dbname);
create_copy_table
copy table is named as concatenate( '__osc_new_', originaltablename) truncated to 64 characters (maxlen). This is done by 'create table <copytable> LIKE <originaltable>'.</originaltable></copytable>
alter_copy_table
DDL command to alter original table is given as input. We replace original table name by copy table name by doing:
$altercopy = preg_replace('/alter\s+table\s+/i',
'ALTER TABLE ', $this->altercmd,
-1, $count);
$altercopy = preg_replace('/ALTER\s+TABLE\s+
'.$this->tablename.'/',
'ALTER TABLE '.
$this->newtablename,
$altercopy, -1, $count);
The command is then run to alter copytable in the same way as we want original table to look like after doing alter. If we have < 1 or > 1 matches in either of preg_replace mentioned above, exception is raised.
Now we also retrieve index info using the following query so that we can drop and recreate NC indexes. (QUESTION : what happens if a concurrent alter table adds or drops index while this is running? For now we assume that operations is aware of alter table commands being run and won't run two in parallel.
$query = "select * from information_schema.statistics ".
"where table_name = '%s' and ".
"TABLE_SCHEMA = '%s' ".
"order by INDEX_NAME, SEQ_IN_INDEX";
$query = sprintf($query, $this->newtablename,
$this->dbname);
The following columns in select list are used:
We also check if old PK (available in $this->pkcolumnarry) is a prefix of atleast one index after the alter table. Note that if old PK is (a, b) and after alter table there is an index on (b, a), that is OK as it supports efficient lookups if values of both a and b are provided. This check is done because replay would be very inefficient if lookup based on old PK columns is inefficient after the alter table.
create_deltas_table
creates change capture table. It is named as concatenate('__osc_deltas_', originaltablename) truncated to 64 characters (maxlen). created using:
$createtable = 'create table %s'. '(%s INT AUTO_INCREMENT, '.
'%s INT, primary key(%s)) '.
'as (select %s from %s LIMIT 0)';
$createtable = sprintf($createtable, $this->deltastable,
IDCOLNAME, DMLCOLNAME,
IDCOLNAME, $this->columns,
$this->tablename);
create_triggers
As mentioned before, in 5.1 we lock table and create triggers and then unlock table, but in 5.0, we create the triggers and then lock the table and unlock it.
Insert trigger is created as:
$trigger = 'create trigger %s AFTER INSERT ON %s'.
'FOR EACH ROW '.
'insert into %s(%s, %s) '. 'values (%d, %s)';
$trigger = sprintf($trigger, $this->insert_trigger,
$this->tablename,
$this->deltastable, DMLCOLNAME,
$this->columns, DMLTYPE_INSERT,
$this->newcolumns);
Delete trigger is created as
$trigger = 'create trigger %s AFTER DELETE ON'.
'%s FOR EACH ROW '.
'insert into %s(%s, %s) '. 'values (%d, %s)';
$trigger = sprintf($trigger, $this->delete_trigger,
$this->tablename,
$this->deltastable, DMLCOLNAME,
$this->pkcolumns, DMLTYPE_DELETE,
$this->oldpkcolumns);
Update trigger is created as
// if primary key is updated, map the update
// to delete followed by insert
$trigger = 'create trigger %s AFTER UPDATE ON'.
'%s FOR EACH ROW '.
'IF (%s) THEN '. ' insert into %s(%s, %s) '.
' values(%d, %s); '.
'ELSE '. ' insert into %s(%s, %s) '.
' values(%d, %s), '. ' (%d, %s); '. 'END IF';
$trigger = sprintf($trigger, $this->update_trigger,
$this->tablename,
$this->get_pkmatch_condition('NEW', 'OLD'),
$this->deltastable, DMLCOLNAME,
$this->columns,
DMLTYPE_UPDATE, $this->newcolumns,
$this->deltastable, DMLCOLNAME,
$this->columns, DMLTYPE_DELETE,
$this->oldcolumns,
DMLTYPE_INSERT, $this->newcolumns);
start snapshot xact
Here we 'start transaction with consistent snapshot'.
At this point the deltas table may already have some changes done by transactions that have committed before out snapshot began. Since such changes are already reflected in our snapshot, we don't want to replay those changes again during replay phase. So we also create a temp table named __osc_temp_ids_to_exclude to save the IDs of records that already exist in deltas table.
$createtemp = 'create temporary table %s(%s INT, %s'.
'INT, primary key(%s))';
$createtemp = sprintf($createtemp, $temptable,
IDCOLNAME,
DMLCOLNAME, IDCOLNAME);
Since innodb gets read locks during "insert into T1 select * from T2" state ments, we select out into outfile and load from that. Outfile is created in 'secure-file-priv' folder with name concatenate('__osc_ex_', $this->tablename).
$selectinto = "select %s, %s from %s ".
"order by %s into outfile '%s' ";
$selectinto = sprintf($selectinto, IDCOLNAME,
DMLCOLNAME, $this->deltastable,
IDCOLNAME, $outfile);
// read from outfile above into the temp table
$loadsql = sprintf("LOAD DATA INFILE '%s' INTO'.
'TABLE %s(%s, %s)",
$outfile, $temptable,
IDCOLNAME, DMLCOLNAME);
select_table_into_outfile
If an outfile folder is passed in, we use that. Otherwise, if @@secure_file_priv is non-NULL, we use it as outfile folder. Otherwise we use @@datadir/dbname as outfile folder. We assume @@datadir is non-NULL.
Outfile is named as concatenate('__osc_tbl_', originaltablename'); Since we use multiple outfiles, they are suffixed .1,.2,.3 etc.
We also commit snapshot xact here.
drop NC indexes
In 5.1 we iterate over the index info gathered in previous step and drop all indexes whose name is NOT 'PRIMARY'. We also don't drop indexes in which first column is AUTO_INCREMENT column. We use this command to drop index:
$drop = sprintf('drop index %s on %s', $this->indexname, $this->newtablename);
Indexes are not dropped in 5.0 as mentioned before.
load copy table
We use this command to load each outfile:
$loadsql = sprintf("LOAD DATA INFILE '%s' INTO".
"TABLE %s(%s)",
$this->outfile_table,
$this->newtablename,
$this->columns);
recreate NC indexes
We iterate over the index info gathered in 'alter_copy_table' step and recreate all indexes whose name is NOT 'PRIMARY'.
We use one alter table command to create all NC indexes.
If the 'SUB_PART' column value in information_schema.statistics is not-null we use it while building columnlist. For example, if SUB_PART value for column 'comment' is 767, we use 'comment(767)' in the columnlist passed to create index command.
Replay changes
As mentioned before replay changes could be done multiple times. We maintain a temp table called TEMP_TABLE_IDS_TO_EXCLUDE to track those IDs that have been processed already. The set of IDs to process is obtained by taking the IDs from deltas table and excluding those that are in TEMP_TABLE_IDS_TO_EXCLUDE and is saved in TEMP_TABLE_IDS_TO_INCLUDE.
// Select from deltastable that are not in
// TEMP_TABLE_IDS_TO_EXCLUDE.
// Use left outer join rather than 'in' subquery for better perf.
$idcol = $this->deltastable.'.'.self::IDCOLNAME;
$dmlcol = $this->deltastable.'.'.self::DMLCOLNAME;
$idcol2 = self::TEMP_TABLE_IDS_TO_EXCLUDE.'.'.self::IDCOLNAME;
$selectinto = "select %s, %s ". "from %s LEFT JOIN %s ON %s = %s ".
"where %s is null order by %s into outfile '%s' ";
$selectinto = sprintf($selectinto, $idcol, $dmlcol,
$this->deltastable,
self::TEMP_TABLE_IDS_TO_EXCLUDE,
$idcol,
$idcol2, $idcol2, $idcol, $outfile);
We process about 500 rows in a transaction (except for the final replay which happens while holding WRITE lock on table, which is done without starting any new transaction).
Here is the query to retrieve IDs and dml type from TEMP_TABLE_IDS_TO_INCLUDE.
$query = sprintf('select %s, %s from %s order by %s',
IDCOLNAME, DMLCOLNAME,
TEMP_TABLE_IDS_TO_INCLUDE,
IDCOLNAME);
DMLCOLNAME column tells if it is insert, delete or update.
Here is how insert is replayed:
$insert = sprintf('insert into %s(%s) select %s'.
'from %s where %s.%s = %d',
$this->newtablename,
$this->columns,
$this->columns,
$this->deltastable,
$this->deltastable,
IDCOLNAME,
$row[IDCOLNAME]);
Here is how delete is replayed:
$delete = sprintf('delete %s from %s, %s '.
'where %s.%s = %d AND %s',
$newtable, $newtable,
$deltas, $deltas,
IDCOLNAME,
$row[IDCOLNAME],
$this->get_pkmatch_condition($newtable,
$deltas));
Here is how update is replayed:
$update = sprintf('update %s, %s SET %s where '.
'%s.%s = %d AND %s ',
$newtable, $deltas,
$assignment, $deltas,
IDCOLNAME,
$row[IDCOLNAME],
$this->get_pkmatch_condition($newtable,
$deltas));
Swap tables
Here are the steps as mentioned in cut-over phase:
$lock = sprintf('lock table %s WRITE, %s WRITE, %s WRITE', $this->tablename,
$this->newtablename, $this->deltastable);
$rename_original = sprintf('alter table %s rename %s',
$this->tablename, $this->renametable);
$rename_new = sprintf('alter table %s rename %s',
$this->newtablename, $this->tablename);
Cleanup
In the force cleanup mode we will pretend as though all cleanup variables are set, and use 'drop if exists'.