A proposal for some features of Drizzle async replication

(or at least some ideas)

I really enjoyed David Shrewsbury's presentation about Drizzle replication today at the MySQL conf 2011. It's not that Drizzle replication is fully fleshed out, but it sure seems like it's got all the necessary components to make it dramatically better than stock MySQL replication, at least from the HA perspective.

Here's some things that Drizzle replication (or MySQL replication, for that matter) needs for me to be able to improve on the Master HA technique that is used at Yahoo:

Unique Transaction Ids:

Drizzle already contains something called a commit id that is a sequentially increasing counter of commits in the local drizzle engine. From what I can tell, it does *not* already contain the equivalent of a server-id, but it sounds like that's coming. If you put whatever becomes the unique server-id together with that commit id, you've got a globally unique transaction id. Once you have an easy way to identify where you are in the replication stream, a lot of other things fall into place (see below).

To be fair, we do effectively have a way to uniquely identify a location in the MySQL replication stream with server-id + binlog-file + binlog-pos. This tends to be unwieldy, however.

Mapping the transaction id between master and slave.

This might get gnarley with multi-master setups, but in the normal single-master case we may want to plan that a given slave could be promoted to the master in the future. In such cases, we would want to enable the drizzle equivalent of log-slave-updates so the slave is logging replication transactions itself that it gets from its own master. In that case, the slave would be writing its own copy of the replicated data to its replication log with its own commit id. While we're doing that, we would want to keep some kind of map of what the master's commit id lines up to for the slave's commit id. Since these are sequentially growing numbers (AFAICT), it should be able to easily keep an efficient map in a small table with rows like this:

master_id | master_commit_id | my_commit_id | started_at

We only need to enter a row in this table when we start replicating from a new master. If there is no other information in the table, we should be able to calculate our commit_id that lines up to any master commit id (after where we started from).

This is important because it would allow us to repoint *other* slaves of our master to our slave *without* needing know the mapping of commit ids between the servers. The algorithm would basically be:

other slave> change master to master_host='our slave', [credentials, if different]; (note, no commit_id here) slave start; (or equivalent)
other slave reports its last executed master id and commit id to 'our slave'.
our slave looks at our table and calculates the starting commit id for 'other slave' based on its position on its previous master
our slave starts sending replication events from that position to the other slave and replication continues as normal

If we could not determine the commit_id mapping, the 'other slave' should throw an error accordingly and require we manually specify a commit_id to start from. If it's determined that 'other slave' is actually further ahead in replication (i.e., has a higher commit_id from the original master) than 'our slave', then something intelligent (or just an error) could happen.

This would allow a CHANGE MASTER away from a failed master to be as deterministic as possible (and throw a proper error otherwise).

Preventing replication from missing commits on the master

Since our commit_id on the master is sequential, it should be easy to detect missing transactions from the master when we receive a non-sequential id. This could throw an error, it could simply store the mis-ordered commit until the earlier commits are received. Some kind of separate message could be sent upstream to the master to resend the missing transactions. If the replication stream is shared across slaves (like the rabbitMQ example) other slaves should be able to easily ignore (and de-dup) re-sent replication transactions. This all assumes, of course, there is no reason for commit_ids to ever get skipped. Even if they did, replication could send some kind of null entry to let the slaves know to ignore that particular commit_id.

Allowing the master to hand off writes to another server automatically

Potentially a signal could be written in the replication stream on an active master at the point of master failover which could indicate all slaves should repoint themselves at another master. If that new master is also a slave, it could be smart enough to not change its replication stream if it sees its own hostname/host_id in that record.

Automatic slave cloning

This would take some doing, but it seems really conceivable that a slave could gather enough information from a master and other slaves in the same cluster to build itself from scratch. Obviously this has to be done carefully, preventing too much load on any production hosts. For this to work, there would need to be an API in drizzle to get a stream of existing live tables in a consistent state, basically an online backup api, maybe this already exists. Ideally this would be at the SQL layer, not innodb specific.

Automatic slave consistency checking with auto fix

Using table checksum techniques, it's now reasonable to do some slow crawling of live tables and compare checksums of table chunks with the same chunk on a slave to detect differences. (Mikey Dickerson talked about this yesterday). I see no reason why this couldn't be a direct drizzle plugin. On the slave size, if it detects a difference, it should be able to send a message back to the master. In theory, the master should be able to simply re-replicate the full set of those rows. Since drizzle replication is RBR, this should not have any effect on other slaves.

Plugin-based semi- and full-sync replication.

According to David, this should work "in theory". In the innodb-based replication plugin, the final commit on the transaction depends on the replication plugin agreeing to commit. This should mean that the plugin can decide whether or not to commit (and block accordingly) if it can verify the data was written elsewhere. This, of course, is a catch-22, since the data can't replicate until it commits in the innodb transaction log. It seems to me that there would need to be some synchronous slaves defined that did not receive replication data in the conventional way (i.e., from the innodb replication log), so this obviously isn't perfect (yet).

This is just some quick ideas that have run through my head. There definitely may be nicer ways to implement them. This is obviously territory I'm interested in, so if nothing else, hopefully this will lend some ideas to the drizzle replication discussion actively happening. In the meantime, 'bzr branch lp:drizzle' is pretty slow from my hotel room, but no promises here: my C++ is pretty rusty. :)

Trackback URL for this post:

http://mysqlguy.net/trackback/36

I'm not s DBA so my question

Submitted by Pavel (not verified) on April 14, 2011 - 5:52am.

I'm not s DBA so my question might be stupid. Having recently encountered a replication failure on mysql, would the automatic slave consistency checking with auto fix mean that replication setups might actually self-heal?

Â That would be the use-case,

Submitted by jay on April 14, 2011 - 11:28am.

That would be the use-case, yes. It would not come for free, but it seems conceivable that it wouldn't need to be a glob of perl scripts in the future.

Hi Jay, Â Thanks for all the

Submitted by Joe Daly (not verified) on April 14, 2011 - 9:04am.

Hi Jay,

Thanks for all the comments. There are plans to have the server_id added to the internal replication table. I have a short right up here on it here http://www.8bitsofbytes.com/?p=28. The design has changed slightly since I wrote it up but basically a slave will know where a transaction came by having what is called a original_commit_id, and original_server_id field in the log and in the slave plugin tables. It would allow a easy sync to a new master with no ambiquity.

Thanks for great feedback,

Submitted by LinuxJedi (not verified) on April 14, 2011 - 9:06am.

Thanks for great feedback, I'll make sure Shrews gets to read it. I love it when people come up with things we haven't thought of yet, and the framework is there to do a lot of what you said.

I really wouldn't recommend doing a bzr branch of drizzle on the hotel wifi though, it is probably about a gig of data for the initial branching (if you used init-repo it will be really fast every subsequent usage). Our last source release of the trunk was earlier this week though so you are welcome to download that until you can get access to a faster network.

Â I'm a n00b when it comes to

Submitted by jay on April 14, 2011 - 11:29am.

I'm a n00b when it comes to bzr. Hopefully I'll get a chance to learn more about it.

Glad you enjoyed the

Submitted by David Shrewsbury (not verified) on April 14, 2011 - 11:13am.

Glad you enjoyed the talk!

This is exactly the type of feedback we want from the community. We want the community to be driving the features of Drizzle. We could always take guesses, but it's nice to have clearly defined target to shoot for. Thanks so much for this post!

-Dave

Â Exactly why I wrote this.

Submitted by jay on April 14, 2011 - 11:32am.

Exactly why I wrote this. Even if I don't get a chance to write some of my own code, hopefully you guys can keep some of these use cases in mind so the apis you guys have exposed will make these possible for future implementation. Keep up the great work!

Excellent working list! Â I

Submitted by Henrik Ingo (not verified) on April 15, 2011 - 6:56pm.

Excellent working list!

I suggest learning a trick from Tungsten: The global transaction id should be <server-id, epoch, transaction-counter>. The epoch is just a constant set by the server at first transaction. (I don't know how Tungsten sets it, but afaiu it could be a random int and transparent to the user.) The epoch is used as a safety check: If it doesn't match what the slave has as its last received epoch, then replication must fail. The interpretation for this is just that "something is not right here". For instance that we might be receiving data from a master that has nothing to do with the data we have in the slave.

Notice that syncronous replication doesn't really need a binary log. The data is committed immediately at commit time, you don't need to write it anywhere on the master. So the fact that current binary log is stored into InnoDB is irrelevant, syncronous replication won't be using it.

mysqlguy.net