(or at least some ideas)
I really enjoyed David Shrewsbury's presentation about Drizzle replication today at the MySQL conf 2011. It's not that Drizzle replication is fully fleshed out, but it sure seems like it's got all the necessary components to make it dramatically better than stock MySQL replication, at least from the HA perspective.
Here's some things that Drizzle replication (or MySQL replication, for that matter) needs for me to be able to improve on the Master HA technique that is used at Yahoo:
Unique Transaction Ids:
Drizzle already contains something called a commit id that is a sequentially increasing counter of commits in the local drizzle engine. From what I can tell, it does *not* already contain the equivalent of a server-id, but it sounds like that's coming. If you put whatever becomes the unique server-id together with that commit id, you've got a globally unique transaction id. Once you have an easy way to identify where you are in the replication stream, a lot of other things fall into place (see below).
To be fair, we do effectively have a way to uniquely identify a location in the MySQL replication stream with server-id + binlog-file + binlog-pos. This tends to be unwieldy, however.
Mapping the transaction id between master and slave.
This might get gnarley with multi-master setups, but in the normal single-master case we may want to plan that a given slave could be promoted to the master in the future. In such cases, we would want to enable the drizzle equivalent of log-slave-updates so the slave is logging replication transactions itself that it gets from its own master. In that case, the slave would be writing its own copy of the replicated data to its replication log with its own commit id. While we're doing that, we would want to keep some kind of map of what the master's commit id lines up to for the slave's commit id. Since these are sequentially growing numbers (AFAICT), it should be able to easily keep an efficient map in a small table with rows like this:
master_id | master_commit_id | my_commit_id | started_at
We only need to enter a row in this table when we start replicating from a new master. If there is no other information in the table, we should be able to calculate our commit_id that lines up to any master commit id (after where we started from).
This is important because it would allow us to repoint *other* slaves of our master to our slave *without* needing know the mapping of commit ids between the servers. The algorithm would basically be:
- other slave> change master to master_host='our slave', [credentials, if different]; (note, no commit_id here) slave start; (or equivalent)
- other slave reports its last executed master id and commit id to 'our slave'.
- our slave looks at our table and calculates the starting commit id for 'other slave' based on its position on its previous master
- our slave starts sending replication events from that position to the other slave and replication continues as normal
If we could not determine the commit_id mapping, the 'other slave' should throw an error accordingly and require we manually specify a commit_id to start from. If it's determined that 'other slave' is actually further ahead in replication (i.e., has a higher commit_id from the original master) than 'our slave', then something intelligent (or just an error) could happen.
This would allow a CHANGE MASTER away from a failed master to be as deterministic as possible (and throw a proper error otherwise).
Preventing replication from missing commits on the master
Since our commit_id on the master is sequential, it should be easy to detect missing transactions from the master when we receive a non-sequential id. This could throw an error, it could simply store the mis-ordered commit until the earlier commits are received. Some kind of separate message could be sent upstream to the master to resend the missing transactions. If the replication stream is shared across slaves (like the rabbitMQ example) other slaves should be able to easily ignore (and de-dup) re-sent replication transactions. This all assumes, of course, there is no reason for commit_ids to ever get skipped. Even if they did, replication could send some kind of null entry to let the slaves know to ignore that particular commit_id.
Allowing the master to hand off writes to another server automatically
Potentially a signal could be written in the replication stream on an active master at the point of master failover which could indicate all slaves should repoint themselves at another master. If that new master is also a slave, it could be smart enough to not change its replication stream if it sees its own hostname/host_id in that record.
Automatic slave cloning
This would take some doing, but it seems really conceivable that a slave could gather enough information from a master and other slaves in the same cluster to build itself from scratch. Obviously this has to be done carefully, preventing too much load on any production hosts. For this to work, there would need to be an API in drizzle to get a stream of existing live tables in a consistent state, basically an online backup api, maybe this already exists. Ideally this would be at the SQL layer, not innodb specific.
Automatic slave consistency checking with auto fix
Using table checksum techniques, it's now reasonable to do some slow crawling of live tables and compare checksums of table chunks with the same chunk on a slave to detect differences. (Mikey Dickerson talked about this yesterday). I see no reason why this couldn't be a direct drizzle plugin. On the slave size, if it detects a difference, it should be able to send a message back to the master. In theory, the master should be able to simply re-replicate the full set of those rows. Since drizzle replication is RBR, this should not have any effect on other slaves.
Plugin-based semi- and full-sync replication.
According to David, this should work "in theory". In the innodb-based replication plugin, the final commit on the transaction depends on the replication plugin agreeing to commit. This should mean that the plugin can decide whether or not to commit (and block accordingly) if it can verify the data was written elsewhere. This, of course, is a catch-22, since the data can't replicate until it commits in the innodb transaction log. It seems to me that there would need to be some synchronous slaves defined that did not receive replication data in the conventional way (i.e., from the innodb replication log), so this obviously isn't perfect (yet).
This is just some quick ideas that have run through my head. There definitely may be nicer ways to implement them. This is obviously territory I'm interested in, so if nothing else, hopefully this will lend some ideas to the drizzle replication discussion actively happening. In the meantime, 'bzr branch lp:drizzle' is pretty slow from my hotel room, but no promises here: my C++ is pretty rusty. :)