MySQL Master HA at Yahoo

I was asked to write a blog post about MySQL High Availability at Yahoo, particularly for writes. Our standard practice is not particularly high-tech, but we've been using it for over 4 years now and it has become a company-wide standard with a few exceptions.

Let me start by saying that at Yahoo! we consider a datacenter as a Single Point of Failure (SPoF). We build and manage many of our own datacenters, and we still don't assume they are invulnerable. How many people can attest to the fact that however to configure your racks, how many redundant switches, power supplies, drives, etc. you buy, if your leased datacenter has power or network issues, you are at their mercy.

We are assuming out of the gate that we have to have a presence in at least 2 datacenters. This concept is something we call BCP or Business Continuity Planning: we want our business to continue regardless of some event that would completely destroy a datacenter. So, we require that any and all components within Yahoo are run redundantly from at least 2 datacenters that are not in the same geographical region. That way if an earthquake, hurricane, or some other natural (or otherwise) disaster affects one datacenter, the other one should be far enough away to not have been affected.

One big advantage of this way of thinking is that this level of redundancy also nicely satisfies High Availability requirements. We can take downtime in one datacenter for software upgrades or other maintenance, and still happily serve out of the other because we've already planned ahead and already built our BCP systems.

Because of the BCP philosophy, we immediately discount most "High availability" solutions for MySQL that people tend to gravitate towards. MySQL Cluster has its uses, but you cannot form a cluster across a WAN link [that performs well]. DRDB cannot handle WAN replication [that performs well, and is synchronous].

That isn't to say that WAN replication can't work with these technologies, it's just that they aren't built to make that problem any better than standard MySQL replication does. Ultimately the problem to solve is that writing a piece of data to two redundant locations simultaneously comes at the expense of latency. Until someone comes up with a solution to write data faster than the speed of light, we are stuck with this problem, regardless of the technology solution we try.

So, we stick with good old MySQL replication. It's still asynchronous, but lots of people already understand it, and it does the job.

Now, what makes our solution different? Not much. The layout is this: two master databases, one in each of our two colocations. These masters replicate from each other, but we would never have more than two masters in this replication loop for the same reason we don't use token ring networks today: one master outage would break replication in a chain of size > 2. Our slaves replicate from one of the two masters, often half of the slaves in a given colocation replicate from one of the masters, and half from the other master.

Two masters by themselves don't present a solid solution. If our application can write to either master, we open ourselves up to autoincrement/unique key collisions. Auto_increment_offset does not fix this problem, it just band-aids it; it assumes the auto_increment key order doesn't matter, that it's ok for the tables on each master to not be identical, and it doesn't address any other primary/unique key collisions.

But, key collisions (that are quite visible when they happen because replication breaks) are only the tip of the iceberg. Writing to multiple masters simultaneously opens you up race conditions of the invisible and nasty types. If you have master A and master B in circular replication and the application makes a change to record X on both A and B before the replication for that change makes it to the other, there is an issue. Imagine that the application changes X to 1 on master A and X to 2 on master B. After replication catches up, A has X=2 and B has X=1. Whoops. Replication hasn't broken, you don't know it happened, but your two masters (and presumably your slaves) are now inconsistent. Can it be designed/coded around? Sure, but we can't assume that it will work for a generic solution.

So, because of this problem, we strongly discourage the use of dual-writers. Can they be useful and not a problem in some cases? Yes, but that is the exception, not the rule.

So, we are left with dual-master, single-writer. We only want our application to ever write to one of our two dual-masters at a given time; we set all our databases read_only except the active master. By doing this we gain consistency: we know if a given record is locked on the active master, no other part of the application can change it at the same time. Most programmers assume consistency when writing code for any RDBMS, so most code can't cope without it.

The trade off, of course, is latency. If our application is similarly distributed like our databases, some of our application servers will have to cross a WAN link to write to our database. This is a problem, of course, but one that is acceptable for many web applications; Many web applications serve many more reads than writes. If those writes are slightly slower, not the biggest deal. There are also a variety of handy web tricks to hide write latency, including ajax progress bars, asynchronous/batched writes, etc. Backend daemons that need to connect to the active master typically fail over with the master; so when the active master changes colo, that daemon needs to as well.

If we need to change masters in a clean way (say, for maintenance), the process is clean and can be done with no data loss and with a brief writer outage:

Set read_only=ON on the active master. This stops writes briefly.
Wait for replication to catch up to our secondary master
Set read_only=OFF on the secondary master -- writes resume
Flush connections from the old master: the application is redirected to the new master by a system we use internally.

As long as our application can handle the WAN latency, it shouldn't matter which of the two colos have the active master.

On a failure of the current active master and if we want writes to be available immediately, we must take the risk of a data loss. We are running standard MySQL replication that is asynchronous. So, if we need to make the secondary master active while the primary is down, we cannot know if replication was caught up before the primary went down. We do assume, however, that it was close (we accomplish this by monitoring our replication). At this point our engineers have a choice: either take the (potential) data loss and bring the other master up (set read_only=OFF), or take the outage (which is only for writing to the DB in this scenario) and hope the primary comes back soon. The policy for deciding this is left up to the individual teams and their business requirements.

Once we pull the failover trigger, we are accepting a possible data loss. If the downed master never recovers, our course is clear: we simply rebuild from the remaining master and its slaves. However, if the downed master does come back online (as it usually does) and we allow it to resume replicating, we are accepting that we may have introduced the race conditions or unique key collisions (described above) into our system; we've effectively had dual-writers running for a brief window. The safest course of action would be to never allow the downed master back into replication, but in practice, that tends not to be practical because of the expense of recloning MySQL slaves, particularly across WAN links.

Now, what about automatic failover? This is a requirement for some systems: availability over consistency, but there are a few challenges with automated failover:

An automated failover solution that does not use some unbiased system to determine which master should be active will be susceptible to split brain issues. Imagine our two colocations each fighting to become the active master because they simply cannot communicate with each other; without some outside authority guaranteeing a given colo can make itself active, we would end up with two active colos (and two independent databases!).
An automated failover solution would need to consider a system policy for taking availability over data consistency and vice versa.
An automated failover solution might be inclined to cause failover flapping, which would be problematic.

My experience has been that this system is the most adaptable to existing mysql databases, using existing storage engines and an existing technology to handle replication. It's likely unfashionably boring, conservative, and stable; but it has worked and continues to work for us.

In the future, I hope to be able to use technologies that improve on the standard MySQL replication by being able to repoint slaves automatically, to find which slave has the most up to date data from a downed master, and even mitigate a lot of data-loss with support for synchronous writes at least within a given colo. There are some candidates out there that are close.

Trackback URL for this post:

http://mysqlguy.net/trackback/31

Excellent practical

Submitted by Richard John Jensen (not verified) on August 3, 2010 - 1:30pm.

Excellent practical post.

Thanks-

Rick

Great post! But I have to say

Submitted by Guillaume (not verified) on August 4, 2010 - 2:02am.

Great post!

But I have to say that MMM really helped me to manage master-master replication without any headaches and without any major issue. Did you evaluate it?

Most of our stuff pre-dates

Submitted by jay on August 4, 2010 - 7:52am.

Most of our stuff pre-dates MMM, and also it looks like MMM hands around virtual IPs which wouldn't work in a multi-colo setup.

Good post. An important

Submitted by Sean C. (not verified) on August 4, 2010 - 4:10pm.

Good post.  An important piece as you highlighted is #4 "the application is redirected to the new master by a system we use internally".  I interpret this as (potentially) two actions:
A) Failover to another datacenter, in the event a dc is lost and/or want to perform non-disruptive maintenance
B) Failover to the secondary master (capturing binlog file and pos)

Could you elaborate on the process for both of these?

A) In some cases this means

Submitted by jay on August 4, 2010 - 6:42pm.

A) In some cases this means failing over an entire application, but often it is simply redirecting connections from the application to the master in the other colo. We have internal tools which can redirect something akin to a DNS lookup which accomplishes this in most cases.

B) As long as the primary master is recoverable, failing over to the secondary master requires no knowledge of the binary log file or position, nor does it require any CHANGE MASTER statements. Even if a primary master fully dies, we simply repoint writes at the secondary (setting it read-write first), then rebuild the down master and its slaves at our leisure. Executing a CHANGE MASTER during an outage situation is never something you can do deterministically (with standard MySQL replication).

Interesting post -

Submitted by Andrew Morgan (not verified) on August 5, 2010 - 8:15am.

Interesting post - thanks.

You mentioned that using MySQL Cluster wouldn't add anything to the geo-replication story that isn't already there for regular MySQL. There's one significant exception to this - MySQL Cluster can provide conflict detection or resolution when running active-active asynchronous replication between 2 sites (obviously this is in addition to the synchronous replication within each Cluster). It does require some effort from the application as you need to maintain an integer timestamp in each of the tables where you need this implementing).

See MySQL Cluster documentation on conflict resolution.

You are right. In a

Submitted by jay on August 5, 2010 - 8:33am.

You are right. In a specialized install, I could definitely see this being valuable (especially when already using Cluster to scale out writes). However, this configuration will be very application specific, and probably not a lot of fun to configure and troubleshoot.

We're just getting into MySQL Cluster internally at Yahoo!. I really hope it can prove itself, but one quote from a DBA setting it up: "It's almost as complicated as RAC". Not the best sign...

I really liked Flickr's

Submitted by Adam Fields (not verified) on August 5, 2010 - 1:26pm.

I really liked Flickr's ticket server idea for generating unique keys for use in a multiple master setup (http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique...). They didn't go into any detail about how they solved the inconsistency problem you outlined above, but I assume they're keeping track somewhere of which server they got a record from, and their writes are sufficiently sandboxed so that's not a general problem.

The biggest problem I've been facing recently with single writer architectures is what to do when the database gets so big that you can't really sling it around anymore (i.e.: such that making backups, slave syncs, separate copies for testing, and smaller isolated copies for development becomes a huge issue). I like the approach offered by Spock Proxy (http://spockproxy.sourceforge.net/), but I haven't had a chance to try it out yet.

It's also important to note that in a distributed scenario like this where you're probably using the slaves for the majority of your reads, you need a way to force some reads to go to the write master in case you can't wait for the replication sync.

If you're interested in sharing them, I have a really nice set of diagrams I did for the Kickstart project illustrating this architecture and various failover scenarios.

(Also, in paragraph 8, I think you meant "two master databases, one in each of our two colocations".)

Hi, Just came across this

Submitted by maggie pong (not verified) on December 12, 2010 - 1:30pm.

Hi,

Just came across this post. I don't work at flickr but I know enu about how their database works. The multi-master scenario for flickr is more like sharding. We don't store all data in a single mysql setup (master, slaves). You hash the data by key and distribute them among multiple mysql setup. Each setup (shard) runs the same dual-master-single-writer setup as Jay mentioned. This should help in your case where you have a mysql database that gets too big and is pushing the limit of a single server.

Hope this helps.

Maggie

mysqlguy.net