So you want to talk about Single points of failure, eh?

Submitted by jay on September 22, 2008 - 8:52am

SPOF

In reply to Arjen's post about Single points of failure:

Arjen, you are absolutely right. Â It doesn't matter how over-engineered a storage solution is (I'm thinking of a giant dual-headed Netapp with redundant everything). Â After you've paid a few hundred K for that, you still have a single point of failure. Â Is it a highly-unlikely point of failure? Â Sure, but it's still a point of failure.Â

Let's take it a step further, at Yahoo we're beyond thinking about how to make a single node redundant (be it for storage, networking, or even a simple webserver), we consider entire datacenters to be single points of failure. Â What does that mean? Â

That means no matter if you own the datacenter, or of someone else does, it can and will go down. Â

That kind of dwarfs concerns about a single SAN solution, doesn't it? Â But I really defy anyone to tell me that it isn't true. Â We've all had colos go down, and there hasn't been a thing we can do about it to prevent it in the future.

So from an application and database perspective, it means redundancy and (at least) fast failover, if not automatic failover. Â It means running hot/hot wherever possible (hot/cold tends to lead to failover sites that don't work). Â It usually means dual-master for MySQL, but with single-writer (only one master is active at at ime) to maintain consistency. Â

But even having two datacenters isn't enough if they are all in the Bay Area, a single earthquake could theoretically take them both out at the same time. Â So, even geographical regions are a single point of failure for us, our redundant colos have to be spread apart quite a ways so the same natural disaster would be less likely to take out both of them at once. Â

That policy eliminates the ability (usually) to try to do any kind of synchronous data replicating to both locations at the same time, so a good data retention policy is in order. Â If you lose a datacenter, when do you failover? Â If you cut over to your secondary master immediately, you risk losing the down master's unreplicated data because of consistency concerns. Â If you wait you are trading downtime for potential data loss: Â if that master comes back up and starts replicating, you have a chance to not lose anything. Â

All the new whiz-bang "high availability" cluster solutions out there are great, but they assume big pipes between their nodes with low latency. Â If you're honest with yourself, they may solve some of your problems, but they can't protect you from data-loss in the event of a colo failure (or, admittedly very unlikely: Â complete colo destruction). Â

That's where "high-availability" really gets interesting. Â

Trackback URL for this post:

http://mysqlguy.net/trackback/25

Yeah.. even Earth is a single

Submitted by Anonymous (not verified) on September 22, 2008 - 2:30pm.

Yeah.. even Earth is a single point of failure. There is no failover if we get hit by a asteroid. Let's start buildingÂ data-centersÂ on the moon or in Mars!

Wasn't Google working on

Submitted by jay on September 23, 2008 - 7:14am.

Wasn't Google working on that? :)

What about U.S.? Can we

Submitted by Ming (not verified) on September 22, 2008 - 4:58pm.

What about U.S.? Can we consider the U.S. is a "single point"? What can you do if the whole U.S. fail?

Obviously you have to draw

Submitted by jay on September 23, 2008 - 7:17am.

Obviously you have to draw the line somewhere. Â The least likely scenario that we decided to consider was the natural disaster that would destroy a single geographical region. Â

Much like the moon example above, if the whole earth, or if the whole U.S. failed, I'm taking the day off. Â

As much as this may seem like

Submitted by Sheeri (not verified) on September 23, 2008 - 7:30pm.

As much as this may seem like a silly rat-hole, it's actually really important to bring up to your higher-ups when they start talking about uptime. For example, during the September 11th attacks, Turner Broadcasting Systems was running at half capacity (due to an upgrade in progress), and diverted all its resources from all properties to 2: CNN and Cartoon Network. It was completely forgiveable for other websites that Turner Broadcasting owned to be down, but they wanted to keep CNN online (for news) and Cartoon Network online (for kids that were home from school).

So the real question is, when talking about high availability, what are the acceptable risks? When your higher-ups say "none", ask them if a disastrous earthquake is a good reason for being down. Take them down this rat-hole and figure out WHERE the points of failure are, because they exist. Most companies have a very low "forgiveable" downtime point -- a natural disaster in the area is usually enough (tornado, ice storm, whatever).

And of course when choosing a data center you probably don't want to choose one in an area known for natural disasters.....