OVH Strasburg (SBG) Outage - #OVHgate

Post date: Nov 19, 2017 10:33:36 AM

Power outage, worst thing that could happen?

I don't think their disaster recovery (DR) is really up to the task? Why? They claimed that: "power outage" is "The worst scenario that can happen to us".

No, that isn't. Let's change that scenario to situations like:

Over voltage, all hardware fried - Solar flare / EMP - Widely fried electronics - Some data centers are hardened against this
Flood, all hardware flushed off, or massive fire
Direct air cooling and nice amount of volcanic ash or something, like corrosive chemical leak
Fertilizer ship / train going past the DC explodes / leaks
Nation state (or any other competent party) gets pissed of at them, and wipes their systems totally hijacking all control of the systems, after monitoring operations for months and they really know what they're doing
Internal sabotage, where key systems are targeted either in software or physical attack

In these scenarios the whole site is more or less physically wiped out or damaged seriously. Recovering that is big more demanding than the seemingly easy and trivial job of restoring power.

This is the reason why I'm always having full off-site remote backups, just in case. Because you'll never know, anything could happen. It doesn't matter what the hosting provider is. You can never trust external party enough. This these procedures are used for all data, no matter the project or system. Always keep data secure.

About costs of downtime

Well, costs can be indirect. It's hard to even estimate costs from downtime and or recovery. If it's downtime alone, it might not be that expensive. But if the situation would have been worse. And systems would have needed to be restored from off-site backups. Yes, it would have been several thousands of Euro immediately and directly. And even more indirectly, when users require compensation, and data is lost and there's extra data synchronization, and restoring data lost due to restoring potentially day old backup, and so on.

Time to recover

In that situation we're probably talking over tens of thousands easily. It would have meant basically redirecting all resources on system restoration (probably on other service provider) and ... Lots of work, before everything is working. Probably one week before most important stuff is working and before everything is working, would have taken around one month. Also if it's a big provider which goes down, it might also lead to sudden high resource demand on alternative providers, which probably would run out of resources as soon as people start realizing that the outage might take very long time to recover.

Yes, of course everyone has considered these things when making their Disaster Recovery Plan (DRP). Good thing. Stuff can be restored. Bad thing. It would take a lot of time and cost a lot, and probably cause indirect costs in loss of customers and tons of bad will, and so on.

There's also some stuff which is considered not worth of backing up daily to off-site location, because that data isn't "critical". But it's still something which would be still essentially recreated in case of total loss of DC & storage. This can be covered by server snapshots, but those taken weekly or monthly depending from setups. Or not at all. If not at all, then usually configuration and stuff, is also backed up daily.

From some of the posts complaining about the situation, it seems that the users / clients hasn't made proper DR preparations. Providers like UpCloud clearly state that the clients are required to have off-site backups of all critical data.

High Availability

Also if uptime is so important, in these kind of situations. As soon as the issue is detected, there restore to another provider / location should be launched immediately. Or even there should be already alternate replicated sites, where your systems can fail-over automatically. - These are the discussions which pop-up always when there's issue with Amazon. If the service is important, you shouldn't trust only one Advisability Zone or Region nor you should trust even one Cloud-Provider. - These are the topics I'm always bringing up, when someone says that they'll need system with high availability.

That's why there are solutions like Google Spanner, if data is actually that important. So it can be replicated to multiple locations in real-time.

But as we all well know. When absolute high availability isn't required, cost is usually the reason why such solutions aren't implemented or used.