OVH SBG2 Data Center Fire
It's so fun to wake up in middle of night, when IoT alarm device and my phone is going crazy about service alerts. In middle of night I just figured out that something bad and big is happening, when things are going down in mass. Decided that it's so serious, they must know about it and continued sleep.
In morning I started to dig news and it turned out to be a fire in SBG2 (OVH Strasburg). Ouch.
All services located there were down, yes of course multi-location distribution, but SBG2 was the primary location. Some servers in SBG1, changes initial announcement that now 4/12(reduced approximation fractional number) that those have been burned down. Waiting for more information.
Can't disclose in my blog publicly how many services were affected, but more than a few. So many that the recovery from disaster recovery backups will be painful. I started the mass restores immediately (I've automated the process) and got all the backups restored while waiting for more news.
Also one of the DNS providers I were personally using failed at very early stage, got several warnings immediately. Then I stared investigating that the DNS provider I were using had three servers for redundancy, but all of the servers were in the same SBG data center. -> Fail... Since that the provider has made small changes, and now one of the servers is in Palo Alto, CA, US and the other in Frankfurt, DE.
While reading the old OVH tickets, it was kind of ah, painful. One of those tickets said (in 2017) that a power outage is probably the worst thing that can happen to a data center. Oh well, how wrong they were. I remember back then all the people whining about the power outage. To be honest, it's not nearly the worst kind of disaster that can happen. When I've seen news about floods, I've always had in mind a picture where a huge flood carries whole OVH SBG data center away. And then the data center containers are floating in sea and the river delta.
From data management point of view, which is worse. Temporary power outage, fire or flood at data center (physical destruction) or APT threat which: gains full access to all management and backup plains, steals all your data, encrypts all the data, and then ransoms for money. Or just a data center accident which allows you to still recover from off-site backups.
I'm also quite glad that I had automated mass backup restore and test scripts in store. Which allowed me to quickly recovering a large number of backups using backup specific encryption keys. - Of course the backup private keys are normally kept offline. Normally those scripts are used for periodic backup restore and integrity tests, using separate restore testing systems.
One drawback of efficiently de-duplicated backups is that restoring individual backup sets from the data set can be quite slow. Because restoring the de-dupe blocks require lots of random read / write access due to data fragmentation in the backup. Well, at least this process was immediately launched as multiple parallel instances when I got the news (34 minutes after the official OVHCloud Tweet) and a few hours after seeing that the (my) monitoring systems indicated SBG failure.
First it was decided that it's better to see how large the damage is, the off site backups aren't of course fully real-time backups, but backups especially designed for total loss of data center disaster recovery situations. Which means that the restore process isn't as quick as it could be, because this is an situation which basically shouldn't ever happen. - Eh, right?
After receiving the information a few days later that the cluster sbg-1-XXX systems are totally destroyed except potentially the sbg-1-012, it was clear that the restoration process needs to be continued based on the scenario of total loss of data center protocol. And that work resumed immediately with enough resources. It was decided that all the servers will be migrated to the another hosting provider near by. Anyway all the servers were anyway quite old, and there was around maximum of two years window before those would have been migrated o a new platform due to other reasons.
In the hindsight, it was easy to say that the total loss of data center copies could have contained more data, now only the essential data and configuration was included, anything being considered "trivial" to recreate wasn't included. - Yes, it would have been better to have a hourly full disk copies of every system for years, but in real world that's not going to happen. The cost of maintaining such information for event which is highly unlikely to happen, is just too great. - Usual in these kind of situations people ask why something like real time replication or multi master systems weren't in place. Well of course we're happy to do that, if you're willing to pay N times the monthly fee. But honestly, that's a question which often comes from people whom keep continuously nagging about the current monthly fees. So it's kind of easy to immediately almost laugh to that question. - Any of the systems being recovered weren't classified in critical or even business critical systems.
Even if extreme backing up and mirroring is great from disaster recovery and later analysis, it's also bad in some terms. How do I explain that the information which was deleted is being held for 10 years in the light of GDPR and other privacy regulation? Sure it was deleted when you've asked for it, but it's still being retained in the backups for 10 years or so? - Hmm, doesn't look too good either. Of course the remedy to that is that the backups are encrypted (of course!) and the encryption keys are held by very small team, which means that the access to the backups need to be properly authorized. Using Shamir's secret sharing, so that no single person can even restore the backups. But I still think it's not what the law makers where asking for, even if they can't prove it. Nope sorry, everyone seems to have forgotten the encryption keys to the petabytes of backups. Would that sound believable? Even if someone is paying the bills for all of that storage all the time? Would you believe?
Still waiting for servers to SBG to be online. Some of the recoverable servers should be online during early next week. Vague time reference, but this means around two week after the fire. - Update: After writing this initial post, and during the publishing process. It turned out that only a single server survived from the fire. The information provided by OVH initially about the cloud server locations and clusters were highly inaccurate and misleading. It seemed that even they don't know where the stuff is, and what's left and what's gone. It was a huge mess from their side as well.
Now the severs in SBG which didn't burn are also online. Awesome. OVH SBG disaster recovery. I have to say, that it went better and faster than I expected, when the process got fully launched. I'm not referring to the OVH's part which basically ended, when they asked for customer side disaster recovery plan (DRP) activation.
Way later, still waiting for credits & refunds. Usually OVH has been really slow and difficult with these cases. I'm kind of expecting that even if they promise to pay, they'll still keep charging and don't pay. It would be so classic approach which many companies use to scam people. I'm kind of positively surprised if they actually keep their promise and do what they said, instead of trying to scam their customers. - Update: And the credit and refunds finally came, but it took really long time. They even kept charging about some of the services for two months, even if the servers were physically gone with the data.
To sum it up, things went really bad, but could have been lot worse. I'm pretty happy how we handled the case and there's not much to say how OVH handled the case, it's obvious.
Btw. I'm still wishing to see a TV series about ICT and data center disasters. No, not CSI kind of show, but a good documentary series. Something similar to Engineering / Aviation disasters series.
Posted: 2022-05-22 with updates - For reference the initial writing day is around: 2021-03-05