posted May 18, 2014, 1:54 AM by Sami Lehtinen
updated May 18, 2014, 1:55 AM
- Massive IT fail and my own thoughts and personal confession. I'm so glad that I haven't done anything like that.
Well, I have had one
very close by incident, but recovered it from it so that customer
didn't even notice it.
Once I screwed up production system data for
one database table. But that's just because I was sick at home, in fever
and almost like drunk. Then the they applied horrible pressure from the
customer to make immediate changes straight into the production. Well,
technically the screwup was really small, only one new line missing. But
that got replicated to tons of computers as well as the data being
collected was affected by that. Of course it was possible to afterwards
clean that, and it didn't stop production. But cleanup was painful as
usual. Most annoying thing was that I actually noticed my mistake during
the run just by checking data being processed and I tried to stop the
update at that point, but it was already partially done and started to
get replicated. I just which I would have reversed the order. First
check data, and then process it, instead first starting process and then
checking data while waiting it to get processed. Clear fail. There's no
reason why I couldn't have done that, before actually updating tables. Because I
was of course able to run the process in steps. After all, that's the
thing that that really bugged me personally. Clear failure to verify a
key thing. But there's one more key point, the fail I experienced, didn't actually have anything to
do with the change I made for the customer. It was a change
which was done earlier, and just wasn't in production yet. So I did check things,
which I expected to be worth of checking, meaning the the changes I made and things affected by those. But I
encountered a secondary trap, which was laid in the code base several weeks earlier. Which clearly wasn't properly checked at the time the change was made. After all, I could have avoided the problem very easily by checking all data in processing steps and verifying the data before running the final update to production. So this err is very human. Hurry. pressure, not feeling well, so let's just get it done quickly and that's it. - This could be perfect example from the Mayday / Air Crash Investigation TV show, how to make things fail catastrophically. - Fixing the data issue in database on primary server took only about 15 minutes. I'm still quite sure there were hidden ripple effect from this event which probably did mean losing about two days of work indirectly. Having database backup would have been one solution, or using test environment, but either was available due to time pressure and me being at home. Because production system was live, having backup would have been worthless anyway, because it would have 'rolled back' way too many transactions.
Yet another really dangerous way of doing things is that you'll remote connect to workstation and then open database management software and connect it back to the server. It's so easy to give commands to server in such situation accidentally by thinking you're commanding the workstation. Luckily I haven't ever failed with that, but I have often recognized the risk to major failure and so have my colleagues.
- Cloud DR and RTO:
In many cases having the data isn't the problem. If it's in some
application specific format, accessing it can be the real problem in
case the primary system isn't working. Let's say you're using accounting
system XYZ. They provide you an off-line backup option where you get
all the data. Something very bad happens and the company / their systems
disappear. Great, now you have the data, but accessing and using it is a
whole another story. Let's say they used something semi common, like
MSSQL server or ProgresSQL and you you got giga bytes of schema dump.
Nothing is lost, but basically it's totally inaccessible to everyone. If
you have made escrow, great. Then starts the very slow and painful
process of rebuilding the system which can utilize that data. Of course
if you got competent IT staff, they probably can hand pick the "most
important vital records" from that data, but it's nowhere near the level
needed by normal operations. So RTO can be very long, like I said
earlier. I'm sure most of small customers don't have their own data at
all, nor do they have escrow to gain access to the application in case
of major fail.
Let's just all hope that something bad like that won't happen, because
it'll be painful, even if you're well prepared. I have several systems
where I do have the data and escrow or even the source. But I assume
setting up the system will take at least several days even in the cases
where I do have the source code for the project(s). In some cases
situation could be even much worse. Let's say that the service provider
was using PaaS and it failed and caused the problem. Now you have
software based on AWS, App Engine, Heroku or something similar, but the
primary platform to run the system isn't available anymore. Yet again,
you can expect very long RTO. But competent staff will get it going at
some point, assuming that you have the code and data.
- Checked out services like: Pandoo "Web operating system", Digital Shadows "Digital attack protection & detection", Wallarm, "Threat and Attack detection & protection", ThetaRay "Hyper-dimensional Big Data Threat Detection", Divide "BYOD", Cyber Ghost "VPN", and Lavaboom "privatey email".
- Studied a few protocols more PCP and NAT-PMP. Yet IPv6 should make all thsese work-a-round protocols unnesessary. I hope nobody really is going to use NTPv6.