Admin post - Massive (communication) DNS fail, Serious server availability problems
Post date: Nov 20, 2016 6:03:50 AM
In one request we asked to update DNS and remove all information about listed bunch of A records to be removed. First of all, it took them a about a week to process this simple request. As always, when doing something important, I'll always pause for a while and recheck the request / task being submitted. I thought it for a while and was happy with it. Several days later at 15:28 they acknowledged the ticket and closed it, without mentioning what they had actually done. About 16:30 I got first alerts that systems are shutting down and things are going wrong. I was like what it is. Is it DDoS because I had seen that during the same day on a few servers. I started looking for network metrics data and basic stuff like that. Checked hosting providers status boards, nothing. Checked network latencies, traced few routes, checked if DNS works, checked network performance logs and charts. But then I noticed that some DNS queries were taking a bit longer than expected. Logs also showed that only one domain was being affected. I thought it for a while, that what if they've mishandled the DNS update request. But things still partially worked, so I wasn't too worried. Then I got a few DNS timeouts. I thought it's bit strange. Are the DNS servers being attacked? Then I started to lookup stuff with dig, and there it was. They had replaced our domain DNS server entries (NS) with the server names to be removed from the DNS zone file. Sure, slight misunderstanding. There's double language barrier between the provider and us. Plus the server names being removed were named slightly confusingly, clearly not NS or DNS but something quite similar. Darn. They had set the new DNS servers with 7200 TTL. It has just been bit over two hours from the DNS update at this point. I started frantically calling contacts at the DNS / domain administration to get this fixed. Finally I got the right guy one the line and it took just about 5 minutes to get the problem fixed. Luckily it was something like 18:30 at that point and basically nobody noticed this major mishap. Phew. It would have been just so interesting to wait until 9 am before doing anything about this. Then waiting for those two hours for the information to get expired from caches and refreshed. That wouldn't have been fun at all. So, small misunderstanding. But automated monitoring, alerts + quick analysis of the situation saved the tomorrow. One thing which delayed me quite a bit, was that I wasn't on proper computer when the first failures got logged and alerts were sent. I just did very basic checks on mobile, to see if al the key things are working. As mentioned things were working, because data was still being cached by DNS servers being used. It delayed the actual start of analysis more than a hour. After that I reached suitable place to really dig in. He.net Network Tools is just awesome mobile app, that's what I used for first checks. Am I happy with the result? Of course not, it shouldn't have happened in the very first place. Yes, I got it sorted out reasonably when noticed. It's good to remember that I'm only hobbyist with these things. - This is like air crash investigation case. There were several overlapping things which caused response to be much slower than expected. Also it was clear that if things were done manually, I would have expected them to check if the new DNS servers respond to the DNS queries, which those of course didn't do. Those weren't DNS servers at all. If they thought the request wasn't clear enough, why they didn't call me or confirm the true intent of it? Without proper confirmations anyone could easily spoof that kind of request on purpose. After the issue got fixed and there was postmortem of the case, all of their guys wondered how it happened. Because the request was quite clear. Probably the reason was that I sent the request in Finnish and the guys handling the request did speak Swedish as their native language. Even if it's Finnish Company. If I would have sent it in English I guess we wouldn't have this story. kw: Reason For Outage (RFO), Incident Post-mortem, Downtime, DNS fail.
Actually, I'm only hobbyist in practically everything I do. Why? I do so many things, I can't really deeply dig into anything. More like tech generalist. Sometimes it's greatly beneficial, sometimes it's really bad because I'm practically incompetent in everything I do.
More complex and painful server issues. Servers just end up being dead in water. No reason what so ever. Just so wonderful. Then the people whom have access to the platform, refuse to give any details. This is just the usual case. I've encountered similar situations over and over again. There's suspected source, or it might have some relation with something else. But people responsible for that part won't do anything. So next step is basically trying to circumvent the problem somehow or change that component / platform / code / person / organization / supplier, etc. out of the chain. Pretty frustrating, especially in cases, where the problem is quite small, but still bad. Like server crashing rarely but at the same time corrupting lot of data, or delivery company losing shipments completely and so on. It can't be tolerated in general, but totally horrendous day. Server issues, DDoS issues, RDP issues, Windows being just dead, nothing in logs. But that's nothing.Also read the next bullet, it's basically the same problem, I've been dealing with lately.
*Black Screen of Death*, Windows ends up being dead in the water. Does anyone know what's causing this with VPS servers? I've been trying to analyze the problem, but it's really hard. Because there's really little to analyze. Nothing in logs etc. Often even processes listening sockets, do technically work to some point. But logs do not show anything etc. I'm personally thinking it probably could have something to do with disk I/O being slow or being so slow that writes / reads timeout etc. But I don't know. It's super hard to get anything to analyze about this problem. Maybe I'll need to write separate debug program which monitors all these aspects and reports out over socket about disk parameters etc. But what if it also dies without any sign. Any pro-tips? I can make some poor nerdy guesses, but it's really hard to come up with proper facts backing up those guesses. I've already placed basic monitoring which checks CPU, RAM, disk I/O. So far haven't hard any of the problems. Most annoying part of all this is of course the fact, that it happens rarely. Which makes debugging much slower and harder. Naturally the systems I'm running the extended system monitoring, haven't had any problems. (sigh) Sometimes even the RDP works, but shows black screen for several minutes and then disconnects. This should be pretty clear indication what's wrong for someone who knows the lead causes for such problem on VPS server. Nagios Core also rocks.