Blog‎ > ‎

Unique ID, Sanity Checks, I/O Latency (Ceph), Yuan/SDR, Databases and locking

posted Nov 19, 2016, 10:12 PM by Sami Lehtinen   [ updated Nov 19, 2016, 10:13 PM ]
  • It's wonderful how people refuse to realize the facts. One again had very long call about identifying transactions, without unique identifier. I think it's totally insane to try to process data in global systems with some random crap identifiers. The only way to make it work, is to use solid unique identifiers. To be honest, this is topic which has been no discussed for several years. Secondary issue is also the user experience. Current operators in the field with their legacy software and bad processes usually cause delay of hours to days in data processing. They're totally not prepared for real time processing and everything is based on decades old technology. This is prefect example why some times it would be much better to clear the desk and throw out the old junk. Building new system from scratch would be much easier than trying to add new flashy features to extremely old technology which nobody wants to maintain / modify. 
  • Tutanota 2.11 finally allows viewing of HTTP headers, that's awesome. It was one of the reasons why I really didn't like their service at all.
  • N+2 won't help anything at all, if the systems are seriously misconfigured. There are just so many different failure vectors. Is there automatic protection against misconfiguration? Something like unit tests, checking if system will be functional with new configuration? That's a good question. - This is related to the previous Admin Post / DNS issues. But also a generic question. Why does system accept a configuration, which is clearly insane and would crash everything? How about doing some sanity checks?
  • Just wondering why DNS server is configured to Server: 127.0.1.1 until I run sudo dhclient and after that it shows the stuff in dhclient.conf. Interesting. I guess that's something what's also obvious if you just know what you're doing. I clearly don't.
  • Worked with HD Tune, bonnie++ and fio to check out some disk system performance issues. Results are still bit unclear, but I'll try to get more analytics. Problem is the usual case, users claim system is slow, I can see that disk I/O is maxed all the time, but system administration claims that there's nothing wrong with the system or the performance. Sigh. I hate being in this position, but this is all too common. It's almost always like this. Nobody ever says oh yeah, it sucks, I'll fix it. Instead they'll spent two weeks working on reasons why not to do days work. If I can collect some data I can publish suitably so it won't annoy anyone too much, I'll do it. It seems that most tests write and read data repeatedly. Which is very bad if your performance problem is caused by tiered storages random cold reads. Naturally the problem doesn't ever exist when you do the tests, because the tests do not test for that particular case. With HD Tune you can get results if you run it on freshly booted server without any other tasks causing data to be cached and let the backend systems 'cool down' for enough. I think I'll get some results if I'll simply dd whole drive to null, that forces also the areas of disk to be read which aren't being commonly read. In some cases the I/O latency times for these cold data hits are in tens of seconds. - Some tests reveal that empty areas on disk are read much faster than the areas containing data. - As well as disk system jitter his huge at times: lat (msec) : 100=2.75%, 250=0.96%, 500=0.13%, 750=0.02%, 1000=0.01%, 2000=0.05%, >=2000=0.07% - If you happen to hit a few of those over 2k in row, it's going to be a bad day, this is exactly the problem what I were complaining about. Disk I/O can stall for several seconds or even tens of seconds suddenly.  Storage is backed by Ceph Block Storage.
  • Yuan to be included in SDR, interesting. China is clearly making important progress on global power scale.
  • Even more discussions with database engineers about extremely simple issues like counters. They were wondering why there are failures in incrementing the counter. This should be so basic CS stuff that I don't even know what there's to discuss about this. If you require sequential counter, then there can be only one update at a time running and that's it. Yes, it causes rate limit. There's nothing to discuss about this matter. Yet they're still wondering why this is happening. - I thought CS people should be fairly logical guys with stuff like this. Another option is of course using locking which prevents any parallel access to the data, but that's even worse than the opportunistic locking which fails on commit. - Which I prefer with all of my projects for a good reason. It opportunistic locking can be problematic in case the database doesn't provide snapshot isolation on beginning on transaction. But knowing in detail how the database works, you can layer this stuff as additional layer on top of it. Compare and Swap is the easiest way to implement "opportunistic locking". Yet in some cases if there are latencies involved, this can make things go from bad to worse. Always make and use overall picture of the situation to make the decisions, there's no silver bullet solution out there which would make everything work magically.