Blog‎ > ‎

Database developer fails, I/O stack, Big Data, IaaS & Cloud, SaaS pricing

posted Apr 5, 2014, 7:32 AM by Sami Lehtinen   [ updated Apr 5, 2014, 7:59 AM ]
  • Total database performance killer from one great developer. He decided to use where key like '%input%'. Of course this is a solution which shouldn't be ever used in production. It's fine if you have small database and you're testing or looking something. But in production that's a real killer. Now that command is run on table containing a few million records on field, which contains about 100B-100kB of text per record. So even with modest estimate, each request requires scanning through at least 2 gigabytes of data. It's just great. It basically triggers massive disk I/O operation every time and also consumes a ton of CPU cycles. Anyone using the web application can trivially launch a denial of service attack, by just sending a few queries to that specific URL. But of course there is there a rate limit or request limit / IP. Well, of course there isn't, it wouldn't be any fun at all. Actually doing little load testing would be just fine and interesting to see what happens. Let's just push 100 requests / second to query which takes minutes to complete.
  • Yet another performance killer and design flaw. One huge table with hundred of millions of records, got one column which is named status. Great. Does make sense. Records which are stored into table are initially using status=0, and records which are processed further, got status=1. But now there are a few record types, which do not require any processing. Guess what, the process which processes those records further naturally skips records that do not require processing. Well, does it update the processed flag? Ha, of course not. It re-checks there records over and over again, skipping those. Another great question is why status is fully indexed field, I would have naturally indexed it partially and only for status=0. If things would be done smartly, there would be only a very small number of records with status=0. Yet the index would be small. Now the index contains all records. Of course the inefficiency with the status flag handling I described earlier also grows index size and makes processing slower.
  • I/O Stack Optimizations - Absolutely excellent article. It also covers many different file systems, EXT4, Btrfs, XFS, F2FS and of course compares those with different SQLite3 journaling modes (Delete, Truncate, Persistent, Write Ahead Logging). I loved it. Perfect reading if you want to know how these fundamentals affect your application / system performance and why journaling a journal isn't always a good idea.
  • FT: Big Data: Are we making a big mistake? - Actually this article was quite good. But another article, which was about data analytics was even better. This only covers the facts that unconfirmed data can lead to. But the another article actually told many details about the facts. And here's the another article: Data analysis hard parts. What they say about data analysis, isn't nothing new at all. I'm used to making many kind of integrations. Like invoicing and bookkeeping, basically those are data analytics. It requires at times going through massive amounts of data, hundreds of millions of records. And precisely processing all of that. If your analytics is wrong, nobody might notice. But if your bookkeeping is off by 100k I'm quite sure that someone will notice at some point.
    Unfortunately most of customers think that when they order something, it'll be perfect and full fill (what ever their needs are) without any input. When I'm making integration or analytics from data, I can be quite happy with any reasonable numbers. It doesn't matter to me if their profit is 30% lower or higher. This is the key reason, why I always want any analysis to be confirmed by customers own people who know the data and work with it daily. They can immediately tell, if something is seriously wrong.
  • Why IaaS rules? Because multi cloud comparison is nearly impossible, because there are so many factors you can't reasonably affect. Simplest way of doing Multi-Cloud comparison, is using pure IaaS concept with Linux boxes. That's much more easily comparable. If you take per vendor PaaS features into count, comparison is very hard. Also it might turn out that all your developers know technology X and the 'perfect solution' is for technology Y and therefore you can't really use it. This is very complex topic.
    Definite and absolute pro with pure IaaS/Linux solution is that you can easlily transfer your servers to almost any place. Your home box, local data center, dedicated servers at customer, customers private cloud, nearly any major public cloud, or some small national hosting company etc. Even Raspberry PI will do it, if that's what required.
    There's also clear benefit that you don't need to pay 'Microsoft tax' for expensive licensing.
  • Studied mod_wsgi (Apache2) in detal, read all configuration parameters etc. I had to, I had some issues when configuring Apache2 to run Python 3 WSGI scripts. Yet I still failed, but that doesn't matter for now.
  • A very nice post about SaaS pricing as form of Fantasy Tarsnap by Patick (patio11 / Kalzumeus). If you sell cheaply, it's hard to make profits. But there are also risks if you have rip-off pricing, because someone might just make a competitor. Especially if the software is quite simple and doesn't require any very special knowledge, like this Tarsnap backup solution.
  • Something totally different again, except this is still engineering: Kingdom Tower, The Illinois.