Integration, Tests, uWSGI, CF Stats, Duplicati, Trees

  • More and more integration code. As said, format doesn't matter, as long as it's specified. This new project uses format which is very similar to JSON. No not really. But it's nice, on every row, there's row type identifier. And then there's data after the identifier. Very clean and simple. Also all fields in the file are either strings or integers, which is also very nice. Simple and nice format to read and produce.
  • Python programming & lot's of unit tests + runtime sanity checks. I've seen so many times where people claim they've got perfect unit test, yet that doesn't mean anything, because the tests do not reflect reality. That's why I've often got bunch of sanity check code, which makes sure that the data is getting properly processed. Which usually stops data which falls "between the rules" from getting through with more or less undefined state.
  • Updated one project which fetches hourly Cloudflare access statistics using the Cloudflare API / Zone Analytics feature and saves those into share PostresSQL database.
  • Installing uWSGI from the beginning with latest Python was a chore. It's done, now with Python 3.6.5 support. This time I didn't compile whole uWSGI from source, bu getting some of the plugins compiled still required me downloading the source, and build-essential and so on. After that I was able to compile the python35_plugin.so and moved it to the package delivered uWSGI. Job done. Even as it was just as unpleasant as I remembered it to be. Sure, HTTPS is still broken. Some posts seem to indicate that it would require recompiling the uWSGI with pcre (internal request router) feature, which I don't actually need, but maybe HTTPS does need it? HTTPS option doesn't work because socket hasn't been defined, duh. Dunno why. shared-socket isn't enough.
  • Had to run first major restore using Duplicati 2.0 (several terabytes). It seems that the I/O is quite inefficient. Of course that's the result of block structure optimization. The source drive is flooded with small random reads (relative term, nothing compared to database use). But anyway, source drive load is constant 100% and destination drive load averages around < 10%. Even if the drives are similar. But this is of course the price to pay, from efficient de-duplicated block data structures. Constructing output files requires lots of random read. Great example of things, which are kind of easy to make working. But can be highly optimized, if anyone bothers or cares to do that. That's one consideration what I'm always taking when coding. I know how I could improve this is a lot, but does it matter and does anyone care about that. Often it's not just worth of it. - But this time, slow restore process could extended the service downtime by 24 hours. Because restore can't be done in one workday. I don't know what the exact restoration process being used is, but to me, it seems highly inefficient. - How I would have done it? Just short example, without going into details. I would have read from database the file versions needed to be restored. Source blocks required for that job and then just sequentially read the source blocks while creating destination files as sparse files and dumping content in. Sure, this would lead to random writes on destination drive, but the source would need exactly one linear read of all (required) data. Also the source read could skip data which isn't required. - Yes this is technically, possible because there's block database available for the restore. In my example case, this should lead to minimum of 5x (500%) performance improvement. And cutting the restore time easily for at least around 80%+. But it could also lead to 10x / cutting time by 90% improvement.- That's it, lets just wait it out. Yawn. Anyway, both drives are fast (200MB/s+ sequential), and CPU is i7. - Speed is totally killed by random read, which isn't surprising at all. And when I said "random read", it's very relative term, because the backup set is using 64MB de-duplication blocks. To improve performance, instead of the 100kB default block size. - Restoring rate was only around 100 GB / hour, sigh.
  • Nice computer & data article: About B-Trees LSM Trees SSTables and so on: Algorithms Behind Modern Storage Systems, nothing new yet.

2019-09-15