Compression, Burn-In, De-duplication, Software

  • When to use Dictionary Compression - While reading this article, it came to my mind that this article didn't mention predefined dictionary which could also radically improve compression. After checking Python documentation I noticed that since version 3.3 zlib has supported zdict to provide predefined shared dictionary to further improve compression when using compressobj. Other than that it's not surprising at all that using larger compression block allows higher compression ratio. Yet, there's nothing new with preset or shared dictionary. I think the form of saying dictionary compression is kind of confusing, because it sounds like it would mean compressing the dictionary used to compress data. When they actually mean using preset / shared dictionary. AS I've blogged earlier It's also possible to get similar kind of results by compressing / decompressing fixed set of bytes and ignoring the output. This primes the internal dictionary for used for compression. Except of course this is horrible hack and wastes CPU time. But if high compression is required without using library which allows you to create preset compression, this will work and reduces required storage capacity just as dictionary compression does.
  • Put new 12 TB storage drive to week long burn-in test. Nowadays I always test drives before accepting those into production. Because I've had so many bad experiences. Of course all drives should be good, but that's not true. One week burn-in test will reveal if the drives are already broken when delivered. Last time when I installed one work station, as soon as I were done I noticed that the drive was broken. Duh. It was a good thing, because I still remembered everything and it was a quick reinstall on new drive. But that's still wasted time and effort. Testing before install is a good approach.
  • Had some discussion about backup de-duplication block size . By default Duplicati uses 100KiB blocks. I personally would prefer using 128KiB or 64KiB blocks. But the block size can't be changed, since backup set is initiated without resetting whole backup. Oh, I guess I'll set it to 128kb in future. Anyway, that's so small difference and after all that 100 KiB is still dividable by 4 into 25 pieces of 4 KiB. After all, most of the data being backed up is database data, so it makes sense to check that the backup process is tuned with the database on disk data structures. Yet in cases where SQL Server is being used, the page size is 8 KB, yet the backup file of the database could be totally different, I haven't had time to investigate what that file actually contains. I personally would prefer snapshot of the actual database in consistent state, instead of copy of it being made with smaller size. Why? Because it's highly probable that the backup copy totally destroys any benefits of trying to de-duplicate the data. Have to study that more, when I've got time for it.

Software as usual

Is it a Simple, Full Featured, Custom, Complex, Prepackaged cheap solution.

The classic question. All of the options above are bad because.

Custom solution is bad, because it's hard to design, implement and maintain and also expensive.

Simple solution is bad, because it might require extending in future.

Complex / Full Featured solution is bad, because it's so complex that even figuring out what it makes costs fortune.

Prepackaged solution is bad, because it doesn't do what I want it exactly to do.

Now we've concluded the usual outcome, whatever you'll do. It's going to be a bad solution, if you ask customer.

Usually they want it to be cheap, simple, full featured, custom and prepackaged.

I'll just want the 20€ prepackaged software, which does exactly what I want it to do, simply, and without too much configuration or options. It's fully customizable for our future needs and very simple to use and fully automatic.

This pretty much applies to any software.

2019-06-18