Log deduplication and compression = 73,62% saving vs 7-zip ultra compression

Post date: May 22, 2015 3:38:47 PM

I had some problems storing my long term log data. Earlier it was just LZMA compressed using 7-Zip ultra compression. But after I got sick'n'tired of those logs being so large. (Yes that's quite relative term) Yet logs are so important that these have to be stored for seven years and at least in two physically separated locations and there secondary off-site backups for even these primary storages. So what to do? I decided to write a slightly better solution for handling this kind of log data.

I created three database tables

1) Timeline: timestamp, message_id

2) Messages: id, hash, block_id

3) Blocks: id, datablob

Now the timeline contains only timestamps and id relation to messages table, which contains hash of the data record being stored relation to block id, which tells which block contains the message.

I collect messages until I got 32MB of NEW data. After that data is LZMA compressed into one message block and messages hashes are written and linked to timeline.

During playback I read required blocks and decompress messages from those and then cache decompressed messages using message id and CLOCK-Pro caching (PyClockPro) so I can efficiently utilize memory and I don't need to decompress blocks all the time. Also using efficient caching like CLOCK-Pro causes unneeded parts of decompressed blocks quickly to be discarded from memory.

Why this is so effective? That's because most of log messages are repeated a lot. Especially when you just take out the timestamp part from the message and separate it out. Let's see the statistics, how much more efficient this method is compared to plain old yet 'very good' 7-zip LZMA ulta compression.

Log entries for 2014 compressed using this method:

413GB old 7-zip data

61GB My format

Year 2015 so far (about 4 months worht of data):

124GB old 7-zip data

28GB My format

Writing the required code to make all this happen took just about 3 hours. But processing the old data took bit longer. But in future the data processing is faster (less data needs to be read and stored) due to this efficient compression and deduplication.

Space savings: 86GB vs 326GB = 73,62% saving compared to 7-zip ultra compression. At this point it's easy to forget that the 7-zip ultra compressed data is already over 80% reduced in size.

Let's just say, that a few hours of work and these results? Yes, I'm really happy with this. You might ask why logs are only 80% compressed? Isn't usual ratio more like 95%? Yes, that's right. Usual ratio is something like 95%. But in these log messages there are some rather long and encrypted entries, which make deduplication more efficient and won't compress as text. Also storing these messages repeatedly if missed by compression window is quite expensive, even if 7-zip ultra window is formidable in size compared to almost all other compression methods. As expected context aware compression can make better job than generic compression.

I know there are existing solutions, but I was bit bored during my summer vacation, and this looked like a nice test to do.