Blog‎ > ‎

SaaS, Clock pro caching, Disk encryption, I2P, PyLoris, CaS, Hybrid drives, Performance tips

posted Oct 28, 2012, 11:26 AM by Sami Lehtinen   [ updated Nov 24, 2012, 11:00 PM ]
  • Nice posting about SaaS Metrics, what to measure and what really matters. - For me, no fresh news. All parts were covered by Finnish SaaS business book I mentioned earlier. But if you're interested, I really recommend reading this article.
  • Nice posting about caching: Caching in theory and practice @ Dropbox.
  • Based on that caching posting and my experiences with different database approaches and problems, I tried to look for very good cache to be used with my projects. I think Clock Pro is quite optimal caching strategy. I didn't find pre-exising Python library with required features. Solution? I think I'll write my own library, designed features: super fast read / write cache, timed caching, thread safe, offer dictionary and function decorator interfaces and finally provide full write-back functionality. I really think there are many projects which would find out this to be beneficial compared to traditional LRU / CLOCK caching. Adding write-back eviciton call handle will allow also efficient caching of writes. I think this is something that every developer dealing with databases / files should use. I'll start by writing Clock Pro logic module and implementing dictionary interface, then I'll add decorator. Timed expiry and write-back features are added bonus finnaly. My goal is to be faster than standard LRU implementation on 50% hitrate. Which Clock Pro cache hits are very fast (much faster than with LRU), but cache misses aka page faults might be bit slower than with LRU. Let's see how this turns out. Git repostitory is already initialized. Best part of all this is that cache pro isn't patented. Results should be better than with LIRS, CAR, ARC, LRU. Only OPT algorithm is better when measured with cache hit rate, but it's slower. - Which programming language? - Python 3, naturally.
  • Checked out Intel documentation of Hardware Lock Elision (HLE) and Restricted Transactional Memory RT and XACQUIRE, XRELEASE as well ass XBEGIN, XEND, XABORT instructions.
  • Tested FreeOTFE for Windows just out of curiosity, I have been using TrueCrypt this far for Linux and Windows environments. When I replace my computer, I will probably start to use LUKS for full disk encryption. FreeOTFE can be used to read LUKS/dm-crypt volumes with Windows, including external hard drives and USB memory sticks.
  • Played a bit with alternate DNS roots. Now domain s-l.42 approved and working. DNS is provider by  free mybind service. Of course installing own DNS servers would have been another simple solution. I have already played with tunneling network traffic over DNS and it included all this stuff. But I don't know if I want to run service in long term.
  • The Pirate Bay moved to cloud? I would really expected to see fully distributed and really secure solution. Current solution doesn't protect users and or service providers very well. Their solution is joke compared to Freenet and Gnunet solutions. Yes Tor and I2P aren't that secure due low latency requirements which makes statistical traffic analysis much easier.
  • MegaUpload starts to encrypt user data? News? Why this is news? Every service provider should encrypt user data so that they can't extract it unless user provides the keys. I actually say that they had major security fault, and now they're fixing it. This same rule applies to any cloud storage, backup and data sharing service providers. If they can tell what I'm storing in their service, they're acually doing it really wrong. But I don't care, I whish them good luck with my .gpg packets. As I told earlier I also apply Reed-Solomon encoding to my backups using par2, so even if they would slightly corrupt those packets, those would be still recoverable.
  • Used about half day to test out I2P network, I have studied it earlier, but now I had time to try it. Seems to be working as expected. It also got quite nice statistics, configuration and information pages built in.
  • Studied PyLoris DoS attack tool. Which works by binding up as many as possible TCP sockets and uses Tor avoid per IP connection limit. Basically creating DDoS even when attacker actually uses only one host.
  • I've been working so much with newer lock free databases that it felt starnge to lock things. Compare and swap solution is quite nice for many tasks. Using traditional locking can easily cause serious performance problems. One of partners running heavy BI systems told me that everything works very well with their database, until they drive BI scripts written by external consultants. As you might have guessed, those consultants use wide locking scope and transactions last too long. Then it causes serious performance degradation in production. For that kind of bad access patterns you should have separate database replica which can be used for these reads.
  • Studied hybrid drives (ssd+rotational disk) and found out that many drives unfortunately do not use ssd in writeback mode. This means that usually writes to disk are as slow as it would be rotational disk. Only read speeds for blocks being re-read often can be improved. Only benefit from this approach is that if SSD part dies, data won't be lost or corrupted, because "master data" is on rotational disk part always. SSD part is only used as read cache if it happens to contain valid data for read request. This applies at least to Seagate Momentus XT drives.
  • Studied Walyand architecture compared to 
  • Added book Python for data analysis to my Kindle.

Finally some performance related comments from one of my test projects: 

This is how much thinking just for a while can change program execution time.
Slow version, technically working code which isn't even especially bad (ie, doesn't request columns from database which aren't needed)
Exec time: 525s

Smart query without caching, added a few extra statemets to query to be very strick about items being requested from database.
Exec time: 165s

Caching without smart query, cache data so same data isn't requested from db over and over again (which is unfortunately way too common!)
Exec time: 39s

Fast version (smart queries with caching)
Exec time: 16s

Actually the "slow version" wasn't nearly as bad as it could be, I still didint use select * and I had some non smart filtering to filter which already filtered 90% of data from table. Making a few obvious bad changes, I could have made it very much slower.  Slow version is completely working version which could have been done by any programmer, who just didn't bother think and check out data in tables. 

Simply chaining four rows in source sode lead to ~97% execution time reduction.

More database, caching and performance related stuff is coming, I have been working hard with one side test project. But I'll write more about it later.