Blog‎ > ‎

NoSQL, SQL, DHT, Access control, Hadoop, Cassandra, PIM, Solr, Nutch, RaimaSQL, LIRS

posted Nov 3, 2013, 3:24 AM by Sami Lehtinen   [ updated Nov 8, 2013, 11:15 PM ]
This is huge assorted post again from my backlog, so stuff isn't in any particular order.
  • Played with PeerJS and though what kind of new in browser applications will be possible, when browser can communicate directly between each other using P2P technology. I'm sure it'll be great for applications which already utilize P2P tech, like video / voice chat, file transfers, file synchronization between devices etc. Only drawback is that constant P2P traffic on mobile devices isn't really very battery friendly and consume a lot of energy due to waking the device CPU, radio interface etc. from sleep state.
  • This is nice post about how to utilize math to distribute valuations correctly. In this case it's about comment upvotes, but similar kind of methods are needed in many other places too. I personally thought this kind of stuff hard when making and playing with on-line (virtual) stock trading platform. How do you make the game scoring fair? So that users won't get unreasonable advantage from starting early, nor creating tons of new user accounts. What if some user creates a penny stock and then sells it in small batches and simultaneously creates new players which all are buying it with all their initial start cash on insane price, etc. This is also related to the multi-armed bandit etc other stuff I've been writing earlier about A/B testing etc. How you can test 1000's of permuted variations simultaneously and start and terminate variations during testing and genetically variate new develop new variations from best performing existing variations etc.
  • Read and thought deeply this NoSQL article, even it didn't contain anything I wouldn't have been aware already. Btw. If database is SQL or NoSQL or if it's key value or relational database, doesn't technically have any correlation. NoSQL databases can support restricted set of SQL features and relational databases can support all typical SQL featuers without using SQL. Also checked out key document storage, but it's basically key value storage, unless there are some additional features which support searching data inside the document somehow.
  • Checked out SpamHous DDoS case and related stories. That's how the world is now. all kind of different attacks on different parts of Internet infrastructure. It's just important to maintain things enough distributed so it's harder to cause any major outage. AFAIK cloud services are great for this, it's easily to distribute a few mirrors around the globe or use CDN's which can last and automatically block even huge DDoS attacks.
  • Internals of many DHT based peer 2 peer networks are very similar to distributed databases? What? Yes, both have to work with large distributed environment with uncertainties, like latency, node outage, data loss etc. (Cassandra, Riak, Voldemort). Maintain list of available nodes and network node status, locating data when requested and dealing with conflicts are similar challenges for both worlds.
  • Replaced my existing MySQL installation with MariaDB without any problems.
  • Quickly tested and played with SPDY for Nginx (ngx_spdy_module). Yes, it seems to work and pass tests. Still doesn't support server push, but it would naturally require also modification on server side apps.
  • The Hacker Shelf - Excellent bookshelf online for hackers.
  • Read with deep thought, this wonderful post about common pitfalls in writing lock-free algorithms.
  • Witnessed state of art electronic locking system failure at one customer. They have really neat digital locking system with smart cards, PIN codes and biometric readers. But there was just a minor mishap. Then they misconfigured the access control system so that every person can open every any door in the building. Yes, there are several companies in the building using the same security system. That's just ridiculous. First you spend huge amount of money to get a great security system, and then you're totally incompetent and do not use it correctly. Oh boy did I laugh. Anyway, logging still shows who used what key to open which door and when. And there are security cameras too, but it's still huge blunder, think about situation where someone loses the key or something like that. Or you'll enter one company to look for keys and then find keys which you can later use on other doors. Of course that's quite short trail too, but it might not be totally obvious to everyone. We did on purpose abuse a few keys on doors which we shouldn't have. Just to see if we're ever questioned about accessing areas we shouldn't.  You guessed it, nope, nobody never asked anything. This misconfiguration also disabled biometric reader requirement for some certain secure areas.
  • Tried Hadoop and Cassandra with Virtual Box Ubuntu in Single Node configuration. This swamp is deep, there's tons of things to learn if you're running major cluster. As far as I don't see any real reason why I would be using these tools because the data sets I'm dealing with aren't that huge yet. I'm only dealing with data sets smaller than 10^8 keys and usually around a few terabytes. I often do not run deeper analysis from production database, I'll only copy required data or use database from backup, so I cause excess load on production db with my stuff. Also with traditional relational databases, my long reads from bases could simply cause too many and too long standing read locks. In all cases I can't use read only and read committed mode without read repeatability. Of course with MVCC databases those read locks wouldn't be such a problem. I'm just not using PostgreSQL for my stuff. I also played a little with MapReduce, basically it's just more complex solution what I'm doing with my current ETL code. Where I run multiple processes which collect data from databases, map it to processing function and reduce to essential data which is then passed to control process. Technically the major difference is that I'm running my processes on single multi-core server instead of distributing the MapReduce tasks to multiple individual database servers. Some times I also run separate I/O processes to actual processing processes due I/O latency. So I have database reader and data finalizer running in parallel with the actual processing tasks. In some case those processing tasks do independent database I/O and in some cases those communicate with separate database 'extra data' I/O process which is there to cache data. These data structures are especially required in situations when many relational lookups for data which doesn't change is required during processing sales receipt journal etc.
  • Had long discussions with my friends about PIM (Product Information Management), especially how badly it works in many organizations and what kind of solutions can be used to fix the daily problems in this sector. In many organizations there's no complete product master data and it'll creates quite nice mess. Also because there are integration chains, where data is moved from system to system, data is often inaccurate and even if some data would be available at the PIM system, it still might drop away during the data processing chain. Of course this can be solved by making additional requests to the original data source to fetch the data with product key if it has been dropped. But I have seen incredibly many ways to fail this process. And cases where things would technically work, but the system users and operators do not follow agreed procedure and then things are really messed up. In my case, ERP or Wholesale's information system is most common product data source. But many ERP systems do not directly deliverer information like product images, longer product descriptions for consumers, etc.
  • Studied two Python LIRS implementations: https://github.com/trauzti/cache/blob/master/LIRS.py https://github.com/barracel/pylirs/blob/master/src/pylirs.py
  • Spent few hours studying Apache Solr and Apache Nutch. Indexing tons of web pages is quite interesting technical challenge, it requires fetching documents, analyzing those, splitting text and keywords, extracting metadata from different document types, updating huge databases etc. Actually I didn't plan to read this topic, because I've been checking this topic several times earlier, but I somehow got distracted into this topic when I was reading about new Tor network search engine TorSearch. Many people talk about YaCy, but based on my tests (several years ago) it was more like academic interest and play project than really working search engine.
  • Read a nice (but very basic) post about how to miserably fail with web HTTPS security. But all this is way too common.
  • Studied and deeply thought through Raima SQL Database manual (RDM, RaimaSQL, Raima Database Manager). I've been using quite much Raima SQL, and there are some issues. But it works out just ok, when you know what things you really need to avoid. Like primary key range scans, it's a total performance killer. For some reason which I really can't understand, it always triggers full table scan and their engineers told me, that's just the way it is. So don't ever ask more than one primary key in a query, there's some kind of fundamental flaw with it. Using any other indexed column for such a query is of course ok.
Now I only got about 536 things in my blog about backlog. These were mostly entries from last spring so about 6 months old stuff. But I'm trying to catch up weekly from now on. It's already dark and miserable outside, so it's better to sit inside and blog, than go out and freeze in total darkness.