I just played with these databases just for fun. I also got bit upset about some queries being incredibly slow so I wanted to find out exactly what is happening and why queries take more time that I would expect. Therefore I thought that I have to do some simple tests which represent my data analyzation tasks pretty well. I use PyMongo 2.3 for MongoDB, Pythons internal SQLite3 for SQLite3, CodernityDB 0.3.60, MySQL Community Server 5.1, PostgreSQL 9.2.1 and finally pyodbc for RDM Server. For client parallelization I actually used multiprocessing instead of multithreading due possible Pythons GIL limitations. I often prefer multithreading with I/O bound tasks due simplicity, but in this case I wanted that nobody complains about GIL making difference and therefore used full multiprocessing. Based on test results I think that multithreading would have given actually bit better results, due multiprocessing requires "heavy weight" starting of process instead of light threading. That's one of the reasons why especially SQLite3 results seem to deteriorate with more processes.
Raima database contained the original data, I transferred data from it to alternate databases. In cases where it was reasonable to change schema I did it. So even if source database uses separate tables with join, my MongoDB contains only one document with all required data. All test queries were made using unique primary key. Test were run until I got at least 5 results with minimal deviation from median. I could get pretty accurate results, any results out of line were ignored. First run was naturally ignored because it caused data to be loaded from disk to memory and therefore was very slow compared with subsequent runs on all databases. But after that, really big performance differences started to pop up.
Test data set was about 6 gigabytes and contained ~880 000 header records with about 5 million data items. All tests did request exactly same 100 000 records because query list was pre-generated random pattern. This allowed me to easily repeat read tests without causing cache misses. Each query requested only one record from database. Test times are time required to complete all queries and in case of more than one clients time also included starting and finishing subprocesses. Actually random test set contained 5671 non-unique entires, meaning that only 94 329 unique records were read. Each test run was run with single, 2 and 4 parallel processes. For these test reads transaction isolation and repeated readability were turned off, if and where possible.
Time in seconds to complete whole test, less is better.
I decided to withheld exact resuts.
Well, I'm sure many people could ask why I did request only one record / query. In most of cases code is written so that individual information is being requested. Main loop might make first one larger select and then read records, but most subsequent internal logic fetches request one entry at a time. I know that it would speed up things if data would be smartly read in larger batches. (Just as I do with some ETL cases, where other solutions are simply unacceptably slow). Some databases also didn't handle range queries with primary keys too well, which also made it impossible to make fair comparison when using ranges.
I repeated same test without database and using only pure Python dictionary (Python 3.2 dict) item lists in it. Retrieving the same data set from dictionary took quite exactly one second. That's over 20 times faster than fastest database which was SQLite3. This is one thing to keep in mind. Even if your database does some caching, it's still slow. Using application level caching for static quite small datasets is still a must for performance reasons. We all know that Pythons dictionary isn't the fastest solution available for data processing either.
CodernityDB is still being developed. I was positively surprised by it's performance, even if it's pure Python database.
I'm aware about SQLite's write performance issues. Performance would be poor if I would try to write with multiple concurrent processes / clients. I'm also aware about MongoDB's ACID restrictions and single document atomicity. This is just one very limited test type, so there's no point to try to generalize results. All databases are very different and perform differently in different situations. As I said with BtrFS test I made, this is also only one simple test. This is not a generic test, which would cover all database aspects. I did this test only because I was curious about different databases and there were a lot of (opinion based, biased) discussion about performance differences, but nobody got any results.
Based on this test, I continue to use SQLite3 in most cases when processing data locally, acknowledging it's limitations. MongoDB performed also very nicely with this test scenario.
Based on my current experiences with Google App Engines databases (M/S (deprecated) & High Replication Datastore (HRD)), I would estimate that doing this exact test would take about 1000 seconds with Master Slave database, which is deprecated and with current High Replication datastore test would require whopping 2500s. Of course in this case, you can request up to 1000 records at time, and it would make significant difference. Datastores reliability and parallel performance is also naturally on quite other level than when using local SQLite3 database. But these values are given as just pointlessly silly estimated comparison. Based on these values this would make App Engines database the slowest database, as expected.
Questions? - Mail me.
Alt keywords: Postgres