Blog

My personal blog about stuff I do, like and I'am interested in. If you have any questions, feel free to mail me!
[ Full list of blog posts ]

SQLite3 performance testing

posted by Sami Lehtinen   [ updated ]

I'm trying to figure out what's the transaction level, which can be satisfied with SQLite3, and when other database solutions are required.
Here's simple test, running 10000 transactional selects & updates, in 10 parallel threads. The update process does only what's required, so basically it reads and then updates data and commits it. After several tests, I found out that there are just a few things that do really matter. Which are Synchronous mode and journaling mode. I have seen many people recommending using larger page size for better performance, but actually in this light test I found out that the default 1024 bytes page size was the optimal. Any larger page, only requirs more reading and writing and brings down the performance, just as you would expect.

After some playing I figured out that in this case it doesn't really matter if the transaction mode is deferred, immediate or exclusive. Of course using deferred just required me to handle 'locking failures' in my code. But using immediate or exclusive pretty much gave same results? Why? Well, because basically it doesn't matter how it's done. All that matters is that there's a read and write lock to a table. Even if it would allow multiple parallel reads on table, only one commit will be successful and others will fail. Because there's no row level locking. Anyway, as we know this isn't highest performing database there is out there. Lol. But let's estimate what this is good for.

Results with the things that do matter.
Mode Sync Disk Time Transaction/second
ModeSync
Disk
Time (s)
T/S
WALOFF
HDD
9.3
1075
WAL
NORMAL
HDD
36
277
NORMAL
OFF
HDD
13
769
NORMAL
NORMAL
HDD
99
101
NORMAL
FULL
HDD
128
78
WAL
OFF
SSD
1.85555
WALNORMAL
SSD
6.5
1538
NORMAL
OFF
SSD
3.0
3333
NORMAL
NORMAL
SSD
69
145
NORMAL
FULL
SSD
85
118
MEMORY *1N/A
N/A
0.1
100000
PY DICT *2
N/A
N/A
0.01
1000000

Notes, Normal in this case means Delete, SSD was using Ext4 and HDD was using NTFS, if it does matter. As we can see from results, file system or even medium doesn't make so big difference, it's all about the mode and if synchronization is required or not. Other factos are just minor nyances in this show.

From these results we can conclude that SQLite is really fast enough for 99,9% web sites as they say on their page. WAL mode without sync, allows over 5k transactions per second using SSD and over 1k transactions per seconds on traditional HDD. Even the default mode, which is NORMAL/FULL using traditional HDD still allows 78 transactions per second. Because default locking mode for SQLite3 is deferred, it means that multiple processes / threads can read the database simultaneously and error / exception / notification is only done when database is being written to. If you're writing efficient code, you'll do everything what is required, and then do just at the very last moment start transaction, update data and commit it that. Because the secondary reason after these modes is the locking mode, which is ONE lock per database. If the segment which is modifying database, takes long, then whole process becomes slow. I also did run these tests by artificially adding time.sleep(0.1) inside the "synchronized" block, which reserves database. Just as expected, it made things very much slower. So keep the sections which update database, short and quick, and don't commit too often. Btw. In deferred mode as default, no journal file is created on read requests. Journal is only created when database is being modified. Now someone could immediately point out that there are other journaling modes than the default which is delete. That's absolutely true, but compared to sync options and WAL mode, that's really marginal difference, so I didn't bother to list results here separately.

Based on this test, SQLite3 is actually best database for my use in most of cases. Usually I have long running single process tasks, which commit at the very end of process. SQlLite3 is basically just a transactional data store.

As pro SQLite3 comment it can be stated that it's "in-process" database, it doesn't CPU and OS to do context switch. This means that the SQLite3 database gives actually great performance in terms of latency when doing tons of light read transactions.

I personally wouldn't run any production system with non-default transaction handling for data I don't want to lose. So turning journaling and sync off for testing environment, is great. But for production, it's absolute no no. I recommend using WAL mode with NORMAL, which is default when using WAL mode.

1) Just for fun, I did also run tests with :memory: database, and as expected, it's bit faster than any of the versions using journaling or file system persistence. This also proves that the delays of the performance are mainly caused by I/O and not the SQL command processing or Python itself. Yet, I had to to try it without using the SQLite3 at all, so the final row in the table is Python 3.2 dictionary with int as key and int as value, and if key exception, insert key with 1.

2) All of these tests had function which randomized the insertion sequence, so I tried it with 10000x1 and 1x10000 meaning 10000 different keys, with only create. Either of these options made any meaningful difference. So 10000 different keys with insert is quite much just as slow or fast as 10000 increments / updates to single key. I also varied number of threads, etc. No practical meaning. Unless you do something silly like using 10000 threads to update one individual dictionary entry per thread. But that doesn't make any sense anyway. If you want to go really crazy, use 10000 processes to each update one individual dictionary entry in memory. Btw. Time of creating the thread pool, closing and joining it, is included in the timing.

Details: I was using sqlite3 module which comes with Python 3.2. The SSD I used was "consumer grade" SSD, with Ubuntu. NTFS tests were run using Windows 7 Pro and consumer grade 500 GB 7200 RPM SATA drive. If that does any meaningful difference.

PythonJS, Adaptive Design, User Identity, Android Studio

posted Apr 13, 2014, 7:06 PM by Sami Lehtinen   [ updated Apr 13, 2014, 7:08 PM ]

  • Perfect XKCD explanation for heartbleed bug.
  • Started Google+ Brython community.
  • Played with PythonJS Python to JavaScript translator. It's a lot faster than Brython, worst part of Brython was it's performance, because it's not optimized code.
  • Quickly tried Android Studio on virtual machine. Even if it's early access preview, it seemd to be much better than the Ubuntu HTML5 developer packet.
  • Played with a few adaptive web design CSS templates, but those were too complex to be utilized very easily, so I sticked with simpler design for dns-hosts. Responsive and fluid designs were old names for this thinking, where pages adapt to the device being used. Unfortunately it's very common to see that sites are clearly designed for some kind of fixed end device. In worst cases, which I have encountered several during this month. Site simply says get lost, you're using wrong kind of device. Without even allowing user to select different kind of layout. I often encounter this when forwarding links from desktop to mobile or from mobile to desktop. It seems that many Finnish news services are run by totally incompetent IT staff. "You've reached our mobile app, but not from a mobile browser." - What do I or you care about the browser I'm using? All I wanted to see was about 3 kilobytes of news article text. Does the browser really matter so much. I would say no, it doesn't matter at all. But that seems to be too complex thing to the "web designers" to understand. Just give me the raw text, without any HTML as pure ASCII, if you're even capable of doing that. Domains related to this whining are: talentum.fi and tietoviikko.fi, they just fail so miserably.
    Anyway, adaptive design is great feature for CMS, where it's easy to produce multiple different designs from same source data depending from required situation.
  • Wondered once again, what is the point of using email address as user identity. It's inherently very bad malpractice, yet many sites do so.  Email address, telephone number, etc, can be anything. And shouldn't be considered as ultimate 'unique identifier for user'. I very much like concept where user id is just plain user id, and it's something that system generated. Like it's with Facebook. I'm saying this even I hate Facebook, but that's something they got really right. Even SQRL uses email address as ultimate user identity. It's just silly. I can change my email address daily, as well as my burner phone number. It doesn't mean that it wouldn't be me anymore. As well as the password recovery using email address is once again, one very very bad idea, but so widely used.
  • Short list of checked stuff: Amazon CloudFront + Wikipedia, Webhook (nothing new, just one way of triggering stuff), Internet in China, CNGI, IPv6 deployment in China, The Wolf Hunters of Wall Street, Riak, Redis, LevelDB, lmctfy,

Google App Engine now supports naked domains

posted Apr 13, 2014, 5:36 AM by Sami Lehtinen   [ updated Apr 13, 2014, 6:27 AM ]

Today it's live, App Engine now supports finally naked domains. To enable naked domain support for your service simply go to your Google Developers Console. Select project, App Engine, Settings and there it is. You can now use naked domain when specifying DNS A and AAAA records, instead of using sub-domain with CNAME record. For more information see issue 777.

Google App Engine Naked Domains


E-Receipt, what's holding it back?

posted Apr 11, 2014, 11:43 PM by Sami Lehtinen   [ updated Apr 13, 2014, 12:25 AM ]

Hi. I started to map out the current situation of the e-receipt. As far as I can conclude, there's no commonly usable standard for it. I'm now wondering if you do know open widely used or usable free ereceipt standard? If you do, please let me know. If you're also receipt guru in your country, I might be interested to hear more about you. I naturally know the requirements for Receipts in Finland and in EU area, but outside that I haven't been working too much on the topic. Of course finding out requirements for US and states should be quite trivial.

Let's now skip the technical terms. What are the actual reasons, which are holding e-receipt back? I have seen several studies to say, that at least 85% of consumers would love to get e-receipts instead of collecting that annoying chemical paper trash in their wallets.

After quite quick analysis I came up with two reasons.
1. Identifying the customer
2. Lack of standard

As integration expert, I would say that fixing the latter is quite trivial in technical terms. We should form a expert group to create versatile integration standard. I've been doing hundreds of integrations yet, each system handles receipt data differently. It's really annoying. There should be one widely used standard.  All those discussions is the total price of the row item price times the quantity, and what if it's return row? Should the quantity be negative, should the price also be negative, or is there some kind of separate return flag. Or maybe the whole receipt is void receipt and the information if it's actually negative or positive row isn't even directly available from any of the fields of this individual record. How are discounts handled, especially subtotal discounts etc. I have very strong views how those things should be done, so it's efficient, simple, yet versatile enough.

Is there any interest for this kind of project? I can create community and wiki for it, if there is. Even better, if there's already existing (free & open) standard which I'm just not aware about, please let me know.

I already created Google+ community group e-receipt professionals SIG for potentially interested people.

National situation in Finland:

In Finland retailers are by law, required to offer an receipt for purchase transaction. E-receipt is viable option, but there's no standard for it. There has been some discussion that e-receipt should be forced by law. Also making it possible for tax authorities to easily process receipts. But if there's no common standard for it, it's impossible to start using such system.

Quotes from e-receipt SIG discussions (Replies written by me, see original questions in the SIG). I'm again explaining why the message format standard and data structure is so important:

Two problems mixed? Yes, that's why I labeled those separately as two different problems.

I'm aware that those are two separate problems and there will be two separate solutions. 1. Is something which is the harder part. I don't mean that it would be technically hard. But it needs to be very quick and reliable. Nobody wants to write e-mail addresses at coffee shop during rush hour. 2. Is totally trivial. 3. One thing which limits the adoption is lack of standardization. That's why I'm especially addressing it in the SIG. If there's commonly used standard, it's much easier to adopt using it. Closed POS vendor solutions just won't work. We've seen major well funded players like Nokia completely failing to deliver simple service like this.

Without proper standards, it's just way too much for individual shopkeeper and entrepreneur to pay for something like 5-10k for simple integration stuff like this. But that's what it easily ends up with everyone using their own custom data formats, requiring complex and slow (aka expensive) mapping, and different transports (sftp,ftp,http,email) etc.

Btw. I've done 100+ integrations and I've got 15+ years of experience from Retail, POS software, solutions and system integration. I'm also very familiar with EDI and European E-invoice standards. Which are just bit too complex for easy adoption for most of smaller players.

Why is the API standard and message format so important? Can't we just scan the paper receipts or grab the data going to receipt printer?

Yes, or you could use printer driver which forks the data stream. But those are all lame solutions, because the printed paper version doesn't anymore contain information, which could be useful for automated processing.

When ever I'm doing integrations, I'm very carefully considering the balance when to deplete information. Depleting information can make things look very easy and simple, until you'll find out that the process requires the data you happened to deplete in the very beginning of the process.

As example, you have receipt with 100+ rows, and tons of different items. Could you please tell me what's the tax class or tax rate for each individual row? No? Darn. That's life. That's exactly why proper XML / JSON standard data structure is required. Electronic receipts in vendor specific or as "printout" formats, simply aren't useful. There should also be CSV format for legacy systems. Many IT guys won't just believe how many systems there are out, which aren't able to handle JSON or even XML.

Think situation when you're filing your travel expenses. Great, you're again playing with paper receipts, maybe photographed, maybe scanned ones. But all together, some vital key information is already missing and needs to be re-entered.

Even if the EDI formats and E-Invoice in Finland are pretty horrible standards, those are yet a lot better than no standard at all.

That's why I'm asking for feedback and wish to form a SIG which would be able to create standard that's freely vendor specifically extendable (!) yet not too complex for software developers to adopt.

The small accounting company I'm using wouldn't understand the JSON / XML, they want everything using FAX.

Well, I completely agree with you. That's just the problem with current e-receipts. But I think I have quite a good views to this subject, because I've been working in this field for a long time. I especially always embrace smaller players and kind of loathe standards made by big international boards. First it takes forever for them to create the standard, and when it's done, it's messy, full of compromises and so complex, it's very hard for smaller players to implement correctly if they bother at all. There has to be easy to use user-interface for the data. Raw data is there for techies, but there has to be also human readable standard representation for the same data. As good examples I could bring up EDIFACT and Finvoice.

NFC application for a store chain.

I agree with you. But main problem is that the field is so dispersed. Without open standard there's a vendor lock-in and you'll have to use 10+20 similar apps. Think about email, if everyone would be forced to use Gmail, because lack of federation.
I know there are NFC mobile apps etc, stuff. But those are practically useless, they require customers to invest in the system, they require shop keepers to invest in the system. As well as the system will be very likely proprietary. Which is of course a bad thing.

Also had many more interesting discussions from different viewpoints about this same topic. Spam issues, why would anyone want e-receipt. Isn't it retailers responsibility to keep receipts. Well it is, but what's the use for you if those are kept by retailer. Which is the current norm, usually they don't provide any way to access that data, etc.

A few keywords: ekuitti, sähköinen kuitti, e-kuitti, e-invoice, einvoice, electronic invoicing, electronic receipt, mobile receipt, email receipt, digital receipt, digital signature, service, site, API, JSON, XML, standard, schema, system integration, point of sale, retail sector, consumers, consumer, service, web site, web service, webservice, business, entrepreneur, european, special interest group, board of professionals, system specialist, European Union, legal discussion, requirements, fiscal, tax authorities, national legalization, laws, law, EU, Americal, global, international, internationalization, standardization, standards, stateful, stateless, adopt, adoption, national needs, extensions, extendable, extended, REST, RESTful, HTTPS, HTTP, flat files, SMTP, multi transport, transports, connection, connectivity, federation protocol, universal, distributed, open e-receipt standard, open e-receipt initiative.

Heartbleed, essential things or misinformation?

posted Apr 11, 2014, 10:04 PM by Sami Lehtinen   [ updated Apr 13, 2014, 8:34 PM ]

Are my comments regarding Heardbleed, misinformation or is it essential? Because I think that many news miss the vital points and only focus on simple instructions which doesn't really cover what needs to be thought about.
  1. Why they say that sites should renew their certificate? It doesn't make any sense at all. As far as I know sites should renew their private key, not only certificate.
    So what should be done instead of only renewing a certificate?
    Revoke the old certificate
    Generate a new key pair (Which of course includes a new private and public key)
    Request a new certificate for new keys
    Install new private key, public key and certificate to server
  2. Because Heartbleed allowed getting any data, renewing all passwords on server and all encryption keys is also required. Like SSH keys to the server, other account passwords. If there are any third party integrations, API keys, traditional other login / password authentication information. Anything that might have been in the memory of the server, could have been leaked. - As example if your site got Facebook API keys it's possible that those have been leaking too. Cloud authentication keys/passwords for Amazon S3, Google Cloud Storage etc.
  3. When people are told to replace their password, why they aren't also told to replace their authentication cookies? Because I have seen that many sites do not properly reset authentication cookies even if you change the password. So changing password is like changing the public key, it still leaves possibility to abuse your account if they have the cookies, even if password has been changed. With some sites, it's actually quite hard to get that authentication cookie to properly replaced, even if deleted they just might place same 'this user logged in cookie', even if you're using different password now. Allowing continued access for the cookie owner.
I think it's pretty essential to fix all related deeper issues and not only focus on the very shallow surface layer of the issue.

Dash, IPv6 email, Cloud, Server outage, e-Receipt, Loyalty system, Loon, Drones, etc

posted Apr 11, 2014, 10:19 AM by Sami Lehtinen   [ updated Apr 11, 2014, 10:19 AM ]

  • Amazon Dash - Quite a interesting solution. But one thing hit me first and hard. Why, oh why, a separate device? Why not a smart phone app? Because that hardware device doesn't bring any additional value, except a bit better barcode reader. With app, all family members could have it installed with shared account etc.
  • IPv6 Email - At first, it would be easy to say, that nothing is different. But as usual, topic is deep. But it's not the email itself, it's actually the abuse prevention which is complex.
  • For people who claim that cloud is bad solution. I just want to ask one question. Which is that do their companies have own water and electricity plants? And if not, why? Do they really trust public utilities? Can they feed their workforce, if when public power, food and water supplies fail? Should they? Well, of course military got their own solutions for these issues too. But most of businesses, even larger ones do not. So what's the fuss about having own data centers if you can't power those or keep your workforce alive.
  • Enjoyed one major server outage due to networking issues. It was interesting to follow how things escalated. Luckily nobody dies, if these systems fail a bit. This is also exactly the reason, why I do run secondary monitoring system in different data center, than where the primary systems are. This outage was so serious that it took the servers, and monitoring system down. Only the secondary monitoring system was able to alert me the primary system wasn't working. Primary system of course noticed that connectivity is lost, but it couldn't send out any alerts due to connectivity loss.
  • Studied e-receipt and bonus/loyalty system solutions. Why some of those have been successful and why most have failed. What are the key ingredients for success? What it takes to be a successful in this sector. Currently there aren't any widely used solutions. I'm still wondering why this stuff is taking of so slowly. It's clear that this has been coming for several years, but yet, even many major chains haven't adopted it, even if they already have everything it really needs to be reality. They do have all receipts, they have loyalty card information, they have customer information and most importantly their contact information, web service portals and everything. They also do have receipt line level data for additional analysis, so allowing customers to access their own data shouldn't be that hard after all.
  • Checked out Project Loon & Facebook Drones, it's hard to say which one is better. But generally I would vote for Facebooks drones, because I assume those are way easier to control and build with correct with durable electric engines and batteries, stuff shouldn't require much more service than the balloons.
  • Read: MariaDB 10. Pyston (JIT Python), HFT guide, Nice post about REST, Upselling, Cross-selling,

Database developer fails, I/O stack, Big Data, IaaS & Cloud, SaaS pricing

posted Apr 5, 2014, 7:32 AM by Sami Lehtinen   [ updated Apr 5, 2014, 7:59 AM ]

  • Total database performance killer from one great developer. He decided to use where key like '%input%'. Of course this is a solution which shouldn't be ever used in production. It's fine if you have small database and you're testing or looking something. But in production that's a real killer. Now that command is run on table containing a few million records on field, which contains about 100B-100kB of text per record. So even with modest estimate, each request requires scanning through at least 2 gigabytes of data. It's just great. It basically triggers massive disk I/O operation every time and also consumes a ton of CPU cycles. Anyone using the web application can trivially launch a denial of service attack, by just sending a few queries to that specific URL. But of course there is there a rate limit or request limit / IP. Well, of course there isn't, it wouldn't be any fun at all. Actually doing little load testing would be just fine and interesting to see what happens. Let's just push 100 requests / second to query which takes minutes to complete.
  • Yet another performance killer and design flaw. One huge table with hundred of millions of records, got one column which is named status. Great. Does make sense. Records which are stored into table are initially using status=0, and records which are processed further, got status=1. But now there are a few record types, which do not require any processing. Guess what, the process which processes those records further naturally skips records that do not require processing. Well, does it update the processed flag? Ha, of course not. It re-checks there records over and over again, skipping those. Another great question is why status is fully indexed field, I would have naturally indexed it partially and only for status=0. If things would be done smartly, there would be only a very small number of records with status=0. Yet the index would be small. Now the index contains all records. Of course the inefficiency with the status flag handling I described earlier also grows index size and makes processing slower.
  • I/O Stack Optimizations - Absolutely excellent article. It also covers many different file systems, EXT4, Btrfs, XFS, F2FS and of course compares those with different SQLite3 journaling modes (Delete, Truncate, Persistent, Write Ahead Logging). I loved it. Perfect reading if you want to know how these fundamentals affect your application / system performance and why journaling a journal isn't always a good idea.
  • FT: Big Data: Are we making a big mistake? - Actually this article was quite good. But another article, which was about data analytics was even better. This only covers the facts that unconfirmed data can lead to. But the another article actually told many details about the facts. And here's the another article: Data analysis hard parts. What they say about data analysis, isn't nothing new at all. I'm used to making many kind of integrations. Like invoicing and bookkeeping, basically those are data analytics. It requires at times going through massive amounts of data, hundreds of millions of records. And precisely processing all of that. If your analytics is wrong, nobody might notice. But if your bookkeeping is off by 100k I'm quite sure that someone will notice at some point.
    Unfortunately most of customers think that when they order something, it'll be perfect and full fill (what ever their needs are) without any input. When I'm making integration or analytics from data, I can be quite happy with any reasonable numbers. It doesn't matter to me if their profit is 30% lower or higher. This is the key reason, why I always want any analysis to be confirmed by customers own people who know the data and work with it daily. They can immediately tell, if something is seriously wrong.
  • Why IaaS rules? Because multi cloud comparison is nearly impossible, because there are so many factors you can't reasonably affect. Simplest way of doing Multi-Cloud comparison, is using pure IaaS concept with Linux boxes. That's much more easily comparable. If you take per vendor PaaS features into count, comparison is very hard. Also it might turn out that all your developers know technology X and the 'perfect solution' is for technology Y and therefore you can't really use it. This is very complex topic.
    Definite and absolute pro with pure IaaS/Linux solution is that you can easlily transfer your servers to almost any place. Your home box, local data center, dedicated servers at customer, customers private cloud, nearly any major public cloud, or some small national hosting company etc. Even Raspberry PI will do it, if that's what required.
    There's also clear benefit that you don't need to pay 'Microsoft tax' for expensive licensing.
  • Studied mod_wsgi (Apache2) in detal, read all configuration parameters etc. I had to, I had some issues when configuring Apache2 to run Python 3 WSGI scripts. Yet I still failed, but that doesn't matter for now.
  • A very nice post about SaaS pricing as form of Fantasy Tarsnap by Patick (patio11 / Kalzumeus). If you sell cheaply, it's hard to make profits. But there are also risks if you have rip-off pricing, because someone might just make a competitor. Especially if the software is quite simple and doesn't require any very special knowledge, like this Tarsnap backup solution.
  • Something totally different again, except this is still engineering: Kingdom Tower, The Illinois.

System integration, kind of RESTful(ish) API, but just much more versatile

posted Apr 5, 2014, 6:04 AM by Sami Lehtinen   [ updated Apr 12, 2014, 11:53 PM ]

I had to write one RESTful(ish) system integration API.

First of all, I wanted to make it fully RESTful and according REST principles. But because the data structures and other systems being integrated, aren't unfortunately designed at all keeping REST in mind, I had to break that plan at very early stage. So yes, it's possible to updated some limited data sets in fully RESTful fashion, but for most of data, it's not possible due to efficiency reasons.

Here are the short notes:
  • Generic data structure support for all database tables in system
  • Using user rights, it's possible to restrict access, but it's not technical limitation, that we only support very small set of features.
  • Support for unlimited number of remote systems. Some other APIs are designed fundamentally so that they support only one integration point. That's very limited approach in todays standards. As far as I have seen, many systems need to be integrated to several other systems in parallel. So supporting only one integration point, isn't just going to get the job done.
  • Full support for multi-tenant configurations. Some other interfaces have been designed so that those do not support multi-tenancy at all or make it in insecure fashion.
  • Fetch data updates from remote system (PULL), Send data updates to other side (PUSH). Because in the smaller customer segment, getting something simple like static IP can be problematic. Therefore I immediately had to ignore two way operations, where data is always pushed from data source to recipient in near real time. In some cases that work great, and is optimal solution. But in some cases it just doesn't. My design allows data to be pushed or pulled on both ends. What ever is the best solution in any case. All methods can be simultaneously and freely mixed. Same data can be sent to system A in near real time, but system B receives exactly same data in daily batches and in different format and using different transport.
  • Data selection by timestamp, record-id or last-poll-id. This was one of the reasons to ignore REST in very early stage. Because in the source database, there are tons of records that can be changed, and theres absolutely nothing that indicates that the data has been changed. Basically this means that there's no other option than get full list of records (possibly millions of records) and then check each record remotely one at time if using pure REST. As we all know, that's not going to work. Therefore I introduced change monitoring based on client id, last poll id and last sent record hashes based on that data. When list of changed records is being fetched, it's locally filtered so that only records changed are sent. I know this isn't most efficient way to do it. But unfortunately with current data source design, there's no other way to get it done. With some tables, it's clear that those are append only tables, and there it is possible just to fetch records N+1 until hitting the current tail. Also some tables do contain timestamps, so it's possible to get data 'since something'.  Of course most commonly polled, updated, processed tables have timestamps and or sequential record-id's. But there are a lot of tables which do not contain any indexed indication of potentially changed data, nor there's table with log where you could indirectly fetch this information. Hash based change checking is done usually during the night, because it is a heavy operation.
  • Last poll id can be maintained remotely or locally. This was important a important design factor for methods which do not actually receive any kind of request, but are locally triggered using timer or some other criterion.
  • Support all data formats JSON / XML / CSV. It's nice to say that we support JSON, but unfortunately there are legacy systems that do not support anything else than CSV. Don't get me wrong. I think CSV is one of the most useful formats ever. Even if Excel CSV support is horrible, it's still great too for so many things. My base design goal was that I can provide same data in any format. If you want it as props file, I can arrange that too. It's just reformatting same stuff in different packet. My goal is to make system as versatile as possible, so that we can choose right data format and transport, what ever is nice for the other side likes. I've seen way too many APIs which are really complex and almost impossible to implement with some legacy ERP / Bookkeeping systems. And just as I said, HTTP / JSON - RESTful is just one perfect example of that. I want to see, how your 15 years old software product easily digests this data and makes required calls.
  • Support a few transfer modes which are FTP, SFTP, FTPS, EMAIL ( IMAP / SMTP ), HTTP, HTTPS, Local files, UNC paths. - Yes, that's basic part of the versatility required. I'll transfer the data, using what ever transport suits the recipient. This also makes it clear why actually local state of what has been processed simply had to be stored, even if it's not right according REST standards.
  • Scheduled data transfers to multiple destinations. That's obvious, if there's no request made to send data, then it has to be triggered using some other method. Most often it's simple scheduler. Usual run intervals are anything from 5 minutes to once per month. Yes, you wouldn't believe it, but some integrators still do prefer monthly data transfer. Even daily data transfers are way too often for them, and they get too much data. They want it to be aggregated on monthly level.
  • Simple and easy to use user interface. Simple web UI allowing what is being sent, when and to where, and in which format. As well as setting credentials for remote request.
  • Parallel writes and transactionally using optimistic concurrency control, allowing maximized parallel write performance without receiving any database reserved errors.
  • Support for parallel multi record (different tables) write transactions. Often it's better to have single standard data structure, but in this case multitable updates were required for versatility. As example, you want to update three separate records in different tables with different data structures transactionally. How can you be sure that your change won't get overwritten by other concurrent writes? This is exactly how it's done. First you'll read all three records, with each record you'll get it's etag (see: Entity Tag) , which is basically hash of the whole record content. Just as side note this same hash function is also used locally to monitor changes. After you have received all three records and made your own modifications locally and then want to push those back. Now you send special multiple data types transaction message. Then you'll combine all records into one larger message. You'll need to only include modified fields. Each record updating something must contain the original etag. When you'll then send command to commit this multi-write transaction, you'll either get success or failure response. When message is received transaction is started internal and records are checked for changes using the etags, if all records match, records are updated and committed. If anything goes wrong with this local transaction, you'll get failure mesage. If etags didn't match and record(s) were changed meanwhile, you'll get information about which records were changed, so you can fetch required records and re-apply your modifications and then commit again with new multi-write transaction. If you don't get any response, you'll might retry same transaction later.
  • Same transaction ID can only be a success or failure exactly once. - This is because I have unfortunately seen so many systems, which are very badly broken on transaction level. Leading to situation where transactions are lost, multiple times, committed partially  or corrupted in multiple very imaginary ways leading to very interesting situations. This time I wanted to be absolutely sure to avoid that problem. So it's impossible to overwrite changes by accident, if you modified old version of record, you can't return it back. It's also impossible that transactions get lost (you can freely retry, until success) and it's also impossible that those are partially committed, because each data segment of transaction is separately confirmed and final commit is also explicitly confirmed. I just hope that many other systems would work similarly.
  • Support for field based deltas. This is very un-restfulish. But it's just so common that many systems require delta data. Maintaining information for delivering delta data in this scale, uses a lot of disk space and resources. It's possible to request delta for certain columns and tables. But the fact is that there are just so many systems which can't compute delta locally and expect updates to be in delta format. If value was yesterday 1000 and now it's 1100, they don't want to receive 1100, they want to get only the 100. It means that the change logging database has to store last value transmitted to the recipient for selected fields, keys and tables. This delta checking is subject to etag / hash checking when using HTTP. When using stateless methods like files, of course all data has to be locally stored.
  • Had to be designed for efficient operation, with over 700 different tables and millions of records per table. This is also one of the reasons why pure REST with pagination and key id fetch and tables scans weren't feasible option.
  • Integration packet does contain standalone HTTP server, scripts can be also run using WSGI or CGI, which is preferred if HTTPS is required.
  • Integration packet doesn't contain integrated FTP, SFTP, FTPS, MTA ( IMAP / SMTP ), SAMBA server features in the application itself. But I do have client code for all of the methods. So I can fetch data using any of the methods based on scheduler. Due to local file support / UNC path support, it's possible to run required servers separately. And just for curiosity, yes, there are also integrations which are using SMTP transport. It's just as good way to transfer data as any other.
  • Final note. Receiving data and storing it to tables is seriously restricted on purpose. That's just because many of the tables have very intrigue internal structure and therefore writing isn't allowed for 3rd parties unless specifically agreed and tested. Making proper data validation for this kind of extensive dataset was absolutely out of the scope.
  • I personally love can do attitude and versatility. Too complex, exclusive, over engineered and limited data transfer solutions aren't just my way of doing it. Because ignoring simplicity, often means that there's a lot of work to be done on the other side to get their system to meet requirements which you have set. It's bit like renting a Ferrari to your customer, without wheels and fuel. This is much better than any other car you could find from other rental shops. Yet this requires you to provide your own wheels and fuel.
  • I could write a lot about file naming, transferring files securely over different transports, ect. Especially avoiding situations where files have been transferred partially, getting processed as still in transport, ect. But I have already written something about it here, even if it's very compact section in long post.

This post is quite random, I wrote it in a hurry but I hope it clarifies some of the choices I have made with integrations.

If you have any ideas, suggestions or questions. I'm really happy and willing to talk about those. Just for fun, check out these fails.

Added bit later my continued thoughts:

These are my additional commends based on some questions I have been receiving about my post.
  1. Last-Poll-Id means that list of changes since that poll ID needs to be kept on server side. So it's against pure REST where there's no client state maintained on server side.
  2. About using CSV. Keep it simple, and it works.
  3. Same (as 1.) also applies to those delta records. Basically it requires maintaining per client data locally which can be used to derive delta. Making system stateful per customer. Most pure REST solutions shouldn't maintain state per client. But these are hard optimization question. Should I force client to pull 5 million records just to see if one of those has changed, or maybe I could just tell that 5 of those have changed and send list of those, if I maintain that data locally.
After all things could be much cleaner and simpler, if the original database would have been designed keeping integrations on mind. But because that's not the case, then it could require complex actions on this level. This is also the reason for 1. and 4. because I simply can't tell if something has changed since X without maintaining detailed per client records. One way could be some kind of Merkle tree solution to find out what has changed, but it would also require flooding system with tons of queries and huge number of roundtrips.

As compromise between simplicity and generality, it might be required after all to build separate message formats for most common transactions involving data from multiple sources. For two reasons, making things both, simpler and more efficient on integrator side.

Test project: DNS-Hosts, Bottle.py, Apache2, PostgreSQL

posted Apr 4, 2014, 10:47 PM by Sami Lehtinen   [ updated Apr 12, 2014, 3:33 AM ]

DNS-Hosts is a very simple project which generates hosts file with precomputed list of DNS names. List maintenance is crowdsourced. There's also project roadmap about the features I'm willing to add if there's some user base after all. It was quite clear that this is marginal requirement in the very beginning. So therefore I just made it as my first practice project with PostgreSQL and Bottle Micro Web Framework. Of course I were familiar with all related things. But reading theory and actually doing something is bit different.

Project contains 6 database tables and three indexes for most commonly used data. There's automatic abuse detection and blocking, statistic tables for daily usage statistics, daily usage reports by email, logging tables, Automated database cleanup monthly etc. So if I want to I could extend this platform to do what's required. I have the basic capability to build systems on demand. I've done similar practice projects on Google App Engine too. If I need solution which is able to scale, a lot. I got three templates, one css file, favicon.ico etc.

As you can see from my previous blog post, everything went really easily and smoothly until I had to actually deploy the script to the server. Where all the problem begun, because I was trying to run it as normal CGI script using ScriptAliasMatch, but in matter of fact, it required different configuration using RewriteEngine and DirectoryIndex options to get it completely working. I think that was the hardest part of whole project.

Project naturally uses HTTP etags, cache-control and browser caches, and server side internal caching of precomputed data correctly and efficiently.

I also tried installing it in WSGI mode, but because it's written in Python 3, I found out to be in such a niche quite quickly that it would have taken too much time. Because latest versions of mod_wsgi should work with Python 3, but it was quite hard to find any sufficient information how to configure it out. Probably getting that to work, would have taken more time than the whole project so far altogether.

I've written similar kind of applications earlier using Google App Engine, and the most delightful feature was speed of the local database, compared to Google App Engines Datastore. So maybe cloud isn't the perfect solution every time. With sharded counters, optimistic concurrency control, etc. Of course local database got it limitations, but as far as there's no problem with it, it's very nice and fast solution.

This was also good test, because I got some ideas, and if those ideas require execution, I must have existing working technology stack, so I don't run into sudden problems like described above.

I would also use probably use Bottle, if I needed to write extensive RESTful integration packet. But I'll blog more about that later. Anyway, utilizing same backed for any kind of integration stuff with authentication JSON / XML etc, is simply trivial. As you can see from my sites, I'm not a web designer. But I'm sure that part can be arranged with some interesting side project if required.

Bottle.py, Apache 2, WSGI, CGI, full software development joy experience

posted Apr 4, 2014, 9:39 AM by Sami Lehtinen   [ updated Apr 4, 2014, 10:56 AM ]

Some times developing is fun. Sometimes it's just so frustrating. Here's one story.

I'm just curious, if anyone has used bottle with cgi. It was a breeze to create my app. But when installing it into production server, all traditional developer it hell broke loose.

1) Bottle CGI routing seems to be broken
2) MOD WSGI doesn't seem to work very well or easily with Python 3

It's just so wonderful when you know that you have alternatives. Ok, this doesn't work, let's try alternate approach, and it doesn't work either. Aaahh, complete developer enjoyment experience. Bottle code seems to be quite complex, but I'll try to get that routing fixed. Now I have the application working, but I'm using bottle root routing with my own router. Which doesn't make any sense at all. I've now played about three hours with thing that should be absolutely trivial, this is bit frustrating. Bottle is clearly broken, and now there are nasty edgy sharp glass shards everywhere.
Btw. If I check os.environ when running bottle via CGI, it's just fine. ['SCRIPT_NAME': '/testing', 'REQUEST_URI': '/testing'] but guess what, just like I said. Whatever the url says, it's always routed to / by bottle.

I'm still looking for solution. Potential discussion is in Google+ Python Community / my post.

Updated about one hour after the post:

Well, I got it finally fixed. After all I figured out that if I use rewrite and add path to script name, then it starts to work. So they're not using the request path, but script parameters instead. Now it works for both, direct links and directories and using pure CGI. Then I had some challenges to get the indexes working (root path). But it required additional configuration on Apache side, so it's routed correctly to the script root path. So ScriptAliasMatch doesn't work with bottle and you have to use RewriteEngine and DirectoryIndex to get things really working.

Btw. The site I wrote just as bottle practise is now open DNS-Hosts.

1-10 of 138