My personal blog about stuff I do, like and I'am interested in. If you have any questions, feel free to mail me!
[ Full list of blog posts
- Checked out the Open Personal Data project MyData project. Also watched their YouTube videos and visions. I did notice that there's no plan for generic receipt standard. Here's also a forum discussing about Finnish Open Data standards (Fin: AvoinData) and requirements.
- I also got information that NRF has an e-receipt standard, but it seems to be a secret standard. I sent them an email asking for the documentation and specifications. What's the use of secret standard? I already complained that Finvoice is complex standard, but secret standard is much worse than complex standard. It's almost as good as no standard at all.
- Played with Tesseract-OCR engine, unfortunately the source image quality wasn't good enough for the purpose I tried it for. It basically requires high quality scans and images captured with mobile phone camera just won't work with it. Well, maybe I'll try it again later, or try some other library which could be better doing OCR with lower quality images. It's also worth of noting that mobile phone cameras are developing fast. I'm not sure if image pre-processing would have improved the situation. I don't wonder poor results, sometimes mobile phone snaps glare or are blurred so badly that those are annoying, slow or even hard for even humans to read. If it would have worked well, I have had an idea for a new website service concept, that doesn't currently exist at all.
- Server hosting is getting cheaper and cheaper, usually cost of servers is nearly to nothing, compared to the cost of work. I guess this is bad for optimization and low level stuff. Here's new all SSD hosting plans by Linode.
- Checked out PASM - Pyston World's first X86-64 Assembler working in the web browser. That's bit too low level geekery for me.
- Studied CloudFlare web-sockets & SSL support. It seems that there's no web-sockets support.
- Spent a few hours playing with snakeMQ and Redis, both worked nicely between four virtual Ubuntu Servers in VirtualBox containers. These solutions allow much higher performance, than the traditional it work's but isn't pretty thing, which is using SQL tables as queues.
Often the reason for using SQL tables as queues, is that it keeps the system stack simpler. Often for small projects, I really prefer keeping number of different technologies being used to minimum. It makes life for everyone easier. I kind of got assaulted in one group because I talked about SQLite3, they told that there are other professional database solutions out there. Yes, but if I just want to run simple script on random low end system. I don't want to install any massive enterprise database for that. Because installing and configuring the database, is 100x the job that the project needs to setup completely when using SQLite3. And what if someone wants to run the project on Raspberry Pi, it should also work without any problems, from 16G SD card, including operating system in that space. Another example is the next row.
- Studied Digital Ocean comparison of Python based web applications. It's of course possible to setup Nginx and uWSGI, but because uWSGI already includes perfectly capable web-server, why to install Nginx and complicate setup as long as it's not required. I also liked these posts, Python Cloud Web-Applications and How To Structure Large Flask Applications. It seems to me that the Flask is pretty hot framework right now.
- Added separate projects page. Future will tell if it will be maintained. I personally prefer simple and easy to maintain solutions. Having same stuff in different places and cross-linked can be hard to maintain.
- Started to study Android own URI schemes and Android intent including data android:scheme and action android:name inside native intent-filter.
- Checked out JulkICT, Finnish Open Data project.
- Studied basics and concepts of currently available Location Based Services (LBS) in Finland and Europe. It's clear that LBS data will be widely used, but many of the older web businesses doesn't seem to have picked it up.
- Studied Geographic Informatino Systems (GIS), PostGIS and spatial databases and required indexes.
I'm trying to figure out what's the transaction level, which can be satisfied with SQLite3
, and when other database solutions are required.
Here's simple test, running 10000 transactional selects & updates, in 10 parallel threads. The update process does only what's required, so basically it reads and then updates data and commits it. After several tests, I found out that there are just a few things that do really matter. Which are Synchronous mode and journaling mode. I have seen many people recommending using larger page size for better performance, but actually in this light test I found out that the default 1024 bytes page size was the optimal. Any larger page, only requirs more reading and writing and brings down the performance, just as you would expect.
After some playing I figured out that in this case it doesn't really matter if the transaction mode is deferred, immediate or exclusive. Of course using deferred just required me to handle 'locking failures' in my code. But using immediate or exclusive pretty much gave same results? Why? Well, because basically it doesn't matter how it's done. All that matters is that there's a read and write lock to a table. Even if it would allow multiple parallel reads on table, only one commit will be successful and others will fail. Because there's no row level locking. Anyway, as we know this isn't highest performing database there is out there. Lol. But let's estimate what this is good for.
Results with the things that do matter.
Mode Sync Disk Time Transaction/second
|NORMAL||OFF ||HDD||13 ||769|
|NORMAL||NORMAL ||HDD ||99 ||101|
|NORMAL||FULL||HDD ||128 ||78|
|MEMORY *1||N/A||N/A||0.1 ||100000|
|PY DICT *2||N/A||N/A ||0.01 ||1000000 |
Notes, Normal in this case means Delete, SSD was using Ext4 and HDD was using NTFS, if it does matter. As we can see from results, file system or even medium doesn't make so big difference, it's all about the mode and if synchronization is required or not. Other factos are just minor nyances in this show.
From these results we can conclude that SQLite is really fast enough for 99,9% web sites as they say on their page. WAL mode without sync, allows over 5k transactions per second using SSD and over 1k transactions per seconds on traditional HDD. Even the default mode, which is NORMAL/FULL using traditional HDD still allows 78 transactions per second. Because default locking mode for SQLite3 is deferred, it means that multiple processes / threads can read the database simultaneously and error / exception / notification is only done when database is being written to. If you're writing efficient code, you'll do everything what is required, and then do just at the very last moment start transaction, update data and commit it that. Because the secondary reason after these modes is the locking mode, which is ONE lock per database. If the segment which is modifying database, takes long, then whole process becomes slow. I also did run these tests by artificially adding time.sleep(0.1) inside the "synchronized" block, which reserves database. Just as expected, it made things very much slower. So keep the sections which update database, short and quick, and don't commit too often. Btw. In deferred mode as default, no journal file is created on read requests. Journal is only created when database is being modified. Now someone could immediately point out that there are other journaling modes than the default which is delete. That's absolutely true, but compared to sync options and WAL mode, that's really marginal difference, so I didn't bother to list results here separately.
Based on this test, SQLite3 is actually best database for my use in most of cases. Usually I have long running single process tasks, which commit at the very end of process. SQlLite3 is basically just a transactional data store.
As pro SQLite3 comment it can be stated that it's "in-process" database, it doesn't CPU and OS to do context switch. This means that the SQLite3 database gives actually great performance in terms of latency when doing tons of light read transactions.
I personally wouldn't run any production system with non-default transaction handling for data I don't want to lose. So turning journaling and sync off for testing environment, is great. But for production, it's absolute no no. I recommend using WAL mode with NORMAL, which is default when using WAL mode.
1) Just for fun, I did also run tests with :memory: database, and as expected, it's bit faster than any of the versions using journaling or file system persistence. This also proves that the delays of the performance are mainly caused by I/O and not the SQL command processing or Python itself. Yet, I had to to try it without using the SQLite3 at all, so the final row in the table is Python 3.2 dictionary with int as key and int as value, and if key exception, insert key with 1.
2) All of these tests had function which randomized the insertion sequence, so I tried it with 10000x1 and 1x10000 meaning 10000 different keys, with only create. Either of these options made any meaningful difference. So 10000 different keys with insert is quite much just as slow or fast as 10000 increments / updates to single key. I also varied number of threads, etc. No practical meaning. Unless you do something silly like using 10000 threads to update one individual dictionary entry per thread. But that doesn't make any sense anyway. If you want to go really crazy, use 10000 processes to each update one individual dictionary entry in memory. Btw. Time of creating the thread pool, closing and joining it, is included in the timing.
Details: I was using sqlite3 module which comes with Python 3.2. The SSD I used was "consumer grade" SSD, with Ubuntu. NTFS tests were run using Windows 7 Pro and consumer grade 500 GB 7200 RPM SATA drive. If that does any meaningful difference.
Added one comment from discussion this post lead to:
Actually, multiple programs, processes, threads can all access same
database at same time perfectly well. But there's only one active write
transaction allowed at a time. When using WAL mode, there can be slow
open read transactions perfectly well which are parallel with (one)
write transaction at a time. If you need additional performance, you can
workaround that limitation, you might want to split data so that
there's small table (database) which contains often updating data, and
larger for rest of stuff, which isn't updated multiple times per second.
So in case of web-forum, there can be separate DDoS protection table
which is updated on every call, and then rest of the system data which
is updated only when logged in user does some changes to rest of data.
In forums, you might like to split data like 'last read' flags to own
database, and not store it with message / forum tables, etc in same
database. But as said,those things can be done only when it's necessary.
For most of sites, that time will be never. Of course if you hit that
kind of situation, it's good idea to start looking for higher
performance solutions or replace DDoS table with in memory database or
something similar. As said, SQLite3 puts reliability before performance
with standard settings, and that's what most of people really like. We
all know amount of whining MongoDB has received, because of lost
committed data. It doesn't guarantee durability by default.
- Perfect XKCD explanation for heartbleed bug.
- Started Google+ Brython community.
- Quickly tried Android Studio on virtual machine. Even if it's early access preview, it seemd to be much better than the Ubuntu HTML5 developer packet.
- Played with a few adaptive web design CSS templates, but those were too complex to be utilized very easily, so I sticked with simpler design for dns-hosts. Responsive and fluid designs were old names for this thinking, where pages adapt to the device being used. Unfortunately it's very common to see that sites are clearly designed for some kind of fixed end device. In worst cases, which I have encountered several during this month. Site simply says get lost, you're using wrong kind of device. Without even allowing user to select different kind of layout. I often encounter this when forwarding links from desktop to mobile or from mobile to desktop. It seems that many Finnish news services are run by totally incompetent IT staff. "You've reached our mobile app, but not from a mobile browser." - What do I or you care about the browser I'm using? All I wanted to see was about 3 kilobytes of news article text. Does the browser really matter so much. I would say no, it doesn't matter at all. But that seems to be too complex thing to the "web designers" to understand. Just give me the raw text, without any HTML as pure ASCII, if you're even capable of doing that. Domains related to this whining are: talentum.fi and tietoviikko.fi, they just fail so miserably.
Anyway, adaptive design is great feature for CMS, where it's easy to produce multiple different designs from same source data depending from required situation.
- Wondered once again, what is the point of using email address as user identity. It's inherently very bad malpractice, yet many sites do so. Email address, telephone number, etc, can be anything. And shouldn't be considered as ultimate 'unique identifier for user'. I very much like concept where user id is just plain user id, and it's something that system generated. Like it's with Facebook. I'm saying this even I hate Facebook, but that's something they got really right. Even SQRL uses email address as ultimate user identity. It's just silly. I can change my email address daily, as well as my burner phone number. It doesn't mean that it wouldn't be me anymore. As well as the password recovery using email address is once again, one very very bad idea, but so widely used.
- Short list of checked stuff: Amazon CloudFront + Wikipedia, Webhook (nothing new, just one way of triggering stuff), Internet in China, CNGI, IPv6 deployment in China, The Wolf Hunters of Wall Street, Riak, Redis, LevelDB, lmctfy,
Today it's live, App Engine now supports finally naked domains. To enable naked domain support for your service simply go to your Google Developers Console
. Select project, App Engine, Settings and there it is. You can now use naked domain when specifying DNS A and AAAA records, instead of using sub-domain with CNAME record. For more information see issue 777
Hi. I started to map out the current situation of the e-receipt. As far as I can conclude, there's no commonly usable standard for it. I'm now wondering if you do know open widely used or usable free ereceipt standard? If you do, please let me know. If you're also receipt guru in your country, I might be interested to hear more about you. I naturally know the requirements for Receipts in Finland and in EU area, but outside that I haven't been working too much on the topic. Of course finding out requirements for US and states should be quite trivial.
Let's now skip the technical terms. What are the actual reasons, which are holding e-receipt back? I have seen several studies to say, that at least 85% of consumers would love to get e-receipts instead of collecting that annoying chemical paper trash in their wallets.
After quite quick analysis I came up with two reasons.
1. Identifying the customer
2. Lack of standard
As integration expert, I would say that fixing the latter is quite trivial in technical terms. We should form a expert group to create versatile integration standard. I've been doing hundreds of integrations yet, each system handles receipt data differently. It's really annoying. There should be one widely used standard. All those discussions is the total price of the row item price times the quantity, and what if it's return row? Should the quantity be negative, should the price also be negative, or is there some kind of separate return flag. Or maybe the whole receipt is void receipt and the information if it's actually negative or positive row isn't even directly available from any of the fields of this individual record. How are discounts handled, especially subtotal discounts etc. I have very strong views how those things should be done, so it's efficient, simple, yet versatile enough.
Is there any interest for this kind of project? I can create community and wiki for it, if there is. Even better, if there's already existing (free & open) standard which I'm just not aware about, please let me know.
I already created Google+ community group e-receipt professionals SIG
for potentially interested people.
I also created a little e-receipt project page
for this idea, I'll be crafting it when ever I have time and receive feedback from different sources.
National situation in Finland:
In Finland retailers are by law,
required to offer an receipt for purchase transaction. E-receipt is
viable option, but there's no standard for it. There has been some
discussion that e-receipt should be forced by law. Also making it
possible for tax authorities to easily process receipts. But if there's
no common standard for it, it's impossible to start using such system.
Quotes from e-receipt SIG discussions (Replies written by me, see original questions in the SIG). I'm again explaining why the message format standard and data structure is so important:
Two problems mixed? Yes, that's why I labeled those separately as two different problems.
I'm aware that those are two separate problems and there will be two
separate solutions. 1. Is something which is the harder part. I don't
mean that it would be technically hard. But it needs to be very quick
and reliable. Nobody wants to write e-mail addresses at coffee shop
during rush hour. 2. Is totally trivial. 3. One thing which limits the
adoption is lack of standardization. That's why I'm especially
addressing it in the SIG. If there's commonly used standard, it's much
easier to adopt using it. Closed POS vendor solutions just won't work.
We've seen major well funded players like Nokia completely failing to
deliver simple service like this.
Without proper standards, it's
just way too much for individual shopkeeper and entrepreneur to pay for
something like 5-10k for simple integration stuff like this. But that's
what it easily ends up with everyone using their own custom data
formats, requiring complex and slow (aka expensive) mapping, and
different transports (sftp,ftp,http,email) etc.
Btw. I've done
100+ integrations and I've got 15+ years of experience from Retail, POS
software, solutions and system integration. I'm also very familiar with
EDI and European E-invoice standards. Which are just bit too complex for
easy adoption for most of smaller players.
Why is the API standard and message format so important? Can't we just scan the paper receipts or grab the data going to receipt printer?
Yes, or you could use printer driver which forks the data stream. But
those are all lame solutions, because the printed paper version doesn't
anymore contain information, which could be useful for automated
When ever I'm doing integrations, I'm very carefully
considering the balance when to deplete information. Depleting
information can make things look very easy and simple, until you'll find
out that the process requires the data you happened to deplete in the
very beginning of the process.
As example, you have receipt with
100+ rows, and tons of different items. Could you please tell me what's
the tax class or tax rate for each individual row? No? Darn. That's
life. That's exactly why proper XML / JSON standard data structure is
required. Electronic receipts in vendor specific or as "printout"
formats, simply aren't useful. There should also be CSV format for legacy systems. Many IT guys won't just believe how many systems there are out, which aren't able to handle JSON or even XML.
Think situation when you're filing
your travel expenses. Great, you're again playing with paper receipts,
maybe photographed, maybe scanned ones. But all together, some vital key
information is already missing and needs to be re-entered.
if the EDI formats and E-Invoice in Finland are pretty horrible
standards, those are yet a lot better than no standard at all.
why I'm asking for feedback and wish to form a SIG which would be able
to create standard that's freely vendor specifically extendable (!) yet
not too complex for software developers to adopt.
The small accounting company I'm using wouldn't understand the JSON / XML, they want everything using FAX.
Well, I completely agree with you. That's just the problem with current
e-receipts. But I think I have quite a good views to this subject,
because I've been working in this field for a long time. I especially
always embrace smaller players and kind of loathe standards made by big
international boards. First it takes forever for them to create the
standard, and when it's done, it's messy, full of compromises and so complex, it's very hard for
smaller players to implement correctly if they bother at all. There has
to be easy to use user-interface for the data. Raw data is there for
techies, but there has to be also human readable standard representation for the same data. As good examples I could bring up EDIFACT
NFC application for a store chain.
I agree with you. But main problem is that the
field is so dispersed. Without open standard there's a vendor lock-in
and you'll have to use 10+20 similar apps. Think about email, if
everyone would be forced to use Gmail, because lack of federation.
know there are NFC mobile apps etc, stuff. But those are practically
useless, they require customers to invest in the system, they require
shop keepers to invest in the system. As well as the system will be very
likely proprietary. Which is of course a bad thing.
Also had many more interesting discussions from different viewpoints about this same topic. Spam issues, why would anyone want e-receipt. Isn't it retailers responsibility to keep receipts. Well it is, but what's the use for you if those are kept by retailer. Which is the current norm, usually they don't provide any way to access that data, etc.
A few keywords: ekuitti, sähköinen kuitti, e-kuitti, e-invoice, einvoice, electronic invoicing, electronic receipt, mobile receipt, email receipt, digital receipt, digital signature, service, site, API, JSON, XML, standard, schema, system integration, point of sale, retail sector, consumers, consumer, service, web site, web service, webservice, business, entrepreneur, european, special interest group, board of professionals, system specialist, European Union, legal discussion, requirements, fiscal, tax authorities, national legalization, laws, law, EU, Americal, global, international, internationalization, standardization, standards, stateful, stateless, adopt, adoption, national needs, extensions, extendable, extended, REST, RESTful, HTTPS, HTTP, flat files, SMTP, multi transport, transports, connection, connectivity, federation protocol, universal, distributed, open e-receipt standard, open e-receipt initiative.
Are my comments regarding Heardbleed
, misinformation or is it essential? Because I think that many news miss the vital points and only focus on simple instructions which doesn't really cover what needs to be thought about.
- Why they say that sites should only renew
their certificate? It doesn't make any sense at all. As far as I know
sites should renew their private key, not only certificate.
So what should be done instead of only renewing a certificate?
Revoke the old certificate.
Generate a new key pair- (Which of course includes a new private and public key)
Request a new certificate request for new public key.
Install new private key, public key and certificate to server.
- Because Heartbleed allowed getting any data, renewing all passwords on server and all encryption keys is also required. Like SSH keys to the server, other account passwords. If there are any third party integrations, API keys, traditional other login / password authentication information. Anything that might have been in the memory of the server, could have been leaked. - As example if your site got Facebook API keys it's possible that those have been leaking too. Cloud authentication keys/passwords for Amazon S3, Google Cloud Storage etc.
people are told to replace their password, why they aren't also told to
replace their authentication cookies? Because I have seen that many
sites do not properly reset authentication cookies even if you change
the password. So changing password is like changing the public key, it
still leaves possibility to abuse your account if they have the cookies,
even if password has been changed. With some sites, it's actually quite
hard to get that authentication cookie to properly replaced, even if
deleted they just might place same 'this user logged in cookie', even if
you're using different password now. Allowing continued access for the
I think it's pretty essential to fix all related deeper issues and not only focus on the very shallow surface layer of the issue.
- Amazon Dash - Quite a interesting solution. But one thing hit me first and hard. Why,
oh why, a separate device? Why not a smart phone app? Because that
hardware device doesn't bring any additional value, except a bit better
barcode reader. With app, all family members could have it installed
with shared account etc.
- IPv6 Email - At first, it would be easy to say, that nothing is different. But as usual, topic is deep. But it's not the email itself, it's actually the abuse prevention which is complex.
- For people who claim that cloud is bad solution. I just want to ask one question. Which is that do their companies have own water and electricity
plants? And if not, why? Do they really trust public utilities? Can they feed their workforce, if when public power, food and water supplies fail? Should they? Well, of
course military got their own solutions for these issues too. But most
of businesses, even larger ones do not. So what's the fuss about having own data centers if you can't power those or keep your workforce alive.
- Enjoyed one major server outage due to networking issues. It was interesting to follow how things escalated. Luckily nobody dies, if these systems fail a bit. This is also exactly the reason, why I do run secondary monitoring system in different data center, than where the primary systems are. This outage was so serious that it took the servers, and monitoring system down. Only the secondary monitoring system was able to alert me the primary system wasn't working. Primary system of course noticed that connectivity is lost, but it couldn't send out any alerts due to connectivity loss.
- Studied e-receipt and bonus/loyalty system solutions. Why some of those have been successful and why most have failed. What are the key ingredients for success? What it takes to be a successful in this sector. Currently there aren't any widely used solutions. I'm still wondering why this stuff is taking of so slowly. It's clear that this has been coming for several years, but yet, even many major chains haven't adopted it, even if they already have everything it really needs to be reality. They do have all receipts, they have loyalty card information, they have customer information and most importantly their contact information, web service portals and everything. They also do have receipt line level data for additional analysis, so allowing customers to access their own data shouldn't be that hard after all.
- Checked out Project Loon & Facebook Drones, it's hard to say which one is better. But generally I would vote for Facebooks drones, because I assume those are way easier to control and build with correct with durable electric engines and batteries, stuff shouldn't require much more service than the balloons.
- Read: MariaDB 10. Pyston (JIT Python), HFT guide, Nice post about REST, Upselling, Cross-selling,
- Total database performance killer from one great developer. He
decided to use where key like '%input%'. Of course this is a solution
which shouldn't be ever used in production. It's fine if you have small
database and you're testing or looking something. But in production
that's a real killer. Now that command is run on table containing a few
million records on field, which contains about 100B-100kB of text per
record. So even with modest estimate, each request requires scanning
through at least 2 gigabytes of data. It's just great. It basically
triggers massive disk I/O operation every time and also consumes a ton
of CPU cycles. Anyone using the web application can trivially launch a
denial of service attack, by just sending a few queries to that specific
URL. But of course there is there a rate limit or request limit / IP.
Well, of course there isn't, it wouldn't be any fun at all. Actually
doing little load testing would be just fine and interesting to see what
happens. Let's just push 100 requests / second to query which takes
minutes to complete.
- Yet another performance killer and design flaw. One huge
table with hundred of millions of records, got one column which is named
status. Great. Does make sense. Records which are stored into table are initially using status=0, and records which are processed further, got
status=1. But now there are a few record types, which do not require
any processing. Guess what, the process which processes those records
further naturally skips records that do not require processing. Well,
does it update the processed flag? Ha, of course not. It re-checks there
records over and over again, skipping those. Another great question is
why status is fully indexed field, I would have naturally indexed it
partially and only for status=0. If things would be done smartly,
there would be only a very small number of records with status=0. Yet
the index would be small. Now the index contains all records. Of course
the inefficiency with the status flag handling I described earlier also
grows index size and makes processing slower.
- I/O Stack Optimizations - Absolutely excellent article. It also covers many different file systems, EXT4, Btrfs, XFS, F2FS and of course compares those with different SQLite3 journaling modes (Delete, Truncate, Persistent, Write Ahead Logging). I loved it. Perfect reading if you want to know how these fundamentals affect your application / system performance and why journaling a journal isn't always a good idea.
- FT: Big Data: Are we making a big mistake? - Actually this article was quite good. But another article, which was about data analytics was even better. This only covers the facts that unconfirmed data can lead to. But the another article actually told many details about the facts. And here's the another article: Data analysis hard parts. What they say about data analysis, isn't nothing new at all. I'm used to making many kind of integrations. Like invoicing and bookkeeping, basically those are data analytics. It requires at times going through massive amounts of data, hundreds of millions of records. And precisely processing all of that. If your analytics is wrong, nobody might notice. But if your bookkeeping is off by 100k I'm quite sure that someone will notice at some point.
Unfortunately most of customers think that when they order something, it'll be perfect and full fill (what ever their needs are) without any input. When I'm making integration or analytics from data, I can be quite happy with any reasonable numbers. It doesn't matter to me if their profit is 30% lower or higher. This is the key reason, why I always want any analysis to be confirmed by customers own people who know the data and work with it daily. They can immediately tell, if something is seriously wrong.
- Why IaaS rules? Because multi cloud comparison is nearly impossible, because there are so many factors you can't reasonably affect. Simplest way of doing Multi-Cloud comparison, is using pure IaaS concept with Linux boxes. That's much more easily comparable. If you take per vendor PaaS features into count, comparison is very hard. Also it might turn out that all your developers know technology X and the 'perfect solution' is for technology Y and therefore you can't really use it. This is very complex topic.
Definite and absolute pro with pure IaaS/Linux solution is that you can easlily transfer your servers to almost any place. Your home box, local data center, dedicated servers at customer, customers private cloud, nearly any major public cloud, or some small national hosting company etc. Even Raspberry PI will do it, if that's what required.
There's also clear benefit that you don't need to pay 'Microsoft tax' for expensive licensing.
- Studied mod_wsgi (Apache2) in detal, read all configuration parameters etc. I had to, I had some issues when configuring Apache2 to run Python 3 WSGI scripts. Yet I still failed, but that doesn't matter for now.
- A very nice post about SaaS pricing as form of Fantasy Tarsnap by Patick (patio11 / Kalzumeus). If you sell cheaply, it's hard to make profits. But there are also risks if you have rip-off pricing, because someone might just make a competitor. Especially if the software is quite simple and doesn't require any very special knowledge, like this Tarsnap backup solution.
- Something totally different again, except this is still engineering: Kingdom Tower, The Illinois.
I had to write one RESTful(ish) system integration API.
First of all, I wanted to make it fully RESTful and according REST principles. But because the data structures and other systems being integrated, aren't unfortunately designed at all keeping REST in mind, I had to break that plan at very early stage. So yes, it's possible to updated some limited data sets in fully RESTful fashion, but for most of data, it's not possible due to efficiency reasons.
Here are the short notes:
- Generic data structure support for all database tables in system
- Using user rights, it's possible to restrict access, but it's not technical limitation, that we only support very small set of features.
- Support for unlimited number of remote systems. Some other APIs are designed fundamentally so that they support only one integration point. That's very limited approach in todays standards. As far as I have seen, many systems need to be integrated to several other systems in parallel. So supporting only one integration point, isn't just going to get the job done.
- Full support for multi-tenant configurations. Some other interfaces have been designed so that those do not support multi-tenancy at all or make it in insecure fashion.
- Fetch data updates from remote system (PULL), Send data updates to other side (PUSH). Because in the smaller customer segment, getting something simple like static IP can be problematic. Therefore I immediately had to ignore two way operations, where data is always pushed from data source to recipient in near real time. In some cases that work great, and is optimal solution. But in some cases it just doesn't. My design allows data to be pushed or pulled on both ends. What ever is the best solution in any case. All methods can be simultaneously and freely mixed. Same data can be sent to system A in near real time, but system B receives exactly same data in daily batches and in different format and using different transport.
- Data selection by timestamp, record-id or last-poll-id. This was one of the reasons to ignore REST in very early stage. Because in the source database, there are tons of records that can be changed, and theres absolutely nothing that indicates that the data has been changed. Basically this means that there's no other option than get full list of records (possibly millions of records) and then check each record remotely one at time if using pure REST. As we all know, that's not going to work. Therefore I introduced change monitoring based on client id, last poll id and last sent record hashes based on that data. When list of changed records is being fetched, it's locally filtered so that only records changed are sent. I know this isn't most efficient way to do it. But unfortunately with current data source design, there's no other way to get it done. With some tables, it's clear that those are append only tables, and there it is possible just to fetch records N+1 until hitting the current tail. Also some tables do contain timestamps, so it's possible to get data 'since something'. Of course most commonly polled, updated, processed tables have timestamps and or sequential record-id's. But there are a lot of tables which do not contain any indexed indication of potentially changed data, nor there's table with log where you could indirectly fetch this information. Hash based change checking is done usually during the night, because it is a heavy operation.
- Last poll id can be maintained remotely or locally. This was important a important design factor for methods which do not actually receive any kind of request, but are locally triggered using timer or some other criterion.
- Support all data formats JSON / XML / CSV. It's nice to say that we support JSON, but unfortunately there are legacy systems that do not support anything else than CSV. Don't get me wrong. I think CSV is one of the most useful formats ever. Even if Excel CSV support is horrible, it's still great too for so many things. My base design goal was that I can provide same data in any format. If you want it as props file, I can arrange that too. It's just reformatting same stuff in different packet. My goal is to make system as versatile as possible, so that we can choose right data format and transport, what ever is nice for the other side likes. I've seen way too many APIs which are really complex and almost impossible to implement with some legacy ERP / Bookkeeping systems. And just as I said, HTTP / JSON - RESTful is just one perfect example of that. I want to see, how your 15 years old software product easily digests this data and makes required calls.
- Support a few transfer modes which are FTP, SFTP, FTPS, EMAIL ( IMAP / SMTP ), HTTP, HTTPS, Local files, UNC paths. - Yes, that's basic part of the versatility required. I'll transfer the data, using what ever transport suits the recipient. This also makes it clear why actually local state of what has been processed simply had to be stored, even if it's not right according REST standards.
- Scheduled data transfers to multiple destinations. That's obvious, if there's no request made to send data, then it has to be triggered using some other method. Most often it's simple scheduler. Usual run intervals are anything from 5 minutes to once per month. Yes, you wouldn't believe it, but some integrators still do prefer monthly data transfer. Even daily data transfers are way too often for them, and they get too much data. They want it to be aggregated on monthly level.
- Simple and easy to use user interface. Simple web UI allowing what is being sent, when and to where, and in which format. As well as setting credentials for remote request.
- Parallel writes and transactionally using optimistic concurrency control, allowing maximized parallel write performance without receiving any database reserved errors.
- Support for parallel multi record (different tables) write transactions. Often it's better to have single standard data structure, but in this case multitable updates were required for versatility. As example, you want to update three separate records in different tables with different data structures transactionally. How can you be sure that your change won't get overwritten by other concurrent writes? This is exactly how it's done. First you'll read all three records, with each record you'll get it's etag (see: Entity Tag) , which is basically hash of the whole record content. Just as side note this same hash function is also used locally to monitor changes. After you have received all three records and made your own modifications locally and then want to push those back. Now you send special multiple data types transaction message. Then you'll combine all records into one larger message. You'll need to only include modified fields. Each record updating something must contain the original etag. When you'll then send command to commit this multi-write transaction, you'll either get success or failure response. When message is received transaction is started internal and records are checked for changes using the etags, if all records match, records are updated and committed. If anything goes wrong with this local transaction, you'll get failure mesage. If etags didn't match and record(s) were changed meanwhile, you'll get information about which records were changed, so you can fetch required records and re-apply your modifications and then commit again with new multi-write transaction. If you don't get any response, you'll might retry same transaction later.
- Same transaction ID can only be a success or failure exactly once. - This is because I have unfortunately seen so many systems, which are very badly broken on transaction level. Leading to situation where transactions are lost, multiple times, committed partially or corrupted in multiple very imaginary ways leading to very interesting situations. This time I wanted to be absolutely sure to avoid that problem. So it's impossible to overwrite changes by accident, if you modified old version of record, you can't return it back. It's also impossible that transactions get lost (you can freely retry, until success) and it's also impossible that those are partially committed, because each data segment of transaction is separately confirmed and final commit is also explicitly confirmed. I just hope that many other systems would work similarly.
- Support for field based deltas. This is very un-restfulish. But it's just so common that many systems require delta data. Maintaining information for delivering delta data in this scale, uses a lot of disk space and resources. It's possible to request delta for certain columns and tables. But the fact is that there are just so many systems which can't compute delta locally and expect updates to be in delta format. If value was yesterday 1000 and now it's 1100, they don't want to receive 1100, they want to get only the 100. It means that the change logging database has to store last value transmitted to the recipient for selected fields, keys and tables. This delta checking is subject to etag / hash checking when using HTTP. When using stateless methods like files, of course all data has to be locally stored.
- Had to be designed for efficient operation, with over 700 different tables and millions of records per table. This is also one of the reasons why pure REST with pagination and key id fetch and tables scans weren't feasible option.
- Integration packet does contain standalone HTTP server, scripts can be also run using WSGI or CGI, which is preferred if HTTPS is required.
- Integration packet doesn't contain integrated FTP, SFTP, FTPS, MTA ( IMAP / SMTP ), SAMBA server features in the application itself. But I do have client code for all of the methods. So I can fetch data using any of the methods based on scheduler. Due to local file support / UNC path support, it's possible to run required servers separately. And just for curiosity, yes, there are also integrations which are using SMTP transport. It's just as good way to transfer data as any other.
- Final note. Receiving data and storing it to tables is seriously restricted on purpose. That's just because many of the tables have very intrigue internal structure and therefore writing isn't allowed for 3rd parties unless specifically agreed and tested. Making proper data validation for this kind of extensive dataset was absolutely out of the scope.
- I personally love can do attitude and versatility. Too complex, exclusive, over engineered and limited data transfer solutions aren't just my way of doing it. Because ignoring simplicity, often means that there's a lot of work to be done on the other side to get their system to meet requirements which you have set. It's bit like renting a Ferrari to your customer, without wheels and fuel. This is much better than any other car you could find from other rental shops. Yet this requires you to provide your own wheels and fuel.
- I could write a lot about file naming, transferring files securely over different transports, ect. Especially avoiding situations where files have been transferred partially, getting processed as still in transport, ect. But I have already written something about it here, even if it's very compact section in long post.
This post is quite random, I wrote it in a hurry but I hope it clarifies some of the choices I have made with integrations.
If you have any ideas, suggestions or questions. I'm really happy and willing to talk about those. Just for fun, check out these fails
Added bit later my continued thoughts:
These are my additional commends based on some questions I have been receiving about my post.
- Last-Poll-Id means that list of changes since that poll ID needs to
be kept on server side. So it's against pure REST where there's no
client state maintained on server side.
- About using CSV. Keep it simple, and it works.
- Same (as 1.) also applies to those delta records. Basically it
requires maintaining per client data locally which can be used to derive
delta. Making system stateful per customer. Most pure REST solutions
shouldn't maintain state per client. But these are hard optimization
question. Should I force client to pull 5 million records just to see if
one of those has changed, or maybe I could just tell that 5 of those
have changed and send list of those, if I maintain that data locally.
After all things could be much cleaner and simpler, if the original
database would have been designed keeping integrations on mind. But
because that's not the case, then it could require complex actions on
this level. This is also the reason for 1. and 4. because I simply can't
tell if something has changed since X without maintaining detailed per
client records. One way could be some kind of Merkle tree solution to
find out what has changed, but it would also require flooding system
with tons of queries and huge number of roundtrips.
As compromise between simplicity and generality, it might be required
after all to build separate message formats for most common transactions
involving data from multiple sources. For two reasons, making things
both, simpler and more efficient on integrator side.