posted Apr 5, 2014, 6:04 AM by Sami Lehtinen
updated Jun 25, 2014, 12:02 AM
I had to write one RESTful(ish) system integration API.
First of all, I wanted to make it fully RESTful and according REST principles. But because the data structures and other systems being integrated, aren't unfortunately designed at all keeping REST in mind, I had to break that plan at very early stage. So yes, it's possible to updated some limited data sets in fully RESTful fashion, but for most of data, it's not possible due to efficiency reasons.
Here are the short notes:
- Generic data structure support for all database tables in system
- Using user rights, it's possible to restrict access, but it's not technical limitation, that we only support very small set of features.
- Support for unlimited number of remote systems. Some other APIs are designed fundamentally so that they support only one integration point. That's very limited approach in todays standards. As far as I have seen, many systems need to be integrated to several other systems in parallel. So supporting only one integration point, isn't just going to get the job done.
- Full support for multi-tenant configurations. Some other interfaces have been designed so that those do not support multi-tenancy at all or make it in insecure fashion.
- Fetch data updates from remote system (PULL), Send data updates to other side (PUSH). Because in the smaller customer segment, getting something simple like static IP can be problematic. Therefore I immediately had to ignore two way operations, where data is always pushed from data source to recipient in near real time. In some cases that work great, and is optimal solution. But in some cases it just doesn't. My design allows data to be pushed or pulled on both ends. What ever is the best solution in any case. All methods can be simultaneously and freely mixed. Same data can be sent to system A in near real time, but system B receives exactly same data in daily batches and in different format and using different transport.
- Data selection by timestamp, record-id or last-poll-id. This was one of the reasons to ignore REST in very early stage. Because in the source database, there are tons of records that can be changed, and theres absolutely nothing that indicates that the data has been changed. Basically this means that there's no other option than get full list of records (possibly millions of records) and then check each record remotely one at time if using pure REST. As we all know, that's not going to work. Therefore I introduced change monitoring based on client id, last poll id and last sent record hashes based on that data. When list of changed records is being fetched, it's locally filtered so that only records changed are sent. I know this isn't most efficient way to do it. But unfortunately with current data source design, there's no other way to get it done. With some tables, it's clear that those are append only tables, and there it is possible just to fetch records N+1 until hitting the current tail. Also some tables do contain timestamps, so it's possible to get data 'since something'. Of course most commonly polled, updated, processed tables have timestamps and or sequential record-id's. But there are a lot of tables which do not contain any indexed indication of potentially changed data, nor there's table with log where you could indirectly fetch this information. Hash based change checking is done usually during the night, because it is a heavy operation.
- Last poll id can be maintained remotely or locally. This was important a important design factor for methods which do not actually receive any kind of request, but are locally triggered using timer or some other criterion.
- Support all data formats JSON / XML / CSV. It's nice to say that we support JSON, but unfortunately there are legacy systems that do not support anything else than CSV. Don't get me wrong. I think CSV is one of the most useful formats ever. Even if Excel CSV support is horrible, it's still great too for so many things. My base design goal was that I can provide same data in any format. If you want it as props file, I can arrange that too. It's just reformatting same stuff in different packet. My goal is to make system as versatile as possible, so that we can choose right data format and transport, what ever is nice for the other side likes. I've seen way too many APIs which are really complex and almost impossible to implement with some legacy ERP / Bookkeeping systems. And just as I said, HTTP / JSON - RESTful is just one perfect example of that. I want to see, how your 15 years old software product easily digests this data and makes required calls.
- Support a few transfer modes which are FTP, SFTP, FTPS, EMAIL ( IMAP / SMTP ), HTTP, HTTPS, Local files, UNC paths. - Yes, that's basic part of the versatility required. I'll transfer the data, using what ever transport suits the recipient. This also makes it clear why actually local state of what has been processed simply had to be stored, even if it's not right according REST standards.
- Scheduled data transfers to multiple destinations. That's obvious, if there's no request made to send data, then it has to be triggered using some other method. Most often it's simple scheduler. Usual run intervals are anything from 5 minutes to once per month. Yes, you wouldn't believe it, but some integrators still do prefer monthly data transfer. Even daily data transfers are way too often for them, and they get too much data. They want it to be aggregated on monthly level.
- Simple and easy to use user interface. Simple web UI allowing what is being sent, when and to where, and in which format. As well as setting credentials for remote request.
- Parallel writes and transactionally using optimistic concurrency control, allowing maximized parallel write performance without receiving any database reserved errors.
- Support for parallel multi record (different tables) write transactions. Often it's better to have single standard data structure, but in this case multitable updates were required for versatility. As example, you want to update three separate records in different tables with different data structures transactionally. How can you be sure that your change won't get overwritten by other concurrent writes? This is exactly how it's done. First you'll read all three records, with each record you'll get it's etag (see: Entity Tag) , which is basically hash of the whole record content. Just as side note this same hash function is also used locally to monitor changes. After you have received all three records and made your own modifications locally and then want to push those back. Now you send special multiple data types transaction message. Then you'll combine all records into one larger message. You'll need to only include modified fields. Each record updating something must contain the original etag. When you'll then send command to commit this multi-write transaction, you'll either get success or failure response. When message is received transaction is started internal and records are checked for changes using the etags, if all records match, records are updated and committed. If anything goes wrong with this local transaction, you'll get failure mesage. If etags didn't match and record(s) were changed meanwhile, you'll get information about which records were changed, so you can fetch required records and re-apply your modifications and then commit again with new multi-write transaction. If you don't get any response, you'll might retry same transaction later.
- Same transaction ID can only be a success or failure exactly once. - This is because I have unfortunately seen so many systems, which are very badly broken on transaction level. Leading to situation where transactions are lost, multiple times, committed partially or corrupted in multiple very imaginary ways leading to very interesting situations. This time I wanted to be absolutely sure to avoid that problem. So it's impossible to overwrite changes by accident, if you modified old version of record, you can't return it back. It's also impossible that transactions get lost (you can freely retry, until success) and it's also impossible that those are partially committed, because each data segment of transaction is separately confirmed and final commit is also explicitly confirmed. I just hope that many other systems would work similarly.
- Support for field based deltas. This is very un-restfulish. But it's just so common that many systems require delta data. Maintaining information for delivering delta data in this scale, uses a lot of disk space and resources. It's possible to request delta for certain columns and tables. But the fact is that there are just so many systems which can't compute delta locally and expect updates to be in delta format. If value was yesterday 1000 and now it's 1100, they don't want to receive 1100, they want to get only the 100. It means that the change logging database has to store last value transmitted to the recipient for selected fields, keys and tables. This delta checking is subject to etag / hash checking when using HTTP. When using stateless methods like files, of course all data has to be locally stored.
- Had to be designed for efficient operation, with over 700 different tables and millions of records per table. This is also one of the reasons why pure REST with pagination and key id fetch and tables scans weren't feasible option.
- Integration packet does contain standalone HTTP server, scripts can be also run using WSGI or CGI, which is preferred if HTTPS is required.
- Integration packet doesn't contain integrated FTP, SFTP, FTPS, MTA ( IMAP / SMTP ), SAMBA server features in the application itself. But I do have client code for all of the methods. So I can fetch data using any of the methods based on scheduler. Due to local file support / UNC path support, it's possible to run required servers separately. And just for curiosity, yes, there are also integrations which are using SMTP transport. It's just as good way to transfer data as any other.
- Final note. Receiving data and storing it to tables is seriously restricted on purpose. That's just because many of the tables have very intrigue internal structure and therefore writing isn't allowed for 3rd parties unless specifically agreed and tested. Making proper data validation for this kind of extensive dataset was absolutely out of the scope.
- I personally love can do attitude and versatility. Too complex, exclusive, over engineered and limited data transfer solutions aren't just my way of doing it. Because ignoring simplicity, often means that there's a lot of work to be done on the other side to get their system to meet requirements which you have set. It's bit like renting a Ferrari to your customer, without wheels and fuel. This is much better than any other car you could find from other rental shops. Yet this requires you to provide your own wheels and fuel.
- I could write a lot about file naming, transferring files securely over different transports, etc. Especially avoiding situations where files have been transferred partially, getting processed as still in transport, ect. But I have already written something about it here, even if it's very compact section in long post.
This post is quite random, I wrote it in a hurry but I hope it clarifies some of the choices I have made with integrations.
If you have any ideas, suggestions or questions. I'm really happy and willing to talk about those. Just for fun, check out these fails
Added bit later my continued thoughts:
These are my additional commends based on some questions I have been receiving about my post.
- Last-Poll-Id means that list of changes since that poll ID needs to
be kept on server side. So it's against pure REST where there's no
client state maintained on server side.
- About using CSV. Keep it simple, and it works.
- Same (as 1.) also applies to those delta records. Basically it
requires maintaining per client data locally which can be used to derive
delta. Making system stateful per customer. Most pure REST solutions
shouldn't maintain state per client. But these are hard optimization
question. Should I force client to pull 5 million records just to see if
one of those has changed, or maybe I could just tell that 5 of those
have changed and send list of those, if I maintain that data locally.
After all things could be much cleaner and simpler, if the original
database would have been designed keeping integrations on mind. But
because that's not the case, then it could require complex actions on
this level. This is also the reason for 1. and 4. because I simply can't
tell if something has changed since X without maintaining detailed per
client records. One way could be some kind of Merkle tree solution to
find out what has changed, but it would also require flooding system
with tons of queries and huge number of roundtrips.
As compromise between simplicity and generality, it might be required
after all to build separate message formats for most common transactions
involving data from multiple sources. For two reasons, making things
both, simpler and more efficient on integrator side.