Web Bloat, DB Migration, Bug Bounty, Cafe Work, ERP, ETL, Data Quality, Multiprocessing
Post date: Mar 13, 2016 8:52:09 AM
- Lot of work with different git repos to get everything synced, staging, deploying, etc. Webhooks and so on. Using the BitBucket and GitHub.
- Enjoyed database migration stuff and found really annoying and strange issue. Actually I had time pressure so I worked around it. But it would have been so fun to know what the real issue is.
- So when I try to drop column I get integrity error for totally non related column. Also it's most curious that the column that is claimed to got the integrity error, doesn't even got it. So it claims that column x is null. Well it isn't. That's guaranteed. Yet after getting annoyed about that, I did the exactly same operation on other lighter data set and it didn't cause the integrity error. So? What's the problem really? I don't know. I just copied whole table to temp table without the column and then renamed it back. Yeah, that's stupid, but it works. So much joy. Business as usual.
- I'm not sure if I mentioned this already. But I just claimed my first official bug bounty. That's great! I'm happy someone values issue reporting. Unfortunately often it's just like it shouldn't be. When you try to report something serious, they're doing everything to down play it and keep telling several times there's no such issue. It kind of makes you think if you should fully exploit the vulnerability and tell them, well, I did this just because you claimed I couldn't do it. Ha, in your face.
- One article said how it's great to do remote work in a cafe. I don't really get it. There are several reasons why not to do it. Why would anyone want to work in a such place. First of all, it's a mess, you can't get anything done and secondarily even if more importantly the privacy and data security aspects. It's just insanity to work in co-work spaces or cafes. Well, word is full of crazy people, so I don't wonder if that's happening. But I wouldn't recommend it and actually would prefer that company rules would strictly forbid such craziness.
- It's interesting to see the details of the New European (EU) U.S. (USA) data transfer pact. I'm pretty sure that it's mostly paperwork which is 'official guidelines' but nobody actually follows. ;( - Expired Safe Harbour framework.
- Once again, finished a few more ERP integrations. Like I said, those are all very similar. It's just same stuff in different packet. Yep. Data in, data out. And some ETL tasks.
- Quickly plated a bit with my B2 Beta Account. Current impression about B2 is that it's cheap, and extremely slow. Ha, the common trade-off. Like I've said so many times. Some cloud storages are cheap and others perform well. Some services deliver 1 Megabit/s and others deliver 1 gigabit/s without any problems. It seems that the B2 delivers to Finland about 10Mbit/s which isn't great at all.
- There should be more sites like FossHub, it's great way to find out free open source software for any operating system and get high speed downloads!
- Lots of data sanitization and 'trust' questions. Can we trust this input, how it has been validated, are there any guarantees that the information is correct? - Actually these discussions are really repeated topic. It's almost like error logging. Fill logs with useless unactionable errors which doesn't actually do any harm and nobody bothers to check those even for more serious ones. This is why I often choose to make daily batch jobs such that whole batch is aborted if there's something unexpected happening. It forces someone to react to it sooner or later. If I just would log errors, well, everything could be skewed for months easily.
- Subscribed Google Cloud Big Data Blog. Dealing with loads of data every day, not nearly in Google scale, but in tens of terabytes anyway. Actually I don't say it's big data, because it's normal data, you can deal with that amounts of data without any 'special' big data hype stuff. It would be different if it would be in petabytes. Most of data sets I'm dealing with daily are around tens of gigabytes. For that even cpython has been enough, I've just experimented with PyPy but I don't have any production use for it. Stuff runs quickly enough even without it. Of course I'm using optimized libraries like Pandas and NumPy. But most of ETL tasks doesn't even require that. Usually it's most important to fine tune the queries and content which pulls data in. One fail there could make the job 10 - 100x slower easily, or even impossible to execute in any reasonable time. When amount of data getting in is properly thought, then dealing with the rest of it, is usually pretty swift. Multiprocessing and .pool is my normal day to deal with the stuff.
- This just reminded me that the pyodbc driver hasn't been updated to work with latest Python versions. Actually the Google's pyodbc page is returning 404. It provided handy precompiled binaries earlier. Well, let's see and ask around.