Blog

My personal blog about stuff I do, like and I'am interested in. If you have any questions, feel free to mail me!

[ Full list of blog posts ]

Note: I currently have temporary priority shift, I'm going major home improvement project. That's not all, I have been spending considerable amount of time planning my summer trip. I hope it will be absolutely wonderful.

Data Protection and Security, PostgreSQL & other databases, etc.

posted Mar 17, 2013, 12:49 PM by Sami Lehtinen   [ updated Mar 17, 2013, 12:50 PM ]

Stuff I have done lately, in random order:
  • Browser cache stats: Hits: 70374 Bytes: 881510044 I think that's remarkable results! Even if I count time using average roundtrip and maximum bandwidth, it's still saving a lot of time, which all would have been vasted if I would disable browser cache.
  • Studied: Data Retention Directive, Data Protection Drictive, ENISA, HIPAA, PCI SSC / DSS / PDA / Cloud and Mobile standards (boring stuff, all of these got same message). Maintain secure systems, procedures, monitor and log everything. Documentation doesn't actually tell you how to do it, it simply says you have to do it so that it will pass audits. FIPS does go in to much deeper details how things must be done, which cipher suites must be used etc. In general, these are good guides to wake up people to think about security issues and how bad security often is.
  • Linux 3.8 change log: F2FS was especially interesting. Because now they finally got the point that current SSD drives (with static wear leveling) already to block mapping, so log sturcture isn't required for that specific reason. NUMA improvements were good stuff too, they also got sense of humor, because the old NUMA balancing system was called MORON. Removing support for 386 CPUs made me really sad, all the good times. Actually I never owned 386, I had 286 and 486DX. SCTP and B.A.T.M.A.N. stuff was semi-interesting. I don't know if anyone is actually using mesh networking.
  • Wondered how bad Windows 2008 R2 internal tools are. Because you really can't clearly see which part of memory is being used by memory mapped files. It's confusing when server says that all system memory is used, but you don't see what's using it. But when using rammap.exe from Sysinternals it becomes immediately clear, memory mapped files are using 'all' memory. Yet, because it's cache, it's not a problem, it can be down sized when required. It's just interesting that memory mapped files aren't shown as cache, as those used to with older server versions.
  • Fraud profiling: Studied web shop and card payment / invoicing fraud detection theory. How to make Bayes analysis classification detecting fradulent transactions.
  • Tested TS seamless RemoteApp support, did seem to work fine. It's just funny that it requires modifying .rdp files. I wonder why they haven't put checkbox for it in the mstsc.exe UI.
  • Wondered this blog post which told how Googles two-factor authentication can be bypassed. It seems that security is too hard for even large companies. I would have assumed that this password isn't alternate password for the account, but it could be restricted to very small subset of features, like just checking mail or updating Google App Engine applications. This is one of the reasons, why I have separate account with separate credentials and 2FA for every service, even if it makes things bit slower and more complicated at times.
  • Nice post, how to hire a product manager.
  • Finished reading PostgreSQL 9.2. manual. Yes, there was quite much reading, but it's quick task when you know all background and can just scan through it. There are some areas which I haven't worked with which are GIST indexes and queries related to those. But as far as I know. I won't be needing those skills anytime soon. I were most interested about indexing, transactions, data persistency (WAL), query optimizer and database maintenance tasks. I have seen some very bad SQL query optimizers and compared to those PostgreSQL was really awesome. Rest of features unsurprisingly reminded me about SQLite (Vacuum) and MongoDB (Compact) which I have studied earlier. Google App Engines database doesn't require any maintenance. I also studied Pythons shelve library and tested it with about 5 million keys. I found that it becomes very slow after about two million keys, so it's not option for SQLite3 usage with local apps. I decided that I have to play with Postgres bit more, so I now have test server with PostgreSQL 9.2, node.js, golang and python3 which I use to play with it. Something I haven't used this far are full text search indexes, that's one of the things I think I will need in near future and continue studying. After I'm done with this, I'll try Apache Cassandra. I'm not sure if I need Cassandra class databases (yet) for anything. (Eventual consistency, Paxos algorithm, MVCC) PostgreSQL vacuum, vacuum full, analyze updating statistics based query optimizer data. 
  • Quickly studied LevelDB and found out that I don't have any use for it right now. SQLite3 and shelve can get the job done for me with sufficient speed without installing additional stuff on production servers.
  • Quickly studied basics of CloudFlares RailGun, seems to be pretty straight forward solution, nothing special after all.
  • Quickly studied IPv6 geolocation databases: IP2Location as well as MaxMind.
  • Studied Linux kernel 3.9 Cache Target feature, which allows to use SSD drive as block device cache for other block devices like spinning disks. This is the approach which I like. I don't like higher level solutions where files have to be placed on SSD or on traditional disk. What if I haave 30 terabyte file and only 512 gigabyte SSD drive? It's highly likely that only parts of that 30 terabyte file are regularly accessed, as we all know. 80/20 rule.
  • Read load balancing without load balancers by CloudFlare and this interesting post about Linode network overhaul.
  • Quickly tried Chef and it's recipes for mass Linux server management.
  • Tried virtual server at DigitalOcean, they seem to provide great bang for buck. But their web pages do not properly describe security and reliability matters. What kind of disaster recovery plans they have, are backups made to same data center etc.
  • Added OpenStack, Apache Cassandra and Apache Hadoop books to my Kindle
  • I think I'm studying right things, top three trends for 2013 were mobile devices, HTML5 (single page) applications and cloud solutions.
  • I really need to notch up my JavaScript skills, I have a few books on Kindle, but I still have had other tasks which have had higher priority.
  • Quickly checked out Silent Circle, RedPhone secure phone applications and reminded my about current Crypto.cat's solutions.
  • Read about Carrier grade NAT (CGN) and it's shared address space 100.64.0.0/10. RFC 6598.
  • Not to get overloaded with IT stuff, I did read about 9K720 Iskander missiles
  • I really did think that TLS would discard a few hundred bytes from start of RC4 stream, but it doesn't. So it allows breaking RC4 encryption by sending same sata over and over again several times. Because this RC4 vulnerability has been known for ages, it's quite funny it's stil a problem with TLS. (article) Also see: Lucky13, BEAST and CRIME attacks.
  • Checked out Salsa20 stream cipher as well as it's ChaCha variant and curve25519 ECC. Salsa design seems to be very simple, I guess one of the most important things are the values used with different rounds. Are block ciphers using Output FeedBack mode (OFB) mode safer(?), OFB mode allows block ciphers to be effectively used as stream ciphers.
  • DRY, Don't repeat yourself. I have seen tons of documentation, bad documentation, which doesn't answer essential questions. Why documentation is so crappy? Because integrators often answer the specific questions using private email. Afaik, the only correct method would be updating the documentation to answer all possible questions. Then they wouldn't need to answer the same questions over and over again. I personally hate it.
  • I have seen incredible server admins.. Does this sound like a plan to you? First make sure that you got public IP. Then disable all possible firewalls. Then enable guest account and set all printers and network shares accessable using the now enabled guest account. Perfect Windows server and networking security, right? Ehh.
  • The Internet is a surveillance state By Bruce Schneier - Are you surprised? You really shouldn't be, this has been known for a long time. I personally don't like any system which leaks any information which I didn't ask to be shared. It's hard to find applications which do not leak information. Almost all high level applications leak more or less. Modern operating systems can "leak" data at any time storing it to swap etc. Where it can be later discovered even if you would have used encryption when saving file etc. The list goes on, browser headers leak information. My desktop computer has always been unique when ever I have made any browser fingerprinting tests. So that's also a serious leak, super cookies, etc. List is endless.

Bandwidth Hog (quick concept draft)

posted Mar 17, 2013, 9:34 AM by Sami Lehtinen   [ updated Mar 17, 2013, 9:34 AM ]

Just some random thoughts about bandwidth hogging protocol. Actually this kind of app would have been most useful back in early days when international connections were totally jammed.

Hotel Wifi, great internet connectivity is something you keep as granted, until you don't have it. Sigh! Then it's absolutely horrible.
ProtoObfs with Bandwidth Hog UDP module. Large streaming window, fast retransmits, absolutely own quick protocol which agressively resends data, in case of communication failures. Main connection timeout long, data packet level timeout very short. Use fixed packet size for speed!
Can be used to tunnel tcp streams, but I would better suit  for file transfers, because chatty protocols which do not buffer enough data, could still be slow.
We could also allow simple NAT hole punching etc. Then protocol could be used even in situations where both ends are behind NAT etc.
How to tune protocol so that packet loss isn't a problem, how to correctly set packet size, max rate, etc.
Skype / UDP worked surprisingly well, but TCP streams were extremely hopeless and dropping connectivity (due sub protocol timeouts).
Sending packet already several times even before acknownlement, base rate, max rate, loss estimation etc. Different more aggressive modes. Auto rate based on loss etc.
Bandwidth Hog, that's the name. It doesn't add bandwidth, it doesn't work more efficciently, but it hogs bandwidth from other fair protocols by being way more aggressive than traditional TCP options.
Well, it seems that I don't have time to write article about this, nor I have time to write implementation. But this would have been fun. I have been thinking about this option from 90's, because back then network connectivity was usually much worse and using protocol like this would have givent proptionally bigger benefits.

PyClockPro (Python CLOCK-Pro) cache implementation published

posted Feb 23, 2013, 4:11 AM by Sami Lehtinen   [ updated Feb 26, 2013, 7:50 PM ]

Well, finally it's done. You'll find the project page from Bitbucket.

See: PyClockPro project page

Why I bothered? - First of all, it was fun, a nerd sudoku as they say. As well as I got nice results. Here's result charts from the benchmarks I did run.




Other information:
1. Higher resolution PDF charts are here.
2. Sample dump of internal state during operation.

Any questions? Feel free to contact me. From the project page you'll find the source code and additional documentation.

Home Lab, UX design, Web Peformance, Startup, Data Discovery, Cloud Security & Platforms

posted Feb 17, 2013, 5:59 AM by Sami Lehtinen   [ updated Mar 17, 2013, 12:01 PM ]

  • Why I'm doing all this? I'm no different than other IT guys afaik. here's excellent story "Why you need a home lab to keep your job".
  • UX design for startups. This is something that every developer shoud l know. It's very important to provide users pleasurable UX. I have seen way too many bad user interfaces and way too complex user interface logic, even when trying to complete very simple basic tasks. Direct link to PDF.
  • Web performance crash course (PDF). This is also something what every Web Developer should know even when suddenly waken up in the middle of the night. I'm actually quite happy that nothing except a few JS tricks were new to me.
  • After reading previous document, I modified my test JavaScripts to work with async loads. This clearly makes page rendering look faster to user, even if some content is filled in later.
  • Updated https://s.sami-lehtinen.net and 9ox.net HTML code with correct HTML5 DOCTYPE definition.
  • Finished reading (yet another) Finnish startup guide. Creating company, from idea to business idea, business plan, SWOT analysis, risk assessment and protection against the risks, sales, marketing, marketing tools, funding, pricing, pricing models, start money (from government), types of business entities, official business registration, bookkeeping and financial statements, taxation, taxation with different business entities, insurances, business pensions, hiring employess, environmental responsibility, 10 steps to success, check lists for business owners. What is a startup, how to grow your business, key personel, team, shareholder's agreement, competition advantages, other important things to know, list of common business vocabulary. I have earlier read a whole book about this topic. But this guide was a compact one, only 60 pages.
  • Studied cloud provider CloudSigma. Virtual service providers Digital Ocean. As well as front caching using aiCache.
  • Checked out Python Wheel Binary Package Format 1.0 (PEP-0427, PEP 427)
  • Announced that Off-The-Record service will be closed, I don't have time to maintain older projects. 9ox.net will probably die when I would need to upgrade it from Master Slave (MS) data storage to High Replication Data (HRD) storage. Time will show if I'll have time / motivation to upgrade it when that time comes.
  • As for privacy policies and security, studied Lavabit secure mail services Security and Privacy documentation. (Secure, Privacy Policy)
  • Quickly checked out Amazon RedShift.
  • Steven Gibson recommended using  1 second scrypt password encryption to store passwords securely. Depending from site and attacker, that might make site vulnerable to DDoS using Layer 7 attack (Application Layer Attacks & Protection). Large number of users just try to login simultaneously overloading the system. Because password check is so slow and ties up a lot of CPU and RAM it could easily cause denial of service situation. Of course there must be sane limits how many and how often that password check can be triggered, but with large botnet it could be possible to bring site down at some important time or at least prevent users from logging in. Simply overloading those by utliizing the slow login process / servers.
  • Tried Tableau and QlikView quickly with one large data set. I might like to try SpotFire, Catavolt, BellaDati and Talgraf too. At least the to named tools were quite awesome. Tableau seemd to be better (simpler / faster to use) based on really quick testing. I got nice results from the dataset I used using Tableau. I'm sure many people do not realize difference between Business Intelligence, "Reports" and Data Discovery tools like these. Checking the data using different perspective is just so incredibly easy. I just got one tip, please make data available in tables without too many joins, simply de-normalise it, and don't use crypting integer references and flags in data. Each column should be simple and straight forward to process. Too complex data structures kill (easy) usability of these great tools. - Afaik, all software vendors should provide free trial for competent people. Why I would buy something, unless I have been able to play with the product first?
  • Studied byod best practices and mobile device management.
  • Throughly studied Check Point's DDoS whitepaper. DoS Attacks, Response Planning and Mitigation.
  • Wondered if sites protected by CloudFlare are vulnerable to RUDY attack from large botnet. Played with PyLoris and Python raw sockets. Finetuned sshd cipher and authentication parameters.
  • I didn't watch these, but if you're interested there are now video tutorials training you how to utilize Google App Engine.
  • Studied ENISA CIIP documentation for Cloud Services: Cloud computing is critical, Cloud computing and natural disasters, Cloud computing and DDoS attacks, Cyber attacks, Relevant Threats, Different scenarios, Infrastructure and platform as a Service the most critical, Administrative and legal disputes, Risk assessment, Security measures, Logical redundancy, best practices, Monitoring, Audits, tests, exercises, Incident reporting.
  • Read this nice writing about Heroku and how important routing requests to right handlers are. There have been followups for that, I really wouldn't believe that Heroku would use such a bad request routing.
  • Checked out Googles new App Engine (Also SDK 1.7.5) instances with high memory options. F4_1G and B4_1G which allocate 1G of memory for every instance. Previously only 512kB options were available. Also mail bounces are finally getting processed properly. No more sending mail blindly.
  • DDoS @ Wikipedia: Well, back in old good days when there weren't too many checks in many P2P protocols, it was really easy to feed false source information to network, making all clients connect one desired server at once. Many P2P networks are now much more strict about sources and won't often pass information forward without verifying it, effectively limiting damage caused by false source / tracker / peer information. First Edonkey 2000 versions with DHT were really dangerous. You just gave IP and PORT and then there was lot of traffic directed to that address. Also there was some kind of flaw how DHT looked up it's peers. If you made sure that the client id was lowest or highest in network, it seemds that those IPs got tons of traffic, when other clients updated their DHT peer information. When I played with that, I had to stop, because even back then it was able to saturate 10 megabit/s connection, which was super fast, because most of people were using 33.6k or 56k modems. Changing client DHT address was trivial using HEX editor and what's even worse, the "outside" IP and port reported to network was directly in configuration file as plaintext. Yes, I was useful if you used NAT and client couldn't figure out your public IP. But it also allowed targetting any other IP/port combination on the Internet.
  • Amazon Route 53 DNS supports now primary and secondary servers. So you can define your traffic to be routed to another server in case primary server(s) fail. Really easy way to add simple failover redundancy or simple pretty error pages.
  • Played with Bing Webmaster tools. It's really annoying that Bing considers lack of sitemap as an error and nags about it every week. Well, now I have a sitemap for every site.
  • A nice article about A/B testing. Should be pretty clear to every web developer too. I personally might prefer multi armed bandid approach instead of plain A/B testing. It would allow testing any number of combinations simultaneously. Of course that requires adequate traffic to be analyzed. Same methods can be applied to that which are mentioned in this post. 
  • Studied Fatcache documentation. What it is? Well, it's "memcache" for SSD drives. No, not memcache in front of SSD drives, it's SSD cache for any slower subsystem. I still think that I have to study memcaches eviction policies more carefully, I don't actually know how exactly memcaches cache eviciton works. Otherwise Fatcache looks like a great idea, and it's just yet another cache tier in multi-layered cache / date storage system. 
  • Many peole doesn't seem to get how caching works at all. They always claim that it's important to have blah blah large SSD drive etc. I personally think you don't need large SSD, if you simply use SSD as write-back block cache and utilize SARC for reads, you'll get really effective block caching in front of your primary datastorage. And you'll save ton of money. For most normal desktops having 128 GB SSD and 3 TB storage disk, it enough. Blocks which are regularly used and store are stored on SSD and rest on storage disk. I'm 100% sure that automated caching makes better work on this detail level than any nerd who tries to manually optimize what data is on SSD and what's on the secondary disk. It would actually be fun to see how great the difference between "intelligent human" and automated caching would be. Only risk is huge reads which might completely flush SSD cache if pure LRU is used. Like repeatedly reading 1 TB file several times from the storage disk.
  • TorBirdy, a TorButton for Thunderbird. - Nice.
  • Firefox OS, Tizen and Bada are coming? Who's going to develop native apps for every environment? Here's the Promise of FFOS. It remains to be seen if end users really care about new alternatives.
  • Article about PEN Testing and security drills. (Penetration testing, Zero-day exploits, backdoor code, drill, hackers, cyber security, CERT, Integrity testing, Stress testing) Basic stuff, systems can be attacked on multiple levels. Can anyone actually protected their systems from hackers anymore? Things are simply too complex that there would be anything that could be called as secure. 
  • Mega vulnerability reward program got quite nice list of different security severity classes:
    Severity class VI: Fundamental and generally exploitable cryptographic design flaws.
    Severity class V: Remote code execution on core MEGA servers (API/DB/root clusters) or major access control breaches.
    Severity class IV: Cryptographic design flaws that can be exploited only after compromising server infrastructure (live or post-mortem).
    Severity class III: Generally exploitable remote code execution on client browsers (cross-site scripting).
    Severity class II: Cross-site scripting that can be exploited only after compromising the API server cluster or successfully mounting a man-in-the-middle attack (e.g. by issuing a fake SSL certificate + DNS/BGP manipulation).
    Severity class I: All lower-impact or purely theoretical scenarios.
  • Confirmed that X-Frame-Options and X-XSS-Protection are both correctly configured with my sites.
  • There's strange bug or maybe it's a feature with Dolphin Browser. I have selected that I want a new tabs to be opened into a background tab. Earlier this did mean that when ever new tab is opened, it was opened in to a new background. But now something has changed and now every link (with out request to open into a new tab) is opened into a new background tab. Even if it was regular link that should have been followed in this tab. Let's see what other users think about that and if they'll change it back in future. It was super confusing, I clicked several links repeatedly and wondered why nothing is happening. Until I found out that I got tons of new tabs open in the background.
  • With one own target domain and about 50 clients with different request types and requests bots. How fast and what kind of requests are blocked and when. What are the criteria for blocking etc. What if requests come from large set of individual IPs etc. How well known good request filtration works, if there is large number of slow requests with some previously unknown characteristics. What if get requests contain additional parameters to bypass sites page / request caching or if existing parameters are modified to cause continuous stream of failed requests. With some platforms even these failed requests consume quite much resources, if same failing requests aren't done all the time. After some testing it seems that it's possible to generate permutation which changes requests over time so that those won't get easily caught. Any kind of traditional passive overloading doesn't work very well where you just flood similar requests from multiple clients to one target. It seems that if the origin host for the site, doesn't not have adequate resource reserver, it's easy to make site slow or even cause denial of service situation. With sites with reserves it's considerably harder and at least you will need larger pool of clients, so any of the clients doesn't trip protection. For some sites making legitimate requests to older infrequently accessed data seems to work very well. Because server caches can't satisfy these requests. In these cases only difference between other requests and these requests are the fact that I know that the information isn't cached on the source server or in CloudFlare network. It's quite hard to detect this kind of attack as an attack, because it's just larger number of clients, accessing non-cached data, with relatively slow request speeds. Of course some sites can help the CloudFlare by telling in return data if data was cached or not. This helps to identify clients which bombard site with requests with non-typical data access patterns.
    X-Cache: MISS from sq63.wikimedia.org X-Cache-Lookup: HIT from sq63.wikimedia.org:3128 Age: 14 X-Cache: HIT from amssq38.esams.wikimedia.org X-Cache-Lookup: HIT from amssq38.esams.wikimedia.org:3128 X-Cache: MISS from amssq40.esams.wikimedia.org X-Cache-Lookup: HIT from amssq40.esams.wikimedia.org:80 Via: 1.1 sq63.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq38.esams.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq40.esams.wikimedia.org:80 (squid/2.7.STABLE9)
  • Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS), Advanced Evasion Techniques (AET), Intrusion detection system evasion techniques https://en.wikipedia.org/wiki/Intrusion_detection_system_evasion_techniques All seems to be simple and straight forward on theory level. Of course knowing all the insights and how each individual device and operating system works, is completely another story. But that is information which is available through searches and pure simple research, if you know what systems you're targeting.
  • Played with Python and raw sockets, allowing me to open tons of simple tcp sessions, without loading my own system with sockets. Because we all know that plain syn flood isn't going to work, then it's just required to handle the sessions bit further until suitable break point is found. Or simply asynchronously handle packets on event basis so that huge number of connections can be kept without loading local system too much.
  • Checked out the star wars route. I immediately wondered why they needed two routers to do that. Any hardware / software can be simulated using raw sockets, so using raspberry pi would have made it as well possible than using two Cisco firewalls.
  • Playing with reverse names (RDNS) is also fun, some badly designed services trust reverse names and it allows all kind of nice tricks to be done as well as naturally providing fake ident service (identd) data.
  • Studied Continuous integration (CI)
  • Found out that ZyNOS got issues with DNS caching and round robin. If I have configured 10 A records for one DNS name, Zyxel will only store first of those in the cache and return it to all clients. Therefore breaking round robin and load balancing or redundnacy that might have been used for.
  • Found out that some Telewell models got IPsec related memory leaks, renegotiating IPsec connection over and over again leads to out of memory situation.
  • I have watched many Air Crash (Mayday) shows. It's interesting to see how minor mistakes mount up and cause major catastroph. I'm also very curious about the usability issues, like how hard it's to read some meters or see if switch is on or off. So that TV-show is important for even IT guys. Don't design programs which confuse users, make user interfaces clear, tell all vital information correctly so that it is easy to understand, and don't flood users with useless or false information.
  • Google+ Platform: Especially Pages API seems to be interesting, otherwise read only APIs are quite useless. It seems that it requires whitelisting, especially granted access after an application has been submitted and hopefully accepted.
  • One guy lost his disk encryption key... This is exactly why I always keep paper backup of the master passkey. But, the paper backup is encrypted with light encryption. Why not to use strong one? It really doesn't matter, the master password is random string and 16 chars long. Then it's encrypted with simple phrase, using substitution, partitioning and transposition. After those steps, I'm confident that the password on paper is also utterly useless to anyone without knowledge how it is encrypted and what the simple pass phrase is. The backup key is also hidden outside any reasonable search area.
    You could also utilize very simple methods like reversing case of random password, or swapping parts, adding or removing something you know. Like prefix to strengthen the password, you just always write passwordpassword (or something similar) and then add your real password. Without knowledge to the attackers now your 6 chars long f8Snb3 random password is 22 chars long. Don't use any of the schemes mentioned here, make up your own.
    The password container software is configured to run about ~10 million streghtening iterations on the password before it's being used. This means that it will take about two seconds to verify one password. (Yeah of course depending from many factors.) - Password strengthening can be done using memory hard problems, like scrypt, which is way better than options which only consume pure processing power. (Read about memory hard problems)
    You should also be aware of corruption risk of encrypted data. Therefore it's better to always have a off-site backup set with different encryption key(s). I usually do not renew both keys simultaneously, so I can reasonably recover from the backup even if I would lose the master key.
    Of course you can also use indirect method, where you map to numbers and letters, pages, rows, char poses and therefore the password on the paper has absolutely nothing to do directly with your password. Do mapping so, that distribution is even and it's not clear that it's offset references. Then you just know, that when you pick (pdf/book/file,source code) X and start applying your code, you'll get your password.
    Generally I have absolute minimum length for passwords 12 random chars and for master keys I prefer 20. For keys that I don't remember, I use 32 random. If you're using AES256 and prefer to have 256 bits of entrypy in your password, use fully random password of 40 characters (including large set of special characters) or more.
    Giving password to lawyer is good idea if you want someone to have your password, in case something bad happens to you. Otherwise it's totally pointless. If I'm gone, my (private) data is gone, and that's it.
  • Finished reading Christensen - Innovator's Solution.
  • Google Cloud Platform: Studied BigQuery Best Practices, Tutorials. Read BigQuery Cookbook. Uncompressed vs Compressed data formats. Data de-normalization. Schema, data conversion, xml, json, csv, query basics. API V2 overview, Google Compute Engine, Overview, Main Concepts, Instances, Images, Networks and Firewalls, Disks.
  • Why People are upset about Facebooks Graph search? Isn't it clear that what ever they send to Facebook should be anyway considered to be public information. So what's the news?
  • More security reading, Lucky Thirteen: Breaking the TLS and DTLS Record Protocols. Doesn't seem to be quite feasible, but who knows, maybe attack can be improved.
  • Found out performance limits of one firewall which is quite important, for some reason they promise it should provide 50 Mbit/s IPsec throughput. Actually when using small packets it stalls way earlier (CPU maxed out). Didn't find any way to work around it. Seems that it's good time to replace firewall with new one, which also would support OSPF, SSL tunneling and naturally IPv6 as well as intrusion detection features.
  • Studied: Ulteo Open Virtual Desktop, rdesktop, xrdp, Ericom, TS RemoteApp, RDP, Citrix, 2x, ThinStuff - Tested solutions with Windows 2008 R2 Data Center and Windows 7 Professional (64bit) - ThinStuff is excellent for light remote desktop virtualization. I'm just not 100% if it fully accomplies with Microsoft TS licenses.
  • Finished reading about Spanner: Google's Globall-Distributed Database whitepaper. (PDF) (TrueTime, NewSQL, Paxos, NoSQL, MapReduce) 
  • From developer magazines I did read Cloud Services and Strong Identity article as well as Possibilities of Future Internet Payments. 4G the next generation of wireless communication. Long BIG DATA article. Hackerws view, how to utilize open public data. From wage salve to millionaire, making your startup business successful.

I think this is summary for bit more than two weeks.

DNSWL, DDG, Helpdesk & Customer service, Fast or Slow, Browser Cache, Data Security, Back button, Signing

posted Feb 3, 2013, 2:29 AM by Sami Lehtinen   [ updated Feb 3, 2013, 6:09 AM ]

  • Got my mail server listed on DNSWL. It seems that internet is quite hostile nowadays. If you start running your own mail server, it seems that even some of the large webmail providers seem to easily classify your mail as spam. Well, now that should have been fixed. - s.sami-lehtinen.net @ DNSWL.
  • A few articles:
    How US spies on Europe located data centers (fictive, speculation)
    Personal Security Images (useful or not?). I have said that site should authenticate to user, this is one attempt to do it, but not foolproof at all.
    Big Data, Business Intelligence (BI), Data Discovery, Data Analysis, Probability and Statistics, Visualisation and Graphing by AppliedDataLabs. I have been playing with Qlikview, but Tableau was new app to me.
    Reaching 200k events / second. Small opitimization article.
    Tips for Legitimate Email Senders to Avoid False Positives. As simple as it is, it seems that many email server admins do not follow the tips. Like the services I have mentioned here, which lack reverse DNS name for the outbound mail server.
    Secure Data Dedupliation (PDF). Strange white paper, because as far as I know, this same method has been used for ages. Even when you encrypt data for multiple recpients using PGP/GPG/OpenPG actualy payload is encrypted only once, but the actual encyption key is simply encrypted using multiple public keys. So each recipient and decrypt the actual encryption key using their own private key. Also Freenet ang GNUnet use similar method, where the data is used to form the encryption key for the block. So you can decrypt the block only if you know the content, or you have the key for the block. As said this is problematic just like it's in Mega's case, because when you get something you know how to decrpyt, then you can start hunting people having the data, even if they wouldn't have the decryption key, nor they would know what the data actually is. Freenet & GNUnet encrypted disk cache. It would be very interesting to see how courts handle this kind of situation.
    Microsoft starts supporting Git. Nice!
    Good and bad OAuth 2.0 implementations. Yet another nice article.
    DuckDuckGo handles over million deep queries / day, article by HighScalability. I have been using DDG for a long time, and it's now clear that
  • Computers & Robots get smarter. Will there be work left for the regular guy in the future? Actually this is no news, isn't all cloud & data automation about mostly reducing labor costs? At least in every case where I have talked about integration and automation, that's the primary goal. If system looks cool, but doesn't reduce costs to produce the service, then it's a fail.
  • Helpdesk & Customer service parts of any service should be as automated as possible. If possible 100% self-service is preferrable and then multitiered help functions.
    A) Clear, simple, self-explatonary user interface with tips. To avoid raising questions how application / program should be used in first place.
    B) Excellent user FAQ & guide.
    C) Automatic checks and suggestions for the user question being sent. If user tries to ask something that will be covered by FAQ & guides, give straight answer.
    D) When a person real sees a ticket first time, there should be automated suggestion for canned response, as well as option to select from other canned responses.
    E) Last expensive option is to individually answer the question. Note. question should be analyzed and there must be option to store this repsonse as canned response and store it to FAQ & guide section knownledge base too.
    F) Way too often answers aren't stored, and same questions are being answerred by humans over and over again. Which is really expensive.
    G) There should be a regular reviews to update the FAQ & guide based on new canned responses. All of the canned responses should not be visible in the public FAQ & guide section, there might be serveral reasons for that.
  • My favorite RSS to mail service was closed a month ago. Now I'm using Blogtrottr, which seems to be working very well. I also tried IFTTT but I didn't like it too much. I have also listed all of my favorite RSS feeds to Ampparit service, so I can see all interesting feeds with other news collected from public (mostly finnish) sources.
  • Other essential tools for web-masters are ChangeDetection and WatchThatPage, both allow monitoring listed web pages for changes. Especially good for sites which do not provide key RSS streams.
  • Following stuff and white papers released at Usenix is as essential as following Google I/O events.
  • Fast or Slow application? Annoying or good app? The user decides, based on many things:
    It's all about user perception. Developing smart programs give user fast impression even if the fundamental task didn't change at all.
    Perfect example is Outlook vs Evolution, and in this case it's the evolution which absolutely sucks. I'll write message and click send mail. Outlook handles it in less than 100 ms. But with Evolution, it's taking for ever! Come one, I got 200 other emails to handle, what is taking so long? Still sending mail, omg! This is ridiculous!
    So what is the design failure causing all this annoyance? It's super bad user interface design! Instead of showing big slab on screen: WAIT, sending email... For several seconds or in some cases even for tens of seconds. All should simply queue the task and get it done in the background without annoying users so much, that they bother to write posts like this! Of course I can switch to another window and continue processing mails, but then I get many of those sending message windows in background and those bother me. Also it's not ideal for user flow to always find the next task from stack of windows.
    One guy just yesterday asked about me, that if mobile client is used with cloud service, how it can be made responsive enough. That's the trick, use what ever data you have, and make app at least seem to be fast, even if the real task is handled in the background asynchronously. It just makes so big difference.
    Let's take scenario where user would like to check new mails, there's one mail with large attachment. He can't access any of the mails, before all of the mails has been completely downloaded. Or, you'll get list of messages, you can process small messages which are already downloaded while the large attachment is downloaded in the background without annoying user for 5-15 minutes with "downloading messages, please wait" notification.
    These are so simple things, that I thought everyone gets the point immediately, well, it seems that many engineers and developers simply do not.
    That's my rant for today about annoying user interfaces, bad design and poor user experience (UX).
    Whats worst of all things? If something happens, before message is fully sent. It's often completely lost. Yet another massive win for super bad design. If message would be first queued and then sent, if something happens during the transmission, it can be sent again. But in this case, message is totally lost or only partially recovered, because it was not actually saved before the attempt to transmit it.
  • It seems that to many organizations even simple things like renewing SSL certificate seem to be way too complex thing to get done. First I laughed hard, when Finnish web forum & quite big web mail provider Suomi24 forgot to renew their certificate. But when Nordea Bank forgot to do it for their web bank, I really did start to wonder. What kind of guys they really got working there? - Hire me, laugh. I know how to check date from SSL cert, as well as I'm totally qualified to use calendar, with re-occuring events. Worst part of this which make me really wonder, is that often they claim this to be complitely surprising thing. I have also heard that even more important monet traffic related SSL certs have expired (can't mention exactly what) in similar fashion, and yet same BS arguments were used. In some cases, admins tell that best way to fix this is to disable SSL cert checks completely. Guess what, when the cert is renewed, how many percentage of users turn the check on? 1 or maybe 2%, I actually think that my estimate is way too high this time. For important systems I would still use something better than pure server SSL cert based authentication. Self signed certs and client certs is better, but even with that I would like to use bi-directonal challenge based authentication. So even if SSL layer competely fails, attackers won't gain authentication information. Of course they still could set up MitM attack and get the data because encryption layer is now peeled off.
  • One friend of mine suggested that it's good idea to always disable all browser caching. I personall said that it's not so good idea, because it'll add network traffic and make browsing experience slower. Therefore I did some cache analysis from my own browser cache, and results were following. During two weeks, browser cache saved one gigabyte of browser traffic. Even more importantly cache served over 55k request immediately to me. Based on this temporary 3G connection I'm using, it saved me about 110k seconds = 1843 minutes = 30 hours. Well, of course I usually open tabs before hand when working with other tabs. Even if we completely ignore the round trip time, which is usually quite important when surfing, pure data transfer with max rate would have taken 5,5 hours. Is caching pointless or not? You tell me. Like I said, my usage is quite light, because I'm now using 50% broad band connection.
  • One friend of mine was really upset, when his boss told him that he can't download customers all data to the dev / test environment and use it for testing. I have had similar kind of discussions several times. Usually the argument is that it's faster to troubleshoot if you have all customers data to be used for testing. But I personally think that's security issue and it shouldn't be so. Therefore my own apps use excellent logging, with tracebacks, without any (meaningful / private) customer data. If something malfunctions, first I get automated alert about the problem, as well as I usually get everything I need to have for fixing the issue. In worst case, all the customers data is then left indefinitely around some hard disk corner, and not properly destroyed from the system(s). You'll never know when you need it again, do you? It's better to keep it. A very bad policy, AFAIK.
  • Quickly tested Asana for project management and sharing tasks with distributed team and Freshdesk for helpdesk functions.
  • Reminded my self quickly about usefulness of Bloom filters.
  • Duplicati backup system comments: I'm not running duplicati.exe process all the time on all of the servers. I'll start it using Windows scheduler or from larger backup batch for creating off-line backups when it's required. Currently I'm using parameters: --run-backup="Backup" --trayless. I just would need need a option to tell the Duplicati process to exit when it's done. Currently there's no option to do that. I have defined process to be killed when certain time has passed, but as we all know, this is very dangerous and incorrect method. I would love to be able to add: --exit-when-done parameter. That's all, when backup job(s) are done, exit. - Thanks. You might ask Oh, why? There's command-line version of the backup program. Yes, I know. I use it with Linux systems. But because some Windows admins doesn't seem to understand the command-line version parameters at all. Theyre' not used to command-line tools.
  • I'm so happy with 7-Zip It's just so superior compared to other alternatives. Nobody should use other general archivers than it. Zip can be used for compability reasons when working with legacy stuff, but otherwise modern systems are able to run 7-zip and it's simply great compared to zip. Compresses much faster and even with better compression ratio.
  • Reminded me about benefits of memory hard problems, like scrypt. Which require memory instead of pure math / CPU processing power. Scrypt is also on it's way to a standard.
  • Google Cloud Storage is offering Durability Reduced Buckets / Durability Reduced Storage, which offer a cheaper way to store data. But this isn't anything compared to Amazon Glacier.
  • Google is also offering PageSpeed as cloud service, quite a nice idea.
  • Checked out and tried HelloSign and SignOm online agreement / document signing services. HelloSign seems to be really nice to use. Most important part of signing service is that it has to be legally accepted way to make a contract. In Finland banks are mostly providing authentication and signing services, because those are only trusted online authorities which also got wide userbase. I just wonder when Facebook can provide legal agreements. Doesn't it have wide user base and hopefully strong authentication? How strong authentication is strong enough? Using this method would also make sure, that they (should) know if user is using his right identity with the service. Many strong authentication service providers do not actually provide strong real world identity, that's the weakness of many services. As well as mentioned earlier, I might like to have a strong pseudonymous identity, so they know it's "me", even if they don't know who I actually am. Very useful in some situations.
  • It was really hard to get one web designer to understand difference of browser back button, and going up one hierarchy level in web store. If user lands to site using Google to one article, and wants to see other articles in the group, that should be possible. She thought that now it's good idea that that when user wants to go back up one level, it means back to Google. No, going back / up in article hierarchy doesn't mean returning to where you came from. It would be very annoying to then form another Google query and hope that you'll land back to the right page in the web-store, because web-store navigation is crippled.
  • Android back button, is it broken or not? Kennu wrote that having a global back button is really bad idea. I personally say, it's not a bad idea at all. I really like global back button. He also claimed that it's confusing to have several back buttons. No it isn't. One back button is the Android back button, which goes back one step, what ever it was. Another back button is then the application layer back button which goes back or up one level, just like in the web store case I wrote earlier. Most important for the user experience (UX) is just defining the steps where back button returns to clearly. As we know this has not been simple in the future. Currently browser back buttons are totally broken, when every ou press the back button, you really do not know where you're going to land. As well as you click back button on web site, it's same result. You don't know where you're going to end up when clicking it. But this is not a technical flaw, this is flaw from the application developers, it's not systemic flaw. When you press cancel or undo, it's just the same question, it goes back to some stored point, but without trying, you really do not know how far back you'll end up with that click. So, global back button is cool, as long as apps work correctly with it. I open link from email, read the page, press back, I'll get back to the email. What is the problem with it? It's just like with any HTML5 app, you might end up leaving the whole application, or getting back a smaller step in the app. You'll never know. As well as with browser, you'll learn not to use the back button with broken apps, but is it the source of problem the badly behaving app or the broken(?) button concept.

Cloud security, JavaScript Crytpo, Megafail, Hard Disk Drives, Spinrite, Disconnect.me, exFAT, Graph Search

posted Jan 27, 2013, 1:19 AM by Sami Lehtinen   [ updated Feb 10, 2013, 4:04 AM ]

  • This is extremely pro-cloud writing, but often true: The deslusions that companies have about the cloud.
    It's much worse to have some own random server somewhere, which got 100% uptime... Wel, until it doesn't. There was this one guy working in the company three years ago who knew something about that server. Now HD of the server is broken, it could easily take a week to get it online again, and restoring full functionality can take week or month. It's also well possible that it doesn't happen at all, because lack of backups. There might not be automated backups at all, backups might have been malfunctioning for a extended period. Or even funnier situation where backup is fresh, but nobody knows the encryption key. Or the backup tape has been unreadable for years. Backup was in same room with server, and is now burned. Backup was USB stick attached to the stolen machine. Some guys also said, that cloud companies care about privacy and security, because it's essential. Many other companies doesn't care about it all. Even if alarms are raised, if nothing really bad happens, all issues are simply ignored.
    - Yes, I have personally seen all of these issues when working with customers.
  • People say that browser JavaScript crypto is inheritly unsecure and doomed.  Well, this is one of the issues I have been raising from time to time, about automated updated for programs. If they can modify web-pages in realtime, what makes you think they couldn't deliver fradulent desktop software using automated updates? - Well, they can. So if you don't trust javascript crypto software downloaded from website, why you should trust native encryption program, downloaded from a website? Unfortunately still many program updates are delivered using plain http and without any signatures etc.
  • Megafail? Mega.co.nz security analysis, which tells something about my previous bullet point. Would native locally installed client be any better with automated updates? No?
  • Checked out some information about h.265 / HEVC, I'm really happy that I don't need to write decoder or encoder for that. If something is really mind bendingly complex, that will be it.
  • Read interesting news article about Shingled Magnetic Recording. There are interesting times for HDD industry, when SSDs are mainstream, traditional disks need to really offer superior capacity. Currently manufacturers seem to be moving to Heat Assisted Magnetic Recording (HAMR). Patterned media is also one interesting future technology. Also see: full 9 page SMR PDF document. Perpendicular recording is already old stuff. Here's also one article about HAMR.
  • After listening endless stories by Steve Gibson about his Spinrite level 1 and 2 runs for SSDs. I decided that these are very basic functions, it doesn't really require any specialized software. So I did run badblocks -sv for all drives to confirm that all data is readable, and disk controller is aware of the current "disk state", because status of un-used disk areas can be unknown to controller.
    For my Windows workstations I used chkdsk /F /R which also reads all files and non-allocated space from partition. I with badblocks could have unmounted the drives and used -n parameter (to do non destructive write tests too), but as far as I know, that's not required because I assume that everything is ok. Read-only test is actually more than enough. I think badblocks is better than chkdsk, because it reads all blocks from selected block device, and doesn't care about partitions. Remeber to run it for whole block device, not only to a specified partition.
  • Disconnect.me excellent site about social network spying. They offer excellent browser plugin to prevent continuous snooping. Facebook is absolutely the worst spyware of the year. Don't forget to check out their blog too. Well there are also other spies watching over your shoulder like Twitter, LinkedIn, Google Plus, etc. Social network buttons are the cancer of the internet, destroying privacy. Also user identity can be stolen using Widget Jacking method due insecure social network widgets which aren't using SSL.
  • Reminded my self about S.M.A.R.T. and AHCI as well as disk drive native wipe (overwrite / destroy) features. But how you can be sure that everything has been erased? Well, without good lab, you simply can't. Is it worth of it? No, just physically destroy the drives containing confidential data when you're done.
  • DARPA got interesting plans, they're planning to create payloads that fall upwards. Yes, from the bottom of the sea to the surface. See the project. "Distributed systems to hibernate in deep-sea capsules for years, wake up when commanded, and deploy to surface providing operational support and situational awareness"
  • Something fun for a while, a nice 404 page. Although I personally think that the animation isn't nearly as advanced as it was in the original version over ten years ago. So yes, it's a partitial fail.
  • Checked out exFAT features. Well, doesn't impress me much. It's just extension to old fat. Without journal it's really dangeorus file system to use for anything meaningful. Even NTFS USB sticks get corrupted, but luckily it's only the data that get's corrupted, because journaling saves the file system. Without journal, well, things can be much worse. I don't find over head of using NTFS or Ext4 for USB flash too bad, compared to the damage what non-journaling file system coud cause with multiple random disconnections.
  • Blog post about actual facebook graph searches. Well, why is this news? They have made all that information public anyway. Shouldn't be any news at all.
  • Updated to latest GAE developer SDK: release: "1.7.4", timestamp: 1352314290, api_versions: ['1']

PRP, MarkDown, PCP, Mega, LocalStorage, Blink Protocol, UnitTest, Passwords, Helpdesk, Integration Best Practices, etc.

posted Jan 20, 2013, 6:13 AM by Sami Lehtinen   [ updated Mar 17, 2013, 9:05 AM ]

 
  • I have been working with my own Personal Resource Planning (PRP) methodology. Most important skill seems to be now to overcommit your self, because it leads certainly to unhappiness. Next important thing is clear prioritization. It also helps with not overcommitting your self, you'll do what's important and rest is not important. It's important to be able to happily ignore what's not important, otherwise it'll bog you down. My own GTD system with multiple priortized in queues will do the trick. What I still need to improve, is being even more selective what I insert in to my tasks queues. I seem to be bit too interested about way too many things. Focusing on much narrower field would make life easier. Recognizing high energy times from low energy times is also great, then you can pick tasks which are suitably challenging to your mood. Like watching movie, purging email, doing investments, programming etc. Each step requires more concentration and is mentally more demanding. Learning my own limits, makes turning things down much more easy. I have received some very interesting offers lately, but I really don't want to overcommit my self. I have done it often in past, learned from my mistakes and I won't do it anymore. It's better to undercommit, then you have free time and you still can do something useful on it. Even the same things you could have committed to and then suffered if you run out of resources.
  • Entire nations intercepted online, key turned to totalitarian rule. - No wonder BGPSEC is being pushed forward. (Btw. There isn't BGPSEC topic in Wikipedia yet.)
  • Studied Markdown for my PyClockPro project. Actually quite many services use Markdown. I personally didn't like it too much. Writing technical documentation with it wasn't as straight forward as I would have preferred. Getting Markdown to be formatted correcly with BitBucket wasn't fun at all. It's not complex, it's just annoying. Luckily they provide many other alternatives. If I'm not happy with markdown I then revert back to plain text.
  • Mega Upload uses convergent encryption. Well, it's a fail: TL;DR; It is not secure, deduplication ruins privacy. Not an acceptable solution at all.
    I like Freenet's approach, because it's also complemented with anonymous routing. Without anonymous routing data content based encryption is dangerous.
    Encryption key is based on the payload, so if you don't know what he payload is, you can't decrypt the packet. Of course decryption keys can be delivered using different encrypted tree of keys, which is used when you deliver download link.
    For that reason, when ever I'm sharing anything I usually encrypt files with my recipients public keys before sending those out. Just to make sure that data is really private and keys are known only to my selected peers. In some cases when I want to make stuff even more private, I encrypt data separately with each recipients public key, so you can't even see list of public key ID's which are required to decrypt the data.
    I also have 'secure work station' which is hardened and not connected to internet. That's the workstation I use to decrypt, handle and encrypt data. Only encrypted and signed data is allowed to come and go to that workstation.
    This is exactly the same problem as with Freenet. Because same plaintext encrypts to same ciphertext there is huge problem with that. If I really don't anyone want to know that I got this data, that's failed scenario. It makes things easier for service provider, they don't want to know what they're storing. Just like Freenet's data cache. But if I know what I'm looking for, I can confirm if my cache contains that data or not. Therefore this approach doesn't remove need for pre-encrypting sensitive data. Otherwise it's easy to bust you for having the data.
    There's also one interesting solution which is OFF System. Which is just pain gimmickry.
    Well, my point is that if security is provided, it's better forget trying to deduplicate data. If data is deduplicated security has been weakened aready way too far. Secure service do not need any deduplication, and if they use it, it means that the service is fundamentally flawed.
    See (PDF's): Secure data deduplication white paper, Efficient Sharing of Encrypted Data.
    When file sharing use GnuPG and Tor, or Freenet, Gnunet what ever. Anyway, using the first encryption layer is also critical. Then you can use secondary service to make you pseudoanynomous so you're reachable, even if they and you don't know who you're communicating with.
  • Littele browser local storage test using Brython and Google App Engine.  Seems to be working well with mobile browser, but my Firefox is configured to drop all data when I close it, so it doesn't work with my Firefox at all at all. (If I close FF during between page visits).
  • Year 2000 problem, year 2038 problem, etc. Do you know what year comes after 1999? - Don't you know? It's of course year 19100. Laugh, our invoicing system started to spew out invoices with due date in the beginning of 19100. Well, it's nice to get little interest free payment time. Let's see what happens when year 2038 gets closer.
  • Blink protocol: Well, I don't personally see need for that in my case, because I'm already used to higher level languages and higher level data structures. Which naturally are much less efficient than blink or binary formats. But with some scenarios I really can see it to be beneficial. Because I have to addmit that it's ridiculous to have 5 gigabyte XML file, which is less than 2 megabytes after efficient LZ77 compression. It really tells everyone, what's the "information value and density" of that XML-junk file.
  • Read quite short but information article about Thorium reactors: I have read a lot of stuff about nuclear reactor design and security systems earlier so this was quite nice fill in piece.
  • I'm still having issues with my PyClockPro project. I'm currently doing already third refactoring round. That's the reason I haven't yet released it. I have been adding many optimizations to save CPU time and discarding higher level Python data types for some things. I think the code is now actually quite ugly, because it's not Pythonic at all. It's more like C jar Java code, just written in Python.  I'm still wondering if I would use linked lists with internal processing logic, or external loops accessing data in lists. I benchmarked recursive linked-lists and performance was quite bad. Sigh. Currently I'm wondering should I use list.remove(something) or should I use loop to check list entries (using index) and remove content when found? Why I'm asking this? Because I already know the range in the list where that value being looked for is. But if I use list's remove, it doesn't have the luxury of utilizing the known range to look the value in. Well, after these questions, I know there is need for this kind of library, because implementation is non-trivial and I really understand why simpler alternatives like LRU or CLOCK are so popular.
    I'm currently writing unittests and proper complete benchmarks for PyClockPro. I'm also aware that the class decorator wrapper might not be optimal, I assume someone who's actually experienced in writing class decorator wrappers could tell me how it should be exactly done correctly and especially why so. It's working well, but as we well know, it really doesn't mean it would be done correctly. Some core functions are still pretty broken, I really hate refactoring core parts of program, because it inevitably leads to breaking most of code for a while. (Add the refatoring video clip of cat trying hopelessly to jump out of slippery bathtub.)
  • Studied pydoc3 for generating standardized class and function descriptions for Python.
  • Studied unit testing (with Python) and git hooks, preventing commits until unit tests pass. Read several long articles about Python unit testing, so I can include unittest for PyClockPro. I still think I'll release first version without unittests after I have done other tests and analyzed debug data so I'm sure it works ok.
  • This would be way interesting, if I would be younger and have time for it. Calling all coders: Hardcode, the secure coding contest for App Engine. "During the qualifying round, teams will be tasked with building an application and describing its security design." 
  • Google Declares war on password - The problem is that if criminals can convince you that you’re visiting Gmail even when you’re not, they can trick you into entering that secret code. In fact, the bad guys can even turn two-step authentication against legitimate users. Site should first be authenticated to user, and then key tied to that authentication, should be used to authenticate user to site, without ever revealing the shared secret or private key which was used as basis of the authentication.
  • I just noticed this week that one high value target got security flaw. Even if strong 2FA authentication was used, the session cookie they served wasn't https only. Oops. it means that (Java)Scripts can access it, and it can be sent over http connection. We all know that not all users add the https prefix, so fail, there it goes out to the internet without encryption. Well, I naturally reported the issue to their security team. I'm currently waiting for confirmation. I also suggest that sites should use HSTS, even if it's not nearly perfect solution. As well as not to provide service at all on port 80. That might make some people to understand that site is https only. Because if there is redirect, users will use it, without understanding that it creates some risks. (Like possibility of redirection to another 'fake' domain with valid https certificate.)
  • Some funny stuff for a change: I told my colleagues that we receive at least 5000 "hack attempts" aka failed logins daily to any of our public Internet facing servers. One of my colleagues just said to me: "Well, you're having such a ****** password policy, that maybe those are actually failed login attempts and not hack attempts at all." - It really got me laughing. Yes, passwords, especially long complex and random ones are painful for users. Here's password of the day (opening and closing quotes aren't included in the password):"^j'lb#K-€3,<_úgWJdXå(n_6=41Bµ%cj!" Btw. Good luck guessing the password or finding it out using SHA-1 hashs or so. I know it's possible, it just might take a while. ;) p.s. This password still got less than 256 bits of entropy.
  • I made minor fail, because I didn't remember that Windows Server 2008 R2 Datacenter edition doesn't referesh scheduled taks or service lists automatically. Pressing F5 is required. So I restarted one service a few times wondering why it's always starting, not running. Oops, it happens. 
  • Helpdesk processes: Had a long discussion with fellow IT-manager about: Internal (intra) & External (public / web) wikis, information & knownledge management, canned helpdesk reponces, automated first responce informing about on going issues, etc.
    1. Inform customers immediately (automatically) about current known issues
    2. Offer them FAQ / help, but if they ignore it allow them to enter ticket subject
    3. Scan knownledge base for answers to that subject
    4. Suggest a few best matching articles to the user
    5. Allow them leaving the actual ticket
    6. Show it to helpdesk guy, with pre-selected canned responces if those would be suitable for question.
    7. Always keep customer posted about the ticket status
    8. Preferrably especially in B2B sector, ticket isn't closed, until the customer tells so. If ticket is stale for a while, system shuold automatically send message to the customer asking if issue has been resolved or not. Ticket really can't be closed "by default" after giving some semi random answer to the customer.
  • Checked out a few cloud / mobile pos providers: Vend HQ, Imonggo, AirPos, Kachng! (Torex), POS Lavu, PointOfSale.net, Posterita Cloud POS, SquashPOS, EffortlessE, MerchantOS, ShopKeepPos, LivePOS and wrote (private) study about those.
  • Added "An Introduction to Programming In Go" by Caleb Doxset to Kindle.
  • Some very old stuff just for fun. I reverse compiled one PCBoard script and made some funny modifications in it. Administrators newer found out about those changes. ;) Maybe people should checksum files, so they know if those files are still the files they think those are. 
  • Game / Java reverse engineering (or more like reverse/de-compiling). Well I liked that PCBoard reverse engineering. Another nice story was reverse engineering one Java Applet game. I decompiled the code, added my own class which contained hooks to key parts of the game. After that I could simply call any functions inside the game when I wanted to so. After adding some additional control code and timers, I could enable my player to receive points at will. My speech bubbles also had scroller features. (Just sent the message with offset about 5 times / second). One of the funny parts of this was that the game protocol wasn't too optimal. All buffers of the game server and especially users which slower connections got absolutely flooded. So what if I receive one point ever ms? - Well, that was fun as long as it lasted. Game developer added some checks to the protocol to pervent this. Oops, natural fail. He had problems with some users who reverse engineered the messaging protocol. But I didn't do that, because I decompiled whole program. One of those friends who wrote their own client, used something like 15 minutes to add the checksum code to their programs and off they went. I was even quicker, because my own class actually extended the game class. So I didn't need do any changes, just download the new game class locally and launch the game again using it. Well, I clearly had too much free time back then. 
  • Quora fail: I have been wondering this trend to app that, app it and actually app everything. I really don't get point of millions of (more or less useless) mobile apps.
    Today I really got slammed right into my face. Quora doesn't work with mobile browser, they require you to install their freaking Quora-app. This is simply getting ridiculous!
    I remember some sites that worked perfectly with desktop computer, but if you used mobile they required to use their SMS service or something similar. This is at least as bad trend. All the benefits of using browser as platform are totally lost. What do you think about it? Do you love installing App instead of using web browser for every freaking ridiculous site you have ever visited? I don't like it, I really really hate it. I don't want to install any crappy spy/malware apps. I just simply want to use their website. Or in this case, I don't want to use their services anymore at all. Do you know any others sites which are as ridiculous as Quora? No this is not +1 for Quora, this is -1 for Quora and +1 for StackExchange.
    When we think bit futher: It's simply bad for usability, but it's even worse when we start thinking about security! If every person using mobile learns to install what ever app is pushed by what ever website, it's going to be really soon a bad security & privacy problem.
    p.s. Yes, I do know if I change the browser signature to Linux or Windows desktop browser, I can perfectly well use the site. Which makes my point of using crappy (unnecessary) app simply even more valid.
  • Blog FTP/FTPS/SFTP/SCP/NFS/SMB based integration best practices. I try to be very compact:
    1. Use temporary files and paths if possible when generating file.
    2. Move ready files to target path / rename files with final name.
    3. Always check file size after any transfer. If reasonably possible checksum too.
    3. Retry if required.
    4. Always process independent files / transactions, do not do "random batching".
    5. If you don't have luxury of using temp paths and file names, use file locking properly. If even that isn't possible have proper EOF flag in file. So non-complete files won't get processed for sure.
    6. If file wasn't properly transferred (for any reason), resend data.
    7. Think this process as transaction, what has to happen (with positive confirmation) before you can proceed to next stemp.
    These rules are very simple, but you guys won't believe how often programmers fail even with these super simple rules.
    Same transactional basics can be and need to be utilized with more modern methods. Do not mark something as completed, before it's really done.
  • As example how to fail what I just mentioned above. A generate list of files to be processed, let's say "files*", then process all of those files. When done generate list of files "files*" and delete those. Well, isn't it reasonable? We processed files* and now we delete files*. Well, fail. What if files were added to that path during your processing. Yes I have seen that in live production. What a fail.
    It's funny that engineers fail that test, but small children do not. If I put some gummy bears on the table, and let the kid start eathing those, during the process I add 10 more gummy bears on the table. When he has eaten the original bears I ask him, if he as now eaten all the bears. Do you really think that the kid would say yes, I have. Nope.
  • I complained about Filezilla's poor FTPS performance a few months ago. Guess what, they have released fix for that. Excellent! Now everything is working as expected.
  • I did some security auditing for one company. They had installed most of systems using default username and password, and servers were directly accessable from internet. This is incredible, are we still living security through obscurity times? I thought this was old news even in year 2000, but it's still happening.
  • Other non-it tech stuff, space lanuch using Ram accelerator and non-rocket spacelaunch. Afaik. Ram accelerator is super high tech and cool solution, still being viable when we have well working scramjets.
  • Uber Taxi is interesting technological development in taxi sector. Instead of really expensive Taxi systems, they replace everything with modern mobile phone. This is technological revolution at it's best.
  • A really nice post about Python dictionary basics.
  • When I did read news about thousands of SCADA devices being accessable directly over internet. My first thought was: Well, they got many honeypots. - I really hope I'm right with this thought.
  • Seeding data with fake entries, so it's easy to spot if information has leaked. It's nothing new, Canary Trap is just more advanced method than Mountweazel.
    As example I personally provide unique email address to every service I ever give email address to. It's very easy to see, if they leak it. Most disturbing case I have this far encountered is receiving spam to email address only given to one investment banking company. I'm 100% sure I haven't ever given it to any other site. This means that either my server was exploited, they leaked the addresses on purpose, or someone stole their customer base with email addresses.
    I can recommended studing field of counterintelligence in general, there's lot of interesting methods and stories.
    See: False document
  • I know my signing key (1024D/274EF626 - 1024 bit DSA) key is not up to recommended strength. I have earlier said that I'm waiting for ECC upgrade and then start using it. Otherwise I recommend using 4096 bit RSA encryption and 4096 RSA signing keys.
  • How to enable HTTPS secure encrypted SSL connections for LinkedIn:
    Click your Own name -> Setting -> Account -> Manage security settings -> When possible, use a secure connection (https) to browse LinkedIn - Check -> Save changes.
    So if you haven't done it yet, do it now.
  • Well, I still hear people talking about IPv6 NAT. No no no, it will reduce usability of features like power saving. Keepalive traffic is absolutely pointelss. When TCP ip stream is open, it's open, it shouldn't require any keepalives. Sending keepalive packets every 30 seconds or so, just consumes power on mobile devices. When data is sent over connection, if remote end isn't there, connection will die and that's it. Most systems default keepalive time to two hours, most of NAT devices default to much less. Also benefits of global reachability are lost.
  • Brython (1.0.20130111-000752)
    print(int(0.5)) -> 0
    print(round(0.5)) -> 1.0
    Python 3.2 (Win64 bit r32:88445)
    print(int(0.5)) -> 0
    print(round(0.5)) -> 0
    Ouch! These kind of differences can make apps behave in really surprising ways. Well, as we know naturally the floating point implementation being used affects this issue too.
    Reminds me from: Write once, test everywhere - The Java, approach.
  • Spent one day studying Windows PowerShell. Haven't been using it too much. Usually I have cmd file which calls my Python scripts (which might call for os.system for some sub functions), but PS can be really handy directly for many things.
  • Did read a article about Dart.
  • Thought mre about multi-core vs many-core thinking. yes, with multicore something like pipelining process steps could work well, but with many-core it's not an viable option anymore. Like I have said earlier, I often use Process Pools and simply chunking data to be processed into slices. Naturally this won't work with all kind of work loads, but for me it has worked well this far.
  • Tempmail.eu temporary email forwarding service lacks reverse DNS name for their outgoing smtp mail server. - Fail. This is easy to fix, but they simply haven't done it.
  • I would have liked to write longer article about national relations and locations of their national domains DNS servers. But don't have time for that analysis right now. I have been doing quite many checks and I know there are political ties. Just check out where dns servers are for .fr, .de, .nl, .ru, .jp, .ch, .tw etc. It's interesting. Why Taiwan doesn't have their top DNS servers in China. Why .ru DNS servers aren't in US etc. Why all .eu DNS servers are in EU area? etc... It could be interesting to make complete political world analysis based only on this information.
  • Quickly studied basics of GlusterFS.
  • CipherSaber is a very simple but working cipher implementation. Assuming you have pre-existing RC4 cipher code.
  • For a change studied water turbines, wind turbines, and ship propulsion systems designs for a while.
  • WebSockets will allow an easy way to offer VPN services over HTTPS. I guess this will end silly politics to think that using some services can be blocked based on IP or TCP port number.
  • There have been some news about new convert messaging applications. As mentioned earlier, any service which allows you to store key value data, can be used as proxy to deliver data. So all kind of DHT networks, DNS service etc, what ever can be used as data relay. As long as "arbitary" values can be sent over the connection. Even if values would be very limited using large number of keys or updating value for key often enough allows data transmission, just on lower data rate. Some people still do not understand that allowing even plain DNS usage allows me to communicate exactly anything in and out of their network. 
  • I started to write an article about Bandwidth Hog, it's a UDP protocol that is designed for high packet loss / high latencynetworks. Retransmission is simply super aggressive and there isn't any kind of window limiting data transmission. Actually HS/Link protocol worked like this. It used infinite window and packets that got lost were sent later again. This allowed protocol to maintain sending data out always at defined data rate. Some data got lost, some got through, so what? The lost parts were sent again later. This would allow protocol to "steal" bandwidth from applications using TCP connections which slow transmission down in these situations. For this protocol it doesn't make any difference if there is 30% packet loss and 30s (yes, seconds, not milliseconds) latency on link. I have been thinking about this for a long time, especially when using cognested slow shared connections. Basically same kind of results could be achieved using TCP connection splitter, where let's say 300-1000 tcp connections are used in parallel to transmit data over the link. - Well, I think the key point was in this post already, so I don't bother writing any longer story about this. Because some limits are always required, those could be things like packetloss or latency. Let's say that the protocol is tuned so that it transmits data faster until 20% loss ratio is reached. This would make it way faster than parallel TCP connecting using exactly the same link / route.
  • Well, this was mostly new stuff. I couldn't manage to shorten my backlog at all. Maybe some other week I'll get that done.

Google App Engine (GAE) Gotchas and How To Avoid Them

posted Jan 13, 2013, 12:49 AM by Sami Lehtinen   [ updated Feb 10, 2013, 4:03 AM ]

I cover here a few commong gotchas which snag new developers on Google App Engine PaaS platform.

Limited processing time

Every request got really limited processing time. I have seen many web apps to run user requests lasting tens of minutes or even hours, that's very bad approach. Return user queries quickly and if required poll for results later. Also see Task Queues and Database update frequency limitations.

Can't write to local file system

Even if files uploaded to service are accessable, you can't write anything to local file system. You have to use alternate storage methods like Google Cloud Storage, Google App Engine BlobStore or something else. If you're used to write local temp files or modify statics files, that won't work anymore.

No database locks

Google High Replication Datastore, does not provide any kind of database locking, it provides only transactions. Lock free operation is one way to make sure that no process is holding locks and blocking applications global progress. It simply means that either your transacion is successful or not. You can't do traditional locking. This also is beneficial, because now one process can't lock database for extended periods of time. Every entity does have version id, so in most of cases I try running most of processing outside transaction. Read data, process it, then start transaction, check data version and commit. If data version is not anymore what you started with, restart process. To update data in multiple entity groups use cross group (XG) transactions.

Database index latency

Database indexes (for separate entity groups / model) are not guaranteed to be up to date, this means that when ever you run query, you have to check values again when processing data. So if you ask for records where x=1, you might get records in that query where x!=1. So check records before actually processing those. This also means that you can't run cross entity queries in transactions. To avoid this problem, you store data in larger entities and then use use ancestor queries. This strategy also got it's own drawbacks which are covered bit later.

Ancestor queries

Ancestor queries are queries which are run in one entity group and therefore can be run in transaction. These queries do not return stale data. Main problem with using larger entity groups is that entity groups are the atomic processing unit with database.

Database update frequency limitations

Because HRD datastore is distributed to several data centers, it also means that there are internal latencies with datastore. Practically this means that each entity group can only be updated about once / second. If you now have one entity group which contains only one item like visitor_counter and you'll try to increment that on every page load, it's going to fail for sure. All you're processes are stuck with "run in transaction" mode, which is by default tried three times if transaction fails. Because roughly only one of parallel tasks can successfully process that transaction per second, all others are doomed to fail. To avoid this problem you need to use sharing. So if transactions fail, you'll simply add new shard to counter. So you'll end up practically with visitor_counter_# where # is shard number. When you need to update visitor counter, you'll update in transaction random shard id. When ever you need to read the visitor counter, you'll read all entities from model and sum those. For better results you can cache that to memcache for one second or bit longer time if your approach allows that. When ever run_in_transaction fails, meaning three failed update attemps, just add new entity group (shard) to model and then you're app is once again able to handle more traffic. Not handling this case properly is very common and sure way to fail.

Query result limit

Any query will return maximum of 1000 records. Therefore you might need to repeat the query several times. Using offset is problematic because non ancestor queries can't be run inside transaction, so you might miss a few records or get same records several times, if entries are inserted or deleted during processing.

Inequality filter limitations

You can't have multiple inequality filter rules in on query. Queries like select * from mytable where property_a>1 and property_b<100 order by property_c; simply can't be done. Composite indexes can solve some of these problems, but usually it's not as simple as it is with most of SQL databases.

Task Queues

If possible, do not update all data on client request. Just record the absolute required minimum to successfully execute the task later. Then spool rest of updates and processing to task queue. This allows user requests to return quickly.

Slow database access


High Replication Datastore is "really slow" compared to traditional local data stores. Therefore it's really important to cache data and avoid traditional database normalization. When ever possible you should be get all data from cache, or in worst case, pick it up by running just one select and reading a few records. See caching.

Caching  data & output

If possible, you should store full output for the request using memcache, this means that there won't be any output processing required. Simply check if page url is in cache, and return results for it. This is very beneficial method, especially for public pages, even if it would be done for a short time interval. Also don't forget to utilize browser cache.

See my older post about Google App Engine and Caching.

Vendor lock-in

Be ware of vendor lock-in trap. When you creatae something on Google App Engine, always use your own (or some other) abstraction layer (API) between your application and the actual Google API. This allows you to use alternate services to provide data and communication for your application if and when required. Without this abstraction layer, you need to modify almost all of your code if you're not running your services on Google Cloud platform anymore. With this layer, you can only modify the layer, and start using SQL database instead Google High Replication Database etc.

Scalability, performance, reliability

There's also inherit pros for the Google App Engine solution, it allows virtually unlimited scalability when required without any changes. Excpeting that your app is designed and implemented correctly. This means that it must not create unnesessary bottlenecks, like updating same record with every request which is 100% sure way to cause a failure. Another thing is that your servers and platform is run by excellent guys.

I personally would select Google App Engine for especially quite simple programs, which require reliable database, no data loss after commits and high scalability no demand. Work loads with continuous fixed capacity are cheaper to be run elsewhere.

kw: appengine, developer, developing, programming, python, java, platform, PaaS, cloud service, cloud platform, data, database, Google App Engine (GAE) Gotchas and How To Avoid Them
as: ohjelmointi, kehitys, kehitystä, kehittää, alusta, alustalle, konsultti, konsultointi, konsultointia, pilvi, Google pilvi, pilveen, pilvessä, Googlen, ohjelmointia, pilvi, pilvelle, skaalautuva, skaalautua, automaattinen, automaattisesti, ohjelmisto, verkkopalvelu, verkkopalvelut, verkkopalveluita, verkkopalveluun, verkkopalvelussa, Googlelle, Googlessa, kuormitus, kuormittua, kuormittaa, edullisesti, edullinen, tehokas, tehokkaasti, tehokas, tehokasta, Suomi, Suomessa, Helsinki, Helsingissä, Helsinkiin, Suomeen, Suomeksi, pilvipalvelu, pilvipalvelut, pilvipalvelua, pilvipalveluun, pilvipalvelussa

2013 assorted stuff

posted Jan 6, 2013, 7:56 AM by Sami Lehtinen   [ updated Jan 20, 2013, 5:49 AM ]

To celebrate the beginning of year 2013 I have this little assorted post. This is not even nerly everything. I could have written several posts about each topic covered in this post. I'm just sorry I don't have time for that. This is just a web log of stuff I do, I really can't cover all the details.

  • Finished reading Innovator's Dilemma. I can see many similarities in the business I'm currently in. How things change, and how everything new is impossible or hard, things should be done as those are always been done. Well, maybe it's time to change things, even for better?
  • Veikkaus.fi web site is finaly allowing longer passwords up to 32 chars including special characters. Earlier their policiy did only allow 8 chars including a-z, A-Z, 0-9, which is bit outdated policy afaik.
  • Reminded my self about HTTP header fields (RFC2616). Mean while in process, found out that really many sites give out conflicting information in Cache-Control header. Private means that data should be cached, but only for the user. While no-cache means that data shouldn't be cached. So why site then defines private and no-cache simultaneously? Also some browsers require usage of must-revalidate. Why? If there's no-cache filed given, why shoud cached data be revalidated when there is no cache in first place? Of course we have Expires field and old Pragma and all tricks involved with those. But basically Cache-Control: no-cache should be enough and Expires field can be set to 0 or point to passed point in time, but it's not required.
  • Did read huge long discussion about CloudFlare. Point of discussion was if CloudFlare is making your site faster or not? Well, basically cloudflare works for caching proxy for resources which are cacheable, but for non-cacheable resources, it just adds delay. Therefore it's really important to be smart about what is being cacheable and what's not and especially not to declare resources that are cacheable to be non-cachable like Google does.
  • Worked hard with one minor project to get it off the blacklists. Let's say that the abuse control policy wasn't best possible. Domain got blacklisted and now I had to fix first the app and then get it removed from the lists. It can be very tedious work.
  • Limited number of parallel (concurrent) HTTP sessions in my browser to improve performance. I'm now living in rented flat and using ridiculous (compared to my normal fiber connection) 3G connection. After limiting number of parallel connections everything seem to be working much better. Because there aren't too many TCP connections competing from this bandwidth. I also might tune down my TCP-stack settings which are absolutely maxed out for fiber connection. (Like initial TCP RWIN 64k etc.)
  • Read a few posts wondering when IPv6 will really kick-off? Here's one. Well, I have to say that I fully working native IPv6 connectivity, and I'm happy with it. I'm really not using NAT, it seems that many people do not understand network interfaces at all. They also do not understand benefits binding services to only selected interfaces.
  • Just found out that my 3G internet provider doesn't block outgoing TCP port 25 at all. That's quite interesting, I assumed that all consumer grade network connections block port 25 access. because even many business network connections blick it, unless you make it clear to the provider that you don't want it to be blocked.
  • Encountered SQL gotcha when wondering why one app didn't behave like it should have. It seems that SELECT key,value FROM data WHERE value!=0: leads to interesting results. It returns only fields where key is non-integer.  So from 0 1 1.1 2 2.1 3 3.3 only values 1.1, 2.2 and 3.3 get selected. If I want it to return all, like non-zero values, it requires adding decimal to the query. So let's put 0.0 instead 0 there, and then it behaves as expected. It's always good to know your systems throughly otherwise you'll find your self having these kind of strange problems.
  • Reminded my self about WiFi (WLAN) stuff: WDS, roaming (controlled by client), SSID, frequency allocation, encryption modes, etc. Especially for situations when you have larger set of base stations. One of main points was not to use WDS unless it's absolutely required, it'll just make your network slow or superslow in case of several base stations.
  • We all did read news about fake google.com SSL certificates. As I said, users should first authenticate the site, before users authenticate to the site. There's no reason to trust broken SSL at all.
  • Read interesting article about Tux3 FS. They sure have some interesting ways of solving things out. Reminds me about database transaction journalling and SQLite3 WAL mode.
  • Wondered SLIC BIOS, Sysprep and cloning Windows 7 and Windows 8 machines, whic use SLIC BIOS and special Windows licenses. Found one russian hacker app, which allows to read and decrypt all required data from BIOS allowing cloning of those machines and even re-entering the product key. Basically I don't like the basic concept of embedding licensing information in BIOS. This is just like the secure boot which could prevent loading any other than MS(tm) operating system or so.
  • Studied litte more in detail, Windows server / workstation Sysprep for cloning.
  • Updated passwords and email addresses for most of sites. It's quite painfull process. You'll never know which site will accept what kind of new password etc. Some sites allow only short passwords, some sites allow any kind of UTF-8 passwords without any problems etc. It's also hard to find where to change the password in some sites and some sites simply do not properly confirm if new passwords was accepted or not. Overall quite crappy user experience. One of the crappiest sites (bank btw), didn't accept password which is non-numeric and longer than 6 numbers. (Duh)
  • One nice article about bluetooth security. I just wonder what kind of fails there will be with the Near Field Communication (NFC) stuff.
  • Read nice article how I run my own DNS. Nothing new really, it was just fun to read it. I just wonder why he's using postfix to forward mailto Gmail. I run my own Postfix because I don't want my mail to be forwarded to Outlook or Gmail or any other huge dataware house.
  • HTTP headers from my web-server:
    HTTP/1.1 200 OK
    Accept-Ranges: bytes
    Cache-Control: max-age=86400
    Content-Length: 79318
    Content-Type: image/png
    Date: Mon, 31 Dec 2012 20:35:59 GMT
    Etag: "1ff11-135d6-4bfd1068fed7c"
    Expires: Tue, 01 Jan 2013 20:35:59 GMT
    Last-Modified: Sat, 12 May 2012 06:33:06 GMT
    Server: Apache/2.2.22 (Ubuntu)
    Strict-Transport-Security: max-age=86400
    x-mod-spdy: 0.9.3.3-386
    X-Firefox-Spdy: 3
  • Closed a few (small traffic) web forums and replaced those with newly created Google+ Communities.
  • NitpickerTool.com finally a proper spellchecker. Maybe I should use it too? Just for lulz, they fail... The image on front page http://nitpickertool.com/resources/img/overview.png sends HTTP response header Cache-Control: no-cache. It's totally crazy and pointless. Usually I don't even notice these things, but I'm now using narrowband 3G instead of 1 Gigabit/s full duplex fiber connection. It's so very easy to spot crappy web-sites from the sites where admins know what they're doing. This is perfect example. (btw. Google Engineers do not know either what they're doing, I'll get back to this topic). - Btw. They now have fixed it.
  • Reminded myself about current Secure Boot UEFI issues with Free Software, and how that's going to possible affect several Linux distributions, unless SB can be disabled by user.
  • Studied and tested mod pagespeed. Even I didn't leave it enabled. 
  • Wondered why FB sends B*S* information about users to other users. I'm 99% sure that the user isn't following the pages FB claims her to follow. I also later confirmed my suspicion and I was right. The information which FB sent, was based on my profile, but they claimed it was based on her profile. I think that's just plain deception.
  • Checked out and played with Ninchat. Didn't like their Flash video chat, but otherwise it's nice web-chat.
  • Watched Indie Game Movie. When stakes are high, it seems that people can get bit messed with their thoughts. Luckily all businesses covered in that documentary did well after all. 
  • Military stuff read whole articles from Wikipedia: NASAMS 2, AIM-120 AMRAAM, MBDA Meteor, Iranian submarine force (Kilo (Russian) & Quadir/Ghadir (Iran) & Yono (North Korea) class), supercavitating torpoedoes (Hoot &
    Shkval) and Iron Dome (Israel) antimissile systems.
  • Studied Missile Development project control goals. Strict requirements for features that has to be demonstrated before proceeding any further with project. Great way to have strict way for Prooft of Concept.
  • Studied pricing of one service provider. I got really upset, because their pricing was really confusing. It didn't cover edge cases at all and interaction of different mix and match solutions were also missing. People who make pricelists should really think out the cases, instead of listing just prices. It's just like doing my daily ERP integrations, I get way too often technical description of some API and then question how much it would cost to do this. Well, it's great, we now know that we can REST, but all the details required to generate the actual payload data are still totally missing. "Hi guys, how much does a car cost?"
  • Laughed at Jysk Xmas website. They are using Azure but doesn't help at all. They send daily emails asking quiz questions and then users rush to site to answer those questions. Guess what, site is absolutely jammed (over 30 seconds / page load) when those mails arrive. At least they were smart enough not to run this xmas campaing using their mail web servers, because I assume situation would have been even worse in that case. If Google App Engine would have been used, I'm sure it would have performed well. (I'll be posting Google App Engine gotchas post bit later.)
  • Other stuff I have read about lately: Big Data, Map Reduce, Hadoop. Well, isn't that just tabulating data? Makes me smile! Check out Tabulating Machine. Overall System Security, Data Erasure Procedures, Enterprise Licence Management, Startup Tech Companies, Hyper convergence products: Nutanix Complete Cluster, Scale Computing HC^3 (Hyper_Converged Compute Cluster) and Simplivity Omni-Cube. Storage systems: Hitachi Data Systems, NetApp, EMC Cisco Systems, EMC, VMware, VCE Vblock. Storage Networks (SAN) Storage Computing: Nimble Storage, Tintri and Astute Networks. Cognos, Sales, CRM, ERP, Data Analysis, ePOS, POS, mPOS, Point of Sale, Customer Loyalty Systems, Store Chain, Big Box Super Market. BigQuery, Data to Intelligence (D2I)
    "Volume, Variety, Velocity", Open Data, Stream Computing, A/B testing, Continuous Deployment, Software Defined Network (SDN), Open Flow, European Public Sector Information ESPI Platform
    Public Sector Information (PSI), Open Knowledge, PoS CRM sale data analysis on customer level.
  • I'm mentoring one really young but still promising nerd. He has done all the server stuff I have done (VPS, Linux installations, web servers, dns, mail server (smtps,impas,webmail), starttls with authentication, earlyssh, full disk encryption, SSL / HTTPS), some PHP code etc, and he's only 13 yo. 
  • IPTV service TVkaista is under threat in Finland, they're facing charges.
  • Needed to checkout interesting linux distribution called Mageia. I think Ubuntu is going to wrong direction, that's also the reason why I gave up Windows years ago.
  • I visited The Next Web / ArcticStartup Meetup at Teatteri Club, Helsinki (13.12.2012).
  • Read article: 14 big trends for 2013: Preemptive health care, Predictive data analytics, Algorithmic censorship and algorithmic transparency, Social coding, Liquid data and Personal data ownership.
  • Well one company I have been visiting works just on this "Choise Engines" sector and they're currently having problems with system performance and managing all the data they're collecting from users. Here's nice article about the topic. Question remains, how you can avoid being included in these huge datasets? 
  • I were in Egypt, there I learned what it is to have super narrow bandwidth. This current 3G connection is a luxury network connection compared to that. In the very monring it was possible to get 100kbit/s (10kbytes/s data rates), in the afternoon not even half of that. Usually packet loss was so high that it was totally impossible to use any services because timeout did hit faster than requests get processed. This is major dilemma. Many countries with good network connectivity assume that these "slow connections" are some kind of DDoS attacks. I just open sockets keep those reserved without doing anything. But it's not true. Getting SSL (HTTPS) negtion done just might take 30 seconds or more. If server isn't receiving request in 30 seconds, it doesn't mean there is anything wrong with it. On my severs I have set many timeouts relatively low, because I got full 1 Gbit/s connectivity between home and server and less than 1 ms latency. Then it's huge surprise to find out that there is 500 ms round trip lantency and bandwidth is well, bit less than what I have used to. Timeouts and limits are really tricky issue.
  • Read long and excellent posting about atomic transactions and what could go wrong. It was hilarious, getting things right is much harder than anyone could even imagine before trying it in reality. No, it's not ok if you think only your layer. There are multiple layers involved, disk drives, controllers, drivers, operating system, several layers of caching etc. It's really hard to make sure that atomic transaction is really committed successfully. In some cases it's even impossible, because hardware, file system or operating system could mask it from your app that data isn't actually yet permanently persisted. So you think your system is working? Just run high write transaction loads and then randomly yank the powercord.  Did something get messed up, is the system state still consistent? If not, well, now you have failed fix it.
  • Finished reading Blog Hypnosis for beginners book... Very important psychological stuff for sales guys. Yes attitude to lower resistance etc. It seems that many Egyptian sales guys follow these base rules.
  • Google Sites seem to use no-cache and no-store (as well as noarchive) tags for hosted images. It's absolutely painful to see every image loaded again on every page refresh. I sent them feedback and I'm now really hoping that they will fix this, because the way they're doing it now, its totally crazy. I also keep wondering why they provide Etag for content which is not allowed to be stored or cached. "no-cache, no-store, max-age=0, must-revalidate", ETag:"1355247003982", X-Robots-Tag:noarchive. Google also seems to be very worried about web robots archiving their content. Well, I guess they know how much content for from other sites they cache.
    Set of full headers from my web page, which I weren't too happy about:
    HTTP/1.1 200 OK
    Content-Type: image/png
    X-Robots-Tag: noarchive
    Cache-Control: no-cache, no-store, max-age=0, must-revalidate
    Pragma: no-cache
    Expires: Fri, 01 Jan 1990 00:00:00 GMT

    Date: Wed, 26 Dec 2012 11:09:25 GMT
    Last-Modified: Mon, 09 Aug 2010 16:50:23 GMT
    ETag: "1281372623009"
    Content-Length: 119297
    X-Content-Type-Options: nosniff
    X-XSS-Protection: 1; mode=block
    Server: GSE
  • Studied features of Linux 3.7 Kernel ARM 64 bit support, nice.  Server side TCP Fast Open, way cool. BtrFS Hole punching is really nice feature. Now you can "resparse" files by deallocating segments of file, when space inside file is freed, without slow and resource consuming copy data to new sparse file. Option to disable Copy-On-Write (COW) is also nice, even if it might weaken data durability a bit (?).
    Supervisor Mode Access Prevention (SMAP), nice. Yet another layer to layered security approach. I just wonder how many layers we actually need. See Intel documentation. JSF also added TRIM support. Many changes to IPv6, NAT and Netfilter. I really hope that nobody want's to use NAT with IPv6.
  • Had some light vacation reading, finished: Getting Things Done For Hackers (GTD guide), The Trading Profits of High Frquency Traaders (High Frequency Trading, HFT) - Highly Profitable and Secure business with Sharpe ratio of 9.2. So I had some light vacation reading.
  • I'm using multiqueue GTD and it works very well. I never "forget" anything. Then it's just another question how to get taxes done from the list, especially the tasks I hate.
  • Well long day we left to Cairo 4:20 and got back to hotel 21:34. Now I have seen the great Pyramids and the Sphinx, while getting sand blasted. I'm been inside of one pyramid too. We ate at the Blue Nile river barge restaurant. Friend of mine bought painting of Scarab (The Beetle) on Papyrus. Street sales guys are incredible pests. I'm not buying anything, I dont't have money and yet they continue bombarding you with offers getting more and more ridiculous. I guess you have to have really dodgy car to drive in Cairo, traffic was quite funny compared to ours. I'm sure insurance companies are happy to provide full CDW cheaply, NOT. Even boarding procedures were completely strange. It just seems that they're not able to organize things efficiently there.
  • Studied more messaging, authenticatio and encryption(?). Is chaffing and winnowing encryption at all? No, I think it's message authentication scheme, which allows you just to send tons of messages and only the recipient knows which messages are fakes and which one are real ones. So using this method, you can communicate in a way confidentially without using any kind of encryption. See: Null cipher, Steganography (Concealed messaging).
  • Added to Kindle: The Checklist Manifesto: How to Get Things Right by Atul Gawande, The Signal and the Noise: Why So Many Predictions Fail-but Some Don't by Nate Silver, Automate This: How Algorithms Came to Rule Our World by Christopher Steiner
  • Played with Snappy compression. It seems to be working quite well with the huge XML data files which I'm dealing with daily. See Google Snappy code page.
  • Quicky checked out Blake2 hashing and Blake2*p which is fully parallelized version.
  • ROR Sitemap, ROR (Resources of a Resource) is a rapidly growing independant XML format for describing any object of your content in a generic fashion, so any search engine can better understand that content. RORweb.com is the official ROR website. - I really dunno what I would do with this, traditional sitemap or url list is just good enough.
  • In Egypt they honestly said 3.75G network, in Finland mobile operators claim DC is 4G even if it isn't.
  • Studied lightly CDN networks: CloudFront, ChinaCache, CacheFly, of course we're all familiar with old big ones like Akamai, LimeLight, Level3 and EdgeCast. For some reasons CacheFly seems to be pretty slow often to Finland.
  • Studied efficient email server design utilizing: Studied maildir, mbox, procmail, and dovecot caching and datastorage solutions in detail. Btw. Afaik dovecot is doing good job with caching and storage. Sdbox mdbox dbox imap cache, updating automatic caching options / mailbox based on client requirements etc. Afaik this is stuff done right. Utilizing memory cache, cache file, two tier storage, automatic cache optimization. Locking minimization, fsync options, disk io minimization, efficient storage file size (dbox), not too small, not too large etc.
  • Checked: Sieve script mail filtering with Dovecot
  • Read nice crypto article, 7 codes you'll never ever break. Afaik, it's bad that there are mistakes in Kryptos statue, of course it's also in realworld possible that when something is encoded, it's not getting encoded (encrypted) correctly. I just obfuscated my master key passwords using light paper & pen crypto and I have to say, I had to verify encryption & decrytpion three times / passwd, so I'm sure it's correctly encrypted. I do not ever store plain text passwords anywhere. Level of obfuscation / encryption depends from the security level requirement for that password. Because passwords are random nonsense to begin with, those won't give you easy hints when it's correctly deciphered so encryption doesn't usually need to be very strong. Basic ECB works quite well, even if that's not what I'm using (maybe).
  • These Egyptian guys at hotel didn't know how to use Sauna properly. They didn't throw water on hot stones at all. The sauna was totally incorrect and unbearable dry hot room, until we fixed a few things.
  • A friend of mine made an interesting project using MongoDB. This is yet another temporary email service, but it was mostly studying project. See: my10minutemail.com I just whish he would blog more about the stuff he does. He has really thought about details, how to prevent message loss, how to handle message bounces correctly, how to store data to database etc. I have seen so many services which are in production but they haven't really paid enough focus on technical details. One of these services which fail is boun.cr. Their out going mail server doesn't have proper reverse DNS record. That's a major fail and leads to situation where many receiving mail servers to reject forwarded messages.
  • Played with QR codes and also studied the encoding method, error correction etc. I have to say that I don't like many things I see, but QR coders are well done. 
  • Reminded my self about NUXI problem and differences between small and big endian systems. 
  • Checked out release documentation of App Engine 1.7.4: Expanded EU Support, Task Queue statistics, Traffic splitting (between application versions), LogsReader and Logs API, makes analyzing logs a lot easier.
    Expanded Datastore query support, DISTINCT queries. DISTINCT returns distinct set of results. Just like you can do with python by forming set from list. set([1,2,3,2,4,3]) returns {1,2,3,4}, or simply in python shell {1,2,3,2,4,3} returns {1,2,3,4}. Distinct queries are based on index so it's pretty good way to get the results, without distinct feature it would be better to normalize data. Except that data store doesn't support joins, so it might make things actually pretty slow, unless you're caching all normalized fields.
  • Just some fun:
    JavaScript:
    0==false   // true
    0===false  // false
    1=="1"     // true
    1==="1"    // false
    Python:
    1==True // True
    1 is True // False
    type(True) // bool
    True is bool // False
    That's all basic stuff, we should all know it.
  • Python performance play:
    Each if statement is executed 10 million times.
    if True == True : 1.78s
    if True is True : 1.64s
    if None == None : 1.76s
    if None is None : 1.66s
    if True:        : 0.93s
    if 1:           : 0.93s
    pass only       : 0.93s

    I have been doing quite many tests like this for my PCP caching class. I just hope I'll get it released soon. It needs a little final touch and it require just the right mood to dig into details.

How to correctly define character set information for static HTML files when using Google App Engine

posted Dec 30, 2012, 5:51 AM by Sami Lehtinen   [ updated Jan 6, 2013, 5:09 AM ]

I just tried several times to define character set (inside the document) for static files served from Google App Engine and failed miserably. It seems that Google doesn't read meta tags like http-equiv from file when serving those. Therefore static files didn't contain correct Content-Type header with charset definition.

HTML file contains hint for web browser, but it's not passed as HTTP response header.

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Naturally if I use script (or Python Google App Engine program) then I can get it delivered correctly as response headers.

    self.response.headers['Content-Type'] = 'text/html; charset=UTF-8'

Which returns correct header line in response header:

    Content-Type: text/html; charset=UTF-8


To fix this for static files, you'll need to define charset in app.yaml file:

    - url: /
      static_files: root/create.html
      upload: root/create.html
      mime_type: text/html; charset=UTF-8

Now Content-Type header for static files also correctly contains character set information.

    Content-Type: text/html; charset=UTF-8

kw: GAE, Python, document, documents, file, files, encoding, mime type, html5, browser, server, headers, header, encodings, utf-8, character set.

1-10 of 77