Blog

Google+
My personal blog about stuff I do, like and I'am interested in. If you have any questions, feel free to mail me! My views and opinions are naturally my own and do not represent anyone else or other organizations.

[ Full list of blog posts ]

ETag, gzip, HTTP, NSA, DHT, Integration, Specification, Bad code, Python 3.4 libs

posted Mar 2, 2015, 8:09 AM by Sami Lehtinen   [ updated Mar 2, 2015, 8:10 AM ]

New stuff:
  • NSA's firmware hacking - It seems that they're missing the fact that harddrives got much more space than the 'service area alone' which is reserved for own operatios. Spare sectors and SSD wear levelling area and all "free space" can be used to store a lot of data on disk, especially if it's not completely full. Even then everything except the free space remains fully usable before drive starts to fail. Also I didn't like the fact that they said store data unencrypted. I'm sure if they bother to go that far, they also will encrypt the data when storing it. Just to be sure, not encrypting it would be just silly. Ok, often obfuscating data is enough, it's must lighter and faster, yet makes data such that it's not immediately obvious what it is. There's no reason why ROM alone should be used to store documents. Even code which doesn't fit in rom could be stored on disk and loaded on demand. So code base for this kind of application can be larger than the storage space in ROM. Did somebody forget dynamically loaded segments, which were used with .exe apps a long time ago. Same address space will be just swapped with different code loaded from disk, if there's not enough room to store or loaded everything at once. 
  • One of the five machines they found hit with the firmware flasher module had no internet connection and was used for special secure communications. " - This part reminded me from my secure system with RS connection. When I said that it's being used with low speed serial link, I did mean it. Also the out of band attack channels are disconnected like: DCD, DSR, RTS, CTS, DTR, RI. Only Transmit data or receive data and ground are connected also the RD, SD is controlled using a switch on cable, which makes the cable always unidirecitonal. If the other pins would be connected, it would be possible to carry data overt CTS, RTS, DSR, DTR and RI pins without the LEDs indicating it. I'm using DB9 pin out, if DB25 pin out would be used there would be many other pins which could be used to relay out of band data. As said, it's important to make relaying data in our out as clear and hard as possible.
  • " The attack works because firmware was never designed with security in mind " - Made me smile. Well, that's true. In most of cases, software is barely working. Who would want to spent additional resources to make program secure when adding those features could also make the system brittle and harder to manage & maintain? Security isn't priority is the norm when creating software. There are much more imporant things to consider, like if the program is working at all and not crashing all the time. Anyway, aren't applications and security only for honest people? If somebody really wants to get in, they will. 
  • Got bit bored at home and wrote decorators for ETag ang gzip handling for bottle.py.
  • Enjoyed installing a few Fujitsu Server PRIMERGY RX200 systems with LSI MegaRAID CacheCade Pro 2.0 SSD caching solution. 
  • It seems that many storage solution sellers don't even understand meaning of Hot-Warm-Cool-Cold tiered storage. There's no reason what so ever to store archival data on expensive SLC raid arrays. Only small amount of hot data should be on fastest possible disk system and rest can be stored on slower tiers. 
  • Wondered how some server dealers try to sell you tons of stuff you don't need, included in a package. As well as leaving the stuff which isn't included in package openly priced. I think this kind of pricing model is just annoying and wasting everbody's time. Just give me a clear pricelist which includes everything and I can make my own conclusions out of that. I don't want to waste time negotiating stuff which doesn't really matter. If prices are too high or service isn't what I would like it to be, then I'm not buying. Multiple negotiation rounds just waste everybody's time. Another thing which is pretty ridiculous nowadays are long contracts, like we demand 36 month contract? Ok, fine. If prices are lowered during the contract do these price cuts also apply to existing contracts? No? Ok fine, I don't want that kind of deal. Some service providers do cut prices also for existing contracts, other do not. I don't like it. If you offer a backup solution, I'm interested to know if the backup solution is off-site backup. I would prefer option where the backups can be fetched at any time without any assistance from service provider. So if required, I can even keep my own copies. How do we gain access to the backups in case of total data center loss? Yes, I know it's rare, but it has happened before and it will happen in future too. Is invoicing clear and right? Some service provider send horrible invoices with mistakes and unclearl ines, others deliver clear invoices which are always right. Some service providers provide clear invoice every three months or so. Other service providers require advance payment / contract / month, which is horrible. Also questions how refunds are done in case of service is cancelled are always interesting. Does the service provider provide flexible contracts where you can modify system resources as needed? If I need extra CPUs or memory for some heavy batch run is that possible? Many service providers also offer SSD storage. Well, nice deal, but what if you don't need it. As well as tons of bandwidth included in package which isn't needed either. I don't care if it's included, it's nice, but including bandwidth in package shouldn't bring it's price to pain point. I just wonder how much server resources are sold using these kind of Dilbert deals. Lot of BS talk, little facts and then just let's roll the monthly billing. Does anyone even know what they really need and what they're buying? Nope? I guess that's true unfortunately in many cases. Clueless customers and managers are truly clueless and those are also the customers which keep this kind of service providers running. 
  • Quickly read through Flux article - It's yet another pub/sub messaging / queue solution. 
  • Had long discussions with friends about DHT, STUN, TURN, how to know if peers are alive, how ping and pong should actually work, how often. How to prevent reflection and amplification attacks with UDP based solutions. How to manage peer information in a sane way. Listing 10k nodes got no use if there's really high churn. Keeping list of smallish number known reliable nodes is a good idea. In this case if bootstrap / seed / initial fixed list nodes are under attack network won't fully collapse because peers can't find information about current active peers to join the network and so on. List also shouldn't contain too many peers which are unreliable or short term nodes. In most of P2P networks it's really common to have extremely high churn rates. Some peers might run just a few minutes in a month and so on. Looking for those at a random time is quite unlikely to be successful. Like client pings server very 900 seconds. If no reply, enter test mode, send pings 6 times every 10 seconds. If no pong is received consider server to be dead. And if server doesn't receive any pings from client in 1000 seconds consider peer to be dead.  Of course this is only for times when state is in idle. During normal operation there's constant bidirectional communication as well as ACK / NACK packet traffic. As well as software engineering aspects and integration architecture consulting. Lot of debugging, hanging threads, non-blocking socket I/O and all the general stuff. Plus lot of discussion about NAT loopback, other devices do support it and others do not. Some allow it to be configured freely. It's also known as NAT hairpinning or NAT reflection. I'm hosting several networks which do support loopback but a few networks do not. It's really annoying because services can't be accessed using name or IP but you have to know the private IP address to access the service. Some times NAT is also doing NAPT and translating the port number so even port number might be different for LAN than for "rest of the world". 
  • Firefox 36.0 Release notes -  Adds support for HTTP/2. After using this for a while, I don't know what they got wrong. This is just like what I cursed a few posts ago. Shit code is shit and you'll notice it. Firefox totally freezes and hangs and seemingly nothing is happening. Network is on idle, CPU is on idle, there's plenty of RAM and Disk I/O capacity etc. But alas, nothing happens, why? Why? WHY?!
  • Zoompf guys wrote that they double their database performance by using multirow SQL inserts.
  • First thing to remember with UDP is that it's addresses can be spoofed. So data shouldn't be sent to recipient without first verifying that the request is valid. This is exactly what TCP does by it's nature. If this step is skipped, it's very easy to make and such program to amplify and reflect attacks. I'm just sending packets to all OB nodes which tell that some random ip and port just requested that huge image. It's very usual to measure the amplification factor. If one 512 byte UDP packet can trigger sending 100 kb then the amplification factor is roughly 200000. If there are no measures what ever to prevent this (I know there's already at least some window limits) I could use my 1 Gbit/s connection to trigger 200 Tbit/s DDoS attack easily. As well as the targets wouldn't know it's me even if I would do it from home. So this is just theoretical sample. Some times even no amplification is enough for attackers, they're just happy with the masking features. So they can use a few servers with high bandwidth to indirectly attack site making attack detection and mitigation harder. It's important that the recipient validation is made in a way that can't be also spoofed.
  • Attended Retail and Café & Restaurant 2015 expo / convention / fair / conference. Same stuff as always, self service, mobile apps, RFID, digital signage, loyalty programs and retail analytics. 
  • Had once again interesting discussion about customer data retention. What ever information is received, will be stored indefinitely and won't be removed ever. So when you use cloud storage, have you ever considered the fact that what ever you ever store there, you can't ever remove? Did you understand that? Maybe not? But you should really think about it. Yes, there might be "delete button", but it's just a scam. Anything isn't removed ever, it's just hidden from YOUR view. It's still there. These are very common practices and there's nothing new or special about this. Even all temporary files back from 2013 are stored. When asked if those can be deleted answer was nope, we don't ever delete any data which we have once gained access upon. 
  • Enjoyed configuring Office 365 for one business & domain + installing Office 365 clients as well as configuring email accounts and SharePoint.
  • Replaced CRC32 etags with Pythons own hash based etags using base64 encoding. Computing it it's about 7 times faster as well as amount of bits provided to avoid collisions are plenty more.
  • Also adding HTTP xz (lzma77) content-encoding compression support would be also trivial, but currently no browsers support it.
  • Requirements specification, all that joy. Fixed a few things for a old project. No fixing is wrong term, there wasn't anything wrong to begin with. The program worked exactly as specified. But after it has been in production for six months, customer had unexpected situation which created NEW requirements. Then there's all that age old and boring discussion, should they pay extra, because the integration isn't working. But they don't just get what's causing it not to work. In this case it was especially boring case. Data is transported over HTTP as XML to another system. Structure is really simple and clear and there are three systems interoperating via message passing. Problem? Well. 
  • For some reason system let's call it N doesn't accept messages from system S which are generated by system W. And the reason is? Well, for undefined reason system N can't handle in tag T data which contains information for several days, even if there's no reason what so ever to do so.
    Example:
    <data>
      <day date="1">
        <stuff/>
      </day>
      <day date="2">
        <moar-stuff/>
      </day>
    </data>
    They insisted that there has to be msg for each day, even if there's no technical reason for it and no documentation requires it. Of course this situation creates a problem only when there's data for several days to be delivered.
    So who made mistake? Me? Them? Nobody? And who's going to pay for it? - All just so typical integration stuff.
    Well, I 'fixed it'. It was naturally trivial to fix. Even if I still say that I didn't fix anything, because there was no mistake to begin fixing with. I just open and close the msg between days. Totally pointless and doesn't change a thing practically, except that now it works.
    Funny thing about these things is that sometimes it takes months of pointless discussions how it should be fixed. Even if fixing it would take just 5 minutes. Some companies and integrators just seem to be much more capable than others.
    In one other case situation was quite similar but instead of date it was profit center. One major ERP vendor said that it's impossible to handle transactions from multiple profit centers in same message, even if there's no technical limitation for it. In that case it wasn't even my app which was generating the data. I wrote simple proxy which received one mixed message, weeded it out per profit center and then sent per profit center messages forward. Totally insane stuff, but it works. Because both parties said that it's impossible to fix so complex things, which made me laugh. One party could generate per profit center data and another part couldn't receive mixed data. I think they both got pretty bad coders. Well luckily there was someone who was able to deal with this impossible to solve technical problem in a few hours.
  • Studied Bitcoin Subspace anonymous messaging protocol for direct P2P communcation. I also wrote about it.
  • CBS got the same problem PBS had earlier:
    "Unfortunately at this time we do no accept any foreign credit cards. In order for you to make a donation you will have to use a bank issued credit card from the U.S."
  • Read about Payment Services Directive (PSD2)
  • Python: Problem Solving with Algorithms and Data Structures - Just read it.
  • Noticed that SecureVNC allows cipher called 3AES-CFB. Yay, AES256 isn't enough? Do we need 3AES already? What about using ThreeFish with 1024 bit keys? 
  • Checked out twister which is distributed p2p implementation of Twitter.
  • Checked out Transip servers in EU - https://www.transip.eu/vps/ Excellent hosting option like Digital Ocean, Vultr, OVH, Linode and so on.
  • Quru wrote about Stockmann's webshop. - It just sucks. Actually I just yesterday proved it. My friend couldn't make her pruchases from the store. I had to make purchases because the payment solution was so broken that standard MasterCard didn't work with it. Nothing happened after credit card information, nothing at all.
  • Now it's clear, Samsung S6 doesn't even have SSD slot. This was really expected move, because even the old phones with SSD slot were crippled by firmware updates so that the SSD card couldn't be practically used with applications. I just wonder why nobody made bigger fuss about this. Devices which you have already bought and downgrade via software 'updates', duh! 
  • Python 3.4.3 released
  • Python 3.4 statistics library
  • Studied about Opus audio format - Because latest VLC 2.2.0 - https://www.videolan.org/press/vlc-2.2.0.html - supports it. 
  • Peewee ORM ala Charles Leifer - Techniques for querying list of objects and determining the top related item
  • dataset - Super lightweight ORM for Python
  • Python 3.4 tracemalloc - Track application memory allocation in detail 
  • Python PEP 448 - Additional Unpacking Generalizations
  • Google PerfKitExplorer - Track performance across clouds  
  • For some reason multi threaded version of par2 seems to crash with my new 16 core system. *** Error in `par2': 6429 Segmentation fault      (core dumped) par2 c -u -r10 recovery.par2 *** Most interestingly the crash happens after the Reed Solomon matrix Construction is completed. So there's some kind of simple addressing fail somewhere probably. I'm pretty sure it's simple bug, and not a hardware related issue. It also seems to be happening quite often. 
Phew, now my backlog is gone. I did it. Hooray!

Topic mega dump 2014 (3 of 3)

posted Mar 2, 2015, 7:50 AM by Sami Lehtinen   [ updated Mar 2, 2015, 7:54 AM ]

  • Backblaze Hard Drive Reliability Update
  • One investment company didn't use HTTPS for their customer pages in 2014. That's incredible. Also many forms were server using HTTP only and then results were submitted over HTTPS. When I complained they said it's ok, because information is submitted over HTTPS. But no user knows or notices if someone edits the page and removes the HTTPS. As well you don't really know where the data is being submitted without checking the source. As well as MitM attack would allow modifying the form sent over HTTP easily and choosing free content for questions as well as the destination for form content.
  • Read TorCoin plan.
  • Read good long post about Amazon's Elasticsearch. - Unfortunately I don't have real use cases for such system right now. As well as I consider many Amazon services to be actually quite expensive compared to competition.
  • Had a discussion how to learn stuff. My view: "I think it would be better to learn the same skills on something concrete. amd being productive while learning, and not spending resources only on learning. That's one of the reasons, why I now study programming by deciding a project which requires a certain skill set and level. Then I proceed building it to at least on alpha or MVP level. Which allows me to create something useful as well as learning the required skills. Yes, this takes more effort than only 'skimming' a book on some specific topic, but then I know bit deeper the topic and hopefully generated something useful something while learning."
  • Something different: Semi-automatic transmission, Canard, Tricycle landing gear, Free piston engine, Wave disk engine - Free piston egine can be used as linear generator. In such engine there would be only moving valves and piston. No crankshaft or physical output axel. - Iron Dome, Skyshield, Depleted uranium, Quad TiltRotor, Supervolcano, MANTIS, AMOS, CV90, Rutherford, V-3 cannon, Psychological Warfare, Ballista, Catapult, Trebuchet, Hall effect thruster, VASIMR, Inertial Navigation System, Anti-aircraft warfare,
  • Also reminded myself about: Counterintelligence, Countersurveillance, Computer forensics, Forensics data analysis, Distraction, Cover-up, Disinformation.
  • Did a few short tests using Google Cloud Messaging and my phone. I had one specific project on my mind, and I found out that the delivery latency as well as latency jitter were totally unsuitable for the purpose of the potential requirements of the project. But in general I really like concept of one messaging solution which can be used to trigger events and so on, which naturally saves a lot of energy compared to running tons of different applications polling something constantly or even keeping idle tcp connections (with repeated ping/idle/alive messages) open. Consuming cpu, bandwidth, memory and battery resources.
  • Facebook data center concepts Wedge and FBOSS as well as disaggrecated network. - Kw: switch, configuration management, statistics packages, environmental handling, microserver, modular enclosure, control logic, switching module, Open Compute Project (OCP).
  • Reminded my self about Graph databases. - But the specification made me smile: "A graph database is any storage system that provides index-free adjacency." - Hmm? Index, that's though definition question. I would say it's "direct pointer" to data, that doesn't require index. But with current complex systems, that definition must be really lingering. Because any lookup table could be considered as index and therefore I believe that most of current systems simply can't provide index free solutions. There are so many layers of indexing already in existence on modern systems. But if we return to legacy systems, in ram graph database could be something where record A got direct pointers to other records with memory addresses where data is being stored. As example inodes in file systems. When making comparison to file systems, if the record contains filename that's a fail. Because looking up inode using filename requires using an index for lookup. Or if indexing isn't being used then it means going through a list of filenames in directory which is even worse. In a way I really like legacy programming and C. Because many high level systems diverge developers from what's really happening. Really simple naive legacy implementation is much cleaner. Dictionary, hashtable or what ever = index, fail, direct memory pointer or disk address doesn't require index. Except, that if data is being stored on SSD or any modern system, there are already multiple layers of lookup tables and indexes. And same applies to modern operating systems, paged memory and so on. Actually when these modern systems are used and you listen how high level developers describe those, it might sound like that they don't know computing at all. And they might not be able to describe on low level what's happening and how.
  • Read documentation about PostgreSQL / FreeBSD performance ans scalability on 40 core machine
  • Checked out Google Cloud Save, a cloud data storage for Android devices.
  • Google Cloud Platform - Cloud Endpoints. Just a additional layer making using App Engine with Android easier for developers.
  • Checked out Google Cloud Monitoring -  Would this be what's needed for future monitoring of cloud based services? Seems bit lighter solution than what I'm currently running. But I liked the way they provide ready installation using Puppet, Chef and Ansible.
  • Google Dataflow - This is something I could use for my ETL tasks if required. Most of those tasks are currently running locally with the primary application server. But if there's too much data to be processed by that server, relocating whole system in cloud should be future proof option. Provides data pipeline, data transformation layers. Which I've currently implemented in my own integration module. Yet I don't really like the fact it's Java only. I've written all of my latest code using Python 3 and left Java dusting where it should be.
  • Lightly checked out Google Polymer, Web Components Meteor, and Mozilla X-Tags - This is something I could love. Something which is quite simple to use and makes web UI and Application development much simpler. Current solutions with Angular and web server side stuff and tons of different JS frameworks combined for UI side make development quite complex mess. You'll have to know so many different technologies well as well as know exactly how those can be fitted together. On the other side, those high level frameworks could add considerable load on server as well as on client side. Just like the guys mentioned in Don't Use Angular post. - No link to single post because there are multiple good posts on this topic. If I would be JavaScript programmer I might like the concept of Meteor a lot.
    It's bit like the situation like the cross-platform mobile application development. Use something like Intel XDK and you'll get one slow bloated application which will perform poorly on all platforms.
    If you got interested also check out MEAN.
  • HTML Include - This is something I've been wanting to use from early 90's. iframe came, but it's not same as simple include. Of course there were solutions to make server side includes, and template engines do nested includes and stuff like that. But it's not the same as simple include on browser side. 
  • It seems that someone else came to exactly the same idea as I did. Why Brython isn't served by global CDN as well as why it's not using (even optional) HTTPS. Delivering a JavSscript library to whole world from one server at OVH, France isn't optimal solution. Lack of https and ipv6 is so great either. My personal suggestion for CDN would be using cdnjs.
  • Digital Panopticon - You're being watched. What will the future be like?
  • Most popular programming languages 2014. - Python is strong as well is Java, even if I don't love it anymore. - Java seems simple, but you'll end up generating a lot of bloat code, diminishing development joy and efficiency.
  • I finally figured out why some of the stuff I were battling with Peewee ORM and PostgreSQL and Python didn't work at all. Reason? It's very simple and quite a traditional trap with ORM and especially with dynamic programming languages and databases.
    Peewee ORM - Oh joy. It took me a while to figure out that Python's Peewee ORM handles default and None differently than Python usually does.
    Usually None != True and None != False are True, but in case of Peewee ORM, those won't be True. That's because None is only None, as example  None == None is True. Now it's finally clear. It also seems that even if there's default value defined for Model, those aren't used, in case reference Foreign Key is missing. So you'll need to write X == None or X == False, and only then that's about same as X != True, even if default value for X is False. This is especially important to remember when doing outer joins.
    Did I feel stupid after this? Yes I did. It's just like SELECT * FROM table WHERE data = 0 and then you'll finally figure out that it returns completely different number of records than SELECT * FROM table WHERE data = 0.0 isn't that fun? This is exactly why you should know your tools well or otherwise you'll end up with really nasty surprises. Even basic unit testing won't catch those unless you're specifically aknowledging that you should test for those cases. I assume that part of this problem is the fact that Peewee ORM doens't have exact NOT operator. ~ used by Peewee is about the same.
    Of course there are silly workarounds for the previous problem was that I could ask for count of matching records and if it's 0 then it's same as == None, but that's silly. As well as compiling sublist of potential join entires and then asking if key not in (sublist) which also excludes records which do not have references. Both of these solutions do work, but are quite non optimal. Isn't this just what normal programmers do? Now it works, fine, let's continue. Even if the solution is slow, crazy and doesn't make any sense.
  • Reminded my self about Enterprise Service Bus (ESB) stuff. I'm actually quite glad that many customers select simple, lightweight and more efficient integration methods. Some customers even clearly say that we have that ESB but well, let's just make this work and not use it. Smile. That fits quite well to my current view of avoiding bloat and overhead when it isn't absolutely required.
  • Tor exit node operator prosecuted in Austria. - This battle with Internet freedom and Surveillance will be long, we're living interesting times.
  • hubiC - Excellent European Cloud File Storage service with bit better pricing than what Box or Dropbox and many other alternatives provide. Data is also stored in three separate data centers for storage reliability and availability.
  • I thought that the email would be thing of past soon. But it doesn't seem to be that way. New email clients are popping all the time. Mailpile is one of those.
  • Python 3.4 asyncio with examples - A nice post about new features. This is also one of the reasons I'm not using the (whatever) pipe / queue solutions from (whatever) providers. When servers are clustered together with great interconnectivity, it's pointless to pass data via cloud adding bandwidth costs and latency. As well as because it's so simple using Python alone, I don't want to mess up my projects with additional and needless dependencies. Those should be brought in only if those offer some killer advantage over existing solution. Which they do not currently do. This is exactly the reason why most of my projects are also using SQLite3 and only some projects use PostgreSQL.
  • Whoa, Hotmail and Outook are finally supporting smtps (tls/ssl) smtp transport. - I wonder why it took so long.
  • Google Compute Engine is providing Persistent SSD Disk storage s well as global load balancing. - Which is nice.
  • Vultr seems like a good competitor for Digital Ocean. Based on quick tests they provide even better cost performance ration that digital ocean.
  • NSA targets privacy-conscious - Even more interesting development, maybe we do something to hide? But who's we? Maybe NSA will find out, maybe not.
  • I thought about messaging client which would use DHT for data storage. Everything in the DHT storage would be encrypted and all data would pass via DHT storage using pseudo-random data access patterns. In some cases even the encryption key itself could be used for covert messaging. The payload is basically meaningless, it's all about the key which could be used to decrypt it successfully.
  • PyCon 2014 - Multi-factor authentication, Postgres Performance for Humans
  • One guy said in one tech talk, that he's job is do all the tasks that the engineers can't get done. - Made me lol so much - I don't know why this sounds so familiar. - My work is to be kind of SWAT team or a special unit, when the other departments just can't handle it. - It's good and bad. Because you're going to get all the very problematic cases to solve. Which might require long monitoring, deep analysis, extensive logging and so on. (I'm actually right now working on one such complex case (Feb 2015). Issue has been analyzed for several months by others, but there aren't any real results. I guess I'll have to dig deeper than that.
  • Checked out Google Drive Pricing and compared it to hubiC - Yes there are price differences.
  • COMSEC - Communications Security
  • It's important to have certain arrangements made before hand, allowing maintaining capability to communicate securely even in time of real major crisis. Private out of band communications using multiple separate communication channels and without the need to relay existing networks like mobile phones or Internet connectivity. - It's also a good idea to have a few anonymous Internet connections, which are using 4G data.
  • I guess people with Comsec, Infosec, privacy, covert, communication, system administration and good general IT knowledge and skills can be dangerous.. If there is just a motivation for nefarious intent. But why bother if there's no good enough reason?
  • Cheap cloud services and optimized code could be easily used to generate such a flood of messages to Bitmessage system that it would overwhelm most of network peers. I don't know if proof of work is the right way to securing and limiting network resources in the future.
  • NSA classifies Linux Journal readers, Tor and Tails Linux users as "extremists" - Are Linux users really that dangerous?
  • Maintaining covert identities is hard, really hard. It doesn't require anything else than a simple habit based fail to ruin it all. It's something that needs to be practiced a lot to learn. If you just read about it and try it, you're going to fail, badly.
  • Actually I came up with this before the "Lorem Ipsum" stuff came out. My plan? Having a simple application which generates cipher text which is then translated to viable looking normal plain text, so it wouldn't trigger "encrypted communication" alarms. Program should have pluggable dictionaries and language modules so that it could be used with multiple languages. it's kind of steganography. First point of this whole thing is not to trigger any suspicion at all. See stegano.net
  • Turned NLA, TLS and 128 bit encryption on for all systems when using RDP. - For some strange reason this prevents Remmina from connecting. I guess it has to do something with the high cryptography requirement because Remmina does suppot TLS and NLA.
  • Are privacy enhancing tools pro or con? Maybe using some simple basics could keep you off the radar? Instead of using well known yet efficient tools which arise suspicion. I was thinking building really simple text steganography tool just for fun. Embedding messages in text using c&w method with compression encryption and ECB. Result is text which doesn't seem suspectable but still contains strongly secured message. Depending from fill in text of course statistical analysis would pretty easily reveal that something is going on. Of course these questions are related to any privacy tools. If you're trying to keep things private and secret, you must have something to hide right? Especially when privacy tools aren't so commonly used, so it really sticks out when someone is using high grade privacy tools.
  • Stego - Text Seganography tool.
  • *** different attacks and stuff like that... False Flag strikes? Who gets the blame game?
  • Subliminal channels - A way to pass communication over unencrypted links. Just like the time stamp modifications with PW.
  • Canary Trap - Creating different documents for different recipients to see which one leaked.
  • Charge Cycle - Battery tech, how many charge cycles can your batteries take?
  • KW: Edi envelop SOAP envelop and Finvoice envelop.
  • JSON Resume standard - Nice way for hackers to represent data in consistent way?
  • Tried Windows IP Ban service, but it didn't work out as well as it should. Didn't like it.
  • Xiki - Improved (Amazing?) shell - Had to play with it, but didn't see a need for it being used for daily operations.
  • Credit Card Skimming - List of different kind of modern (?) skimmers. It's so silly that the magstrip is still being used.
  • Parsing Accept-Language header using Python. I didn't use that one, I wrote my own version. It takes the list, sorts it by preference and then finds first match in my available languages list.
  • Python 2.x vs 3.x survey results
  • It's known that comparing cloud service pricing is really hard. Sometimes nearly impossible. Some providers give lower price and yet provide 10x the performance. It's interesting to notice how bad performance AWS is actually providing. If you compare AWS prices to Hetzner prices the difference is mind blowing. 
  • It's just horrible how many people won't take proper care of their PGP/GPG keys, when hard drive crashes then they just generate new keys and assume that everyone should trust those right away. Sounds like a really bad practice.
  • Hacking Government Surveillance Malware - Totally awesome story including technical details!
  • Storing personal names - First name last name, a good idea? Well, it isn't. That's why I'm using only single unicode field for name.
  • SSL CA information shuoldn't be trusted - No news here
  • Reminded my self about Kaizen - That's something what everyone should follow automatically.
  • Kaikaku - Disruptive innovation and change / pivot
  • XG-Fast - 10 Gigabit links over copper. But as logical drawback distances are getting quite small.
  • KW: Enterprise Resource Management (ERM)
  • Open Data - Simply put "Personal data should belong to the people" if I store my data to some service, why I can't download it all easily?
  • Python is now most popular programming language in Top Universities
  • Yet another file storage service. Amazon WorkDocs.
  • PyCon Taiwan @ YouTube
  • Amazon Cognito - Similar service compared to Google Save. Easily store application data for users in the cloud.
  • There is a clear bug in Deluge Bittorrent client, per file connection limit doesn't work properly.
  • OSPFv3 vs OSPFv2 What is different? - Really nice post, I haven't yet used OSPFv3 but reading this was good intro, it's important to know that there are new LSA types and possibility for multiple instances over same link.
  • SQLite: Small. Fast. Reliable. Choose any three - Excellent post about SQLite3
  • Google Noto fonts for all languages.
  • Studying lossy image compression efficiency - One of my favorite topics. It remains to be seen if JPEG finally get's some viable alternative. I've also read about JPEG patent fights, some OpenSource projects are worried about JPG patents. Well, I don't miss JPEG and there are already better options like WebP and BPG, which just haven't received wide adoption unfortunately. Here's excellent image compression comparison site.
  • Is your application ready to handle CJK chars? Should be if it's UTF-8 compatible and uses right fonts, but there might be some traps. Like string length limits and so on.
  • We also see in Finland Mojibake often, because some systems print UTF-8 ÖÄÅöäå chars as ASCII leading to interesting results. Anyway post offices are really good deciphering those.
  • Shift JIS - Luckily we're not using anything like in that in Finland. But this actually reminds me from times when I wrote Code-128 barcode encoder. Code-39 and 128 which both allow (and require for efficiency) shifting between different encoding modes called A,B and C. Basically it included shift one letter for capital letters and then caps lock mode which permanently switches to another mode until told otherwise. Modes include lower case, uppercase and double digit mode for compression, which allows encoding two number per one barcode font symbol.
  • Unihan Han Unification - Way to get bit different Asian symbols to use same font and visual representation instead of having different symbols for each.
  • Bit faster SSD from Fusion I/O ioDrive Octal drives - Made me smile. Yet I don't have any use for such high end stuff.
  • iosnoop - Excellent tool for snooping disk I/O latencies per process. I've been using this with some servers when ever I suspect I/O related issues. Especially when using VPS servers disk I/O can really tank from the level you'll expect it to be.
  • Got a bunch of GTIN codes for one project.
  • Everyone is using ISBN-13 nowadays, but it wasn't like that always. I had to write EAN-13 to ISBN and back encoder/decoders back in days.
  • How to be happy - I hope you're already happy, so you don't need to read this.
  • I'm very used to databases which provide full MVCC / Snapshot isolation. It was very good that I always want to test all critical sections separately. I found out that some older and simpler databases require additional lock table statements to lock tables. Without those simply starting and transaction doesn't provide any protection from other committing transactions. So database doesn't provide read repeatability, without additional locking.
    Actually read repeatability is not yet even same as snapshot isolation. Because it only locks rows that you have read so far. So if your transaction consists multiple separate reads, it's possible that those reads do not give you uniform image of the database, when the transaction started.
  • Canvas Fingerprinting - Almost impossible to stop network tracking. Yes it is possible to block it, don't run the scripts in the first place.
  • Terminal - Yet another Linux virtualization management tool
  • Reminded my self about protobuf even if I don't have use cases for it. As well as checked out Transit which can encode/decode MessagePack or JSON formats. World is so full of these 'solutions'.
  • Why blurring sensitive information is a bad idea. - This should be also quite obvious to everyone.
  • StartUp mistakes you shouldn't ever make.
  • hubiC fixed their upload speeds finally. I've been avoiding using hubic.com because upload speeds have been so lousy, less than 1Mbit/s, but now I'm uploading at 100Mbit/s which is good enough. 
  • Ekahau Spectrum Analyzer - Yes, it's just as cool as it sounds like. And does the job. Most of guides how to avoid WLAN / WiFi congestion and interference are quite bad, because most people don't realize there are many other sources than WLAN networks. As well as one heavily used network can be much worse than 10 lightly used ones. Or there might be a reason why there aren't WLAN devices on channels which are used by wireless video surveillance system and so on.
  • One project was designed to use WebServices a long time ago. But back then it was concluded it's so hard and nearly impossible. What then resulted was that the project did silly things. It dumped changed to be replicated to other databases into one table. Then this one table was dumped as XML files on disk. Then one client compressed these XML files to create a ZIP file. Then there was a client which polled bidirectionally for these ZIP files and transferred those over encrypted (of course DIY encryption and implementation) TCP connection. And the other end basically everything happened in reverse. When you think about this complex chain and bit bad code which doesn't lock files properly, doesn't check file integrity and randomly fails, you got excellent and reliable data transfer solution. Ehh, let's say NOT. All this because directly transferring data would have been 'too complex and unreliable'. Just managed to add 10x overhead and even more unreliability. But we all know this is business as usual and there's new about this kind of stuff happening over and over again.
  • Planned Obsolescence - Great for consumerism, but bad for environment. It's also a good policy for software business. It could be hard to charge high maintenance fees, unless customers need that maintenance is needed. If everything would work without continuous manual fixing, customers might feel that it doesn't just make sense to pay maintenance fees.
  • Finished watching lecture series Thinking Like an Economist (TTC).
  • Reminded my self about Markov chains. Finite-state machine is also closely related. - Some times some programs just seem to feel more like Infinite-state machines, wait what? That's because there's nearly infinite number of different ways to fail.
  • One integrator got lamest debugging tools I've so far seen. They used program to dump communications in hex, but then. No, no automatic extraction / analysis. They had printed papers with packet formats and then he used manual calculator to convert between hex, dec and bin. Debugging took long and their team seemed frustrated and it took long. No, I don't have anything else to say about this but I was bit aghast. As you can see, there are different levels, something seems just bad and then some cases are actually insanely bad. 
  • Decentralization I want to believe - It has been seen over and over again, that people don't want and don't care about decentralized systems. Major problem is that decentralized systems are basically mobile hostile. Some companies have used these solutions to limit burden on their servers, pushing to burden to clients, which are then unhappy about it. Clients can consume a lot of cpu time, memory, disk space, disk access, cause lot of network traffic, be potentially used to generate DDoS attacks, or malicious traffic etc. All are great reasons why not to use decentralized solutions. People also seem to totally forget that things like email are already decentralized!
    Zero Configuration is also basically impossible because you have to bootstrap the network some how. Fully decentralised solutions still require bootstrap information. Which is unfortunately hats enough for many and therefore works add efficient show stopper.
    Last nail to the coffin is that most people really do not care about security at all. User base is after all just a small bunch of delusional geeks.
    Otherwise if people would really prefer decentralization and secure communication, something like RetroShare and Bitmessage would be widely used.
  • Telehash - Yet another decentralization protocol 
  • Tor Traffic Confirmation Attack - Carefully studied the article
  • Remy - Even more TCP congestion control, except this one is so complex it's not actually viable. But it's interesting to see that really complex computer generated rules can out perform simpler solutions.
  • Read about QUIC. But no time for this kind of stuff. Hopefully it will be out in future.
  • Internet censorship is progressing, Russia passed new laws. No link, you'll find it if you're interested.
  • Some D-Link firewalls forward WAN UDP DNS queries to ISP. Really nice, works well for DNS DDoS amplification attacks even with spoofed addresses. No wonder some ISPs have been complaining about this. Devices are really easy to exploit.
  • IBM is building Brain like CPU's with 4096 cores.
  • IBM Research Report - Comparison of Virtual machines and Linux Containers (Native, LXC, Docker, KVM) - Yeah, virtualization is expensive. Yet another reason NOT to run "cloud" at all, if it's not required. It's better to run full servers with your software and proper automation and configuration management. Adding virtualization to this mess just lowers performance and adds costs.
  • Windows 8.1 tablets with InstantGo are really annoying if you're trying to save power. Sleep and Hibernation do have real role even with tablets.
  • What happens if you write TCP stack in Python - Nothing to add, except it seems that he wasn't quite up to the task.
  • How to validate your business idea by testing
  • Is there anonymous call forwarding service, which could use prepaid from multiple operators? You (A) call number B (forwarding serivce) and call id forwarded vial C (outbound forwarding service) and finally to D (final destination). This would make tracing calls much harder. Especially because you can swtich A-B and C-D independently. But because this is near real-time forwarding this would have similar traffic confirmation characteristics to VPN provider or Tor relay. Even if you can't directly link A-B call to C-D call, you can do it via statistical analysis of calls and timestamps.
  • Tor relay proxy with intentional latency? Would this be a good idea? At least it could be used with Tor SMTP, store and forward service which on purpose adds delay to skew statistical traffic confirmation analysis as well as it could alter the message size (expand it) or by dropping extra padding.
  • How hackers hid a botnet in Amazon - Well, if there's free resources, even little resources, which can be automatically harvested. It creates great potential for abuse, that should be pretty clear. 
  • Watched two documentaries, one about Israeli Intelligence services and another about Ukraina and Syria.
  • In one security audiot for someone: 1/4 (25%) of database servers facing the Internet used default login & password. Was direct database access blocked by firewall? Of course the answer is no.
  • Studied Netvisor Web Service REST API for system integration.
  • OFF System (OFFSystem) - Anonymous P2P - Storing only non copyrightable data - I actually studied this years ago. I just forgot to write about it. Questions related to it raise interesting questions especially if I XOR two movies together and release the diff, what I'm exactly releasing? This blog post actually contains several high security EC keys. What? Yeah, you'll just have to XOR this with 'random set of bits' Lol.
  • Microsoft is going to give data to US agencies, even if users are foreign and data is not stored in US. So if you think using MS European data center(s) provides privacy, you got it wrong. This is going to set lines how much US Cloud Service companies can be trusted in future. Trust is already very weak.
    It's quite likely that same unfortunate rules apply to Google, Yahoo, Twitter and Facebook as well. Great question is, if it's enough that the company hosting the servers is American? If there's small European business, using Amazon Servers in Europe, is it still all your data belongs to US fair game?
    It became evident from news that Google scans emails and attachments really carefully and reports to authorities. Can that also be extended to other programs? Technically, sure. Wouldn't it be great if the operating system, anti-virus tools, NAS devices, etc, would directly report pirated content to RIAA, they would save them a lot of trouble. 
  • About some of Tor node busts - So many fails. First they failed to use Whonix or similar separation which forces all traffic to go only through tor. Secondly you shouldn't ever mess with, normal (daily use), secure (secret / confidential), and anonymous (no identifying information what so ever) systems. All of those should be completely separate, as deep as hardware level, preferably with individual Internet connections. For secure systems it's good idea to use separation with extremely limited connectivity (rs-serial cable in my case), it's enough to pass ASCII armored pgp messages. AS well for anonymous systems you'll use prepaid data with burner phone and also replace hardware from time to time. You'll also boot the system from read only media, so when ever you'll reset it, it'll be clean again. But if you're lazy and don't care, it's easy to fail. If you use your normal system for all three settings, the results will probably be pretty bad. Also always check signatures, without signature(s) checking it's trivial to give you version which contains well, what ever. 
  • Google play gives really bad UX when some updates keep getting installed automatically even if you try to uninstall and disable those apps. 
  • Quickly checked out Azure DocumentDB
  • Technology 2014 Hyper Cycle Map
  • Sonera (Finnish telco) failed basic access control when providing free benefits. They sent text message (without any code) that using this text message you'll get 10€ benefit. Great, but they didn't include any code on the message which would the same code to be spent multiple times. 
  • Cloudflare now supports WebSockets - Yay!
  • Offline first is the new mobile first? - This is good and bad development. Some offline first sites are actually quite ok to use after inital loading, but using such technology could make the initial load ridiculously slow for visitors who aren't using the site daily. Been there and seen that happening. 
  • Optional static typing for Python? - Is it worth of the speed benefit? I guess it is in some cases, because benefits can be drastic where it matters.
  • I thought I would write more about DNS-based Authentication of Name Entities (DANE) - It seems that not everyone is happy with DANE. Anyway, I have a good friend who's able to do DANE stuff, if you need such services let me know. Yes I did read the RFC6698.  
  • Something different: Active Protection System [1, 2], Optimal Control, Sliding Mode Control, VA-111 Shkval (Supercavitating torpedo), Chemically strengthened glass
  • I thought I would write more about LclBd scoring logig, but there's nothing to add. It factors in location, time, and tags and users weighted by Naive Bayesian implementation. Based on that it picks latest local news for you which you're probably interested about due to tags used in the post or because you've previously like the posts from the poster. Also negative weights are available so you can dislike stuff, that's something what Google+ and Facebook and Twitter won't allow you to do. And I don't like it. I want to be able not to like things. ;) Actually current deployed version got so bad usability issues that those cast serious shadow over any usability. Maybe I got right mood to fix those some day when it's raining and dark. Is it worth of it? Well, most probably it isn't. It's just some hobby tinkering.
  • Cell phone guide for US protesters updated 2014 edition - It's all there, how to use your mobile phone. 
  • It seems that Skype dropped support for old non cloud based clients and is now forcing everyone to use their cloud storage and relay services. They also forced Ubuntu & Linux users to Skype update.
  • Submarine Cable Map - A great resource if you want to know where Internet is flowing under the sea. 
  • Intel Released first 8 core desktop CPU with 16 threads and DDR4 in i7 series. 
  • What are UTM tracking codes
  • Seamless WiFi roaming - Nice! You can configure many of the end devices to scan more often to roam, of course this doesn't make roaming seamless but it's good enough. I'm actually curious if this seamless roaming is actually seamless. It could be, but I don't see any proof there except they're claiming it to be genuine seamless roaming which would be pretty cool. Genuine Seamless roaming would mean that there won't be any kind of hick-up when switching base stations. User wouldn't notice anything at all. Most often that's not required, but can be beneficial if it's available. 
  • The Skyline Problem - Yet another seemingly very simple programming challenge to tackle.
  • Mobile Privacy: Use only clean prepaid phones, do not call outside the closed circle of those clean unidentified phones. It would be interesting to analyze such data, and correlate it with other calls and phones happening in parallel. I guess even in this kind of situation you could detect users which are carrying those anonymous phones with them and another identified phones and where they're relaying the information if using alternate phone. So even this isolation trick won't provide you real privacy.
  • Similarly many people seem to think that SSL would provide security. But no it doesn't. It only encrypts message content, it doesn't hide communication patterns. So when you open a web page and it's resources are downloaded all of those downloads can be monitored. When you then compare those to different possibilities available it's well possible to know exactly what page you opened. Even if the encryption wasn't broken. 
  • Higher level dynamic programming, generic specific code, which prevents reuse of code, Non-uniform Memory Access (NUMA), multiprocessing, multithreading, shared nothing and so on. - I was supposed to write about this, but no can do right now. 
  • Absolutely great post: Visualizing Garbage Collection Algorithms - You just gotta check it out. 
  • I've been writing a lot of stuff using MarkDown (MD) lately. See CommonMark
  • BankAPI - A secure solution specification for delivering messages between banks and other type of financial institutions. 
  • I would love to write about early days of Internet, when I used Windows 3.11 and Trumpet Winsock and stuff like: Slirp, Slip, PPP, packet traces, tcp flags, tcp window, rst, ack, syn, and other stuff I learned already back then. I miss my 14.k modem, no not really. 
  • Samsung Galaxy email app gets ridiculously slow at times. Deleting cache helps. Bad code doesn't show up, until it does.  
  • Actually I don't know why but running par2 on my computer for some reason makes it incredibly slow even if there seems to be no reason for that? Maybe it's memory contention? But afaik that should show up as CPU time, maybe it just doesn't with my current platform. 
  • Boxcryptor pre-crypt data before transferring it to cloud - Now, I've just used 7-zip and GnuPG for this very successfully earlier without any problems. 
  • Studied Universal Description, Discovery and Integration (UDDI) - No I'm not currently using it nor I see it needed in future either. 
  • Sovereign @ Github - Tested it and I can't say it better than they do: "A set of Ansible playbooks to build and maintain your own private cloud: email, calendar, contacts, file sync, IRC bouncer, VPN, and more." 
  • Mail-in-a-Box - Yet another alternative if you don't mind configuring everything by your self (as I did). 
  • My generic guide lines for my own code:  Reusable, simple, use pre-existing, Keep It Simple Stupid (KISS), only make optimizations when actually required. Aka focus on what really matters. Keep project profitable and relatively cheap. - I know a couple of guys who can spend months optimizing code that gets run monthly and takes 5 minutes to run. Is that wise? 
  • Studied BtrFS wiki - It seems that I just like reading it over and over again.
    WPS Wi-Fi router security in some cases ridiculously bad.  Ok, WPS security is always bad, don't use it. This is nothing new. Whole protocol has been broken all the time since it's very beginning. 
  • I'm not providing you enough interesting links? Ok, you asked for it. here's a great list of Recommended Reading - complied by someone else. Just enjoy reading all that stuff. 
  • hecked out: The Payment Application Data Security Standard (PA-DSS), formerly referred to as the Payment Application Best Practices (PABP) 
  • Was the Silk Road bust assisted by NSA? Maybe? Who knows. Where are the packet logs? - And story goes on: FBI's explanation
  • JSON Web Algorithms (JWA, JWS, JWE, JWK) - Standards for JSON encryption. - JSON Web TOkens (JWT)
  • Just my short thoughts: "transport sftp, ftp, sftp, RESTful, HTTPS and data sources xml json csv sql mongodb key value storage or any other. Data source is just data source and I'm sure I can deal with it." 
  • Devops Days Helsinki - Like I've written earlier DevOps aren't ultimate solution, because they lack set of other skills needed to sell, define, offer, build, maintain and support systems. 
  • This is exactly what I've been writing. Poor UI Design Can Kill
  • Just quoting some devops stuff: 'Perusideana on tuoda perinteisesti erillään olleet kehittäjät ja järjestelmien ylläpitäjät tiiviiseen yhteistyöhön. Kyse on isosta murroksesta, jonka ansiosta ohjelmistotuotannon vauhti kiihtyy, laatu paranee ja kustannukset laskevat. "Oleellista on ymmärtää, että devopsissa on kyse ennen kaikkea bisnesprosessista. Devopsin perimmäisenä ajatuksena on saada idea mahdollisimman tehokkaasti ja nopeasti tuotteeksi", Stordell kuvailee omaa näkemystään devopsin luonteesta. ' - Good question, all that high level blah blah. Does it really make any practical difference? What if the guys would be responsible for everything. Software developer everything from start to the very end.  
  • I like concept of chaos engineers. But what if everything is pure chaos even without them?
  • The laws of shitty dashboards - Just so true.
  • I were asked if I'm interested about distributed WebRTC utilizing HTML5 major project. Well, this time I weren't available. Yet the project sounded interesting. 
  • Open Data Finland (Avoindata in English)
  • Why is Google hurrying killing the SHA-1
  • A great post about Wifi beamforming
  • Some TLDs still don't support IPv6 one of those is .at TLD. Nor they do support DNSSEC. 
  • Great post The Art of Profitability
  • How handy e-receipt would be? Our tax authorities remind everyone to issue receipts.
  • Had interesting discussions with friends if database should contain all available information or only just required information. This is actually quite a good question. Because it depends from so many factors. In some cases it's really handy to have everything in database. But from the performance point of view, it's really bad of having everything in the database. Especially if whole record gets updated due to bad database engine. It can drastically add requirements for memory and disk I/O due to database size growth. 
  • More closed source solutions, Google deprecates OpenID 2.0 and forces users to use Google+.
  • Latest OpenID specifications
  • Finland is planning to strengthen national cyber warfare unit and preparedness for hybrid wars. 
  • Making sure crypto sayts insecure - Absolute must read article. This is how things are ruined behind the scenes and odds are set against you.
  • A great TED talk: Big data is better data by Kenneth Cukier

Not enough? See parts 1 and 2.

ETag and gzip decorators for Bottle.py

posted Mar 2, 2015, 7:32 AM by Sami Lehtinen   [ updated Mar 2, 2015, 8:17 AM ]

from functools import wraps

from gzip import compress

 
def gzipped():
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwds):
            rsp_data = func(*args, **kwds)
            if request.headers.get('Accept-Encoding', '').find('gzip') != -1:
                response.headers['Content-Encoding'] = 'gzip'
                rsp_data = compress(rsp_data.encode('utf-8'))
            return rsp_data
        return wrapper
    return decorator
 
from base64 import b64encode
 
def etagged():
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwds):
            rsp_data = func(*args, **kwds)
            etag = '"%s"' % b64encode(
                (hash(rsp_data.encode('utf-8')) + 2**63)
                .to_bytes(8, byteorder='big')).decode()[:11]
            if etag == request.headers.get('If-None-Match'):
                response.status = 304
                return ''
            response.headers['ETag'] = etag
            return rsp_data
        return wrapper
    return decorator

I'm very aware that this etag handing won't make it lighter for the server. But it makes getting response still faster especially for mobile clients.
If you have easy way of validating for ETag content before actually generating the content on server, just move ETag content check stuff above the rsp_data = decorated function line. So the call to the decorated function will be completely avoided if you return results from that stage. Both of these options are designed to be only used with fully dynamic content. Both work well with templates and stuff. 
It's recommended to use max-age=0 instead of no-cache for stuff which should be cached, but still could get quickly invalidated. ETags help with that.

kw: uwsgi, bottle.py, python, programming, webdevelopment, websites, webdeveloper, http, header, headers, content compression, deflate, last modified.

Python hash function performance comparison

posted Mar 2, 2015, 7:13 AM by Sami Lehtinen   [ updated Mar 2, 2015, 7:26 AM ]

Many recommend using sha256, but it's pointless to use if it's not required. It generates longer hash as well is 8x more expensive to compute than adler32. Python's own hash() function is fastest of all. It's almost 4 times faster than adler32.

Python hash 13 units
zlib.adler32 49 units
zlib.crc32 91 units
hashlib.md5 180 units
hashlib.sha1 179 units
hashlib.sha256 403 units

There's a reason why hash is so much faster. Old fashioned crcs (crc32) and sum functions (adler32) can't fully utilize features of new platforms and process data in small chunks. Computing 64 bit hash using 64 bit blocks is naturally faster than computing 32 bit hash using only 8 bit blocks like adler32 and crc32 does.
If I would be re-inveting light hash function I could do something like this:
Just sum or xor (?) 64 bit block of data with buffer and then shift buffer by one bit to left wrapping the one bit to right and repeat with next block. Super simple and fast (?), don't know if it behaves like good hash should. But it's a quick way to find out if data has changed.

kw: python hash speed performance cycles timing time resources compared compare faster fastest slow slower compare functions hashing programming security

Generic thoughts about decentralized (p2p) systems

posted Mar 2, 2015, 7:07 AM by Sami Lehtinen   [ updated Mar 2, 2015, 7:08 AM ]

Just a quick thought dump:

It has been seen over and over again, that people don't want and don't care about decentralized systems. Major problem is that decentralized systems are basically mobile hostile. Some companies have used these solutions to limit burden on their servers, pushing to burden to clients, which are then unhappy about it. Clients can consume a lot of cpu time, memory, disk space, disk access, cause lot of network traffic, be potentially used to generate DDoS attacks, or malicious traffic etc. All are great reasons why not to use decentralized solutions. Zero Configuration is also basically impossible because you have to bootstrap the network some how. Fully decentralised solutions still require bootstrap information. Which is unfortunately hats enough for many and therefore works add efficient show stopper. Last nail to the coffin is that most people really do not care about security at all. User base is after all just a small bunch of delusional geeks. Otherwise something like RetroShare would be widely used. I've been considering and wondering these same questions for several years. About bitcoin style messaging check out bitmessage. Yet it's also technically a bad solution, because it doesn't scale and client and messaging requires way too much resources. Especially when considering mobile clients.
More stuff about distributed solutions: I was personally quite sure that trackers and other download sites would be replaced with distributed systems. Instead sites like TPB and others continue to use centralized systems. I'm of course pro distributed & anonymous systems, that's the reason why I've been studying those for years. I'm kind of disappointed how little those are used, if used at all.
When TPB guys said they would make something cool, I naturally assumed they would release a fully decentralized, pseudonymous, anonymizing solution, which would replace Bittorrent which is inherently dangerous because it leaks so much metadata as well as reveals to whole world what you're downloading. Instead they released something (technically) lame, the piratebrowser. Which just allows accessing their site via Tor (basically working as proxy), even if it's blocked locally.
I really do like Freenet & GNUnet design, because relays do store and cache data, creating a lot of sources quickly for popular data. Many systems have serious flaws, bitmessage is not scalable with it's current design. Most of other fully decentralized systems suffer from flooding sybil and metadata leaks. I personally played a little with a bitmessage creating over 100k fake peers just for fun and it worked beautifully flooding the network with noise and generated junk control traffic. Because it's distributed solution, up to 10k junk messages were stored on each peer, wasting also disk space. Decentralized design also makes nodes running the system vulnerable unless additional steps are taken to limit resource consumption. After resources need to be limited, then it's a great question what should be limited and how that affects the network.
Like the original article says, it's really hard to come up with solution which wouldn't be bad in some way.
When mobile devices get more and more common, the fact is that it's really hard to beat the centralized solution. Of course the system can be partially distributed so mobile clients can connect to one spot only. Server hubs doing routing and data storage. But basically that's no different to email, DNS and web at all. So no news here either. After all, email isn't bad solution at all. It's secure (if configured correctly) and it's distributed. - Yes, I'm one of those running own email servers, as well are my friends too.
Final question is, how to protect metadata and hide communication patterns. As I mentioned in my post about Bleep. It didn't impress me at all. http://www.sami-lehtinen.net/blog/bittorrent-bleep-my-analysis
I'm not attacking the original article writer. He clearly has done a lot of job and considered options as well as written clearly and well about it. But I'm just thinking these issues in completely technical manners. 
I'm happy to hear comments. I'm just cynical because the distributed systems didn't turn out as popular as I would have expected and wanted.
What I would like to see? A perfect fully decentralized P2P network running as HTML5 App which wouldn't require installation. That would be something which I would rate as pretty cool currently.

My thoughts about Subspace anonymous messaging protocol for Bitcoin

posted Mar 2, 2015, 7:04 AM by Sami Lehtinen   [ updated Mar 2, 2015, 7:09 AM ]

Here are my thoughts about Subspace @ Github.

I'm going to complete ignore any presented use cases and write it just in totally generic perspective as this would be any other distributed storage thinking about selected technical choices and potential problems.

Pros
  • Store-and-Forward - Potentially higher latency compared to low latency solutions. Which means that it's harder to connect sender / recipient at least in theory. This also allows communication when another of the parties isn't available which is often quite nice and desired feature.
  • Idea of tagging and anonymity sets is something which I like as concept. Especially the possibility for custom length parameter is interesting. But it also can lead to many problems which aren't immediately clear unless you really start thinking deeper. I'm covering a few of those models in this post.
Cons
  • Lack of Forward Secrecy - Same encryption keys are being used all the time no PF ratcheting.
  • Store-and-Forward -> Potentially much slower than low latency solutions depending form the implementation being used.
Questions
  • Server loading - If the server is doing all the message fetching, it could be seriously be loaded. Caching is the key, but how often information needs to be refreshed and how heavy this operation is?
  • Caching - Should server store fetched messages so it can deliver those to other clients too? If I would be writing the server code I would naturally cache stuff, disk space is so cheap and it would drastically improve user experience due to much lower latency. DHT lookups can easily take several seconds and if larger key space is requested it can take well basically anything up to an hour. Depending from used sets and client distribution and values used with length parameter this could easily lead to situation where all servers are still containing most of messages in cache anyway.
  • Timing questions - Will the messages be fetched by the server as soon as notification protocol announces that messages are available? Server could also prefetch stuff? If stuff is fetched only when client requests it what are the performance and privacy aspects? If data is prefetched then it of course improves performance. If data is not prefetched it might again make client / data correlation confirmation easier (?) as well as make performance from latency stand point really poor.
  • Federation / notification protocol design and routing? - How the notification protocol would work? There has been some discussion that the servers should be notified about new data. But DHT isn't great for that. Is there parallel notification / gossip protocol what's available? There wasn't any description of it. Because basically this is something which forms next bottle neck easily. Even if full messages aren't being tossed around, the amount of control traffic for the notification protocol could be really heavy if network actually gets used. So it's not as distributed as it sounds in the first hand. Without such protocol some key (spaces) could be repeated over and over again repeatedly, which isn't good either. Especially if protocol doesn't have packet types telling no change nothing new don't ask ETag style solution where server could easily tell that there's nothing new without sending the actual data.
  • Can clients have multiple key requests hanging open using one long poll request, What about really slow requests to full fill?
  • What about traditional DHT poisoning and flooding attacks which could be used to cripple the network? Of course servers without adequate control code, this could also lead out of disk space situation and huge consumption of other system resources.
  • Actually the design is such that it gives high probability of potential indirect centralization - "Servers with low latency, high up-time, and robust anti-DDoS measures will attract more traffic" - Which could actually quite easily lead to a service which is basically working as the backbone of the system. This means that system which fetches all messages and servers those instantly from cache would be the preferred solution for almost everyone. Nobody wants to use naive implementation which checks DHT and then fetches messages passing those to clients. Which could take a long time, something like several hours easily as well as of course all messages might not be available for multiple reasons after all. Like if Facebook would allow federation protocol, how many percentage of users would be using other than Facebook servers? Sample kind of examples are Hotmail, Gmail, and so on. Even if email is actually fully distributed, I guess that absolutely huge majority of email is being handled by just a few largest providers. Or what about the BitCoin mining pools, which got over 50% of the network nodes joined under? That's better than any of the other examples I could imagine. Fully decentralized system with one instance controlling more than 50%. Is it really distributed anymore that breaks all the good basic governance rules of distributed system.
  • Amount of information stored / client on server - What kind of information needs to be stored at client and servers? I do have clear image in my mind how the server part would work. Clients only need to know about keys to request so it's more up to the final client application what it does. But the server part could be pretty heavy. Also long polling requests from clients of course take their toll if the service scales up.
  • Expiry? How long will be the data located in the original server or caches? Ok, caches can figure out how to get rid of data, but is there any official policy for the "authoritative" server?

Just confusing stuff

  • "The messages would not be stored by the entire network, but rather only by the server to which they were uploaded." - Afaik, problematic design reducing network resiliency. Also this would create clear centralized targets to bring the network (functionally) down.
  • "but would operate as a node in the DHT, randomly download and store a configurable amount of messages from the network ― inserting its IP address into the DHT accordingly ― and serve them when asked." - What? Download messages? Wasn't it so that the messages are only stored by servers where those are uploaded directly to? A contradiction to previous statement.
  • "With enough nodes, each storing only a fraction of the total messages, we can guarantee all messages will remain reachable at all times, even if a server goes down." - Again doesn't match with the first statement. It would be also smart to mention that the fraction is overlapping fraction, not clear cut section of the message pie.
  • "Because messages are not stored by each node (like Bitmessage) we do not need to use a proof-of-work to prevent the network from being overrun with spam messages." - Flooding is fun and trivial if network is incorrectly designed. I'm waiting for the challenge.
  • "Because the network is server based, servers can implement traditional anti-spam measures without harming user experience." - Yes and no, who says you would attack this kind of implementation on "client level", you naturally would do it on inter server level aka on federation protocol.
  • "The protocol supports lightweight queries allowing the user to make the anonymity set/bandwidth trade-off." - This is nice thing.
  • "The protocol supports market based rationing mechanisms where necessary." - Much more details plz, this gets easily really complex.

My related post about decentralized systems.

KW: Bitcoin, OpenBazaar, P2P, IM, secure, anonymous, messaging, protocol, Subspace, federation protocol, server, servers, message, storage, distributed, messages, mesh, network, notification, announce, announcement.

Btw. I know there has been multiple discussions already about Subspace, but unfortunately I didn't have time to study those thoroughly. So my conclusions are based only on the original provided documentation.There has been also some preliminary discussion of Subspace could be used as storage system for OpenBazaar.

SaaS, Python, Decorators, Databases, Superfish, Gogo, Malware, IoT

posted Feb 22, 2015, 9:57 AM by Sami Lehtinen   [ updated Feb 22, 2015, 9:57 AM ]

  • SaaS business 40% rule 
  • What's new in Python 3.5
  • Python 3 logging - I just encountered interesting thing. If I call at any point logging.info('randomstuff') after that all calls to logging instances start to log to stdout. Why is that? Because I documentation doesn't reveal any reason for as far as I know. How did I find it out? Well, instead of the instance called logger I accidentally wrote once logging and that changed how all of the other instances the logging were working. Some discussion @ G+ - So afterall it isn't bug. It just changes the way how data being propagated to the main logger is being handled by default.
  • Simple Service Discovery Protocol (SSDP) - Enjoyed upgrading firmware to firewalls which were vulnerable with SSDP reflection and amplification attacks.
  • Checked out Gateway Load Balancing Protocol (GLBP)
  • Did some work with MS SQL 2012 Management Studio. I just can't believe how bloated it is compared to other similar tools. Installing it took a long time.
  • Asynchronous Python and Databases - Sometimes asynchronous stuff can save a lot of resources when right subsystems are used. But on the other hand it might implement things which are really hard to debug. One of my projects is now using mostly fully asynchronous processing where there is just inbound and outbound queues. But in this particular case the solution is perfect? Why? Because everything is written really asynchronously so requests are send and response might arrive or not. When response arrives it's processed. So it's similar to most of UDP related code. There's no state what so ever being maintained. Queues are persisted stored so even killing the process won't screw up things. Of course a few responses could be lost, but that doesn't matter. Data will requested again later. But same rules apply to this, asnyc isn't ultimate solution, should you always write multithreaded, multiprocessing or similar code? No? Yes? It dependes from the situation. Using these technologies where not needed can just make code seriously complex and brittle and cause totally new performance problems. [AsyncIO, Coroutines, SQLAlchemy, Concurrency, event loop, tasks, I/O]
  • Similarly the processes which are extracting data received from that subsystem, is fully lock free. It runs as parallel python processes using process pool and doesn't use any application or database level locks. Providing maximal performance.
  • Something different: ARX 160, A2/AD for Submarines, Anti-shock body, FN SCAR, HK416, Watched The Real American Sniper by History Channel, Air purifier, Electrostatic precipitator
  • Laughed how two guys tried to set routers custom MAC address for over two hours. Then they came to me and asked if I could help, because they had invalid MAC address. I smiled and told them that there's no such thing as invalid MAC address. What about just using another browser, so the JavaScript cod wouldn't fail and you could set what ever MAC address you want to. After they stopped using Chrome which failed and started to use old IE, suddenly all of the 'invalid MAC addresses' became very valid.
  • Received a few questions about how NAT works with local connections and replied to it:
    I've encountered several different situations with that kind of configuration... It's really hard to say how certain devices work without knowing exactly what kind of configuration those are using.
    1) Smart routers route that traffic locally.
    2) With some routers it won't work at all.
    3) And some routers handle it like any other connection. Basically forwarding traffic to ISP router which bounces it back. Nice drawback is that you're consuming your Internet bandwidth with your own traffic.
    4) Proper firewalls usually allow you to configure it as you wish including all options mentioned above.
    5) In some cases even the Carrier Grade NAT (CGN) can prevent loopback connections to the same network using public addresses. Which is really annoying. I've encountered that a few times live. Problem was fixed by using "internal addressing", but it's still very annoying that the public addresses won't work. Now when I said it, I actually know one such network right now. It's really silly because you can use server.example.com from everywhere, except from the actual network where the server is. When you're connected to that network you'll need to use 10.x.x.x addressing. Which is as far as I know it really duh. But no can do. It's not my network to administer.
    6) Of course NAT configuration allows all kind of tricks, which haven't been accounted by most software or users. Like mapping ports, IP addresses and anything can be mapped on the fly. Like all outbound traffic can use different IP address than Incoming traffic. Even if computers are in same local subnet, those might have completely different external addresses which might use addresses from different subnets or even service providers.
  • Explored yet another NFC based alternate payment solution. In this case the purchases are invoiced and the service provider also offer credit option. So it's just another alternate credit card provider. (Not using the common payment solution) but 100% their own private infrastructure. As I've said, payment solutions aren't that complex at all. It seems that many are just making those overly complicated or at least thinking that those are so complex, without understanding the details of the system. But when you've been implementing multiple payment systems and account & invoicing solutions, you'll notice that it's basically the same stuff just in different packet. Identifying customer, getting strong authentication to the payment, and then getting the money for the credit purchases (credit card) actually from the customer of course this step can be also made before hand aka prepaid cards.
  • Thank you Microsoft for improving user experience. 2008 R2 says MSVCR100.dll is missing. But Windows 2012 gives just unclear crap about stuff not working. I just love totally pointless error messages. Things fail, live with it. 
  • Had to run sfc.exe /scannow to find out if some libraries are incorrect.
  • Had some problems with cx_Freeze and running those binaries on Server 2012. Works well everywhere else. Fixed issue by installing x86 libraries vcredist_x86. This is related to earlier whine about MSVCR100.DLL
  • Had very boring discussions with guy who doesn't believe that Samba and Server Message Block (SMB) aka TCP port 445 is only port that needs to be opened for modern file & printer sharing. Yes, there are potential security issues which are very related. I guess he has some other kind of configuration error and just believes the port stuff is what matters.
  • Design & Tech of BBC's new mobile site
  • Equation Group - There has been a lot of talk earlier about replacing device firmwares with malware. But everybody's been saying that "it's not possible". Even if it clearly is. This reminds me from early MBR viruses. - Link to older related stuff by a hacker.
  • The USB relay (Fanny) also shows clearly why slow speed RS cable with LEDs is better option than using USB memory, which got too high speed connection & plenty of storage space.
  • Received "Purchace Order.exe" - Clearly targeted attack. No anti-virus tool detected it as a virus. But yet it doesn't make any other sense to send such files. Naturally I didn't execute it. - Heuristic analysis shows that it's probably Trojan / Spyware like application containing key logger, data stealing and botnet features. How nice!
  • I'm just so tired of reports which say that program is taking 100%, 50% of 25% of CPU ... Blah blah, all that BS! Those are absolutely meaningless values. If there's one thread which is in running state all the time. Then there is one thread in running state all the time. It doesn't mean that program is using 12,5%, 25%, 50%, 100% or such CPU utilization. It's totally dependent of what kind of platform it's being run. But still I see dumb people talking about such things all the time. Like Anti-virus product X is good, it's only using 25% of CPU ... Braindead. Did they ever think that Anti-virus program which is using 100% of CPU with low priority might be much faster and yet "load" the system less than the Anti-virus program which is "only" using 25%? I've seen so many darn programs which take 4 hours to process data and when you're looking for bottle neck it's locking and stuff like that. So basically there's no clear single bottleneck like disk I/O, CPU, memory or so. Programs just mostly idle because those are written so badly. But is it a good thing? Because now the programs are consuming less resources, right?
    Btw. I see that multi-core programming has mind f... incompetent developers. They don't understand things like process priority at all. I'm so sad.
  • Tuned many templates related to one project and fixed ridiculous typos. It's funny what kind of stuff you can produce when you just write out things as fast as possible and never check it. I guess you've noticed that when reading this blog. Smile.
    Finished a few data transfer projects a work. In these cases customer needed reliable and quick way to get data from A to B. Old solution was slow and unreliable.
  • 7 Key Saas Metrics You Need To Be Tracking - Old stuff, nothing to add
  • A new development environment for Finnish National Service Channel (Using ESB aka X-Road technology) (Kansallinen Palveluväylä in Finnish) has been opened. - If I would be unemployed this is something I could be exploring in detail and starting to map possibilities it would offer.
  • Did some work with css, png, base64 encoding and basic stuff like that. Boring daily tasks.
  • A few incidents again. Data security is total joke. And cloud or private cloud doesn't have anything to do with it. Even public cloud services from major players are just simply darn secure. Compared to all kind of screw ups by helpdesk and even direct and plain unauthorized use of data for totally other purposes. I always laugh when people tell about their logging and other solutions. What if the backup administrator just simply takes full copy of the system and then does what ever he wants with the data. We have this blah blah logging. Well, nothing is visible in that logging when people with right privileges do their job. Or screw ups... Stealing data and delivering it to third parties without authorization from the rightful data owner. Many guys don't even understand that there are also 3rd party rights being violated there. Let's say that you've been buying stuff from webshop X. Then company administering the webshop just releases the whole database of that webshop. Now privacy of the customer of that webstore is also getting violated as well. Does it really matter? Most probably now, but there can be unforeseen consequences. Do people understand stuff like this? Well, usually they don't understand nor care.
  • Read articles about NSA SIM key theft from Gemalto as well as several Superfish (Lenovo) related technical stories. All this is something that was thought to be "impossible", lol. The Intercept The Great Sim Heist
  • Found SuperFishIEAddon.dll from a few Lenovo laptops. It seems that Windows Defender doesn't remove those nor related registry keys.
  • Enterprise Aplication Integration (EAI)
  • Retail, WebStores, Online marketing, how consumer shop today? This is how I do it - Depends from retail sector. But I personally always first compare and choose product online and then I might pick it up from store to save delivery costs. Of course daily grocery is different, but all other stuff is bought using this compare and select and then just pick it up. Of course some dealers offer really competitive pricing + free shipping, which basically means that I don't need to bother to pick it up and I can save time and money. That's the preferred option. Only stuff I don't usually order online are clothes and daily groceries.
  • Gogo is doing MitM and serving fake SSL certs - Noticed similar issue while I took Norwegian flight. Yet I'm not sure if they were using Gogo, if not, they're still doing the same stuff. I've also seen some hotels doing it. At least one 5 star hotel in Slovakia server me invalid certs for my own server as well as broke banking security.
  • Internet of Crappy Things - I couldn't agree more! I see all the time that even businesses are absolutely full of seriously flawed and fatally mis-configured systems. IoT will make things at least 100x worse, even if we don't count the NSA  / intelligence agency tinfoil factor in.
    It seems that hubiC file upload is still broken - It seems that hubic.com hasn't yet fixed the annoying UX fail. If you upload too large file. Guess what, they'll just fail it at the end and *restart* the upload because it failed. That's just so ... awww.... Uploading 2 GB file can consume 20 gigs of bandwidth when you just leave it running for several hours in background tab and wonder when it's getting completed. Absolutely horrible user experience.
  • Fine tuned my bottle.py using projects to use decorators extensively for stuff like etags, user rights managements per url, etc. Bottle is very similar to Flask and other light Python WSGI style frameworks like Google App Engine SDK. Only question is what you should use to form the etag? It's easy to mask all etags to look similar using something fast like crc32. The only problem is that per function (url being server), you have to decide what's the best method to detect is page has changed or if you fully render the page and then conclude that the result is still same as it was earlier wasting a lot of resources on server side. For these shortcuts I had to add one project wide shared etag validator configuration parameter. If it's changed, it causes all all existing etags to be invalidated at once. This forces pages to be fully reloaded which otherwise use something like DB last modified timestamp to lighten etag processing. If this isn't being done correctly you'll end up having crappy site as so many others are. If the site admins say that you should empty the browser cache to get the site working, it only means that the site admins and webmasters there are just totally incompetent. If change is coming, they should shorten TTL values as well as take care of invalidating all existing etags (at least in cases where it matters) to force cache content to be properly refreshed.
  • It seems that I found two computers affected by   one Lenovo Yoga 2 and one   tablet. Thank you . You can't simply trust anything nowadays.
  • Checked out HTTP/2 draft httpbis http2 as well as related HTTP/2 HPACK for header compression. These things add just so much complexity. After a while situation is like it's now with TLS libs, Kernels and many other complex tings. Just only a few proper existing implementations because things are so complex, nobody want's to make their own from scratch. Even those can be seriously buggy as we have seen in the past.
  • Excellent post about Python, ORM and peewee by Charles Leifer.
  • Tori.fi seems to exclude some special characters from the password. So they don't store the password user gave. This is just as bad stuff I've been describing earlier with some sites. Like Nordnet which drops case information from username, but they forget to tell about this to the user. Or at least they drop it earlier, I haven't it tried lately.
  • Dshell a forensic analysis framework by USARmyReasearchLab using Python.
  • A good writing about Facebook and it's spying power. - Get your loved ones off the Facebook
  • How to disable social features in Mozilla Firefox - I just wonder why every god .... application has to be so full of s... social stuff. I hate bloat.
  • When going through a few competitors programs, I noticed that at least two companies have ripped some parts of my python code and used it for their own projects. Naturally without any attribution but I'm just personally happy about it. I'm glad that they liked it so much.
  • Checked out Honeytoken - Nothing new, just one adaption of old stuff.
  • It seems that my bank got a really bad triggers, they're sending me the same loan offer several times a week by traditional mail. It's interesting to see what kind of transactions trigger automated marketing from which entities. But triggering same trigger on 4 days during the same week and receiving same paper mail over and over again seems just so ridiculous to me. Ok, they've failed earlier too. Some times they send me targeted offers about the service I'm already using. So they're using really simple triggers and skipping all sane checks that should be done.
  • Backblaze released the HD reliability report dataset
  • Memcached vs Redis comparison - Bit silly comparison, because Redis can do so much more than work as simple cache and on the other hand memcached is designed just to be expiring distributed fast key value storage.
  • Very nice list of non-standard data structures for Python

Modern SQL, ALPN, uWSGI, Malware, WiFi, Robots, AI, Blind Trust, Python, Big Data

posted Feb 15, 2015, 12:25 AM by Sami Lehtinen   [ updated Feb 15, 2015, 1:05 AM ]

  • Virtual Memory Intro - Nothing new here, but if you're not familiar with this stuff. It's still a good intro for you.
  • Something different Auto GCAS, Beriev Be-200, Steyer AUG, ADS Rifle, FAMAS, FN P90
  • Played a bit with Mapillary - A service and uploaded about 200 photos around Helsinki.
  • The Cloud Conspiracy @ 31c3 - It's clear that they're doing what ever they want and can behind our backs. As EFF has addressed they have no intention telling about any potential guidelines, so we must assume there's none.
  • Things in Pandas I Wish I'd Had Known Earlier - Nice post, good reading if you're interested about Pandas.
  • 25 tips for intermediate Git users - More good reading, and some beneficial tips. Which you don't often encounter even if using git daily.
  • Modern SQL in PostgreSQL - I really loved this presentation. This is great example how things can be made drastically faster if developers bother to think (or even understand) how their programs should work. Of course this requires that you bother to that and know features of the database engine you're using.
  • Checked out sqlcipher - If and when encrypted databases are needed. - Some suggested that there are newer object databases but so what. SQLite3 is excellent. I guess it's one of the most widely used databases world wide. - That's why main benefit with SQLite is wide compatibility. If you want to use objects there are plenty of ORM solutions available which work with SQLite while not breaking generic file & SQL compatibility.
  • Once again reviewed and made sure that backup procedures work as supposed and automated monitoring is also working without problems.
  • Artificial Intelligence (AI) part 2 - kw: ASI, AGI, AI, Singularity, Kruzweil, AI Takeoff. - This is surely something to think about.
  • Got sick'n'tired of hopeless programmers, developers and engineers again. Let's see, we have a problem X which happens semi-randomly. Their fix? Well you should manually rename files and move those to another directory and then monitor if there will be new files and repeat the steps if there will be and so on. - Well, what about fixing the program so that it would work, and there wouldn't be manual continuously monitoring and action steps that need to be done? Have you ever considered that? Well, no. They don't get that it's crappy programming and bad design causing these problems. The problem isn't that the helpdesk or operations aren't continuously checking if there are "problem" files, restarting services and moving and renaming files to other path(s). Root cause analysis anyone?
  • Once again got very frustrated how slow Windows is. Some things are just ridiculously slow when using Windows compared to similar tasks when using Linux. Luckily I got three work stations so one Windows is lagged out. I can use others to keep going. Even multitasking works so badly that it's better to have physical workstations.
  • Studied ALPN RFC7301 - which should replace NPN for HTTP/2.
  • Boston Dynamics Spot - Our robotic overlords have arrived? - Nah, not yet, but progress seems to be pretty impressive.
  • Once again had not so interesting discussions about data quality. People checking statistics and analyzing data complain that data is useless. Yes, that's true, if whatever is entered into the system. I've been thinking about multiple big data projects, and even if there's plenty of data, it's pointless to analyze it, if it's garbage quality input. So only automated and well thought data sources should be used. As well as everything should be fully automated. If there are manual steps, like in some cases reading visitor counter every hour and submitting data manually, it's going to fail so hard that it's pointless to even plan something silly like that. Yes yes, in this case data is logged and you could enter data for whole week or month at once retroactively, but it doesn't matter. It's just not happening reliably and correctly. Another thing which should be really simple and straight forward is doing store / stock location inventory. In theory it's simple task to do, just enter correct values and that's it. But there are just so many dumb ways to fail. Using RFID to read data for stock might be expensive, but at least it would be up to date. Sorry for counsultant who's  claiming that to be perfect solution. It isn't. First of all, there might be tags which aren't readable for some reason. Then the next step, even if RFID tags would be read 100% correctly, it doesn't guarantee correctness? How so? Well, people might have incorrectly tag the products. I've seen it happening over and over again. So that's not the magic bullet we're all looking for.
  • Btw. One of the major problem cases I remember was one of big Finnish tobacco companies printing incorrect barcodes on their products. Did they pull back the batch? Nope, they didn't. Then we received countless calls about POS system malfunctioning and incorrectly identifying products. Which of course was complete BS from customers. Our system did read and connect the barcodes exactly right. But it seemd to be quite hard for people grasp that products could have invalid barcodes. It would be so much fun if Coca-Cola would ship millions of 0.5 bottles with Pepsi 1,5 litre bottle barcodes. I would just imagine how messed up people would be about that, even if it's really simple thing. Incorrect code and that's it, there's nothing to wonder about it. - Well, that's just life.
  • Another thing I'm worried about is that people some times seem to blindly trust the systems. They just can't get the point that their system is malfunctioning and they shouldn't trust it. If I take license plates from expensive sports car and put those on some old junk, and take it to the dealer, do they just read the license plate and decide that they're giving me 200k€ for that? I guess not, but in some cases people really are that silly. The system says that this is expensive sports car so it has to be.
  • What reminded me about all this stuff? Well, I tried to order stuff from local web shop. I asked them to deliver stuff to my local post office. Their website offered only somewhat distant alternatives. Then I send email to their customer support asking, if you're using standard national mail, why you can't deliver to my local mail office? Then they replied from the customer support that, we can deliver to following post offices. The list in email contained the exactly same list of the somewhat distant options which I had already seen. Then I was like again WTF, but why you can't deliver to the post office near me? It's the same network, there's no sane reason for not to deliver. Is there some specific reason why you're on purpose pissing people off, discriminating people living in certain places or are just playing dumb or are you really that dumb. I included link to the official list of post offices to that mail high lighting the local post office. Then they replied, Oooh, sorry, we didn't know that there is a post office out there. And then they added the post office to their delivery system option list. Why there's nobody responsible for keeping information up to date? Why they gave me pointless reply from the customer service on the first time? Are they really that incompetent that they can't check list of post offices before telling that there's no such post office? ... Or is this just typical customer service from people who are blindly trusting their malfunctioning and badly configured systems? - After one week it turned out that they can't deliver to that post office, because their current system doens't support it. I got no (positive) words for their great achievement.
  • Encountered yet another web trading place which got totally incompetent and lame developers. Their fail? Well, This is the same fail I've reported earlier, but now it's with password instead of username. SO when I create user account I give a password like [x7/9-8'%O*(jPkQe7+y . Everything goes well, but when I try to login they claim that the password is invalid. Then I use password recovery and they send me back my password using mail. Guess what my password is x798 . WTF again? First of all, where are the special characters? And what happened to the rest of password after ' sign? I it really so hard to scrypt unicode string using random salt? That task seems to be quite impossible for some elite developers. As well as there shouldn't be password recovery at least such which returns the given password. Of course no user is so stupid that they would reuse same password for multiple sites. But the whole point of using email based password recovery is that it makes account hijack just so much easier, so it shouldn't be available at all.
  • Played bit more with MemSQL. But the truth is that: I did get everything to work pretty well with Python 3.4 64 bit (Ubuntu). I don’t have any other feedback than that the fact that with current dataset sizes I’m totally happy with SQLite3 when using ’:memory:’ storage. When SQLite won’t cut it, there’s PostgreSQL. For simpler memory based key value tasks, I just use standard Python dictionary. Yet there’s nothing which I would be unhappy about MemSQL. The main fact remains, I just don’t have any real use cases for MemSQL right now with current data sources and systems.
  • Once again wondered engineering, serevr and network monitoring best practices. I just don't get how ... people are. Why do they install system using default configuration and then disable firewall. Before configuring any of the security or authentication options. When I told them that they're doing things really wrong, they just explained that they're installing the system. Well, that doesn't cut. It's BS. How about first configuring system correctly and THEN opening the firewall. These kind of things seem to be incredibly hard for engineers, administrators and most of people to get. Yes, I might sound bit annoyed, because I am. I just simply can't stand people who do not make any sense at all. But this is only very very small tip of the ice berg. Most of best practices seem to be common jokes, everybody knows those, but body cares or actually follows any rules. Funniest thing of all, are the total excuses which do not make any sense at all, why the rules can't be followed. There's no reason what so ever why things have to be done incorrectly. It doesn't provide any benefit or time saving at all. It's just that nobody cares at all... Like in this example case. Funniest part is that then they start claiming that there's nothing wrong they're doing. Yep, it even further proves the point that they're clueless what they're doing.
  • It's like the standard procedure when lighting up fire place or wood stove in sauna or whatever. First you light it up... When smoke alarm goes on, you start to think if you should open the outgoing valve in flue. Really recommended and smart move. As well as closing it as soon as there's no visible flame. What's wrong with that, the valve works just fine if we do so. Well, that wasn't the point, but these guys are just way too clueless to even get it.
  • Let's play the so many dumb way to die here. - Nothing important, it's just so funny. If you fail hard, there's only one final resting place. Only good thing is that when you fail hard enough, it's not your problem anymore.
  • This also proves the point about using private cloud. Well, you can use large public cloud provider, which hopefully follows even some of the best practices. Or you can be fool and think that private cloud is much more secure, even if then everything is actually done with attitude that if it barely works it's more than good enough. So is the private cloud smart move? It only causes a situation where you can be pretty darn sure that everything is more or less insecure, but it just seems to barely work somehow.
  • For private cloud, you won't even probably use SSL certs for administration panels and stuff like that, purely because it costs money and it isn't reasonable to get such hassle for small projects.
  • The Top Mistakes Developers Make When Using Python for Big Data Analytics - The usual stuff. Luckily my data isn't so big that I would especially need to focus on speed. But I very much acknowledge the problem with experimentation if things are running slowly. Article also really nicely explains the repeated failure patterns, like lack of understanding timezones. As well as lack of automation as source of constant semi random problems due to human errors. I've been personally doing these fully automated "hybrid" solutions. Just as she said, mixed original data source, Java filtering & preprocessing & Python finalization & coordination, but I really prefer to avoid that if possible. I always do that so that the Python task is launched first and then it automatically calls and controls other subsystems. Problem is that when such system fails, it might take a while to find where the problem is. Lack of data provenance made me smile. It's so usual story. Happens all the time. Customer asks immediate changes to be made to production and then at the end of month they complain that something should be checked. But the truth is that the monthly can be mixed results due to multiple software versions and processes. What's worse? Even recreating the data with latest version won't solve the problem because the data entry processes and stuff could have been changed simultaneously. This happens, over and over again in hasty and badly managed projects. Customers which require only monthly aggregate data, make this even much worse. With daily aggregate data we could iterate daily. But with monthly in production testing, it means that it can easily take up to an year to get things even closely working. And to make matters worse, customer can require additional changes and change their processes during this time, making the task even much harder. Lack of regression testing? Been there, done that. Problem is that nobody knows what the potential input values can be. You'll find it later. It's extremely rare to get proper examples before hand. And as mentioned, even if things would work with the perfect examples, things change in reality and you'll end up with more or less broken program anyway. Just did that well, not today, but yesterday. I did exactly what customer asked, but I'm quite sure they're not happy with it and they'll be making more requests soon and asking if old data can be reprocessed according new rules and so on. Of course the reinventing the wheel seems to be very common in many companies. Due to reinventing the wheel, none of the best practices and commonly used methods aren't being properly applied.
  • Checked out luigi - Couldn't summarize it better than they do: 'Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.'
  • I have to quote this Karolina Alexiou: 'Working well with any tool comes down to knowing its shortcomings (and especially the ones that most affect your situation).' - I personally just couldn't agree more.
  • Read a few issues of The Economist - I just can't stop loving it.
  • Finland is leading in credit card payments - Do you always pay using credit card? Do you even carry cash? I personally don't even remember when I would have used cash last time. I have only a few 50€ notes in wallet for situations when cards won't work (extremely rare, even if I use cards several times daily). Basically I'm not using cash at all. I guess I didn't withdraw cash even once during 2014 (in Finland). But when I'm travelling, you have to use inconviently cash. Except when I visited Sweden, I didn't have any kronor because I always used credit card.
  • Had annoying problems with Windows clock. It's easy to fix problems with networked computers using w32tm or net time but in this case computers aren't connected to network yet their real time clocks return semi bad time. I've only received random suggestions from HP how to fix it. But it seems that they don't even know how to get right time. So sad, modern computers yet no reliable time. And I'm not now talking about ns / ms resolution. I'm talking about getting time within +/- 1 hour!!!
  • Checked out WebRTC 1.0 draft - This is awesome HTML5 development and could potentially allow fully browser based peer 2 peer & mesh solutions.
  • Dark Patterns - How to trick users. It's annoying when mistakes do happen due to bad or lack of UI design. But it's even worse when sites are designed to deceive you.
  • Checked out GnuPG 2.1.2 release notes - No Windows binaries for latest versions yet. But soon ECC will take over old DSA/RSA keys.
  • Tinkered with Twitter integration APIs. Nothing more to say, but got it working pretty nicely.
  • Had my first major Python namespace collision problem. It took me something like 15 minutes to figure out why the logger module did almost what I wanted, but didn't exactly do it. Why it took so long? Well the new logger module inherited the old logger module so it only slightly modified the old functionality, yet the new logger module was completely from other project so I didn't even know it was there. Great example of dangers of import * from stuff which you really don't know well enough. Pretty classic fail.
  • Antivirus tools won't block malware efficiently - Sure. That's exactly why white list is much better approach than blacklist. You can use App Locker to lock system down so, that only small number known and actually required binaries can run. All other files are blocked.
  • Really neat project how to Visualize Wifi (WLAN) Signals - It's not exactly news that signals vary greatly based on location. But this guy wen't bit further than usual simple wifi signal testing. I've talked a lot of Wifi signals at work and in general, and I know the factors affecting, but most of people doesn't seem to realize how complex stuff Wifi is. There are no simple answers to Wifi reliability matters, channel selection and so on. - Signal variance is nothing new. I remember personally noticing it in late 90's that with my GSM (900MHz) phone, moving it on table just for about 5 centimeters could bring it down from full signal to no reception at all. Of course with analog radios you also notice how easily signal changes with location. If you've been ever playing with TV antenna in in bad reception and so on. Moving your hand in other room might block TV signal or make it crystal clear even if you would make there's no connection what so ever. With 2.4GHz people often forget that interference from other sources can significantly contribute. So signal quality and signal strength aren't same thing at all. Getting to the root all these things require professional, which I'm not. So one type of measurement defined as "signal strength" probably misleads you badly. Is it a good idea to select a wifi channel that doesn't have any other wifi boxes? Well, the reason might be that the channel is totally overpowered by local wireless CCT or phones. That's the reason why nobody's using it for WiFi and then you think it's a great idea to select a free Wifi channel?
  • Radio stuff is (truly) really tricky. With higher frequencies it's just like light. Why some things are in shadows and some things are well lit?What do the guys use their skills for? Well this is pretty bad. But it clearly shows how dangerous APT threats are. Even if 'the database would be encrpted' it wouldn't make any difference in this kind of cases.
  • Hackers steal millions using Malware - This clearly shows how dangerous APT threats are. Even if 'the database would be encrpted' it wouldn't make any difference in this kind of cases. I've lately blogged how dangerous this kind of treat is, especially when it's related to computer systems which people blindly trust. I've often seen unbelievable levels of trust to systems. People just can't get their head around the fact that computers system can tell them whatever lies and they shouldn't blieve it. Aonyone thinking SCADA systems or any other fully computerized systems like ATC? (scada, atc, ics, control systems, control network)
  • My site had some downtime due to uWSGI changes. Suddently it couldn't find libraries from standard path and python34_plugin.so file wasn't loaded. The only quick fix I found was to place the library in the start path of the project. Which is silly. I really don't know why --plugin-dir parameter didn't work nor why it couldn't find the plugin from the standard path or even custom path with the paratmeter. So annoying. At least my server wasn't in debug mode and leak all kind of stuff. I'll need to explore this problem bit more later, but I can't let it to cause more downtime than it has already caused.
    Exact error message: open("./python34_plugin.so"): No such file or directory [core/utils.c line 3675]
  • DownOrNot down? - It seems that they're using App Engine with Debug mode on and dumping some config information... Lol...
"Traceback (most recent call last):
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/webapp/_webapp25.py", line 715, in __call__
    handler.get(*groups)
  File "/base/data/home/apps/wmc/12.348534319863004081/don.py", line 306, in get
    self.work()
  File "/base/data/home/apps/wmc/12.348534319863004081/don.py", line 407, in work
    cloud = dc.fetch(wordcount, offset)
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/db/__init__.py", line 2160, in fetch
    return list(self.run(limit=limit, offset=offset, **kwargs))
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/ext/db/__init__.py", line 2329, in next
    return self.__model_class.from_entity(self.__iterator.next())
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/datastore/datastore_query.py", line 3389, in next
    next_batch = self.__batcher.next_batch(Batcher.AT_LEAST_OFFSET)
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/datastore/datastore_query.py", line 3275, in next_batch
    batch = self.__next_batch.get_result()
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 613, in get_result
    return self.__get_result_hook(self)
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2973, in __query_result_hook
    self._batch_shared.conn.check_rpc_success(rpc)
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1340, in check_rpc_success
    rpc.check_success()
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 579, in check_success
    self.__rpc.CheckSuccess()
  File "/base/data/home/runtimes/python/python_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 134, in CheckSuccess
    raise self.exception
OverQuotaError: The API call datastore_v3.RunQuery() required more quota than is available.
"

Locking, Deadlock, Backdoors, Learning, TOR, Concurency, Isolation, Ansible, Salt, MemSQL

posted Feb 7, 2015, 2:52 AM by Sami Lehtinen   [ updated Feb 7, 2015, 2:53 AM ]

  • Reminded my self about PostgreSQL explicit locking
  • It's interesting to see how some topics pop up again and again. One developer was wondering why everything is working well in his development environment. But hangs semi constantly in production. He didn't know about situation called a deadlock. See deadlock for dummies.
  • CPU backdoors - Post about potential hidden backdoors in CPU's. 
  • Why learning to code is so hard - Learning this take time, and people run into same 'tricky things' over and over again. Like the deadlock lock stuff above.
  • Once again encountered web designers dream. Absolutely horrible mobile site implementation. First I open page X from site, full size is completely loaded taking forever. Then there's a popup suggesting that I should use mobile site. Fine. I'll use it. Next thing I notice, I'm thrown back to the front page of of the site, and not to the mobile version of page X. I just don't get what kind of guys develope sites like this. There are multiple fails. Why to load full site? Why to show popup after loading it? Why to redirect user back to front page and so on. Hopeless. Let's see if they're able to fix it. And this site www.yrittajat.fi (Finnish entrepreneur site) isn't the only site which is designed so badly. I've seen multiple other sites too. I've even dropped some from my bookmarks, because I just hate crappy sites. Thank you so much for delivering very bad UX, wasting bandwidth, showing popups and finally frustrating users when they have to look for the article / page again. 
  • Updated my own mail system to use latest versions of postfix, dovecot, roundcube, gpg and so on. Also fixed issues with RoundCube date column field content which was empty with previous version even if time zone information was configured in php.ini for apache2 & RoundCube. Still using sqlite3 with RoundCube because it's actually optimal solution in this case. Compared to over heavy bloated database solutions. 
  • Dell server RAID controller blowed up. Good thing was that the customer as using RAID1. So I was able to recover data from disks directly without using the RAID controller at all. This is one of the benefits of RAID1 which might be forgotten at times. Recovering data from more exotic RAID configurations might be practically impossible. Was there a backup? No, why there would be, we're using redundant and reliable disk system. ;) - Business as usual. 
  • Read MongoDB 3.0 documentation and especially focused on $isolated operator -  - which allows modifying multiple embedded documents in single document. This is actually exactly how Google App Engine Datastore Collections work by default. As well as checked it's concurrency FAQ. - a “readers-writer” lock as “multi-reader” or “shared exclusive” lock.
  • Security Now #493, TOR Not So Anonymous - Again one of major problems with TOR is low latency as well as it doesn't protect against traffic confirmation attacks. 
  • Check out Practical Data Science in Python - It uses pandas, scikit-learn (sklearn), numpy. kw: Naive Bayes, Nearest Neighbour (kNN), Bagging (Random Forests). Nice charts, example data and code. Excellent resource if you need to bootstrap your Python data science.
  • There's a reason why donating to GPG is really important. Here's great article about it. - Maybe Wegner should just ask how much NSA would be ready to pay for trivial access to encrypted data. 
  • Still baffled how most developers and operations guys seem to think that sending email is so complex thing. They don't understand that some SMTP servers are firewalled and some other require credentials. Even as bonus to that, they don't understand how SPF works and why it can prevent recipients from receiving emails that some guys have forged by using unauthorized email services and so on. Same fun goes on, every year day after day. After endless discussions, results are poor and no basic understanding exists how things should be done. It's wonderful to notice how incredibly complex some simple things are. Another interesting things is that people don't understand that mail can be delivered directly to the recipient, there's no need to user SMTP relay server(s) when delivering email to single destination.
  • Received some pretty ridiculous offers from HP about system service. I wonder who said that Apple pricing is expensive? Ha, that's nothing compared to HP service pricing. 
  • Continued to check out different orchestration aka configuration management solutions. [ Ansible, SaltStack, Chec, Puppet Heat/HOT, Juju, BoxStarter & Chocolatey, Fabric, Invoke, Capistrano, Fabric, Func and Docker ]. Created and pushed one (private) test container to Docker Hub. 
  • Some projects do contain multiple complex dependencies which require compilation and installation. It would be easier to deliver everything in one Docker container ready to run.
  • SaltStack or Ansible? - Ansible and Salt: A detailed comparison - Still need more time to check these out, when required. Currently there's no real need to do it, but situation is starting to near the point where it would be helpful to have easy way to get everything installed. In most of cases ready image is used, so there's no need for configuring the system separately, but that's not always a viable option.
  • Based on quick checkes I might try BoxStarter and Ansible, those are very different solutions. So it remains to be seen which one is better for our needs. I assume both solutions require writing some custom PowerShell scripts to manage many of the the system security, access and so on related settings, but that's not a problem for me. 
  • Some cloud service providers like UpCloud - Got their own UpCloud API for managing virtual servers and resources.
  • Played with MemSQL for a few hours. - Offical MemSQL site - Nice, if you need it. In my case, I don't currently see any use. Database is database, and if I got something in memory, I most likely have it in some of the standard Python data structures or data in one primary dictionary and then link dictionaries to that main dictionary which 'index' the keys I need to look for in the primary dictionary. Works nicely sofar. MemSQL might simply soem of the multiprocess & sharding and synchronzation / transaction / locking stuff, but I haven't yet had any problems with those. 
  • PyPy 2.5.0 released which does include support for Python 3.2.5. I'm using for most of my projects the current 3.4 version, but often there's no reason why the code wouldn't run with Python 3.2.5 or PyPy, except a few native binary libraries which would require some work before being compatible.
  • Public NTP service (Finnish) provided in Finland by MIKES

Small file handing & file queues (Linux vs Windows)

posted Feb 6, 2015, 11:24 PM by Sami Lehtinen   [ updated Feb 11, 2015, 8:05 AM ]

Got bit annoyed due to poor disk I/O performance with one of my projects when using Windows. I just synced one of my git repository with about 2000 files and it seemed to take along time.
 
I found out that it's really true. Even server with SSD gives 3x worse disk I/O performance that el cheapo VPS. Even bigger the difference is when we compare results to Linux server with SSD. Test generates files, gets list of those files and then delete files by name, just to test out file queue (in directory) processing speeds, using Python 3.4 64bit and standard library.
 
10k files 4k / file chart, lower is better.


I often develop stuff on Linux systems, but production runs Windows. Production systems run on Windows servers, but when I'm bored I often develop stuff at home and I'm not going to use Windows at home. Funny thing is that really slow el cheapo USB flash stick gives you better actually performance on Linux than server SSD gives when using Windows. Yeah, that's life.You might be wondering the results, but the point is that the test app does NOT force fsync with disk as most of programs won't do.

This leads to situation where same program is working lot faster on Linux platform than on Windows. Of course this might be a problem if you assume that all data would be persisted on disk without using fsync. I've written about this topic earlier.
 
Raw data and extended tests with larger files, times are in seconds required to run the loop with code.
 
Windows Server (SSD) NTFS:
Create 3.185
Delete 0.955
 
Linux Server (SSD) ext4:
Create 0.276
Delete 0.051
 
Linux Server (HDD using Writeback) ext4:
Create 0.270
Delete 0.054
 
Linux Server (Slow (4MB/s write) USB Flash) NTFS:
Create 1.735
Delete 0.393
 
Linux VPS 1 ext4, el cheapo, minimal resources:
Create 1.061
Delete 0.324
 
Linux VPS 2 ext4, el cheapo, minimal resources:
Create 0.916
Delete 0.279
 
Same tests with 100x 16M / file:
 
Windows server (SSD) NTFS:
Create 10.359
Delete 0.124
 
Linux Server (SSD) ext4:
Create 2.416
Delete 0.379

Additional edit 2015-02-11:

Same test with NTFS using same slow USB2 stick with Windows 8.1. This is a good apples to apples comparison, because same program was run using Windows and Linux using exactly the same file system and storage medium:
Create 133.55
Delete 51.88
This really should blow your socks off. No, it's not 70% slower, it's over 70 times slower. then the Linux run. Deleting files is over 130x slower.

1-10 of 230