Blog

Google+
My personal blog is about stuff I do, like and dislike. If you have any questions, feel free to contact. My views and opinions are naturally my own personal thoughts and do not represent my employer or any other organizations.

[ Full list of blog posts ]

AWS, PostgreSQL, Cloudflare, Equinix, Pyflame, Infrastructure, OVH DDoS, Cloudfront

posted Nov 26, 2016, 3:00 AM by Sami Lehtinen   [ updated Nov 26, 2016, 3:01 AM ]

  • Amazon AWS is adding new regions to Europe. London, Paris, etc... That's great. Yet really many Finnish sites are served from Dublin.
  • PostgreSQL 9.6 Released - Awesome! I really like PostgreSQL. There are so many technologies you can add to your project, but as long as you can make it with just a few, the better. Each new technology adds more to know and more ways to fail, miserably. Been there done that. Especially the PostgreSQL full text search features are great for most of my purposes with not so massive data sets. Friends have been suggesting Apache Lucene (Solr, Elasticsearch) or Xapian and  and stuff like that for me, but no thanks so far. From new features Progress reporting sounds great and index-only scans for partial indexes sound also pretty beneficial for certain use cases. Binary hot-backup improvements are always welcome, backups are one of the most annoying things with some databases. kw: pgsql, postgres, indexing, Full Text Search (FTS).
  • Cloudflare finally opens it's Helsinki, Finland data center. - This is something I've been waiting for a long time. Yet as blogged earlier, latency to Stockholm wasn't that bad anyway. It seems that pretty many ISPs are still serving from Stockholm, but I guess they're doing it like they did with Moscow and gradually running it in and also load testing it. Also they chose Luanda over Lagos, that's interesting. AFAIK Lagos would have been better connected. Actually Cloudflares network is now really dense. It's hard to see how many more pops would make any meaningful difference anymore. If I remember right their first goal was to make everyone within 20 ms reach. Stockholm was only 8 ms from Helsinki. So now they're adding pops (data centers in their terms) for sites which were already really well connected. Actually this was to me quite obvious when they added Copenhagen and Oslo. Because those were already really close to existing sites in Europe. Even areas like South America and Africa have been 'sufficiently covered' which is really awesome. So many CDN networks do not have any pops in those regions, or just one.
  • Equinix is coming to Finland too. This is just more awesome technology news.
  • Played a little with Pyflame, not really useful tool for me. But very good tool to know, if and when basic profiling is required. You'll never know when you'll start lacking required performance in production and don't have actually the "one fail point" which is clearly the source of all the trouble.
  • Three infrastructure mistakes your company must not make - Great post! I were at one point really considering AWS and Google App Engine, because those are being hyped so much. But there are restrictions how things work as well as the cost profile gets quite bad. Just as the article says. One way is of course to start building all kind of workarounds which lower the costs. But building those workarounds wastes resources and makes everything more brittle and complex. Also Google App Engine is real "Cloud Jail". If you build on it, you don't have alternative. If they kick you out, ouch. If they rise prices, ouch. If they terminate the service, ouch. No can do! If you just make bang for bug analysis on AWS results are quite horrible. Everything is seriously overpriced. I think many using AWS services don't actually pay the bills. It's easy to say something is nice, if you don't really care about the cost. All those board & cross margin discussions are very familiar and great examples. Same applies to technical debt, we're spending a ton of money on systems which totally fall a part. Yes, that's the debts interest being paid now. - Yet as bad is using random hipster tools which might now have future. - Monitoring parts were also obvious, traceability and accountability are sometimes really hard in distributed systems, if something simple like unique id is missing. I've written about this over and over again. Application Performance Monitoring (APM) & distributed tracing.
  • OVHs DDoS protection is clearly working, because we didn't notice anything at all, when they were hit with 1.5 Tbps attack. I could imagine many providers would be having a bad day.
  • Many Finnish sites are also using Cloudfront, even if it's really non-optimal for Finnish users. It's also a bad choice for sites which main language is Finnish. Because it's pretty sure that 99%+ of users visiting that site are actually physically in Finland. News in Finnish aren't widely globally followed. It's not like CNN or BBC.

Scaling Micro Services, Jugaad, Performance, Test Mode, Failure Testing, Old Code, Build vs Buy

posted Nov 26, 2016, 2:56 AM by Sami Lehtinen   [ updated Nov 26, 2016, 2:57 AM ]

  • GOTO 2016 - What I Wish I Had Known Before Scaling Uber to 1000 Services. Uber engineering talk said it well, Everything is a trade-off. It is, like building around problems. Haha, been doing that over and over again. It's horrible. Technical debt, throw in more semi bad code to fix really bad code. Ouch!
  • Another great thing was the thing which I've been mentioning so many times. When you've got experts, they're experts in their own field and don't want to know anything else. Which causes serious issues in "understanding your thing in the larger context". Yep, seeing the big picture. As well as understanding that the small change we're asking you to do, and you claim it's impossible to do, because it isn't the right thing, could still save 100x the work because everything else doesn't need to be changed. That's what I've been also doing over and over again. Thing X does 99,8% of what's needed. Can you add that 0,2% no, it can't be done. Ok, then I'll rewrite that 99,8% badly so it does the 49,4% what I needed that project, re-implementing most of the code which we already had, adding that 0.2% and leaving many things out because it's out of my implementation scope. Now if anyone uses my code and needs any of the features I left out, they're going to have really bad time. Often because the code I wrote is focused on getting the 0,2% done which was the key factor rest of code might be bad and as said it might completely fail on the rest of the features I haven't even planned or implemented. Boom. Great, just great. But all this because the 0,2% was impossible to add where it should have been added. Only way to remain sane with those tasks is just to think, this is good generic learning. I'll learn in detail how system works and can say I've implemented feature X, yet of course in big picture this kind of coding doesn't make any sense at all. - If we take car analog, can we add optional connectors for roof rack to car? No it's impossible. Ok, let's hack something together using old scooter, bicycle parts, some iron pipes and old bed. Now, it's done. Now we can transport things with this, which could have been transpored using the cars roof rack. Task done. - Horrible, yes, it's really that bad. But at least I got it done the jugaad style and it works and does what's required. It also added ton of technical debt and multiple potential future failure points. If you Google for jugaad technology you'll find many great examples. Yet I often do the key parts so well that it works actually much better than expected from the car example. It's still solid scooter engine module, there's nothing wrong with using old bicycle parts for wheels and bed frame as the rack. It just works. But it ain't pretty.
  • Performance discussion and points were also great. But I guess it's also universal in software. Nobody wants to fix code which seem to work, but wastes huge amounts of resources. It works, I don't want to do anything. So what if it consumes tons of disk space, taxes network, overloads memory and wastes CPU cycles. Linked list works just as well as dictionary, you'll just need to add more cores. Or something like that. Performance fanout was great example in the talk. Distributed tracing. Repeated calls to semi-slow code. Just way too familiar.
  • Option for 'test' mode in production. That should have been obvious. I'll also often do logging so that if something bad happens, I can run replay on the data. It has been proven to be very useful in solving rarely occurring issues. It's also great for testing, because test case can be 'replayed' to the system.
  • Failure testing, ouch! That one hurts. Sure. For critical parts, it's required. I've had my part of 'surprising failures' which technically aren't that surprising at all usually. Like execution ending at any random point. It just can happen, nothing special about that. If your code misbehaves in such case, that's too bad.
  • They talk about old code being 6 months old. Ouch, in my case it's more like 15+ years old. And I still remember well how it works, so it isn't that old yet. ;)
  • Build vs buy trade-off. That's awesome question always. Also do we co-operate or compete with company X. - Generally a good point, keeping strict focus. Because wasting time on something which isn't strictly your product, is just consuming resources from building your product.

Unique ID, Sanity Checks, I/O Latency (Ceph), Yuan/SDR, Databases and locking

posted Nov 19, 2016, 10:12 PM by Sami Lehtinen   [ updated Nov 19, 2016, 10:13 PM ]

  • It's wonderful how people refuse to realize the facts. One again had very long call about identifying transactions, without unique identifier. I think it's totally insane to try to process data in global systems with some random crap identifiers. The only way to make it work, is to use solid unique identifiers. To be honest, this is topic which has been no discussed for several years. Secondary issue is also the user experience. Current operators in the field with their legacy software and bad processes usually cause delay of hours to days in data processing. They're totally not prepared for real time processing and everything is based on decades old technology. This is prefect example why some times it would be much better to clear the desk and throw out the old junk. Building new system from scratch would be much easier than trying to add new flashy features to extremely old technology which nobody wants to maintain / modify. 
  • Tutanota 2.11 finally allows viewing of HTTP headers, that's awesome. It was one of the reasons why I really didn't like their service at all.
  • N+2 won't help anything at all, if the systems are seriously misconfigured. There are just so many different failure vectors. Is there automatic protection against misconfiguration? Something like unit tests, checking if system will be functional with new configuration? That's a good question. - This is related to the previous Admin Post / DNS issues. But also a generic question. Why does system accept a configuration, which is clearly insane and would crash everything? How about doing some sanity checks?
  • Just wondering why DNS server is configured to Server: 127.0.1.1 until I run sudo dhclient and after that it shows the stuff in dhclient.conf. Interesting. I guess that's something what's also obvious if you just know what you're doing. I clearly don't.
  • Worked with HD Tune, bonnie++ and fio to check out some disk system performance issues. Results are still bit unclear, but I'll try to get more analytics. Problem is the usual case, users claim system is slow, I can see that disk I/O is maxed all the time, but system administration claims that there's nothing wrong with the system or the performance. Sigh. I hate being in this position, but this is all too common. It's almost always like this. Nobody ever says oh yeah, it sucks, I'll fix it. Instead they'll spent two weeks working on reasons why not to do days work. If I can collect some data I can publish suitably so it won't annoy anyone too much, I'll do it. It seems that most tests write and read data repeatedly. Which is very bad if your performance problem is caused by tiered storages random cold reads. Naturally the problem doesn't ever exist when you do the tests, because the tests do not test for that particular case. With HD Tune you can get results if you run it on freshly booted server without any other tasks causing data to be cached and let the backend systems 'cool down' for enough. I think I'll get some results if I'll simply dd whole drive to null, that forces also the areas of disk to be read which aren't being commonly read. In some cases the I/O latency times for these cold data hits are in tens of seconds. - Some tests reveal that empty areas on disk are read much faster than the areas containing data. - As well as disk system jitter his huge at times: lat (msec) : 100=2.75%, 250=0.96%, 500=0.13%, 750=0.02%, 1000=0.01%, 2000=0.05%, >=2000=0.07% - If you happen to hit a few of those over 2k in row, it's going to be a bad day, this is exactly the problem what I were complaining about. Disk I/O can stall for several seconds or even tens of seconds suddenly.  Storage is backed by Ceph Block Storage.
  • Yuan to be included in SDR, interesting. China is clearly making important progress on global power scale.
  • Even more discussions with database engineers about extremely simple issues like counters. They were wondering why there are failures in incrementing the counter. This should be so basic CS stuff that I don't even know what there's to discuss about this. If you require sequential counter, then there can be only one update at a time running and that's it. Yes, it causes rate limit. There's nothing to discuss about this matter. Yet they're still wondering why this is happening. - I thought CS people should be fairly logical guys with stuff like this. Another option is of course using locking which prevents any parallel access to the data, but that's even worse than the opportunistic locking which fails on commit. - Which I prefer with all of my projects for a good reason. It opportunistic locking can be problematic in case the database doesn't provide snapshot isolation on beginning on transaction. But knowing in detail how the database works, you can layer this stuff as additional layer on top of it. Compare and Swap is the easiest way to implement "opportunistic locking". Yet in some cases if there are latencies involved, this can make things go from bad to worse. Always make and use overall picture of the situation to make the decisions, there's no silver bullet solution out there which would make everything work magically.

Admin post - Massive (communication) DNS fail, Serious server availability problems

posted Nov 19, 2016, 10:03 PM by Sami Lehtinen   [ updated Nov 19, 2016, 10:18 PM ]

  • In one request we asked to update DNS and remove all information about listed bunch of A records to be removed. First of all, it took them a about a week to process this simple request. As always, when doing something important, I'll aways pause for a while and recheck the request / task being submitted. I thought it for a while and was happy with it. Several days later at 15:28 they acknowledged the ticket and closed it, without mentioning what they had actually done. About 16:30 I got first alerts that systems are shutting down and things are going wrong. I was like what it is. Is it DDoS because I had seen that during the same day on a few servers. I started looking for network metrics data and basic stuff like that. Checked hosting providers status boards, nothing. Checked network latencies, traced few routes, checked if DNS works, checked network performance logs and charts. But then I noticed that some DNS queries were taking a bit longer than expected. Logs also showed that only one domain was being affected. I thought it for a while, that what if they've mishandled the DNS update request. But things still partially worked, so I wasn't too worried. Then I got a few DNS timeouts. I thought it's bit strange. Are the DNS servers being attacked? Then I started to lookup stuff with dig, and there it was. They had replaced our domain DNS server entries (NS) with the server names to be removed from the DNS zone file. Sure, slight misunderstanding. There's double language barrier between the provider and us. Plus the server names being removed were named slightly confusingly, clearly not ns or dns but something quite similar. Darn. They had set the new DNS servers with 7200 TTL. It has just been bit over two hours from the DNS update at this point. I started frantically calling contacts at the DNS / domain administration to get this fixed. Finally I got the right guy one the line and it took just about 5 minutes to get the problem fixed. Luckily it was something like 18:30 at that point and basically nobody noticed this major mishap. Phew. It would have been just so interesting to wait until 9 am before doing anything about this. Then waiting for those two hours for the information to get expired from caches and refreshed. That wouldn't have been fun at all. So, small misunderstanding. But automated monitoring, alerts + quick analysis of the situation saved the tomorrow. One thing which delayed me quite a bit, was that I wasn't on proper computer when the first failures got logged and alerts were sent. I just did very basic checks on mobile, to see if al the key things are working. As mentioned things were working, because data was still being cached by DNS servers being used. It delayed the actual start of analysis more than a hour. After that I reached suitable place to really dig in. He.net Network Tools is just awesome mobile app, that's what I used for first checks. Am I happy with the result? Of course not, it shouldn't have happened in the very first place. Yes, I got it sorted out reasonably when noticed. It's good to remember that I'm only hobbyist with these things. - This is like air crash investigation case. There were several overlapping things which caused response to be much slower than expected. Also it was clear that if things were done manually, I would have expected them to check if the new DNS servers respond to the DNS queries, which those of course didn't do. Those weren't DNS servers at all. If they thought the request wasn't clear enough, why they didn't call me or confirm the true intent of it? Without proper confirmations anyone could easily spoof that kind of request on purpose. After the issue got fixed and there was postmortem of the case, all of their guys wondered how it happened. Because the request was quite clear. Probably the reason was that I sent the request in Finnish and the guys handling the request did speak Swedish as their native language. Even if it's Finnish Company. If I would have sent it in English I guess we wouldn't have this story.
  • Actually, I'm only hobbyist in practically everything I do. Why? I do so many things, I can't really deeply dig into anything. More like tech generalist. Sometimes it's greatly beneficial, sometimes it's really bad because I'm practically incompetent in everything I do.
  • More complex and painful server issues. Servers just end up being dead in water. No reason what so ever. Just so wonderful. Then the people whom have access to the platform, refuse to give any details. This is just the usual case. I've encountered similar situations over and over again. There's suspected source, or it might have some relation with something else. But people responsible for that part won't do anything. So next step is basically trying to circumvent the problem somehow or change that component / platform / code / person / organization / supplier, etc. out of the chain. Pretty frustrating, especially in cases, where the problem is quite small, but still bad. Like server crashing rarely but at the same time corrupting lot of data, or delivery company losing shipments completely and so on. It can't be tolerated in general, but totally horrendous day. Server issues, DDoS issues, RDP issues, Windows being just dead, nothing in logs. But that's nothing.Also read the next bullet, it's basically the same problem, I've been dealing with lately.
  • *Black Screen of Death*, Windows ends up being dead in the water. Does anyone know what's causing this with VPS servers? I've been trying to analyze the problem, but it's really hard. Because there's really little to analyze. Nothing in logs etc. Often even processes listening sockets, do technically work to some point. But logs do not show anything etc. I'm personally thinking it probably could have something to do with disk I/O being slow or being so slow that writes / reads timeout etc. But I don't know. It's super hard to get anything to analyze about this problem. Maybe I'll need to write separate debug program which monitors all these aspects and reports out over socket about disk parameters etc. But what if it also dies without any sign. Any pro-tips? I can make some poor nerdy guesses, but it's really hard to come up with proper facts backing up those guesses. I've already placed basic monitoring which checks CPU, ram, disk read and write. So far haven't hard any of the problems. Most annoying part of all this is of course the fact, that it happens rarely. Which makes debugging much slower and harder. Naturally the systems I'm running the extended system monitoring, haven't had any problems. (sigh) Sometimes even the RDP works, but shows black screen for several minutes and then disconnects. This should be pretty clear indication what's wrong for someone who knows the lead causes for such problem on VPS server. Nagios Core also rocks.

Storage I/O, 6to4, RDP, Project Shield, Abstractions, FAST, Schannel, Google Edge Network, Spam

posted Nov 19, 2016, 9:50 PM by Sami Lehtinen   [ updated Nov 19, 2016, 9:50 PM ]

  • How bad is too bad Storage I/O performance for Windows? I'm wondering if anyone would know the answer. I guess there are multiple factors. But even guess what's the generic breaking point is would be beneficial. I would guess it's somewhere between 5 - 120 seconds. - I've got a case where disk I/O is really slow due to (serious) storage back end issues. I'm wondering how much latency will Windows tolerate on default, before giving up on writes and corrupting databases and file systems? - I've tried to look for that information for a few days, but nobody seems to have a clear answer for that. - So if storage system write takes more than N seconds, it's write failure and skipped? I could assume that will happen at some point, but what the exact breaking point is. It will naturally lead to data integrity disaster.
  • 6to4 routings changed suddenly again, now traffic is going via He.net @ Stockholm. Earlier traffic was going often via Funet and at times via Trex.
  • Liked this quote: "Good is too expensive; all I want is better (quickly)."
  • Even more potential Windows Remote Desktop Problems (RDP). It seems to be really hard to locate the source of the problems. Simply remote desktop just fails, but why. That's excellent question. No information about that in logs nor the connecting client gets any error message. So frustrating, but this is very typical for problems. If there would be clear error message, it would have been resolved long time ago, and it wouldn't be a problem at all. - Nobody seems to know what's the source of these annoying issues. Just more logging, testing, reading logs and failure at resolving the issue. Sigh.
  • One IPv6 link is also going up and down more or less randomly between routers. Oh joy, at least we're not running out of networking issues anytime soon. - Maybe that's positive?
  • Learned about Google's Project Shield. Yet another Distributed Denial of Service (DDoS) mitigation service.
  • Nice article about DRY and KISS and cost of wrongful abstractions. Very classic questions, getting it right is always dependent from so many circumstances. Some projects contain tons of copy paste code, with limited abstractions, and some are total ravioli without repeating anything and using tons of single line abstractions. The perfection goes somewhere between these two extreme opposites. - Yet with many projects it's extremely annoying that when you report something like % calculation or VAT (tax) calculation failure, they'll fix it. But only in that part of code you reported. Should these kind of simple things be abstracted or not. It's also similarly bad issue, that basically the same 'abstract thing' has been implemented in tens of different ways by different coders, which several have some kind of niche bug or edge case failure in them. Currency conversion could be a great example. Or just like I wrote a few posts earlier, email address validation. Simple task, but everyone got different implementation, even if there should be just one correct way of doing it. In many cases, even the same site might have multiple different validation rules. I've written about that one site which had different password validation rules. They let you set password with lower and upper case, but when logging in, they won't allow uppercase. Duh! That really drove me bonkers. Greets go to their 'elite coders'.
  • Something different: FAST Radio Telescope is ready, checked Wikipedia article about it as well as RATAN-600 Radio Telescope.
  • Once again problems with Microsoft Schannel TLS, sigh. This is one of the reasons why people prefer plain text over encrypted connections. Because the encrypted connections won't work.
  • And even more problems with Remote Desktop Protocol (RDP), it's very unreliable and problem prone. I just wish they would make it even a bit more robust.
  • Reminded my self about Google Edge Network with Data Centers, PoPs and Edge Cache locations with Google Global Cache (GGC).
  • Microsoft spam filter fails and sucks again. Now it's rejecting normal email from Outlook to Outlook again, but now to another direction than previously. It seems that they're totally clueless how their stuff works. That's just great. Thank you for that too. - Isn't that great? I'm at least being grateful about it. But maybe I've just got a hint of sarcasm.

Fails, Data Security, Monitoring, Network Speed, Email Address Validation, Xitroo

posted Nov 12, 2016, 7:45 AM by Sami Lehtinen   [ updated Nov 12, 2016, 7:46 AM ]

  • Elite coders... Key control buttons hidden behind other UI components. List with one entry is empty, and other fails. To sum it out omfg.
  • Data Security Offices called me, their call made me laugh, so much. I didn't dare to laugh during the call, but afterwards. Almost rofl. They weren't sure what the documents are, they personally knew the person responsible for security of those documents in that organization. Yet they didn't them selves do any handling stuff, but as official officers they tried to delecate the job back to me. Ok, in this case I just forwarded the email my self, but really wtf. Was that next step really my job? - Ridiculous. - What should be learned from this? Reporting incidents doesn't interest anyone. And they'll try to delegate it back to you and not take it forward internally. As many researches have said, that's the norm. So options are, sell the information on back market if someone's willing to pay for it. - Great! Or just publish it anonymously for lulz. If there's a sh*t storm after that just laugh. Nobody cared before that, maybe they will now? - Reporting it isn't worth of your time, and nobody really cares.
  • Lot of discussion with several friends about benefits and drawbacks of virtualization and how it's nearly impossible to compare vendors without extensive testing, monitoring and logging. Added software on several servers which records and test server performance repeatedly and report it to central monitoring system with alerts. - Way to go. - Yet I wish things would work so well, this wouldn't be necessary.
  • Constant monitoring and logging is important. One network connection dropped from 12 Mbit/s to 8 Mbit/s, without any notice. Only way to find that out was from logging. We've even found exact time stamp when this happened. - The speed returned to normal, without anyone doing anything. The physical network path is the same etc. Really strange and annoying at the same time.
  • Following Brian Krebs DDoS story closely. Akamai kicked the blog out. Hmm, nice PR.
  • VDSL2 is incredibly crappy technology. I don't really get why someone prefers it over fiber or Ethernet. But it seems that some ISPs just can't stop loving it.
  • More fun, it seems that Microsoft guys can't even validate email addresses correctly. I created temporary address: !#$%&'*+-[]\\\/=?^_{|}~@sami-lehtinen.net yet Outlook doesn't allow me sending email to that address, they claim it's invalid. No it wasn't. It was totally valid. Lol, even Tutanota doesn't allow ~ in email address. Why not? It's incredibly how many services doesn't get something right like email addresses. Lot's of great engineers and programmers out there...
  • Other interesting observations. Thunderbird automatically adds quotes on addresses which require those. So if address is []@sami-lehtinen.net Thunderbird changes it to "[]"@sami-lehtinen.net But why <>@sami-lehtinen.net is not valid, yet <@sami-lehtinen.net is? Also []@sami-lehtinen.net is valid... Very strage, maybe a bug? Gotta review the RFC 5322 again - It seems that the <> identifiers are used for address identification.
  • I just can't stop loving administrators and programmers without any logic. One email service claims that email address can't contain #... Yet their SMTP server accepts rcpt to address with #. What's the logic? No, it didn't deliver email to the XXX#YYY's XXX part.
  • Why Thunderbird says <>@sami-lehtinen.net is invalid, but Postfix says that <<>@sami-lehtinen.net> is valid. Go figure. Postfix also accepts <>@sami-lehtinen.net without brackets when following "rcpt to:". I don't even know what's right, but it seems everyones got different implementation of the same standard.
  • Tested a new service Xitroo.com beta. It's always so fun to find ways to screw projects. Managed to take over their start & welcome pages quite easily, defacing completed. Trivial. So much fun. No it wasn't anything serious, but worked like a charm. Found also other interesting usability issues, no default domain. Some inactive buttons were annoyingly visibly and asking to be clicked, even if those wouldn't work. How about not showing the button in such cases? Back button was broken, required two clicks to get back. And several other small remarks. Well, it's always easy to judge others work. But those are all minor things and can be trivially fixed for better future.
  • Had a way too long discussion about JS CDN's like cdnjs, googleapis, asp.net, MaxCDN, keycdn. Phew.

I/O Performance IOPS, Cloud Pricing, TLS 1.3, Dark Web Underground Trade

posted Nov 12, 2016, 7:37 AM by Sami Lehtinen   [ updated Nov 12, 2016, 7:41 AM ]

  • Even more extremely annoying performance issues. Systems lagged to death, disk I/O stalling. 0 IOPS, 0 KB/s, etc. As well as claims that there's nothing wrong with it. Well, either there's something wrong with it or then keep your junk, and we're moving out. Issues seem to be correlated between multiple servers and not isolated incidents. Next question is how much lag is too much, when does Windows fail? Maybe it's just Windows which is intolerant for I/O pauses? I've seen that happening earlier, maybe Linux performs much better? If I run any constant performance tests everything works well. Maybe the problem is triggered by the system hitting some cold data in tiered storage system and that's extremely slow? That shouldn't be a problem, but then Windows blows up and ends up being dead in the water, partially. Which is also very very annoying. So it's working, but it isn't. - In a way interesting case, but also at the same time enraging waste time. - Stuff should just work, is it my job to figure out why it doesn't. Well it kind of is. But now we're talking about platform causing issues, not my code. - Sigh. - Well I'll be posting updates. Either, I'll move my stuff out or they'll get their s*t together. - After some wondering and testing, I think it's just what I've said. A extremely simple way to verify this is file system walk, dir /s or ls -R and lets see what happens. If there are LONG pauses between directories, several seconds. Then there's something wrong. Of course this test can be run only once, after that the cold data isn't cold for that server anymore. But it shows that 'randomish' cold data walk can be really slow. kw: Ceph, Tiered Storage System
  • This also reflects to the pricing discussions. Some people claim that servers at X and so much cheaper than Y. But the real question is that do you get the bang for the buck? Cheap servers can be actually a lot more expensive than the 'more expensive servers' when you compare performance in detail and make some calculations, persistent performance testing etc. What's the cost of performance hiccups? Pricing can be interpreted very misleadingly, if all the related factors aren't being considered. Also spot performance isn't same thing as performance over a month or so. Situation can also change, several months performance was good, but now it's bad and so on. There's no ultimate answer, other than persistent data collection and analytics.
  • It seems that Finland is looking for a legal way to hack individual users and organizations in other countries. - Nice. I wonder if there will be any international law about that, so far it seems that hacking ICT systems is actually totally ok and acceptable and everyone's doing is as much as they can.
  • Introducing TLS 1.3 - Introduction to TLS 1.3. Yet it might take ages before web browser (Not even mentioning other HTTPS clients) start to support it. As we've seen over and over again with older versions and ciphers etc. Some browsers doesn't support AES256 with GCM mode at all, and some didn't support ECDH and of course these restrictions also apply to the server side. Wikipedia says that no browser so far supports TLS 1.3. - Let's see the technical changes of the protocol so far. - 0-RTT mode if of course really nice. 0.5 RTT data from server side nice. - No more DSA - No more SHA-1 in signatures - No more wear and rare curves - No more MD5 / SHA-224 signatures - No more RC4 - No more custom DHE groups. - No more compression - No more non-AEAD ciphers. - Another short list of features removed is almost same as what I just listed: Static RSA keys, CBC mode ciphers, RC4 stream cipher, SHA-1 hash function, Arbitrary Diffie-Hellman Groups and Export ciphers.
  • Tried with friends one side project called shim, but it failed. It should have worked, we'll be retrying later. Until there's success. - Actually the reason why I needed the shim to be successful actually resolved itself, so I didn't need it in the first place.
  • Watched a few documentaries about dark / deep web trade. Nothing new. If there's a demand, there's a seller. That's how the world runs, even if there wouldn't be official free market.

IO issues, Pricing, Performance, Cyber war, TLS 1.3, Shim, Dark / Deep web

posted Oct 30, 2016, 9:17 AM by Sami Lehtinen   [ updated Oct 30, 2016, 9:18 AM ]

  • Even more extremely annoying performance issues. Systems lagged to death, disk I/O stalling. 0 IOPS, 0 KB/s, etc. As well as claims that there's nothing wrong with it. Well, either there's something wrong with it or then keep your junk, and we're moving out. Issues seem to be correlated between multiple servers and not isolated incidents. Next question is how much lag is too much, when does Windows fail? Maybe it's just Windows which is intolerant for I/O pauses? I've seen that happening earlier, maybe Linux performs much better? If I run any constant performance tests everything works well. Maybe the problem is triggered by the system hitting some cold data in tiered storage system and that's extremely slow? That shouldn't be a problem, but then Windows blows up and ends up being dead in the water, partially. Which is also very very annoying. So it's working, but it isn't. - In a way interesting case, but also at the same time enraging waste time. - Stuff should just work, is it my job to figure out why it doesn't. Well it kind of is. But now we're talking about platform causing issues, not my code. - Sigh. - Well I'll be posting updates. Either, I'll move my stuff out or they'll get their s*t together. - After some wondering and testing, I think it's just what I've said. A extremely simple way to verify this is file system walk, dir /s or ls -R and lets see what happens. If there are LONG pauses between directories, several seconds. Then there's something wrong. Of course this test can be run only once, after that the cold data isn't cold for that server anymore. But it shows that 'randomish' cold data walk can be really slow.
  • This also reflects to the pricing discussions. Some people claim that servers at X and so much cheaper than Y. But the real question is that do you get the bang for the buck? Cheap servers can be actually a lot more expensive than the 'more expensive servers' when you compare performance in detail and make some calculations, persistent performance testing etc. What's the cost of performance hiccups? Pricing can be interpreted very misleadingly, if all the related factors aren't being considered. Also spot performance isn't same thing as performance over a month or so. Situation can also change, several months performance was good, but now it's bad and so on. There's no ultimate answer, other than persistent data collection and analytics.
  • It seems that Finland is looking for a legal way to hack individual users and organizations in other countries. - Nice. I wonder if there will be any international law about that, so far it seems that hacking ICT systems is actually totally ok and acceptable and everyone's doing is as much as they can.
  • Introduction to TLS 1.3. Yet it might take ages before web browser start to support it. As we've seen over and over again with older versions and ciphers etc. Some brosers doesn't support AES256 with GCM mode at all, and some didn't support ECDH and of course these restrictions also apply to the server side. Wikipedia says that no browser so far supports TLS 1.3. Let's see the technical changes of the protocol so far. - 0-RTT mode if of course really nice. 0.5 RTT data from server side nice. - No more DSA - No more SHA-1 in signatures - No more wear and rare curves - No more MD5 / SHA-224 signatures - No more RC4 - No more custom DHE groups. - No more compression - No more non-AEAD ciphers. - Another short list of features removed is almost same as what I just listed: Static RSA keys, CBC mode ciphers, RC4 stream cipher, SHA-1 hash function, Arbitrary Diffie-Hellman Groups and Export ciphers.
  • Tried with friends one side project called shim, but it failed. It should have worked, we'll be retrying later. Until there's success. - Unfortunately there hasn't been success so far. I guess I'll have to retry with better tools.
  • Watched a few documentaries about dark / deep web trade. Nothing new. If there's a demand, there's a seller. That's how the world runs, even if there wouldn't be official free market.

TMD, Skype, e-Receipts, Ficolo, Capnova, DMARK, Networks, NFPA 704, Super-Seeding

posted Oct 30, 2016, 9:13 AM by Sami Lehtinen   [ updated Oct 30, 2016, 9:14 AM ]

  • Played around at one site with Tomographic Motion Detector TMD network being installed. Worked really well. I also liked the concept. It also provides nice visual field disturbance map, which makes it very easy to check video feed what's triggering the motion detection in quite high accuracy. It can be used to automatically lock on and follow targets if there are turret (pan & tilt & zoom) cams. As well as external secure perimeter Protection sensing system. In many of these cases one of primary goals is to provide prewarning of possible impending intrusion or surveillance near the area. One of the neat stuff for TMD network was feature which allows people to carry RF identifier transmitters which disables alarm only for small perimeter around them. This actually allows monitoring large spaces so that people can be working in the area when alarm is armed. If anything / anyone is moving in the area without such device, the alarm will go on. No more sneaking around or tail gating. kw: security, motion detection, perimeter intrusion detection system, electronic fencing, laser fencing, security system, IDS.
  • Yet for most of physical security same rules apply as on computer & networks side. Nobody really cares. The ones who really care, are extremely rare exceptions.
  • Lol, Microsoft. Now it seems that the bridging between Skype for Business and the regular Skype is broken. All contacts are just "updating status". After very long wait the state of Skype users turned in to Unknown. - It's ridiculous how much Microsoft can fail with these quite simple tasks and integrating between two of their own products. I would understand there are issues if it would be just two developers, from two different companies. But now it's plenty of developers from large corporation. They should get this kind of stuff done quickly and well. - This is great example about the efficiency question I asked earlier. Why some small companies get so much done, and some huge companies with vast resources doesn't seem to get much done at all.
  • S-Ryhmä large Finnish retailer started to use e-Receipts with their mobile app. But the information on the receipt isn't detailed at all. It's only "ambiguous name of the article" + price. No article code, VAT group / percentage, etc. - I wonder if this is done on purpose to hide some of the things which made retailers afraid about electronic receipts.
  • Business as usual. First they complains that something is expensive. Then they complain it's not as good as the expensive option. Then they complain again about the better and more expensive option being more expensive. - I guess I can't really comment this because I don't have anything possible to say about this.
  • Studied Ficolo Services. Finnish Co-Location is underground data center facility. As well as Capnova.
  • Wondered if I should bother to setup DMARK. But so far it seems that I won't bother. It's not that necessary. I wonder why Outlook.com leaks out private addresses and causes SPF failures. I think it's pretty much fail. Wtf are they doing? "received-spf: SoftFail (protection.outlook.com: domain of transitioning outlook.com discourages use of 10.152.4.59 as permitted sender) " - I think their spam protection is totally failing on multiple levels.
  • Laughed at one operator, which seems to think that VDSL2 is superior technology, compared to single-mode otpical fiber and cat6 Ethernet cabling. Of course it's possible to run VDSL2 over CAT6 cabling, but ... Phew. It seems that even 10 Gbit/s should be well possible on 20 meters of Cat6 cabling without major problems in this kind of residential setup. Yet the cables go along with power lines, so it might be (?) potential source of problems.
  • Studied NFPA 704 Diamond. Good to know, even if it's usually best to avoid any situations when there should be any hazards around or generally environmental parameters out of ordinary range.
  • Reminded my self about Super-Seeding. They claim it's something specia, even if it's something extremely obvious if you want to minimize your bandwidth utilization. Just like the algorithm where you pick always the rarest block to be downloader or use weighted probabilities to prefer 'rarer' blocks. Yet always picking the rarest could also lead to situations where almost all peers start to download that single rare block. Yet in some cases, that's something which might save file from becoming permanently partitially available.

ATM skimmers, AnyConnect, Queues (fail), OVH, Private Cloud, OpenBazaar

posted Oct 29, 2016, 6:37 AM by Sami Lehtinen   [ updated Oct 29, 2016, 6:37 AM ]

  • Periscope ATM Skimmers - The joke goes on. Banks claimed that chip cards are secure, yet magnetic strip and human readable numbers are still there. What a joke.
  • Cisco AnyConnect VPN is blaah! To make it work, it's required to uninstall the client. Reboot Windows, re-install the client and then use it. Why? Well one customer uses domain.example/identifier and the stupid client claims that it's invalid address. But after re-install it works again. So lame. Also Cisco doesn't provide updates directly from their website which stinks too. - Can't stop loving bad software and bad support.
  • Message processing queues: Yet another fail. In one batch transfer process there's usually one file per day. But the file name is counter. So basically it doesn't matter, if there's 0 or "more" files per day. Yet someone has coded the rest of the processing pipeline so, that it'll actually processes only ONE file from that path and after that removes all files. Then someone claims that my software hasn't sent all data. Yes I have. I can show you the exact logs. You've just lost it. Not my problem, don't call me. Go and seek your lost data. - But as we all know, this is business as usual. Because the occurrence of this kind of event is rare. Basically only when the process has been manually started out of schedule. I'm pretty sure they won't fix it. Therefore we'll end up having this same fruitful conversation again.
  • Got more or less interesting issues with OVH. It's just like all the other helpdesks out there. Extremely simple issues gets fixed. But more complex questions seem to get redirected to /dev/null . If task is more complex than the reset monkey guy pressing the reset button on server.
  • Had interesting discussions with one private cloud provider. They provide managed private cloud platforms in deep underground bedrock cave bunker protected from EMP and other stuff. I'm afraid this falls in category, way cool stuff, but why to pay for it? After radioactive fallout starts and EMP has wiped most of electronics and power plants etc. It might be possible that people really don't care too much about their servers. It's history anyway. Systems also might actually look pretty affordable when built for "optimum capacity", but truth is that capacity requirements change all the time. Which could be a problem because when you run out of capacity, especially in quite small setup, the additional cost of getting more capacity is quite high. You can't get a little more capacity, because unit cost is too high. You'll have to get "a lot more capacity", in percentage compared to previous capacity and then you'll end up with under-utilization for quite a while. - This is one of the reasons why affordable public cloud is actually awesome. Capacity management becomes much easier when scale is almost astronomically larger compared to small setups.
  • Reminded my self about Sponge functions. Useful for both encrypting data & hash creation, Pseudorandom number generator (PRNG) - deterministic random bit generator (DRBG) - in form of deterministic pseudorandom number generators (DRBG). Because the internal state remains hidden. Using different stirring algorithms can easily alternate the output, as well as using bad stirring algorithm makes the sponge function pretty much broken.
  • Also due to latest science news checked out: Ganymede, Europa and phys.org got lot of interesting stuff. IEEE Spectrum is worth of checking out.
  • OpenBazaar 1.1.8 released with new features. Listing pinning, Hidden listings, Maximum Quantity, Addresses, Images, Shipping and Misc stuff.
  • Something different? WU-14 - Chinese experimental hypersonic glide vehicle (HGV). As well as American counterpart project Tactical Boost Glide (TBG).

1-10 of 434