Blog

Google+
My personal blog is about stuff I do, like and dislike. If you have any questions, feel free to contact. My views and opinions are naturally my own personal thoughts and do not represent my employer or any other organizations.

[ Full list of blog posts ]

Python 3, List vs Deque, When to use and why

posted Jun 24, 2017, 11:54 PM by Sami Lehtinen   [ updated Jun 25, 2017, 12:16 AM ]

This post is continuation the the post covering Python 3.6 new features among other things.

PyClockPro test, 10000 lookups, with 1000 cache records. Using different distributions. But let's see the conclusions.

Full test runs using Py Clock-Pro Benchmark.

Using list:

Test set generation time: 1.83
M: R T:   0 #: 1 LT: 0.04 LH:   0.0 PT: 1.15 PH:   0.0 DT: 2851%  DH: 100.0%
M: R T:  25 #: 1 LT: 0.04 LH:  25.0 PT: 5.42 PH:  24.9 DT: 12061% DH:  99.8%
M: R T:  50 #: 1 LT: 0.04 LH:  49.9 PT: 2.54 PH:  50.0 DT: 6264%  DH: 100.1%
M: R T:  75 #: 1 LT: 0.03 LH:  75.1 PT: 0.94 PH:  75.4 DT: 2737%  DH: 100.4%
M: R T: 100 #: 1 LT: 0.02 LH: 100.0 PT: 0.20 PH: 100.0 DT: 840%   DH: 100.0%
M: G T:   0 #: 1 LT: 0.04 LH:   0.0 PT: 1.15 PH:   0.0 DT: 2836%  DH: 100.0%
M: G T:  25 #: 1 LT: 0.05 LH:  25.0 PT: 5.20 PH:  27.2 DT: 11284% DH: 108.5%
M: G T:  50 #: 1 LT: 0.04 LH:  50.0 PT: 2.49 PH:  52.9 DT: 5880%  DH: 105.7%
M: G T:  75 #: 1 LT: 0.03 LH:  75.1 PT: 0.86 PH:  77.6 DT: 2533%  DH: 103.3%
M: G T: 100 #: 1 LT: 0.02 LH: 100.0 PT: 0.20 PH: 100.0 DT: 840%   DH: 100.0%
M: P T:   0 #: 1 LT: 0.04 LH:   0.0 PT: 1.14 PH:   0.0 DT: 2852%  DH: 100.0%
M: P T:  25 #: 1 LT: 0.04 LH:  25.0 PT: 3.42 PH:  33.1 DT: 7809%  DH: 132.5%
M: P T:  50 #: 1 LT: 0.04 LH:  50.0 PT: 2.55 PH:  55.4 DT: 6299%  DH: 110.8%
M: P T:  75 #: 1 LT: 0.04 LH:  75.0 PT: 0.87 PH:  77.0 DT: 2414%  DH: 102.6%
M: P T: 100 #: 1 LT: 0.02 LH: 100.0 PT: 0.20 PH: 100.0 DT: 833%   DH: 100.0%
LRU time:  0.55   LRU hitr: 50.01 %
PCP time: 28.33   PCP hitr: 51.56 %
Dif time: 4555.7 % Dif hitr: 104.2 %


Using deque:

Test set generation time: 1.97
M: R T:   0 #: 1 LT: 0.04 LH:   0.0 PT: 1.11 PH:   0.0 DT: 2767%  DH: 100.0%
M: R T:  25 #: 1 LT: 0.04 LH:  25.0 PT: 5.29 PH:  25.3 DT: 12578% DH: 101.4%
M: R T:  50 #: 1 LT: 0.04 LH:  50.1 PT: 2.66 PH:  50.1 DT: 7080%  DH:  99.9%
M: R T:  75 #: 1 LT: 0.04 LH:  75.0 PT: 0.99 PH:  74.9 DT: 2827%  DH:  99.8%
M: R T: 100 #: 1 LT: 0.02 LH: 100.0 PT: 0.19 PH: 100.0 DT: 804%   DH: 100.0%
M: G T:   0 #: 1 LT: 0.04 LH:   0.0 PT: 1.13 PH:   0.0 DT: 2747%  DH: 100.0%
M: G T:  25 #: 1 LT: 0.04 LH:  25.1 PT: 5.25 PH:  27.2 DT: 11830% DH: 108.5%
M: G T:  50 #: 1 LT: 0.04 LH:  50.0 PT: 2.65 PH:  52.5 DT: 6685%  DH: 105.0%
M: G T:  75 #: 1 LT: 0.04 LH:  75.1 PT: 0.92 PH:  77.1 DT: 2475%  DH: 102.8%
M: G T: 100 #: 1 LT: 0.02 LH: 100.0 PT: 0.19 PH: 100.0 DT: 800%   DH: 100.0%
M: P T:   0 #: 1 LT: 0.04 LH:   0.0 PT: 1.11 PH:   0.0 DT: 2711%  DH: 100.0%
M: P T:  25 #: 1 LT: 0.04 LH:  25.0 PT: 3.32 PH:  33.1 DT: 7554%  DH: 132.6%
M: P T:  50 #: 1 LT: 0.04 LH:  50.1 PT: 2.66 PH:  55.3 DT: 6893%  DH: 110.3%
M: P T:  75 #: 1 LT: 0.03 LH:  75.1 PT: 0.83 PH:  77.5 DT: 2447%  DH: 103.1%
M: P T: 100 #: 1 LT: 0.02 LH: 100.0 PT: 0.19 PH: 100.0 DT: 797%   DH: 100.0%
LRU time:  0.55   LRU hitr: 50.03 %
PCP time: 28.48   PCP hitr: 51.53 %
Dif time: 4733.1 % Dif hitr: 104.2 %


Thoughts:
If we just compare the Pareto 25% (lru) hit ratio set, let's see what the differences are.

With deque:
M: P T:  25 #: 1 LT: 0.04 LH:  25.0 PT: 3.42 PH:  33.1 DT: 7809% DH: 132.5%
With list:
M: P T:  25 #: 1 LT: 0.04 LH:  25.0 PT: 3.32 PH:  33.1 DT: 7554% DH: 132.6%

Using 1000 cache entries, it seems that list is still sightly faster than deque.

This once again shows how hard it's to make some choices. Because in some cases list is much faster and in some other cases deque is much faster. After a few tests I couldn't actually conclude if deque or list would be faster for cache. Because deque is faster for insert and pop, in random order. It's still much slower for generic access.
What's the root cause for this dilemma? It seems that deque access by index is very slow, because it's going through linked list. Instead of having direct memory access using pointer math in single array.

In this case list is much faster than deque, random index access:

list[512*1024]

deque[512*1024]

And in this case deque is much faster than list, basically same as popleft:

del list[0]

del qeque[0]

The differences just grow, if the data set size grows. It's got everything to do with the underlying data structure and implementation. And this is the root cause why deque isn't actually faster than list. Because most of PyClockPro access hits the list with random index number to be accessed.

Why? Deque is linked list, and 'walking the linked list', is slow. On traditional compact memory list you can basically just access index pointer at certain location to find Nth entry.

Credentials, Security, Superintelligence, AI, Python 3.6 what's new

posted Jun 24, 2017, 11:53 PM by Sami Lehtinen   [ updated Jun 24, 2017, 11:56 PM ]

  • Even more wonderful discussion about using default credentials. People just say, that setting device / customer specific credentials is way too hard and complex to maintain. It's just better if everyone is using admin / admin or admin / password. - Sigh. But I'm sure this is nothing new for anyone. It's totally ok to use default credentials everywhere. It makes life much better and simpler.
  • Quite nice article about AI and superintelligence - Collection of approaches to super intelligence aspect from multiple advantage points. It's totally natural to wonder what will be the 'true motivation' behind super intelligence, and how well it can be controlled. Most of people seem to care way too much about things which do not actually make sense in pure logic form. Sometimes it would be good idea to check what's the root cause when you say you've gotta do something. I'm personally having hard time figuring out motivations for some of the stuff I do. Like this blog, why? What's the actual point and so on.

Rest is jargon about Python 3.6 new features.

  • Python 3.6 - What's new? - Yes, I do like f-strings. Yet it seems that people are still looking solutions to do lazy deferred execution, so f-strings could be used as templates. Type hints are also awesome and very welcomed feature. Asynchronous Generators sound awesome. But as I don't need to usually write very high performance code, I'm still very confused about how are these async things actually improve performance, as long as there's the GIL present. I've seen actually some projects to run in huge problems, writing everything in way cool asynchronous way and then finding that it doesn't actually allow ANY parallel execution. It's hard to see how this would bring any better performance than bunch of traditional block threads. But as said, these things can be done in at least hundreds of different ways. Maybe I should write some experimental code to experiment with async code, to see the real benefits. But probably won't do it before, I'll see any practical use for these in my own projects. Pathlib is nice. Simplifies path handling. Using UTF-8 on Windows as default, finally! It has been constant struggle on Windows compared to Linux. New faster, memory conserving and order-preserving dict is also awesome improvement. New secrets module is awesome, no need to use os.urandom for that anymore. Deque pickle is nice. No need to convert those to standard lists anymore before saving. Decimal.as_integer_ratio for fractional (rational) number representation is also very neat, finally. Neat, hashlib now packs BLAKE2 and SHA-3 as well as scrypt. Yahoo, ChaCha20 Poly1305 support added to SSL library. Lot of generic optimizations, very nice.

Dictionary vs list vs deque vs order preserving dictionaries. Lot of speculation.

  • New 3.6 order preserving dictionary (dict). I actually missed one thing. Because dict is preserves order, there should be a way to access items in dict also using index. In pyclock pro some items are tracked using list and dict, because index positions are important. If dict would allow access using index, it would mean that there's no need to maintain parallel list. Which would naturally improve performance quite a lot.. It is possible to pop from dictionary as well as get items or even get keys or iter(dict). But any of these methods do not allow direct index access without using list presentation as proxy. Hmm... I wonder why they haven't thought about this.
  • Just my social media post about the topic: Python 3.6 dictionaries are order-preserving. I'm just wondering why there's no way to access dictionary items by index directly. When dictionary retains order, I would love to see it to be accessible just like a list. It would be quite logical.
    Of course you can get keys() from dict, or items(), but I would think that's inefficient way, if I know I want to access 5th item in dictionary.
  • Compact dictionary - Again classic reasons. Less memory used -> more cache friendly code. These fundamental changes in Python are awesome. This clearly shows how the construction of basic building blocks, which everything else builds on is extremely important.
  • The PyClockPro - Is one example which seems to be one of the worst use cases for the new dictionaries. Because it does repeatedly remove and insert new keys to keep the cache dictionary at the maximum allowed size.
  • Art of optimization, sigh! At least on rather small cache data sets Like 100 items, it seems that deque is slower than traditional list. Let's try same, but with larger cache. Got so deep, that next post will contain the details.
  • Some continuation to my blog post on social media:
    Maybe I've missed some essential Python feature or syntax. But afaik, there's no good way to access dictionary by index right now. - That's exactly why I'm asking. Also Collections.OrderedDict doesn't provide by index access. Also keys, items or iter are all very slow methods of accessing specific index. With lists accessing by index is fast.
  • With PyClockPro project I'm now maintaining a list of keys for index access and dictionary for fast key access. But if ordered dictionary would allow direct index access, there wouldn't be a need to maintain separate list for index access. This is something I have to reconsider later, when Python 3.6 is mainstream.
  • Any potential pro-tips about this would be appreciated. This is just the way I figured out to be the fastest so far for this use case. But I'm always looking for improvement.
    I'm just asking if there's a better way. I'm not proposing any syntax. Of course obvious would be new method to access / get by index.
  • As bonus, it might also be that this compact dictionary would provide better 'random deletion' performance than list or deque, which also provide bad random delete performance. Or is there existing doubly linked list with good random delete / insert performance available in Python standard library? After quick testing, it seems that deque provides also good performance when removing items when item being removed is not first (left) or last (right) item in queue. Yet, if accessing deque list with large index number, it's very slow. Probably due to linked list walk happening behind the scenes.
  • If this wasn't enough. Here's even more: Python 3 - List vs Deque - When to use and why

KeepassX, Skip Lists, Flash Sector Size, Let's Encrypt / SSL / TLS, SaaS Startup

posted Jun 18, 2017, 1:27 AM by Sami Lehtinen   [ updated Jun 18, 2017, 1:28 AM ]

  • KeepassX 2.0 got serious security issue. No it's not about encryption. But it's about user error and not warning about it. It's nice that new password option now required accept click, before old password is replaced with new one. But the another serious issue is that when removing entries from password storage there's no warning about it. Let's assume I've got two applications open Keepass and Email or File Manager. Now I do something, someone calls me. Ok, the file or email is highlighted on screen I want to remove. I just hit delete. Oh, didn't work. Let's re-click the item and delete. Phew now it worked. But at the same time, I managed to delete one key entry from Keepass. From software engineering perspective, this is of course 'invalid user error'. But as we all know, most of people do mistakes. It would be so nice to get option to prompt for entry removal and not just silently dropping it. - Bug, no? - Stupid user, yeah sure. - Potential problem, probably yes. - But this is just so typical for some errors, it's not an (software) error, but it still might cause serious problems. Good design and warning might mitigate this problem.
  • Very nice article about Skip Lists: Done right - But in general, it's bad idea to reinvent the wheel. Sometimes I dislike tons of dependencies and bloat those bring. But in other cases, it would be insanity trying to replicate something which is already well implemented. Changes are that your own implementation is going to be orders of magnitudes worse than the alternate implementation are very very real. In this sense I'm extremely happy about the CLOCK-Pro (PyClockPro) implementation, because it's only marginally slower than LRU even if it's pure python and provides higher hit ratios than the native LRU cache implementation in standard library. Which is actually pretty surprising. I didn't expect that good performance at all. Implementing something like Skip Lists yet can be excellent exercise. There are reasons why many languages implement dequeue, hash tables and dictionaries by default. Also the stdlib is one of the reasons why I love Python so much.
  • I'm just wondering when some Flash / SSD vendors are talking about smaller allocation units that if the operating system will actually ruin everything. If there's even SSD with 512 sector size actually available? What's the benefit? Will the operating system issue minimum of 512 bytes write requests when OS is using 4096 cluster / block? I mean if I read 4096 byte cluster/block and then update one byte in it and write it back. Will the operating system issue to the SSD 4096 bytes or 512 bytes? Probably it doesn't matter, but with large number of small updates it might still reduce write amplification by 8x. I tried Googling around, but I've got no idea. Of course SSD can also internally do the read-modify-write so that it only modifies the changed 512 bytes or even less, there are many optimizations which can be done. Another reason for this question is of course the SATA channel utilization. If I've got really fast SSD and I'm delivering small random write updates, less data means less utilization on channel and more IOPS. There' just so many postings about file systems and operating systems, but this kind of simple question doesn't seem to be covered by most of the articles. My guess is that the cluster/block is the minimum 'update' unit. Yet this brings the question up, why some vendors boast about 512 byte sectors, if it doesn't really matter if the sectors are 4096 or 512 bytes.
  • Configured two systems to use Let's Encrypt. One with Windows 2016 Server with ACMESharp and another on Ubuntu 16.10 (Yakkety Yak) with official Certbot. Of course using automated renewals as recommended every 60 days, at random time + automated retry in case of failure.
  • Bootstrapping a SaaS Startup - That's a pretty good quote: "If You Love Writing Code, You’re Going to Hate Running a SaaS Business". I can honestly say, I like tinkering interesting tech stuff and coding. And don't like so much all the administrative / management tasks which are dropped on you. I'm also bad at marketing and human relations. I just don't care enough about that stuff. But I guess that's nothing new. There were several aspects about sending email etc.
  • I do agree about that. I've tried MailChimp, MailGun and SendGrid. But because I'm nerd, I don't see any problem running my own MX. Actually that's one of the projects, which I might upgrade in fail. Replace the current system with new one, don't know yet which one.

Outlook, Social-Credit, Galileo, Maya, USB Flash, Negative Work, Boring Code & Tech

posted Jun 10, 2017, 11:29 PM by Sami Lehtinen   [ updated Jun 10, 2017, 11:29 PM ]

  • Thank you Microsoft again. Now your server: EUR01-VE1-obe.outbound.protection.outlook.com was mis-configured and returned all outbound emails with error Relay access denied. Come on! I thought Microsoft would have even somewhat competent developers, operators and system administrators. But I'm getting different Outlook related issues all the time. What's even more ridiculous is that your spam filter keeps on failing continuously. - Flagging as spam and completely deleting mails so those never get delivered, and the sender doesn't even get bounce. - That's awesome from you guys.
  • Learned latest hype word Reinforcement Learning, earlier it was Deep Learning.
  • About international and inter-cultural communication, yes, all kind of idioms are usually a very bad idea. It's better to write things verbosely open and simply things down a bit.
  • Read the State of Internet Q3 2016 report by Akamai . IPv4 address re-mappings between RIRs grows routing tables. Ouch! Finland isn't even in top 10 on IPv6 adoption. No surprise, there are three huge national carriers and only one of those provides native IPv6 right now. Only good thing is that Finland is eighth (8.) on the global list of countries with fastest Internet connections. On European list Finland is fourth (4.) fastest. Report also shows that it's tradition to shutdown the Internet if there's political disturbance in country.
  • The Economist got great article about China's social-credit system. Will it be powering the digital totalitarian state? - Economist on Snapchat Discover. Ugh!
  • European Galileo satellite navigation system is finally on-line and available. Public signal location accuracy is 4 meters and paying customers get nice centimeter accuracy.
  • Had some reading about management theory and trends. Everything is relative and there's no right solution. But I kind of like globalization, because I've always been against subsidies and toll tariffs. Afaik, products and services should be made where it's best to get it done.
  • Maya Library - Yet another way to deal with date / time on Python. This is the eternal problem with modules and functions. Complex full featured library might seem complex and hard to learn. Yet simplified version might be too simple. But it's also easier to create large number of simpler versions. Because making something full featured is of course quite complex job. I sometimes write my own wrappers for more complex libraries, so I can use those in simple and sane way. Which actually crates yet another very light way of dealing with the things. As example, URL handling and date time are something which I very often wrap.
  • USB Speed issues again. Many instructions instruct to align partitions. But I'm using NTFS and I don't know how well aligning works with it. With ext4 I know that the file system prefers 8 megabyte extents. Which is actually great if the alignment is right. Yet again, without extent allocation would be more efficient when writing several small files each of those files wouldn't occupy it's own extent. This also leads to free space fragmentation and after certain point when every extent is being used, new writes are going to be scattered.
  • Beware of Developers Who Do Negative Work - Amazing article. Yes, that's why I prefer something simple. Making things extremely complex, cool and advanced code can be really bad at times. It's hard to understand, complex, probably bug prone due to all that additional complexity. It of course sounds cool if you wrote 200+ pages of documentation for a single module. But that's not the way I prefer doing things. - This is also one of the reasons why premature optimization can be really bad. Because you can do some code much faster by using more complex logic. But it might be very hard to understand it. Quote: "They want to feel like their time was spent on something worthwhile. For developers that means delivering software that brings value." - I just couldn't agree more!
  • Perfect example is the CLOCK-Pro algorithm I implemented. Traditional LRU is extremely simple and easy to implement from scratch when required, with circular buffer. But CLOCK-Pro is from other planet. It's more efficient, but requires much more complex code. Yet when used from library, luckily that complexity is masked. - Yes, I'm very well aware that the CLOCK-Pro could use ordered dictionary as well as it contains "magic numbers". But truth is that replacing magic numbers with variables makes the code run slower. Because this is cache library, I think it's reasonable trade-off. Magic numbers can make processing single if statement 10% faster. I would consider that significant resource saving. As well it doesn't mean that the magic numbers wouldn't be documented in the source code.
  • Yet, I'm very happy to learn new things. I guess that should be obvious from my blog. But when doing production stuff under tight deadline, boring is the way to go.

33C3 notes & keywords part 8

posted Jun 3, 2017, 11:28 PM by Sami Lehtinen   [ updated Jun 3, 2017, 11:29 PM ]

  • Do as I Say not as I Do: Stealth Modification of Programmable Logic Controllers I/O by Pin Control Attack - Industrial Control System hacking, Process control, system level protection, firmware integrity, logic checksum, doppelgamger, symbiote defense, autoscopy, PLC Runtime, PLC controls I/O, Rootkit.
  • A New Dark Age - How information, technology, understanding and function changes over time. Helikite surveillance. Weather forecasting and massive data processing using early computer systems like ENIAC. Weather forecasting and nuclear bomb simulations run on same computers. Weather modification, cloud seeding. Automation of medicine research. Computer / Human / Algorithmic synergy will produce very powerful results Go / Chess. Many of the AI based tools are created to alter reality, like change photographs, video and or sound. New kind of literacy required, to understand the new world.
  • Talking Behind Your Back - Ultrasound Tracking System - Cross-Device Tracking (XDT). Comments, I just happened to read an article from technology archive, where they wrote that TV remotes used ultrasound from mid 50s to early 80s. So that's hardly anything new. uBeacons. Ultrasound framework, ultrasound advertising tracking network allowing ad targeting. Device pairing using ultrasound. Tor de-anonymization attack sample with tor browser. How to solve this in long-term. Abusing could bring lot of negative publicity.
  • An Elevator to the Moon (and back) - Space Elevator is a nice idea. But does it work practically. Beyond Rockets. Rocket Equation, Physics of Space Transportation. Benefits of Moon elevator, feasible ribbon material, only few artificial satellites, no (human-made) space debris, no rope erosion by atmosphere. Elevator Cable is the problem and challenge, climber isn't. Hacking celestial mechanics. Space hotels, space power stations, etc. Space Elevator Wiki, Global Exploration Roadmap. Satellite and Space Elevator Simulator. Btw. Excellent speaker, loved even the questions section. Being professional speaker and lecturer does make a difference. Space bombing, Military use. (Sources: ESA, European Space Agency, Unclassified, For Official Use) Wikipedia links: Space Elevator, Launch Loop, Space Fountain, Orbital Ring
  • The Moon and European Space Exploration - Nice talk about European Space Exploration and what it is all about. Yet as said, the time was very short.
  • Decoding the LoRa PHY - Nice examples how to intercept and inject stuff to wireless wide area network wan. Applied security research. Applied on cutting edge wireless IOT protocol. Software defined radio. Fast Fourier Transform (FFT). Local provisioning, gateways. 3G requires lot of power. Of course this is relative term. 3G IoT Standards: LTE-M/NB-LTE Release 13. LPWAN for IoT. Like Sigfox, LoRa, nwave, lte-m, nb-lte, weightless, ieee 802.11ah, ec-gsm, zigbee3.0, dash7 alliance, bluetooth 4.0. Not for everyone, because of duty-cycling, sparse datagrams and serious rate-limits. SIGFOX provides "only", 140 12-byte datagrams / day. That's just bit more than 1.5 Kilobytes. LoRaWAN MAC/NWK stack, LoRa Alliance. Concentrator / Gateway. Roaming supported. NwkSKey and AppSKey used for encryption in security architecture. Uses ISM band. Commercial Networks and Crowdsourced networks. LoRaHam. Radio fundamentals crash course. OSI Model PHY physical layer, energy being sent over RF medium. Radio frequency energy, electromagnetic radiation. Software defined (SR) radio in software implemented using CPU or FPGA. Amplitude, Frequency, Phase. Digital modulation. Symbol presents state. FSK symbols. Spread spectrum. Microchip LoRa RN2903 and Ettus B210 (SDR). Analyzing spectrogram with time and frequency. Chirp Spread Spectrum (CSS) provides resilience and lower power, it's also resistant to multi-path and Doppler effect. GNU Chirp Sounder.  OSIN. AN1200.18 and AN1200.2 app notes. Bandwidth, Spreading factor, Chirp rate. FM modulated chirps. Dechirping the signal. Demodulation and data extraction. Overlapping FFT buffers to synchronize timing for first SFD symbol. Normalizing data. Data transformation to improve OTA resiliency using encoding. Symbol gray indexing. Data whitening. Interleaving. Forward Error Correction (FEC). Cracking the decoder. Hamming(N,4) algorithm. Reverse engineering challenge. Documentation is full of lies. (Nothing new?). PHY packet contains PHY header, Preamble Symbols, Header Symbols, CRC and of course the Payload and Payload CRC. Header is only present in Explicit mode. Implicit mode omits header and sends data alone. Whitening sequence changes between these modes. Header uses coding rate 4/8 and spread factor is always 2. Optional Low Data Rate mode. GR-LORA @ GitHub provides LoRa Encoder LoRa Moluator, etc as GNU Open Source. Adafruit SDR radio transmitter live demo. See PoC||GTFO 0x13 if you're interested about details. - Comments: Thanks you for sharing! It was a really awesome talk!
  • From Server Farm to Data Table - Networks of New York - An Internet Infrastructure Field Guide. Neal Stephenson, essay, Global fiber network. Hacker tourists. Uncensored Google Data Center Satellite Pictures.

Python 3.6 and Performance, Security Policies, Reporting Bugs, WeChat

posted Jun 3, 2017, 11:22 PM by Sami Lehtinen   [ updated Jun 3, 2017, 11:22 PM ]

  • Python 3.6 improvements. I personally think one of largest benefit for my programs, will be the improved dictionary (dict). Because almost everything I do extensively uses dicts. Another extremely nice thing is 'utf-8' default on Windows. It has been so so frustrating to notice that print('€uro') will make your program crash, because the characters in string can't be printed on terminal. Of course you can deal with it, but it's still extremely awkward. I'm almost sure way too many developers have been suffering from that simple fail on several levels. Hail the legendary CP437 and Windows-1252.
  • Sometimes user account security policies are just totally and absolutely ridiculous. One organization absolutely required that there must not be any shared accounts and all accounts must be always personal, no exceptions. After a while they whined how many accounts were required after all. But that's not all, it got even better. When the accounts were finally created, they sent one email to all the user account holders which contained every accounts login credentials. Of course including the password. It was pretty funny. But it still got much funnier. I thought that's fine, I'll got and change my personal password now. The best part is that there's no way for the users to change the password for these accounts. Exactly how does this differ from having one or a few shared accounts between all the users? I just don't get it. Before you ask, no there's no 2FA or anything else than the user name & password required for access. This is pretty perfect example where they can boast about having such high tech solutions and policies when in reality it's all bs and doesn't matter at all. Only good thing I found out of this was that the initial password they set were random and not something like default pwd or so, which I unfortunately way too often see. Yes, I notified them about this. And they didn't give me any reasonable reasoning why they did what they did.
  • As said, it's also interesting to follow what kind of reaction you get when you report something.
    Is it like:
    1) Thank you for reporting. We've just fixed it.
    2) Hey come on, don't waste our time. Nobody gives a bleep about that. It's the way we do stuff here.
    3) Or the most common one, where you don't get any reply at all ever. Nobody cares enough to even read the reports.
  • If I would be paranoid. I would think that the user list which seems to be delivered to users, would be honey pot trap. But in reality, all the accounts do work. And I'm pretty sure they didn't plan this to be a trap. But you'll never know. In some other circumstances that kind of setup could be exactly that, a setup.
  • Watched awesome video about WeChat in China and how it's one application does it all. From privacy purposes that's absolutely horrible. Privacy is eroding faster than ever. I wonder what kind of privacy there will be left in a decade or two. - Well, these privacy issues are also being thought about with e-receipts.
  • Something different: Recoilless rifle
  • Finally something Python 3.6 related performance testing to end some discussions.
All that discussion about + vs % vs format when using Python. Here's some timings and code.

Results:

Test pass: 0.05777428500005044
Test format: 0.2573048029989877
Test percent: 0.05647015100112185
Test plus: 0.25693506800053
Self check
test 1 test
test 1 test
test 1 test
test 1 test


Code:

import timeit

def test_pass():
  return 'test 1 test'

def test_format():
  return 'test {0} test'.format(1)
 
def test_percent():
  return 'test %s test' % 1

def test_plus():
  return 'test ' + str(1) + ' test'

print("Test pass:", timeit.timeit(test_pass))
print("Test format:", timeit.timeit(test_format))
print("Test percent:", timeit.timeit(test_percent))
print("Test plus:", timeit.timeit(test_plus))
print('Self check')
print(test_pass())
print(test_format())
print(test_percent())
print(test_plus())


Conclusions:

% seems to be the best way to deal with this clearly. Other options are much slower. That's something you should keeping in mind, if doing something performance related. I've got bad habit of using + construction from older languages. Yet I hear that many C++, C# and Java programmers also use plus construction because it 'so clear'. But thats of course debatable.
kw: Plus vs Percent vs Format, Python string formatting speed, performance, Python, timing, timeit, format.

IoT, IPv6, Dislike, Opinions, NoSQL injection, Windows Server 2016 memory compression

posted May 27, 2017, 11:01 PM by Sami Lehtinen   [ updated May 27, 2017, 11:02 PM ]

  • People seem to love IoT and all the pervasive monitoring it brings. Automated IoT devices monitoring your driving habits and tracking where you go all the time. What's strange, people really love it. Why? Well, because they get large discount on car insurance. This development was pretty expected and will lead to total loss of privacy for most people in future. Yet as said. The mobile phone was first monitoring and tracking device which got widely accepted by population. They loved it, so it's pretty obvious other monitoring & tracking will be also gladly accepted. Even funnier is that people are ready to pay large sums and often update their tracking devices to the latest version. - Same applies to the social networks, people are happy and eager to report everything they do directly to intelligence agencies. And they love it. - Great.
  • Even more IPv6 fun. It seems that some routes between Finland and Central European data center we're using are nicely looping via New York and Chicago. Honestly, this is daily WTF stuff. Instead of getting around 36 ms latency, we're getting neat 152 ms. Oh joy!
  • Be careful about what you dislike.- Great post! So true. As I've said, you've gotta be flexible. If you disliked or liked something in past, it might not mean you might dislike or like it in future. Especially when your personal views as well as other factors change. We've just talked today about adequate quality at work again. Doing things too well is expensive and wastes resources. Doing things badly, is even more expensive and leads to all kind of resource loss on multiple levels as well as internal conflict, loss of motivation and other nasty stuff. But it's very hard to get the quality just right. Also same thing might be viewed completely differently depending on your own position on the argument. Even if I prefer 'neutral' and 'generalized' view, which acknowledges both parties and complete overview. Also your position and role might make having some opinions officially impossible. Because it isn't just your task or job. But hey, we all know that. CEO of oil company just can't say certain things about oil, energy or environment. That's part of the job.
  • I thought data is data, and instructions are instructions. Today I just first time saw new word with really made me laugh. NoSQL injection. - Ouch!
  • More discussion about IPv6 issues, 6rd and 6to4. Yes, unfortunately not all provides provide native IPv6 or even bother to run 6rd.
Friend claimed that Windows 2016 Server doesn't use Memory Compression. Well, it does support Memory Compression and it can be trivially enabled using PowerShell.
PS C:\Users\Administrator> Get-MMAgent

ApplicationLaunchPrefetching : False
ApplicationPreLaunch         : False
MaxOperationAPIFiles         : 256
MemoryCompression            : False
OperationAPI                 : False
PageCombining                : False
PSComputerName               :


Let's "fix" that, in case we conclude it's better to use memory compression, than add more memory or to use swapping.

PS C:\Users\Administrator> Enable-MMAgent -mc
PS C:\Users\Administrator> Get-MMAgent

ApplicationLaunchPrefetching : False
ApplicationPreLaunch         : False
MaxOperationAPIFiles         : 256
MemoryCompression            : True
OperationAPI                 : False
PageCombining                : False
PSComputerName               :


That's it. Now memory compression is enabled, and this reduces amount of data being swapped to to disk.
kw: enable memory compression Windows Server 2016, swap compression, data compression.

For comparison same info with latest Windows 10 desktop.

PS C:\WINDOWS\system32> Get-MMAgent

ApplicationLaunchPrefetching : True
ApplicationPreLaunch         : True
MaxOperationAPIFiles         : 256
MemoryCompression            : True
OperationAPI                 : True
PageCombining                : True
PSComputerName               :


Of course this comes once again to the balance of storage I/O, RAM, CPU etc. Compression can be amazing or disaster, depending on so many factors. There's no magic way to tell if it's good or bad. It depends on combination of so many different factors.

So it's just like zram or zswap on Linux, it's inherently there, but not enabled by default. Also see: compcache

Duplicati 2 related observations and thoughts

posted May 20, 2017, 9:37 PM by Sami Lehtinen   [ updated May 20, 2017, 9:37 PM ]

Some random ramblings about experimenting with Duplicati 2 backup software.

When run, it didn't create full size blocks and only the last one would have been smaller. This raises a curious question, why so many smaller blocks created at once? I thought it would create data blocks as 'all the other applications'. Which means that there is number of maximum size blocks per run and then one block which isn't maximum size block, because it's not 'full'. After using the program for a while, it seems that something strange happened with that run. Because since that Duplicati 2 has been behaving just as I expected it to do. Maybe I missed that block files are only 1/3 of the files being created.

This raises some positive thoughts, what's actually is in those blocks and if Duplicati 2 already does some kind of optimizations I wrote about earlier. Like trying to separate static vs dynamic data in different blocks. Which would make compacting and generally bandwidth and storage management more efficient.
 
Most of programs seem to use full file path to order data. But some others use alternate methods, like 7-zip uses as default the file extension. But for backup system, I think the last modified date would probably be the best way. And after that statistical data, how often something is being inserted / updated / deleted after that. It's good to note that with data blocks, actually inserting doesn't matter. It's only update and deletion, because those create stale data. Also another good question if existing partial blocks are itself being updated. But I wouldn't suspect that. Because that kind of updating should be done when compacting, not while adding data to the block storage. Also there's a difference which data is put into which block and how is the data arranged inside the block. The 7-zip default ordering aims to efficient compression dictionary usage.

Some IT guys and colleagues asked me why I'm using Duplicati 2. That's because there are simpler which can be used. Here's the reasoning why it's better than many.
 
I've been personally using 7-zip a lot for backups. For local backups. But it's very slow, if there's lot of data. As well as it's not suitable for maintaining proper version history, unless using a very lot disk space.
I needed clearly better solution. Duplicati 1 provides that, but it still requires "full backup" every now and then, to maintain the history short enough for reasonable restore time, as well as reducing required disk space. This full backup monthly also requires substantial bandwidth. But much less than full daily backup. As I (of course) prever efficient off-site backups, it also means that the required bandwidth it precious.

So why should you use Duplicati 2? Because it provides following key features. Which I'm personally really after:
 
1) Data de-duplication on block level
2) Maintaining efficient incremental versioning
3) Secure Encrypted backups
4) Multiple storage / transport options
 
This means that when daily backups are run with reasonably sized data sets, it's taking just a few minutes, and it's done. And consumes minuscule amount of bandwidth. Basically only the changes to the files & databases. Even if the source backup data set might be tens or hundreds of gigabytes. Benefit is that only about 1-5% of backup data gets updated daily. And therefore we can maintain 3 months worth of daily versioning data for all of our key systems. Duplicati 2 does this very efficiently.
Of course we could have had snapshots of the database or even full servers disks. And only have remote database copy data for disaster recovery (DR) purposes But in that case we would need a lot more storage space on the servers, and that's just a waste of expensive fast high performance storage space.
Now backups are stored on 8 TB archive drives. I've been thinking moving this to cloud too, maybe Backblaze or some other cheap storage servers or so. But current solution works well and is terms of costs, it's extremely cheap to run.
I think I've blogged way too much about backups. But that's one of the most important and basics tasks of information management department.

Links: Duplicati, Duplicati @ Wikipedia, hubiC, Backblaze B2.

Looking for similar but alternate projects? Check out: Bup and Attic.

ROTI, GitLab, VPN, Email Flood, UDP vs TCP, Deep Web, Compact Fusion Reactor

posted May 20, 2017, 9:30 PM by Sami Lehtinen   [ updated May 20, 2017, 9:30 PM ]

  • I've been thinking about traveling and meetings. What's the ROTI for those. Payback time or Return On Investment (ROI) is one way to say it. But Return On Time Invested (ROTI) is one way to see it. Is it really worth of traveling to meetings, how much time would be saved if that wouldn't happen. Often meetings do not provide such extra value to make it worthwhile. I've also mentioned several times that if there are problems with project, customers want to have meetings to discuss and dig in to the reasons. But if the problems are caused by lack of resources, every additional explanation and or meeting just reduces resources available for actually solving the issues. This is also why I don't like ambiguous meetings and or meeting agendas. Clear background data and decision / action lists is the right way to get it done. Usually that alone is enough to make the actual meeting unnecessary.
  • A new own platform for GitLab? Very nice article about all the consideration which goes into running your own hardware versus using the cloud platform(s). They say it once again, running on cloud is very expensive. As soon as you reach some reasonable scale.
  • Tested three VPN providers: Ivacy, PureVPN and Private Internet Access (PIA). Got all three working with Linux without any problems. My use is quite rare and very random. All of the providers got VPN gateways at several of the locations I prefer. Support for five parallel devices is of course a nice feature as well as unlimited bandwidth. Also these are very cheap with year subscription. From logging and privacy part it's hard to say, each of these providers 'promise' privacy, but who knows. Like I've written in one earlier post, I'm sure that privacy is limited, if you keep pushing the boundaries. You just need to push hard enough, and it's all an illusion.
  • One loopy (buggy) system sent over 10 emails / second for a weekend. It seems that some email service providers aren't quite happy about receiving over one million emails to a free account users. Ha ha. The email servers even got huge backlogs of that mails, when the final destination server started tar pitting etc. I gave them permission just to delete the email backlog so the load from their email relays will be alleviated.
  • Watched Deep Web documentary. Nothing to add to this discussion, it's complex topic and I don't have any kind of right opinion about it.
  • Nvidia drivers again, so much crap. I've been thinking about replacing the display adapter. But maybe it's not worth of that. I'm not watching many movies anyway. So maybe it just doesn't matter so much I would care enough.
  • UDP vs TCP - An age old discussion. TCP is simple and easy, UDP can be real pain. Use it only if required.
  • Read a long document about 5 GHz Wi-Fi / WLAN & Weather Radar C-Band co-existence issues. Also checked which WiFi channels overlap with the publicly announced weather radars in the area. Interestingly OPERA RADAR database doesn't contain Helsinki Kumpula weather radar. FMI Radars are listed. There has been some examples how badly configured 5G WiFi practically render weather radars useless.
  • Something bit different? Checked out Lockheed Martin Compact Fusion Reactor. Getting compact fusion reactors with reasonable costs would change a lot of things. 1 gram of deuterium provides as much power as 10 tons of coal and provides around 18 Megawatt hours of power and can power one household for more than a year. Also the temperature differences inside the reactor are amazing. Superconducting magnets are kept near zero kelvin and the plasma is around 100 million kelvins. Think about the insulation required. This subjects was so awesome that I had to watch a few hours of lessons about it. - Will compact fusion reactors work? And if they do work, it's pretty much guaranteed to change the world. Many people talk about coal, gas, oil, peat power. But what's a fact, those are all just thermal power plants using burning to generate heat energy aka power to run turbines. If the heat source is compact and safe enough, any existing thermal power plant can be easily converted to nuclear plant. Or how about Planet Nine from outer space?

OTP, 2FA, VPN, SQL, Bluetooth5, SAFE Network, IoT, ORM, PowerShell

posted May 13, 2017, 8:52 PM by Sami Lehtinen   [ updated May 13, 2017, 8:54 PM ]

  • Just so many web stores have lost sales to me, because they require paper backed OTP tokens. Which I'm not naturally carrying with me. Whenever I want to buy something, everything goes well, and I'm ready to purchase until. You have to provide this strong secure authentication to confirm purchase. - In many cases especially if buying something not so important, this means I'm not going to return ever again. Lost sales, due making the purchase process too complex and annoying to the customer. - Why I'm blogging about this again? - Well, there was some stuff on offer which I would have liked to buy, but nope. They made it too hard, so I dropped the purchase. And when I have the paper backed OTP codes, it's too late. Good thing I just saved couple of hundreds of euros, but not buying that stuff. - From conscious purchase point of view that's awesome. People should always first pre-book everything, they want. Then spent a week thinking if they really need it, and then confirm the purchase week later. - It would be just so smart. No more impulse buying junk you don't really need.
  • Same also applies to many other applications which require high security 2FA authentication. I'll login to the service, but only when actually required. Preferably weekly or monthly. Not daily, because I'm just not interested to login that often.
  • Reminded my self about strength & weaknesses of old VPN protocols including: PPTP L2TP / IPSEC / SSTP / OpenVPN (SSL) / IPSEC (IKEv2) and transports & encryption. It seems that some documentation is ridiculously failing. They claim that ESP is transferring data over UDP port 50. That's not raight. How about trying to use Protocol 50, which is NOT UDP. Everyone knows that UDP protocol number is 17. It's also good to note that IKE uses UDP port 500 for key exchange, also IPSEC AH uses protocol number 51. It's also easy to forget that TCP is using protocol number 6. And ICMP is protocol number 1. So often people are only talking about TCP & UDP and port numbers. I guess most of net users don't even know what protocol numbers are.
  • Nice, latest PostgreSQL supports table partitioning. That's something I've been waiting for a long time.
  • Bluetooth 5 - Lot's of very nice improvements, among expected lines. Longer range, low power, broadcast size, reliability and better co-operation / interference avoidance with other wireless protocols.
  • Passive WiFi where antenna just modulates signals generated by signal generator could reduce power requirements for "passive" wifi devices a lot. As well as the signal from the generator could be used to power the low power WiFi devices. Sounds like a nice idea. But could practically just make the wifi more congested and add extra RF pollution.
  • Reminded my-self about SAFE Network. Distributed, P2P, network as a platform to build services on. That's nice. I'm wishing all the good for projects like this.
  • Read ebook about Industrial Internet of Things (IoT). Unsurprisingly it didn't contain any new information. Yes, it's possible to collect lot of sensor data, it's possible to proactive maintenance, machine learning, monitoring industrial process efficiency, get better quality etc. As well as several success stories how companies utilizing such technologies have gained in efficiently and reduced costs. Sure, I can believe that. But that's nothing new. It's just basic stuff, if you're interested improving your business with such technologies. As well as it isn't anything new anymore. Best part of industrial processes is that those are often quite expensive, and something like maintenance and downtime is really expensive. So even if implementing all this costs lot of money, it's going to probably or at least hopefully save more. But some times it's not, that's why there's efficient frontier for all of this technology, savings and costs.
  • Spent some time studying PonyORM - It's nice. If I would start now new Python project which would need ORM. I probably would use Pony. Yet many of the older projects do use Peewee ORM very successfully, and I'm happy with it.
  • Studied some PowerShell stuff for Windows Server 2016 remote management. - This will be awesome tool. But I'll need to stydy it quite much more, before it being 'production ready'. - I've been looking for suitable solution, but I guess this will be the solution with minimal overhead. So far so good. Yet again, this is stuff which still requires quite much knowledge and maintenance. So even if it's low overhead, it's no overhead by no means.
I'm doing double posts per week again, because I've got a backlog of posts for roughly a year.

1-10 of 488