Making of DNSKV.com - The DNS Key Value Storage

One day I were thinking that I could do some kind of hobby project just for fun. First I thought I could run a few Tor relays or something like that. But would that actually make any difference? Or be particularly fun? - No it wouldn't. Boring basic ICT sysadmin work.

I thought what I could do which would potentially help people whom need to communicate and there are currently some limitations / obstacles, excluding the existing mainstream options. I asked a few guys about ideas about running ODoH relay in Finland. Everyone said running ODoH is pointelss, and so on... Just use Tor, etc.

I forgot the thing for a while, but then I happened to read some news about DNS tunneling attacks and leaking sensitive information over DNS. DNS tunneling is nothing new, I knew that very well. But with combination to that, I've also thought about messaging apps, privacy, security, anonymity and encryption a lot lately.

I asked on Twitter and on several hacker / tech forums, if anyone knows any application and or service for sending messages / data over DNS trivially, without installing or configuring any software. I want to try it quickly and spend a few minutes testing and configuring it. Everyone practically laughed, it can't be done, you'll need all kind of domains, server infrastructure, databases, DNS servers, a lot's of configuration, testing, etc.

Several replies said something like: "it won't work", "it's not possible", "it's pointless", "nobody would use that".

Hmm, awesome. That's actually what I'm looking for. Now it sounds like a fun thing to do. If everyone believes it won't work, then it'll come as a surprise to them and they're not prepared for it. This shuold be a nice and fun challenge to complete.

I decided to make a two pieces of software, which both are practically trivial to do, if you're familiar with the stuff.

1) Hosted DNS Key Value Database @ dnskv.com
2) DNS Messenger aka dnsmsg @ dnskv.com/dnsmsg

What does the combination of these pieces of software allow everyone / users do?

Well, it allows anyone to send and receive messages using only DNS protocol and communication. This means that if you've got some IP / domains / services blocked (of course excluding dnskv.com), you can send and receive messages from the network. Even if HTTPS, HTTP, SSH, VPNs and so on would be blocked.

I'm sure that some hackers and similarly minded people have used this method as somewhat covert communication channel for a long time. But as mentioned, it hasn't been easy enough for average or beginner users. Dnskv just makes it very easy and available for everyone.

I had several key value based designs and approaches in mind how to implement it. But finally I ended up with unidirectional communication channel design and message daisy chaining.

When using dnsmsg which allows sending and receiving messages on encrypted, obfuscated channels. To receive messages, all you need is the channel name. To send messages, all you need is the channel name. But, if you don't specify the channel update key, then the channel will be write-once and read-only until it expires.

I also added features to the dnsmsg for posting and receiving small binary files as well as Unicode text. The channel's (default) cache TTL is 3600 seconds, which means that there's basically no point of checking for new messages any more often than hourly. The default still remains at one hour, and that's not a problem, due to low bandwidth, high overhead and covert nature of the communication. Program also contains auto-receiver mode, where it will hourly check for new messages on selected channel(s).

After I found out that it wouldn't cause loading issues on the server side, I considerably lowered the user configurable TTL limits. I just were worried that getting hit by hacker news crowd would cause load problems.

Some users complained about the (intentionally) slow transfer rates. It's that way on purpose. Because the service is supposed to be only used in situations when there are no other viable options like Tor, Matrix, email or HTTPS available. There are some rate limits on server side configured as well, but those are designed to block clear flooding DoS attempts and do not affect normal users.

I still don't know if anyone is going to use it, but it remains to be seen. Hopefully someone finds it useful. Isn't that exactly what interesting experimental projects are supposed to be like?

At least in some cases it could be very helpful, because now you can send messages using any device which does DNS queries. This means that any device which allows you define DNS name like www.google.com can send messages, even if it gives error after that, the message could have been potentially sent at that point. I can see some specific situations where this can be tremendously helpful, providing a covert communication channel. As example browser does DNS lookup and then blocks the HTTPS request. But the actual message is already out. It's also possible to check if a specified key is defined using A / AAAA queries alone. Which could allow receiving information even if DNS TXT queries aren't possible.

Actual setup

Challenges and results:
1) Custom DNS server back-end written in Python @ ns.dnskv.com.
2) Updating DNS root servers took a while, especially annoying if you're not 100% sure that everything is configured correctly.
3) Found some interesting aspects about DNS which I didn't know earlier. This is why these experiments are also great ways to learn about things.
4) DNSSEC was turned on by default. I didn't implement DNSSEC. I had to wait the disabling to get activated as well which also took around 24 hours.
5) I also tried configuring BIND as reverse DNS server, but it didn't work out, it was a really bad setup, that combined with the DNSSEC issues was hard to troubleshoot.
6) I had issues with DNS glue, but I got it finally configured correctly.
7) It was bit hard to know when to send additional records in response and when to send actual response records, with NS type queries. When to send SOA? etc.

But nobody does these kind of projects anyway, unless:
1) It's your job, and you have to bear it
2) You're just looking for trouble and enjoy it (?)

Issues with BIND

I needed three simple things.1) BIND to answer to domain requests2) BIND to forward requests dynamically to back-end sub-server3) BIND not to answer any other queries

I found out this to be incredibly hard. It's like CSS, it's almost there, but nope, not quite right after all.

After very long tuning (a few days). I managed to get the configuration so that everything seems to be working. I'm now forwarding all domain requests to the back-end sub-server. Yet the BIND is doing DNS flattening for some CNAMEs. It was easy to get to the stage, where BIND worked, or didn't work. But usually when it worked, it allowed global recursive queries. That's something which I really don't want to do. After a lot of experimentation, I managed to find configuration which works as desired.

But there's still one huge problem, when I enable my name server code, everything, absolutely everything works. Except DNS trace breaks down. Why? I do have glue and SOA in place. I've copied everything (as in DNS records in replies / queries) in as much detail as I can from other places. But probably BIND is doing something, which nobody seems to know about. Some niche stuff, used to validate DNS servers? Because that's missing, the DNS chain gets broken. Sure I've got NS records as well as additional records with name & ip info. But nope, it just won't work. Now I'm down to debugging queries and responses with TCPdump, BIND does something which my server doesn't. What?! I've got no idea. When I'm using BIND, it works, when not, it breaks down (after a while). That's the most annoying part, it keeps working for quite a good while. This also makes DNS stuff extremely annoying. Configuration might be broken or good, you just don't know it until some time has passed.

Two ways later... Analyzing the logs and reading tons of specification documentations and the source code to figure out what the magic trick is. Afaik, I've done everything exactly as other services do it, I've checked all configuration over and over, tested all queries, everything works. Except it still stops working after a while.

Based on things I saw in the logs, my current best guess is that it's a few magic PTR queries, I've seen in debug logs, which are the key for accepting the glue data.

After very careful analysis, the PTR queries and responses got nothing to do with it. It seems that the problem once again is the BIND which replies like the domain wouldn't exist. This is totally hopeless, but I've got one final idea to try. Now it's so small problem anymore, I guess it's just impossible to get this to work with BIND.

When running self written raw custom authoritative server, I found out stuff about additional records I didn't even know about. As example how MX queries are potentially affected by those. It seems that some servers use the info directly from those. And it might lead to some unexpected results in some situations. Lot's of tcpdump, packet analysis, and comparisons from different services and configurations. Dig is priceless tool. Thanks BIND, that's the reason why I initially got so confused and frustrated, because it seems that BIND also removes and mangles some key information when forwarding / caching information. Which was the reason my domain repeatedly went completely down.

Other alternatives

Then I got sick'n'tired of BIND which is causing all these problems. And started looking into dnsmasq as light weight and much better option. It was interesting, but lacked required rate limit options for preventing mass abuse, which were mandatory on my personal requirements list. I didn't implement any rate limits in the custom backend code. Then I checked second option, CoreDNS. It looks good, lots of features, but clearly not designed as authoritative internet DNS server. Next thought how about PowerDNS? Ok, that's the next stop, I installed in using Docker.

After a lot's of testing I ended up using the awesome dnsdist as the clearly the best DNS reverse cache option.

More problems and learning

I also found out that some clients are very aggressively sending HTTPS SVCB (Type 65) queries, which are used with Encrypted Client Hello. Ended up reading draft-ietf-dnsop-svcb-https documentation, and thinking how to implement it. Now it's done. Even if the site itself doesn't support ECH.

Dealing with non-IDNA2008 compatible domain names, initially those requests were directly rejected. It seems that there's some other binary encoding scheme, but I personally don't know what the practical benefits are. I just changed settings so that if the domain name doesn't match IDNA encoding rules, it'll be rejected. Some characters are bit problematic like the € character. I guess there's some kind of issue with domain encodings, because some encoders seem to accept € character and upper case letters like ÖÄÅÖÄÅ and some others won't. I guess it's the non IDNA2008 binary mode, which allows basically anything to be encoded in domain name. Need to read more about it, if I'm going to implement it. At least it's clear that the current framework I'm using, can't deal with it.

Another discussion was the system load management. I did get about 60k page visits very quickly after publishing the service. It seems that on the DNS level I got about 100k static requests and 9k dynamic requests, totaling 110k requests when rounded. Which split into categories 3k dynamic (database miss) requests, 5k (database hit) requests. Only 700 database inserts and 1050 database updates (to existing records). I were expecting a lot more database activity, the server is actually tested to support 5k database lookups / updates per second. I were initially worried that it might not be enough, and weren't quite happy with the performance. It turns out that the worry was totally misplaced. It seems that during the highest load spike there were about 10k requests per hour, which is basically nothing compared to the offered performance.

Thank you: Habbie, #dns, freenode. Now unset values won't return NXDOMAIN anymore, just empty NOERROR.

Later additions and comments

I've got a few questions why I'm doing it like this, and not as a PowerDNS extension. Well, one thing with crazy side hobby projects is to do things in strange ways and learn in the process. I've learned many things already, and I'll probably learn soon more.

I enjoyed dealing with SO_REUSEADDR and SO_REUSEPORT under load as well as number of file descriptors. Yes, basic stuff, but can bite you. Especially if trying to quickly restart services under load to get the latest version getting served.

Read and understood: rfc8020, NXDOMAIN, rfc8020, rfc2308, NCACHE, rfc2308. Most of it I were familiar to me, but there were a few but critical technical specification details I weren't aware about. - If you want to understand something, reading rfc isn't enough. Write production implementation and many hidden things might get revealed.

Added new options for the program to support time units, second, minute, hour, day, week. As well as allowing users to set custom expiry and ttl or records using the new option t for Time-To-Live (TTL).

Tried to put all files related to dnsmsg all into singe Windows binary container. To do that I needed to use pyinstaller. I tried pyinstaller with print('Hello World!') one liner and even that resulted with binary which was practically unusable. Why? Well, Windows seems to regard it as virus, as many other scanners. Which is a false positive of course.

Strange observations, some recursive DNS resolvers seem to make A query before TXT query. Don't know why, but some others won't. Maybe it's a feature of some specific software / configuration? At least Telia DNS servers in Finland seem to be doing that. If you know the reasons behind this behavior, I'm curious to hear more.

Added ANAME and ALIAS support, to provide multiple addresses efficiently for the naked / apex domain. No two IPv4 and two IPv6 addresses are announced for web-browsers and IPv4 and IPv6 addresses for the DNS service.

After all got a bit surprised about the end result. It seems that there are basically three groups.
1) Those who didn't understand at all how DNS works. The which was the largest group.
2) Those who understood it so well that they weren't interested because anyone whom needs it, can of course build a similar personal service. This is quite a small hacker / infosec group.
3) And the smallest group of all - A few curious ones whom actually understood how it works used and use the service.

It became quickly evident that a simplified usage / user tutorial and instructions with examples was badly needed for people whom aren't familiar with DNS. Including screenshots using mobile devices and client software.

Some things were really confusing when building the Windows binary version. As example some libraries behave differently, if run directly with Python and when run after CX_Freeze. Totally enraging and annoying. When run directly, I'll get nice unicode string, when run after CX_Freeze, I'll get raw string with first byte indicating the length. Crazy, didn't change anything in the code. But now my code needs to take into account this kind of inconsistent stuff. Agonizing, frustrating and illogical, enraging. Fixed using some horrible magic fix code, but the root cause is still totally unknown. SIGH!

Responses to the privacy questions

1) I've got asked a several times about no logs and statistics logging. Direct answer to direct questions. There are no logs for normal requests. Only requests which trigger exception in the handling, are logged. The logs will be automatically deleted after 24 hours, or when manually debugging when the potential issue is found and fixed. Yet there hasn't been any issues during several last months.

2) What about statistics, some sites save usage statistics which are even worse than logging. In this case, the statistics are accumulated on hourly level, and doesn't contain any other information than timestamp, number of requests, number of responses, number of hits and misses on the database lookup level, number of inserts and number of updates to the database. As you can see, there are no user / ip / port or what ever data being stored at all. Ie. classic we're spying on you kind of statistics. The statistics is just used to monitor the system performance, reliability and load management on very coarse (hourly) level.

Whoa, you got this far? Now go and see the primary site: https://dnskv.com

2022-06-19