GreatReads - Blog Aggregator · Phoenix Framework

The strange webserver hot potato — sending file descriptors

I’ve previously mentioned my io-uring webserver tarweb . I’ve now added another interesting aspect to it. As you may or may not be aware, on Linux it’s possible to send a file descriptor from one process to another over a unix domain socket. That’s actually pretty magic if you think about it. You can also send unix credentials and SELinux security contexts, but that’s a story for another day. I want to run some domains using my webserver “tarweb”. But not all. And I want to host them on a single IP address, on the normal HTTPS port 443. Simple, right? Just use nginx’s ? Ah, but I don’t want nginx to stay in the path. After SNI (read: “browser saying which domain it wants”) has been identified I want the TCP connection to go directly from the browser to the correct backend. I’m sure somewhere on the internet there’s already an SNI router that does this, but all the ones I found stay in line with the request path, adding a hop. A few reasons: Livecount has an open websocket to every open browser tab in the world reading a given page, so they add up. (no, it doesn’t log. It just keeps count) I built a proof of concept SNI router . It is a frontline server receiving TCP connections, on which it then snoops the SNI from the TLS ClientHello, and routes the connection according to its given rules. Anything it reads from the socket is sent along to the real backend along with the file descriptor. So the backend (in my use that’s tarweb) needs to have code cooperating to receive the new connection. It’s not the cleanest code, but it works. I got ChatGPT to write the boring “parse the TLS record / ClientHello” parts. Rust is a memory safe language, so “how bad could it be?”. :-) It seems to work for all the currently used TLS versions. As I said, it requires the backend to be ready to receive “hey, here’s a file descriptor, and here’s the first few hundred bytes you should treat as if you’ve read them from the client”. File descriptors don’t have an operation to “unread”. If they did then this would be easier. Then it would “just” be a matter of giving a backend webserver a file descriptor. For some use cases that could mean starting a new webserver process that reads and writes from stdin/stdout. Not super efficient to go back to the fork-exec-per-connection model from the previous century, but it would achieve the direct connection. But the details are academic. We do need to pass along the snooped bytes somehow, or the TLS handshake won’t succeed. Which means it does need cooperation from the backend. Because the SNI router never writes to the client, and therefore doesn’t perform a TLS handshake, it doesn’t need any private keys or certificates. The SNI router has no secrets, and sees no secrets. I also added a mode that proxies the TCP connection, if some SNI should be routed to a different server. But of course then it’s not possible to pass the file descriptor. So encrypted bytes will bounce on the SNI router for that kind of flow. But still the SNI router is not able to decrypt anything. A downside is of course that bouncing the connection around the world will slow it down, add latency, and waste resources. So pass the file descriptor where possible. So now my setup has the SNI router accept the connection, and then throw the very file descriptor over to tarweb, saying “you deal with this TCP connection”. Tarweb does the TLS handshake, and then throws the TLS session keys over to the kernel, saying “I can’t be bothered doing encryption, you do it”, and then actually handles the HTTP requests. Well actually, there’s another strange indirection. When tarweb receives a file descriptor, it uses io-uring “registered files” to turn it into a “fixed file handle”, and closes the original file descriptor. On the kernel side there’s still a file descriptor of course, but there’s nothing in : This improves performance a bit on the linux kernel side. The SNI router does not use io-uring. At least not yet. The SNI router’s job is much smaller (doesn’t even do a TLS handshake), much more brief (it almost immediately passes the file descriptor to tarweb), and much less concurrency (because of the connections being so short lived as far as it’s concerned), that it may not be worth it. In normal use the SNI router only needs these syscalls per connection: At the risk of going off on an unrelated tangent, HTTP/3 (QUIC-based) has an interesting way of telling a client to “go over there” . A built in load balancer inside the protocol, you could say, sparing the load balancer needing to proxy everything. This opens up opportunities to steer not just on SNI, and is much more flexible than DNS, all without needing the “proxy” to be inline. E.g. say a browser is in Sweden, and you have servers in Norway and Italy. And say you have measured, and find that it would be best if the browser connected to your Norway server. But due to peering agreements and other fun stuff, Italy will be preferred on any BGP anycasted address. You then have a few possible options, and I do mean they’re all possible: The two DNS-based ones also have the valid concern that screwing up DNS can have bad consequences . If you can leave DNS alone that’s better. Back to HTTP/3. If you’ve set up HTTP/3 it may be because you care about latency. It’s then easier to act on information you have about every single connection. On an individual connection basis you can tell the browser in Sweden that it should now talk to the servers in Norway. All without DNS or anycast. Which is nice, because running a webserver is hard enough. Also running a dynamic DNS service or anycast has even more opportunities to blow up fantastically. I should add that HTTP/3 doesn’t have the “running out of file descriptors” problem. Being based on UDP you can run your entire service with just a single file descriptor. Connections are identified by IDs, not 5-tuples. So why didn’t I just use HTTP/3? No support for that (yet). From some skimming ESNI should “just work”, with just a minor decryption operation in the SNI router. ECH seems harder. It should still be doable, but the SNI router will need to do the full handshake, or close to it. And after taking its routing decision it needs to transfer the encryption state to the backend, along with the file descriptor. This is not impossible, of course. It’s similar to how tarweb passes the TLS session keys to the kernel. But it likely does mean that the SNI router needs to have access to both the TLS session keys and maybe even the domain TLS private keys. But that’s a problem for another day. Having all bytes bounce on the SNI router triples the number of total file descriptors for the connection. (one on the backend, then one each on the router for upstream and downstream). There are limits per process and system wide, and the more you have the more you need to juggle them in code. It also wastes CPU and RAM. I want the backend to know the real client IP address, via or similar, on the socket itself. I don’t want restarting nginx to cut existing connections to backends. I’d like to use TLS keys that the nginx user doesn’t have access to. I used for livecount , and last time I got blog posts on HackerNews nginx ran out of file descriptors, and started serving 500 for it serving just plain old static files on disk. For now I’ve moved livecount to a different port, but in the long run I want it back on port 443, and yet isolated from nginx so that the latter keeps working even if livecount is overloaded. for the new connection, a few hundred bytes of ClientHello, of same size to pass it on, to forget the file descriptor. Have the browser connect to , with Norway-specific IP addresses. Not great. People will start bookmarking these URLs, and what happens when you move your Norway servers to Denmark? now goes to servers in Denmark? Use DNS based load balancing, giving Swedish browsers the Norway unicast IPs. Yes… but this is WAY more work than you probably think. And WAY less reliable at giving the best experience for the long tail. And sometimes your most important customer is in that long tail. Try to traffic engineer the whole Internet with BGP announcement tweaks. Good luck with that, for the medium to long tail. Install servers in Sweden, and any other place you may have users. Then you can anycast your addresses from there, and have full control of how you proxy (or packet by packet traffic engineer over tunnels) them. Expensive if you have many locations you need to do this in. Some traffic will still go to the wrong anycast entry point, but pretty feasible though expensive. HTTP/3 is complex. You can build a weird io-uring kTLS based webserver on a weekend, and control everything (except TLS handshakes). Implementing HTTP/3 from scratch, and controlling everything, is a different beast. HTTP/1 needs to still work. Not all clients support HTTP/3, and HTTP/1 or 2 is even used to bootstrap HTTP/3 via its header. Preferred address in HTTP/3 is just a suggestion. Browsers don’t have to actually move. tarweb was first written in C++ . livecount keeps long lived connection . tarweb rewritten in Rust, and using io-uring . You can redirect with Javascript , but this still has the problem. I passed file descriptors between processes in injcode , but it was only ever a proof of concept that only worked on 32bit x86, and the code doesn’t look like it actually does it? Anyway I can’t expect to remember code from 17 years ago.

Web Development

C++

JavaScript

0 views

Blargh 5 months ago

Ideal programming language

My last post about Go got some attention . In fact, two of my posts got attention that day , which broke my nginx since I was running livecount behind nginx, making me run out of file descriptors when thousands of people had the page opened. It’s a shame that I had to turn off livecount, since it’d be cool to see the stats. But I was out of the country, with unreliable access to both Internet and even electricity in hotels, so I couldn’t implement the real fix until I got back, when it had already mostly died down. I knew this was a problem with livecount, of course, and I even allude to it in its blog post . Anyway, back to programming languages. The reactions to my post can be summarized as: I respect the first two. The last one has to be from people who are too emotionally invested with their tools, and take articles like this like a personal attack of some sort. They go out of their way to be offended, and then start screaming “but I don’t fucking want guitar lessons!” . They want to counter attack against another programming language, thinking I would take it personally too. Maybe this heretic is a Java programmer, and that’s why he’s stupid? ( bad guess ). It also reminded me of PHP programmers back in the PHP 3.x days who would die on the hill of defending PHP as an awesome language, while admitting that they knew literally no other language. What should they know of England who only England know? I’m not offended. Those replies are not offensive; They’re boring. There’s nothing to learn from the comments, and probably not from the people making such comments in general, either. Also it seems that somebody managed to get my whole Blog comment database deleted from disqus. Either disqus itself was hacked, or just my account with them. Or someone tricked disqus into deleting it. They managed to restore it, though. Keeping this third type of commenter in mind, I got an email a few days later asking what programming languages are “closer to ideal”, and if I maybe have a blog post in me about that. I don’t know who’s asking, so I replied the long version, while still taking the question at face value. Ideal… Well, this is getting into the space of “what is the best programming language”, which doesn’t have a perfect answer. To do what? To make an Android app (something I’m not an expert in. I’ve just made one simple one), I think Kotlin seems nice. But I don’t know it very well. For web development, something I also don’t do much, it’s probably Typescript. For maximum portability for systems programming, C or C++ (depending on if all your target platforms (e.g. embedded stuff) support C++) is probably best. But these are practical answers. Some people like Lisp. Others like Haskell. Rust strikes a good position between practical, low level control, safe, and a high level type system. If the Rust compiler supports your platform, then it’s pretty much as portable as C/C++. I’ve written about the deficiencies of Java and Go because the ways they are deficient are interesting. I don’t find the ways C++ is deficient to be interesting. C++ is what it is. I happen to enjoy coding C++. I also have thoughts about the trap of accidentally writing too much in bash . I have a draft of things wrong with Rust. But so far I think they are all fixable. (e.g. no placement syntax has been defined yet ) But I have no interest in writing a blog post about the lack of memory safety of C++. It is what it is. Java’s deficiencies are interesting because they are the best guesses of the future that 1990 had to offer. And those guesses were almost all wrong. Go’s deficiencies are interesting/frustrating because it (almost entirely) is the best that the 1980s-early 1990s had to offer. And yet it launched in 2009. I don’t enjoy coding Javascript. So I’m experimenting writing frontend stuff in Rust and compile to WASM. So far it works for me, but is not something I’d recommend for anyone who wants to get anything done. But no, I won’t be writing up which programming language is “ideal”, because it’s one of those “it depends”. Oh yes, these things are definite flaws in the language. What you’re saying is true, but it’s not a problem. Your post is pointless. You’re dumb. You don’t understand Go. Here let me explain your own blog post to you […]

Programming

C++ Haskell Bash

0 views

Blargh 7 months ago

Go is still not good

Previous posts Why Go is not my favourite language and Go programs are not portable have me critiquing Go for over a decade. These things about Go are bugging me more and more. Mostly because they’re so unnecessary. The world knew better, and yet Go was created the way it was. For readers of previous posts you’ll find some things repeated here. Sorry about that. Here’s an example of the language forcing you to do the wrong thing. It’s very helpful for the reader of code (and code is read more often than it’s written), to minimize the scope of a variable. If by mere syntax you can tell the reader that a variable is just used in these two lines, then that’s a good thing. (enough has been said about this verbose repeated boilerplate that I don’t have to. I also don’t particularly care) So that’s fine. The reader knows is here and only here. But then you encounter this: Wait, what? Why is reused for ? Is there’s something subtle I’m not seeing? Even if we change that to , we’re left to wonder why is in scope for (potentially) the rest of the function. Why? Is it read later? Especially when looking for bugs, an experienced coder will see these things and slow down, because here be dragons. Ok, now I’ve wasted a couple of seconds on the red herring of reusing for . Is a bug perhaps that the function ends with this? Why does the scope of extend way beyond where it’s relevant? The code would have been so much easier to read if only ’s scope had been smaller. But that’s not syntactically possible in Go. This was not thought through. Deciding on this was not thinking, it was typing. Look at this nonsense: Go was not satisfied with one billion dollar mistake , so they decided to have two flavors of . “What color is your nil?” — The two billion dollar mistake. The reason for the difference boils down to again, not thinking, just typing. Adding comment near the top of the file for conditional compilation must be the dumbest thing ever. Anybody who’s actually tried to maintain a portable program will tell you this will only cause suffering. It’s an Aristotle way of the science of designing a language; lock yourself up in a room, and never test your hypotheses against reality. The problem is that this is not year 350 BCE. We actually have experience that aside from air resistance, heavy and light objects actually fall at the same speed. And we have experience with portable programs, and would not do something this dumb. If this had been the year 350 BCE, then this could be forgiven. Science as we know it hadn’t been invented yet. But this is after decades of very widely available experience in portability. More details in this post . What does this print? Probably . Who wants that? Nobody wants that. Ok, how about this? If you guessed , then you know more than anybody should have to know about quirks of a stupid programming language. Even in a GC language, sometimes you just can’t wait to destroy a resource. It really does need to run as we leave the local code, be it by normal return, or via an exception (aka panic). What we clearly want is RAII, or something like it. Java has it: Python has it. Though Python is almost entirely refcounted, so one can pretty much rely on the finalizer being called. But if it’s important, then there’s the syntax. Go? Go makes you go read the manual and see if this particular resource needs to have a defer function called on it, and which one. This is so dumb. Some resources need a defer destroy. Some don’t. Which ones? Good fucking luck. And you also regularly end up with stuff like this monstrosity: Yes, this is what you NEED to do to safely write something to a file in Go. What’s this, a second ? Oh yeah, of course that’s needed. Is it even safe to double-close, or does my defer need to check for that? It happens to be safe on , but on other things: WHO KNOWS?! (Largely a repeat of part of a previous post ) Go says it doesn’t have exceptions. Go makes it extremely awkward to use exceptions, because they want to punish programmers who use them. Ok, fine so far. But all Go programmers must still write exception safe code. Because while they don’t use exceptions, other code will. Things will panic. So you need, not should, NEED, to write code like: What is this stupid middle endian system? That’s dumb just like putting the day in the middle of a date is dumb. MMDDYY, honestly? (separate rant) But panic will terminate the program, they say, so why do you care if you unlock a mutex five milliseconds before it exits anyway? Because what if something swallows that exception and carries on as normal, and you’re now stuck with a locked mutex? But surely nobody would do that? Reasonable and strict coding standards would surely prevent it, under penalty of being fired? The standard library does that. when calling , and the standard library HTTP server does that, for exceptions in the HTTP handlers. All hope is lost. You MUST write exception safe code. But you can’t use exceptions. You can only have the downsides of exceptions be thrust upon you. Don’t let them gaslight you. If you stuff random binary data into a , Go just steams along, as described in this post . Over the decades I have lost data to tools skipping non-UTF-8 filenames. I should not be blamed for having files that were named before UTF-8 existed. Well… I had them. They’re gone now. They were silently skipped in a backup/restore. Go wants you to continue losing data. Or at least, when you lose data, it’ll say “well, what (encoding) was the data wearing?”. Or how about you just do something more thought through, when you design a language? How about doing the right thing, instead of the obviously wrong simple thing? Why do I care about memory use? RAM is cheap. Much cheaper than the time it takes to read this blog post. I care because my service runs on a cloud instance where you actually pay for RAM. Or you run containers, and you want to run a thousand of them on the same machine. Your data may fit in RAM , but it’s still expensive if you have to give your thousand containers 4TiB of RAM instead of 1TiB. You can manually trigger a GC run with , but “oh no don’t do that”, they say, “it’ll run when it has to, just trust it”. Yeah, 90% of the time, that works every time. But then it doesn’t. I rewrote some stuff in another language because over time the Go version would use more and more memory. We knew better. This was not the COBOL debate over whether to use symbols or English words. And it’s not like when we didn’t know at the time that Java’s ideas were bad , because we did know Go’s ideas were bad. We already knew better than Go, and yet now we’re stuck with bad Go codebases. This post was discussed on HackerNews on 2025-08-22 . Data race patterns in Go Lies we tell ourselves to keep using Golang I want off mr golang’s wild ride

Programming

Python

Java

0 views

Blargh 8 months ago

Setting clock source with GNU Radio

I bought a GPS Disciplined Oscillator (GPSDO) , because I thought it’d be fun for various projects. Specifically I bought this one . I started by calibrating my ICOM IC-9700 . I made sure it got a GPS lock, and connected it to the 9700 10MHz reference port, with a 20dB attenuator inline, just in case. Ok, the receive frequency moved a bit, but how do I know it was improved? My D75 was still about 200Hz off frequency. Segal’s law paraphrased: “Someone with one radio knows what frequency they’re on. Someone with two radios is never sure”. Unless, of course, that person has two radios with disciplined oscillators. Which I do. I also have a USRP B200 with an added GPSDO accessory . Sidenote: wow, that’s gotten expensive. Today I’d probably use the same GPSDO from DXPatrol instead (it has four outputs). Note that if you do have the GPSDO installed in the B200, then you cannot use an external 10MHz reference. It’s a known issue . Then again if you paid this much, why would you not use it? First I thought that surely the best reference would be the default, so I should be able to just send a tone, having configured only frequency and output gain: But that doesn’t seem right. It was off by almost a kHz, compared to the 9700! That’s worse than the D75. Let me try with the example in RustRadio , where I know I set the clock source. Yeah that’s perfect. So GNU Radio is not setting the clock source, but instead defaulting to the undisciplined clock source. I looked all over the GNU Radio Companion, testing both and (deprecated). I added various versions of to driver strings and parameters. No change. Then I went to check the source. The block does have a parameter, but it’s hidden . Looking further, it’s not only hidden, but also unused. I guess they kept it only to prevent old flowgraphs from having syntax errors. But there is a way to set the clock source. By sending a message . So I created this flowgraph, with the being : It worked. You can clearly see where the strobe message told the Source block to stop being a silly goose. The internal clock can be seen being almost a kHz off, and the GPSDO is so perfect you can’t see any error, even with the waterfall zoomed in as much as it can be. I expected the 9700 to only support a one-off calibration, but looks like it locks on to the external reference, and keeps tuning forever. And it has a button saying “Cancel Sync”. It even keeps it enabled across reboots. Nice. Does this mean that this injection board is obsolete? My understanding didn’t come out of nothing. Here’s a blog post that corroborates my memory. I guess it was solved in a firmware update? My firmware version is 1.44. Oh I see, a comment from 2022 says that this has now been fixed. Which… corroborates what I’m seeing now. So yeah. No need for an injection board. Just plug in a 10MHz reference to the reference port.

Electronics

Tutorial

Hardware

0 views

Blargh 8 months ago

Software defined KISS modem

I’ve kept working on my SDR framework in Rust called RustRadio , that I’ve blogged about twice before . I’ve been adding a little bit here, a little bit there, with one of my goals being to control a whole AX.25 stack. As seen in the diagram in this post , we need: Applications talk in terms of streams. AX.25 implementation turns that into individual data frames. The most common protocol for sending and receiving frames is KISS . I’ve not been happy with the existing KISS modems for a few reasons. The main one is that they just convert between packets and audio . I don’t want audio, I want I/Q signals suitable for SDRs. On the transmit side it’s less of a problem for regular 1200bps AX.25, since either the radio will turn audio into a FM-modulated signal, or if using an SDR it’s trivial to add the audio-to-I/Q step. On transmit you do have to trigger PTT, though. You can do VOX, but it’s not optimal. But on the receive side it’s a completely different matter. Once it’s audio, the information about the RF signal strength is gone. It makes it impossible to work on more advanced reception strategies such as whole packet clock recovery , or soft decoding . Soft decoding would allow things like “CRC doesn’t match, but this one bit had a very low RF signal strength, so if flipping that bit fixes the CRC, then that’s probably correct. Once you have a pluggable KISS modem you can also innovate on making the modem better. A simple example is to just run the same modem in multiple copies , thereby increasing the bandwidth (both in the Hz sense and the bps sense). Since SDRs are not bound to audio as a communication medium, they can also be changed to use more efficient modulations. Wouldn’t it be cool to build a QAM modulation scheme, with LDPC and “real” soft decoding? Yes, an SDR based modem does have two main challenges: For the duplex problem, the cheap and simple solution is to use frequencies on different bands, and put a band pass filter on the receive port, thus blocking the transmitted power. SDR outputs are not clean, so you’ll need a filter on the transmit path too anyway. In other words, you can just use a diplexer . It gets harder if RX and TX need to be on the same band, or worse, the same exact frequency. Repeaters tend to use cavity filters . But that’s a bit bulky for my use cases. And in any case don’t work if the frequency is exactly the same. More likely a better use case here is to use half duplex, with a relay switching from RX to TX and back. But you need to synchronize it so that there’s no race condition that accidentally plows 10W into your receive port, even for a split second. But that’s a problem for the future. For now I’m just using two antennas. I’ve implemented it. It works. It’s less that 250 lines of Rust, and the actual transmitter and receiver is really easy to follow. Well… to me at least. In order to not introduce too many things at a time, here’s how to use the regular Linux kernel stack with my new bell202 modem. Bell202 is the standard and most used amateur radio data mode. Often just referred to as “1200bps packet”. Build and start the modem: Create Linux AX.25 config: Attach the kernel to the modem: Now use it as normal: Applications, client and server — I’ve made those . AX.25 connected mode stack (OSI layer 4, basically) — The kernel’s sucks, so I made that too . A modem (OSI layer 1-2), turning digital packets into analog radio — The topic of this post. Power. SDRs don’t transmit at high power, so you need to get it through a power amplifier. Duplex. Most TX-capable SDRs have two antenna ports. One for TX, one for RX. You’ll need to have two antennas, or figure out a safe way to transmit on the same antenna without destroying the RX port.

Rust

Programming

0 views

Blargh 8 months ago

QO100 early success

I have heard and been heard via QO-100! As a licensed radio amateur I have sent signals via satellite as far away as Brazil. QO-100 is the first geostationary satellite with an amateur radio payload. A “repeater”, if you will. Geostationary means that you just aim your antenna (dish) once, and you can use it forever. This is amazing for tweaking and experimenting. Other amateur radio satellites are only visible in the sky for minutes at a time, requiring you to chase them across the sky to make a contact before it’s gone. They also fly lower, meaning they can only see a small part of the world at a time . QO-100 has constant line-of-sight to all of Africa, Europe, India, and parts of Brazil . Other “birds” (satellites) can be accessed using a normal handheld FM radio and something like an arrow antenna . Well, you should actually have two radios, so that you can hear yourself on the downlink while transmitting. There are also linear amateur radio satellites. For them you need SSB radios, which narrows down which radios you can use. And you still need to chase them across the sky. For QO-100, though, you don’t just need another modulation, you need more complicated frequencies . The uplink for QO-100 is 2.4GHz, and the downlink is 10.4895GHz. (I’m only looking at the narrowband transponder, for now). There are many options. I chose mine because I want to do as much as possible with software defined radios, and I want my components to work for other experiments, if this one doesn’t work out. First I needed a dish. I bought a small 35cm dish from passion-radio , because of its portability. Small dishes are not as good at focusing the signal, but in return they are easier to aim, since they have a wider beam. Basically nothing reasonable can receive a 10GHz directly, so the next thing I needed was an LNB . It takes a block of high frequencies, and “moves it down” to lower ones. I got the othernet Bullseye , since it has good specs and reviews. This converts the downlink signal down to a more manageable 759.5MHz. The LNB needs a bias-tee of 12V to power it. This is like a “power over ethernet injector”, for those of you who build networks. This shifting of frequencies also has the benefit that lower frequencies have lower losses while going through coax. I plugged this into a USRP B200 (if buying from ettus, you’ll need its enclosure too). There are cheaper options out there, but I bought this 10 years ago, so why not use it. I also tried a LimeSDR Mini . It worked very well too, but my B200 has a GPS disciplined oscillator, which reduces any unknowns about being on frequency. After the investment of the dish and the LNB, I brought the receive setup to the top of a tall building, to see if it works. It was harder than I expected to point the dish. It needs to be very precisely aimed. In theory you can just use this website and a compass, but compasses don’t like nearby metal (including the metal in the dish), so I found that it was more effective to eyeball it from google maps, and the alignment of nearby buildings, and hunt from there. The second problem is wind. Wind will catch the dish, and tip it over. I recommend a better plan than to simply dedicate one hand to holding on to it. Eventually I found the beacons where I expected them: The best way I found to fine tune the direction was to output FT8 as audio, and maximize the apparent volume, while making very small adjustments. Even though the B200 was GPS locked, and presumably bang on frequency, that’s not the case for the LNB. It’s not bad, but it looks like it’s a problem for very frequency sensitive modes like FT8. I used pipewire virtual audio cards to get the signal from GNURadio into WSJT-X. There we see the problem clearly. Those yellow blocks at the bottom should be perfect rectangles, and not lean to the side. The fact that they all lean the same way tells me it’s a problem on my end, not with the person transmitting. I have a plan for fixing this, but not today. So in short: LNB in dish - bias-tee - USRP B200. The transmit side has a different problem than the receive side. 2.4GHz is easy enough to generate using an SDR like the B200 (or LimeSDR). But the output of SDRs doesn’t have any power. Before amplifying, though, it needs to be filtered. SDRs can generate any signal in a huge frequency range, but in the end any transmitter needs an analog filter. And analog filters are physical. They are band specific. I found some nice band pass filters on aliexpress, and bought a bunch for different bands. For this project we need a 2.4GHz band pass filter. For power amplification I bought a 10W amplifier , but I’m still trying to make it work. In the mean time, I decided to use an analog devices CN0417 . It’s only a 1W amplifier, but I was hoping that it could work for morse code or FT8. In the past I’ve noticed that the USRP B200 has produced a pretty garbage signal (“nonlinearities”, the word may be) when it has its output gain set too high. I can set the gain lower, but the 1W power amplifier only amplifies by 20dB. So I added a 30dB LNA before the power amplifier. I have only used LNAs on the receive side before, but turns out they work very well to amplify the low power of an SDR for transmit, too. For the antenna, the best option seems to be to get a helix antenna , since it sits “around” the LNB, and bounces on the same dish. In short: B200 - 30dB LNA - CN0417 - Helix antenna. At this point I have the uplink and downlink signal in the B200. As one does, I created a mess of a GNU Radio graph . It only uses standard blocks. The transmit side takes audio and just sends it to the SDR. For transmit the SDR is mostly easily tuned directly. It just needs analog filtering and power amplification. The receive side takes the whole narrowband downlink, and lets me tune around for what I want to decode. Or save the whole spectrum to disk. To connect to WSJT-X I used the same virtual audio cables I’ve blogged about in the past. Some fighting in is needed to route the audio to the right loopback. WSJT-X doesn’t need to trigger PTT, since this setup is full duplex. Sure, a bigger dish would help, as would putting the dish on the roof. But since I need to tweak it and have short cables for efficiency, putting it on the roof means a risk I’m not willing to take. So at least for now the setup is near ground level. Reception works better than expected. Sure, not as clean as some other ground stations , but that’s to be expected. For transmit, I first thought the small dish and weak amplifier was just not enough. I could not see the morse code beacons I send using RustRadio . But then I saw and heard the beacons on websdr! So they do get there. What about FT8? As mentioned above, FT8 decoding isn’t great. Some decode, but most (by far) don’t. They sound OK, but don’t decode. I tried switching to lower sideband, in case that’s how people transmitted, but that didn’t seem to help. I think it’s all due to needing to compensate for LNB frequency instability. I was heard, though: So yes, I can receive stronger signals, and I can be heard with my weaker signals. I just can’t hear my own weak signals. A bigger dish would help. I just need to figure out the best way to install it. Maybe this can wait until it’s all “perfect” and I can hire someone to install it on the roof. More power. I need to get this 10W amplifier working. That should make me able to hear myself. I need to figure out the downlink frequency stability. Either I can try to get an LNB with an external GPS discliplined clock, or I could use the satellite’s beacons — presumably stable — to lock on and fix the LNB instability in software. That’d be ideal. What can be done in software should be done in software.

Electronics

Hardware

0 views

Blargh 11 months ago

io_uring, kTLS and Rust for zero syscall HTTPS server

Around the turn of the century we started to get a bigger need for high capacity web servers. For example there was the C10k problem paper. At the time, the kinds of things done to reduce work done per request was pre-forking the web server. This means a request could be handled without an expensive process creation. Because yes, creating a new process for every request used to be something perfectly normal. Things did get better. People learned how to create threads, making things more light weight. Then they switched to using / , in order to not just spare the process/thread creation, but the whole context switch. I remember a comment on Kuro5hin from anakata, the creator of both The Pirate Bay and the web server that powered it, along the lines of “I am select() of borg, resistance is futile”, mocking someone for not understanding how to write a scalable web server. But / also doesn’t scale. If you have ten thousand connections, that’s an array of ten thousand integers that need to be sent to the kernel for every single iteration of your request handling loop. Enter ( on other operating systems, but I’m focusing on Linux here). Now that’s better. The main loop is now: All the syscalls are pretty cheap. only deals in deltas, and it doesn’t have to be re-told the thousands of active connections. But they’re not without cost. Once we’ve gotten this far, the cost of a syscall is actually a significant part of the total remaining cost. We’re here going to ignore improvements like and , and instead jump to… Instead of performing a syscall for everything we want to do, commanding the kernel to do this or that, io_uring lets us just keep writing orders to a queue, and letting the kernel consume that queue asynchronously. For example, we can put into the queue. The kernel will pick that up, wait for an incoming connection, and when it arrives it’ll put a “completion” into the completion queue. The web server can then check the completion queue. If there’s a completion there, it can act on it. This way the web server can queue up all kinds of operations that were previously “expensive” syscalls by simply writing them to memory. That’s it. And then it’ll read the results from another part of memory. That’s it. In order to avoid busy looping, both the kernel and the web server will only busy-loop checking the queue for a little bit (configurable, but think milliseconds), and if there’s nothing new, the web server will do a syscall to “go to sleep” until something gets added to the queue. Similarly on the kernel side, the kernel will stop busy-looping if there’s nothing new, and needs a syscall to start busylooping again. This sounds like it would be tricky to optimize, but it’s not. In the end the web server just puts stuff on the queue, and calls a library function that only does that syscall if the kernel actually has stopped busylooping. This means that a busy web server can serve all of its queries without even once (after setup is done) needing to do a syscall. As long as queues keep getting added to, will show nothing . Since CPUs today have many cores, ideally you want to run exactly one thread per core, bind it to that core, and not share any read-write data structure. For NUMA hardware, you also want to make sure that a thread only accesses memory on the local NUMA node. This netflix talk has some interesting stuff on NUMA and high volume HTTP delivery. The request load will still not be perfectly balanced between the threads (and therefore cores), but I guess fixing that would have to be the topic of a future post. We will still have memory allocations though, both on the kernel and web server side. Memory allocations in user space will eventually need syscalls. For the web server side, you can pre-allocate a fixed chunk for every connection, and then have everything about that connection live there. That way new connections don’t need syscalls, memory doesn’t get fragmented, and you don’t run the risk of running out of memory. On the kernel side each connection will still need buffers for incoming and outgoing bytes. This may be somewhat controllable via socket options, but again it’ll have to be the subject of a future post. Try to not run out of RAM. Bad things tend to happen. kTLS is a feature of the Linux kernel where an application can hand off the job of encryption/decryption to the kernel. The application still has to perform the TLS handshake, but after that it can enable kTLS and pretend that it’s all sent in plaintext. You may say that this doesn’t actually speed anything up, it just moves where encryption was done. But there are gains: Another optimization is to avoid passing file descriptors back and forth between user space and kernel space. The mapping between file descriptors and io_uring apparently has overhead. So in comes descriptorless files via . Now the supposed file descriptor numbers that user space sees are just integers. They don’t show up in , and can only be used with io_uring. They’re still capped by the file descriptor limit, though. In order to learn these technologies better, I built a web server incorporating all these things . It’s named because it’s a web server that serves the content of a single tar file. Rust, io_uring, and kTLS. Not exactly the most common combination. I found that io_uring and kTLS didn’t play super well together. Enabling kTLS requires three calls, and io_uring doesn’t support (until they merge my PR , that is). And the crate, part of , only allows you to call the synchronous , not export the needed struct for me to pass to my new io_uring . Another pr sent . So with those two PRs merged, it’s working great. tarweb is far from perfect. The code needs a lot of work, and there’s no guarantee that the TLS library (rustls) doesn’t do memory allocations during handshakes. But it does serve https without even one syscall on a per request basis. And that’s pretty cool. I have not done any benchmarks yet. I want to clean the code up first. One thing making io_uring more complex than synchronous syscalls is that any buffer needs to stay in memory until the operation is marked completed by showing up in the completion queue. For example when submitting a operation, the memory location of those bytes must not be deallocated or overwritten. The crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either. I feel like I’m back in C++, where any mistake can blow your whole leg off. It’s a miracle that I’ve not seen a segfault. Someone should make a crate or similar, using the powers of pinning and/or borrows or something, to achieve Rust’s normal “if it compiles, then it’s correct”. Edit: I’m not the first to point it out . This post was discussed on HackerNews on 2025-08-22 . This means that can be used, removing the need to copy a bunch of data between user space and kernel space. If the network card has hardware support for it, the crypto operation may actually be offloaded from the CPU onto the network card, leaving the CPU to do better things.

C++

Performance

Rust

0 views

Blargh 11 months ago

Exploring RISC-V vector instructions

It finally happened! A raspberry pi like device, with a RISC-V CPU supporting the extension. Aka . Aka vector instructions. I bought one, and explored it a bit. First some background on SIMD. SIMD is a set of instructions allowing you to do the same operation to multiple independent pieces of data. As an example, say you had four 8 bit integers, and you wanted to multiply them all by 2, then add 1. You could do this with a single operation without any special instructions. Success, right? No, of course not. This naive code doesn’t handle over/underflow, and doesn’t even remotely work for floating point data. For that, we need special SIMD instructions. x86 and ARM have gone the way of fixed sized registers. In 1997 Intel introduced MMX , to great fanfare. The PR went all “it’s multimedia!”. “Multimedia” was a buzzword at the time. This first generation gave you a whopping 64 bit register size, that you could use for one 64-bit value, two 32-bit values, four 16-bit values, or 8 8-bit values. A “batch size” of 64 bit, if you prefer. These new registers got a new set of instructions, to do these SIMD operations. I’m not going to learn the original MMX instructions, but it should look something like this: So far so good. The problem with SIMD is that it’s so rigid. With MMX, the registers are 64 bits. No more, no less. Intel followed up with SSE, adding floating point support and doubling the register size to 128 bits. That’s four 32-bit floats in one register. So now we have 64-bit registers, 128-bit registers, and uncountably many instructions to work with these two sets of new registers. Then we got , , . Then , (256 bit registers), and even (512 bit registers). 512 bit registers. Not bad. You can do 16 32-bit floating point operations per instruction with that. But here’s the problem: Only if your code was written to be aware of these new registers and instructions! If your production environment uses the best of the best, with AVX-512, you still can’t use that, if your development/QA environment only has AVX2. Not without having separate binaries. Or you could maintain 20 different compiled versions, and dynamically choose between them. That’s what volk does. Or you could compile with (gcc) or (Rust), and create binaries that only work on machines at least as new as the one you built on. But none of these options allow you to build a binary that will automatically take advantage of future processor improvements. Vector instructions do. Instead of working with fixed sized batches, vector instructions let you specify the size of the data , and the CPU will do as many at once as it can. Here’s an example: Write once, and when vector registers get bigger, your code will automatically perform more multiplies per batch. That’s great! You can use an old and slow RISC-V CPU for development, but then when you get your big beefy machine the code is ready to go full speed. The RISC-V vector spec allows for vector registers up to 64Kib=8KiB, or 2048 32-bit floats. And with (see below), that allows e.g. 16384 32-bit floats being multiplied by 16384 other floats, and then added added to yet more 16384 floats, in a single fused multiply-add instruction. RISC-V has 32 vector registers. On a particular CPU, each register will be the same fixed size, called . But the instruction set allows us to group the registers, creating mega-registers. That’s what the in is about. If we use instead of , that gives you just four vector registers: v0, v8, v16, and v24. But in return they are 8 times as wide. The spec calls this batching number . Basically a pairwise floating point vector multiplication in mode effectively represents: Bigger batching, at the cost of fewer registers. I couldn’t come up with a nice way to multiply two complex numbers with only four registers. Maybe you can? If so, please send a PR adding a function to my repo . Until then, the version is the biggest batch. may still not be faster, since the version of is a little bit faster than the version in my test. Like with SIMD, there are convenient vector instructions for loading and handling data. For example, if you have a vector of complex floats, then you probably have real and imaginary values alternating. will then load the real values in v0, and the imaginary values in v1. Or, if was called with , the real values will be in v0 through v3, and imaginary values in v4 through v7. Curiously a “stride” load ( ), where you can do things like “load every second float”, seems to not have very good performance. Maybe this is specific to the CPU I’m using. I would have expected the L1 cache to make them fairly equal, but apparently not. So yes, it’s not perfect. On a future CPU it may be cheap to load from L1 cache, so your code should be more wasteful about vector registers, to be optimal. Maybe on a future CPU the stride load is faster than the segmented load. There’s no way to know. It seems that the CPU, a Ky X1, isn’t known to llvm yet. So you have to manually enable the extension when compiling. But that’s fine. I filed a bug with Rust about it, but it seems it may be a missing LLVM feature. It’s apparently not merely checking for the features in question, but needs the name of the CPU in code or something. It seems that the vector registers (VLEN) on this hardware are 256 bit wide. This means that with a single multiplication instruction can do 32-bit floating point operations. Multiplying two vector registers in one instructions multiplies half a kibibyte (256 bytes per aggregate register). 64 32bit operations is a lot. We started off in SIMD with just 2. And as I say above, when future hardware gets bigger vector registers, you won’t even have to recompile. That’s not to say that the Orange Pi RV2 is some sort of supercomputer. It’s much faster than the VisionFive 2 , but your laptop is much faster still. I started a Rust crate to test this out. The “rust” version is normal Rust code. The code is Rust code where the compiler is allowed to use vector instructions. As you can see, Rust (well, LLVM) is already pretty good. But my hand coded vector assembly is faster still. As you can see, the answer for this small benchmark is “about 3-5 times faster”. That multiplier probably goes up the more operations that you do. These benchmarks just do a couple of multiplies. Note that I’m cheating a bit with the manual assembly. It assumes that the input is an even multiple of the batch size. To experiment with target settings, I modified the target spec. This is not necessary for normal code (you can just force-enable the extension, per above), but could be interesting to know. In my case I’m actually using it to turn vectorization off by default, since while Rust lets you enable a target feature per function, it doesn’t let you disable it per function. I have found that the documentation for the RISC-V vector instructions are a bit lacking, to say the least. I’m used to reading specs, but this one is a bit extreme. fails with illegal instructions when in mode. That’s strange. Works fine in through . Can we check the specs on the CPU? Doesn’t look like it. It’s a “Ky X1”. That’s all we’re told. Is it truly RVV 1.0? Don’t know. Who even makes it? Don’t know. I see guesses and assertions that it may be by SpacemiT, but they don’t list it on their website. Maybe it’s a variant of the K1? Could be. The English page doesn’t load, but the Chinese page seems to have similar marketing phrases as Orange Pi RV2 uses for its CPU. Ah, maybe this is blocked by the spec: The EMUL setting must be such that EMUL * NFIELDS ≤ 8, otherwise the instruction encoding is reserved. […] This constraint makes this total no larger than 1/4 of the architectural register file, and the same as for regular operations with EMUL=8. So for some reason you can’t load half the register space in a single instruction. Oh well, I guess I have to settle for loading 256 bytes at a time. It’s a strange requirement, though. The instruction encoding allows it, but it just doesn’t do the obvious thing. ARM uses SIMD like Intel. I vaguely remember that it also has vector instructions (SVE?), but I’ve not looked into it. Yes, qemu has both RISC-V support and support for its vector instructions. I didn’t make precise notes when I tried this out months ago, but this should get you started. Vector instructions are great. I wasn’t aware of this register grouping, and I love it. An old by interesting LLVM deck from 2019

Programming

Open Source

Hardware

Rust

0 views

Blargh 1 years ago

Rebuilding FRR with pim6d

Short post today. Turns out that Debian, in its infinite wisdom, disables in . Here’s a short howto on how to build it fixed. Then you can enable pim6d in and restart frr. Not that I managed to get IPv6 multicast routing to to work over wireguard interfaces anyway. Not sure what’s wrong. Though it didn’t fix it, here’s an interesting command that made stuff like look like it should work:

Linux

0 views

Blargh 1 years ago

Pike is wrong on bloat

This is my response to Rob Pike’s words On Bloat . I’m not surprised to see this from Pike. He’s a NIH extremist. And yes, in this aspect he’s my spirit animal when coding for fun. I’ll avoid using a framework or a dependency because it’s not the way that I would have done it, and it doesn’t do it quite right… for me. And he correctly recognizes the technical debt that an added dependency involves. But I would say that he has two big blind spots. He doesn’t recognize that not using the available dependency is also adding huge technical debt. Every line of code you write is code that you have to maintain, forever. The option for most software isn’t “use the dependency” vs “implement it yourself”. It’s “use the dependency” vs “don’t do it at all”. If the latter means adding 10 human years to the product, then most of the time the trade-off makes it not worth doing at all. He shows a dependency graph of Kubernetes. Great. So are you going to write your own Kubernetes now? Pike is a good enough coder that he can write his own editor (wikipedia: “Pike has written many text editors”). So am I. I don’t need dependencies to satisfy my own requirements. But it’s quite different if you need to make a website that suddenly needs ADA support, and now the EU forces a certain cookie behavior, and designers (in collaboration with lawyers) mandate a certain layout of the cookie consent screen, and the third party ad network requires some integration. And then you’re told that the database needs to be run by the database team, because there’s FIPS certification aspects that you absolutely don’t have time to work on, and data residency requirements with third party auditability feature demands (not requests). What are you going to do? Demand funding for 100 SWE years to implement it yourself? And in the mean time, just not be able to advertise during BFCM? Not launch the product for 10 years? Just live with the fact that no customer can reach your site if they use Opera on mobile? So yeah. The website is bloated. I feel like Pike is saying “yours is the slowest website that I ever regularly use”, to which the answer is “yeah, but you do use it regularly”. If the site hadn’t launched, then you wouldn’t be able to even choose to use it. And comparing to the 70s. Please. Come on. If you ask a “modern coder” to solve a “1970s problem”, it’s not going to be slow, is it? They could write it in Python and it wouldn’t even be a remotely fair fight. Software is slower today not because the problems are more complex in terms of compute (yet they very very very much are), but because the compute capacity of today simply affords wasting it, in order that we are now able to solve complex problems. Ironically, a lot of slow bloated websites (notably banks and airlines) run on mainframes with code written in… the 1970s! When supposedly men were men, and code was fast? That part we could fix. Just give me 10,000 programmer years, and I’ll have us back to square 1, except a little bit faster. People do things because there’s a perceived demand for it. If the demand is “I just like coding”, then as long as you keep coding there’s no failure. Pike’s technical legacy has very visible scars from these blind spots of his. He doesn’t recognize that not using the available dependency is also adding huge technical debt. Every line of code you write is code that you have to maintain, forever. The option for most software isn’t “use the dependency” vs “implement it yourself”. It’s “use the dependency” vs “don’t do it at all”. If the latter means adding 10 human years to the product, then most of the time the trade-off makes it not worth doing at all.

Programming

Python

DevOps

0 views

Blargh 1 years ago

Connection coalescing breaks the Internet

Connection coalescing is the dumbest idea to ever reach RFC status. I can’t believe nobody stopped it before it got this far. It breaks everything. Thus starts my latest opinion post. It’s specified in the RFC for HTTP/2 as connection reuse, but tl;dr: If the IP address of host A and B overlap (e.g. host A and B both resolve to 192.0.2.16), and host A presents a TLS cert that also includes B (via explicit CN/SAN or wildcard cert), then the client is allowed to send HTTP requests directed to B on the connection that was established to A. To save roundtrips and TLS handshakes. It seems like a good idea if you don’t think about it too much. I’ll resist just yelling “layering violation”, because that’s not helpful. Instead I’ll be more concrete. Performing connection coalescing is a client side (e.g. browser) decision. But it implicitly mandates a very strict server architecture. It assumes that ALL affected hostnames are configured exactly the same in many regards, and indeed that the HTTP server even has the config for all hostnames. Concrete things that this breaks: I’m sure there are more ways that it breaks everything. It commits all servers everywhere forever to be locked in to how it works. Countless possible architectures can never be, because connection coalescing has already committed all servers into a very specific implementation. Not really. It has a handwavy “oh the server can(!) send HTTP 421, and the client is then allowed to retry the request on a fresh connection”. But how is the server even supposed to know? This forces a HUGE restriction on the server even detecting this happening. And it’s too late! The secret requests with cookies and other secret tokens have already been leaked to the wrong server! Not to mention that some clients don’t implement handling 421, even if it were possible for the server to detect the situation. Which it can’t, in the general case. For any nontrivial server setup, you should probably: And obviously hope is not a strategy. Nobody does any of these workarouds. The Internet (well, the web) will be broken forever. Let’s pretend that you can detect all cases of misrouted requests. What do you do? The spec allows you to return 421. But it’s a free Internet, you can do whatever you want. If you return 421 then some clients will handle this correctly. Others will have not implemented 421 handling (it’s not mandatory), and will break in some other way. (but remember. It’s already too late. The client has already sent you the secret request that may contain PII) Arguably you should return some 5xx code, so that you can more easily detect when you’ve screwed up with your certs or other SNI routing. This assumes that you monitor for 500s, in some way. Basically the logic is that it’s better to work 0% of the time than 98% of the time, since you’ll be sure to fix the former, but won’t even know why some people keep complaining when it happens to work just fine for you. The RFC says “[421] MUST NOT be generated by proxies”. Presumably this only means forward proxies? “A 421 response is cacheable by default”. What does a cached 421 even mean? A 421 is a layering violation. You might as well say that a TCP SYN is cachable. Connection coalescing considered dumb and harmful. The server can’t have a freestanding TLS termination layer, that routes to HTTP servers based on SNI. The HTTP server can’t reference count HTTP config fragments, since requests can come in for anything. Hosts with stricter TLS config and/or mTLS cannot prevent the client from leaking headers into a less secure connection by inadvertent request smuggling. Good luck not logging secrets, while still detecting it properly. Reject all requests on a connection that don’t match the first request. And “hope” that SNI matches the first request. Or better yet, verify SNI against header. Don’t put more than one FQDN in your TLS certs, and definitely don’t use wildcard certs. “Hope” that you catch all cases. Always use separate IP addresses per hostname. Like in the pre-SNI 1900’s. Again “hope” that you catch all cases. https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/ https://blog.cloudflare.com/connection-coalescing-experiments/

Security

Web Development

0 views