Latest Posts (20 found)

Thoughts on increasing ssh security using a hardware security key

I have been using hardware security keys (including YubiKeys and Titan keys) for FIDO2 and TOTP for a while, but not for ssh. At the moment, I harden the ssh config on my servers, lock down access by IP address, and use password-protected certificates for authentication, blocking password-based authentication. So I think that I do at least reasonably well as it is. But I was interested to see if I could introduce a further aspect of security for ssh, using a security key. My security keys support the generation of both resident and non-resident keys. Resident keys are stored on a slot on the YubiKey, while non-resident keys are stored on the client computer, but require the YubiKey. I picked non-resident. I set a passphrase as part of the ssh-keygen process, so, when it comes to using that key, I need to enter that passphrase and insert and touch the security key. So now someone would need: I can, I think, add a PIN to the YubiKey but, to date, I have not done this. Perhaps I should. Honestly, I was probably fine without this, but, well, I had the security keys, so why not. But, while this works fine from my laptop, I can’t get it to work on my phone (GrapheneOS). At the moment, I use Termux, and from there, I can ssh in to my servers. But I can’t get Termux to use my _*_-sk keypair. There is a six year old issue in the Termux Github repo which indicates that it might, some point, be coming, and that would be welcome. Apparently it can be done using a closed source tool, but since I’m only looking to use FOSS, that’s not on the cards for me. So that is a bit of a pain, as it is convenient to be able to log in from my phone from time to time. to be connected to the correct network to have a copy of my private key to know the passphrase for that private key to have one of my security keys (my main security key, and my backup security key)

0 views

Anthropic’s New Model, The Mythos Wolf, Glasswing and Alignment

Anthropic says its new model is too dangerous to release; there are reasons to be skeptical, but to the extent Anthropic is right, that raises even deeper concerns.

0 views

Louie Mantia’s ideal burger ingredient stack

I love hearing about someone’s thoroughly considered argument for something that I’ve never given much thought to. It’s like pulling aside a curtain to discover there’s been a window with a gorgeous view behind it the whole time. Take, for instance, the order in which a burger’s ingredients should be stacked. I probably could have improvised my preferred order with a few minutes of thought. But now I don’t have to because Louie Mantia’s already figured it out : To me, an ideal cheeseburger has the following: The order of ingredients  is  important. It’s not critical, but I think  this  order makes a lot of sense. The sauce and veg are the cool ingredients. Your tongue should hit those first so you enjoy how fresh and crisp they are and to save you from the hot patty and melted cheese. The melted cheese sticks to the top bun. The sauce coats the bottom bun and dresses the “ salad” part of the sandwich when you bite. If the cool ingredients are on the top, above the cheese, the watery vegetables sweat. The hot-cool barrier created between the patty and lettuce is the key to prevent that. The cool, raw vegetables don’t benefit being adjacent to the hot, melted cheese. First, an excellent rating system , and now this well-reasoned defense? I think I’m going to enjoy this blog. 🍔 HeyDingus is a blog by Jarrod Blundy about technology, the great outdoors, and other musings. If you like what you see — the blog posts , shortcuts , wallpapers , scripts , or anything — please consider leaving a tip , checking out my store , or just sharing my work. Your support is much appreciated! I’m always happy to hear from you on social , or by good ol' email . Fluffy, toasted bun Grilled onion Processed cheese Crisp lettuce Juicy tomatoes Cucumber pickles Tangy, mayo-based sauce

0 views

Obfuscating My Contact Email

I stumbled across this great post by Spencer Mortensen yesterday, which tested different email obfuscation techniques against real spambots to see which ones actually work. It's a fascinating read, and I'd recommend checking it out if you're into that sort of thing. The short version is that spambots scrape your HTML looking for email addresses. If your address is sitting there in plain text, they'll hoover it up. But if you encode each character as a HTML entity , the browser still renders and uses it correctly, while most bots haven't got a clue what they're looking at. From Spencer's testing, this approach blocks around 95% of harvesters, which is good enough for me. On this site, my contact email shows up in two places: Both pull from the value in Pure Blog's config, so I only needed to make a couple of changes. The reply button lives in , which is obviously a PHP file. So the fix there was straightforward - I ditched the shortcode and used PHP directly to encode the address character by character into HTML entities: Each character becomes something like , which is gibberish to a bot, but perfectly readable to a human using a browser. The shortcode still gets replaced normally by Pure Blog after the PHP runs, so the subject line still works as expected. The contact page is a normal page in Pure Blog, so it's Markdown under the hood. This means I can't drop PHP into it. Instead, I used Pure Blog's hook , which runs after shortcodes have already been processed. By that point, has been replaced with the plain email address, so all I needed to do was swap it for the encoded version: This goes in , and now any page content that passes through Pure Blog's function will have the email automatically encoded. So if I decide to publish my elsewhere, it should automagically work. As well as the obfuscation, I also set up my email address as a proper alias rather than relying on a catch-all to segregate emails . That way, if spam does somehow get through, I can nuke the alias, create a new one, and update it in Pure Blog's settings page. Is this overkill? Probably. But it was a fun little rabbit hole, and now I can feel smug about it. 🙃 Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment . The Reply by email button at the bottom of every post. My contact page .

0 views

Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?

Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090 , using code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular test set, my model gave a loss of 3.944 -- quite a lot more than the original GPT-2's 3.500 on the same dataset. I wanted to see whether I could train a model on my own hardware (or on something that didn't cost too much to rent in the cloud) that got closer to the original model's performance. So over the last few months, I've done a bunch of further training runs, each one testing a specific intervention -- a stand-alone change that I expected to change the loss, either for better or for worse. Specifically: At the end of all of that, I had this table showing the effect of each intervention in terms of loss on the test set. They're sorted from least-effective to most-effective, and you can see the baseline in there too: Winners and losers are reasonably clear: So, for an optimal train, we'd just use the effective interventions, right? Well, not quite. Full-fat float32 I decided wasn't worth the effort, as it meant that the train took more than twice as long, and (because it required a larger machine), cost more than three times as much. The others did look like solid changes, but there was one concern. The effect of each intervention is actually pretty small. For example, gradient clipping reduced the loss by 0.014, from 3.692 to 3.678. That's a 0.3% improvement. Even the best intervention, scheduling the learning rate, only improved things by 2%. Could it be that some or all of these improvements were not real, but just a result of the random nature of training deep neural networks? Could the differences just be in the noise? They seemed small enough for that to be possible. I've trained seven more models over the last few days to try to get a feel as to how big an effect noise has for this kind of training run. The results appear to show that variations in the initial weights matter quite a lot, but randomness in the training loop (given the same initial weights) actually has a fairly minimal impact. That surprised me a bit! Let's go through the details. When I did the original baseline training run -- creating the model that was the comparison point for all of the interventions -- I wanted to minimise the amount of random number-induced differences between the training runs in this interventions series. I did this by setting the random seed at the start -- specifically, I had this code: At the time I wrote it, this seemed pretty complete -- the seed is set on Python's own random number generator, on PyTorch's, and on the separate ones it uses for CUDA. However, in a separate project, where I was fine-tuning a Qwen model as a classifier, I'd found that this wasn't enough. In order to get full reproducibility, I'd had to lock things down a bit more, with this additional code: So: was my random number seed code enough for this case? Or would I get a different model if I ran the same code a second time? That was easy enough to do; I spun up a machine, and just ran the "baseline" train again. 3 hours 24 minutes later: Interestingly, that was exactly the same final train loss as the original baseline train. Here's the model . I ran my normal smoke test, asking it to complete "Every effort moves you" ...so that was OK -- the model was generating reasonably coherent text. Then I ran the eval to find its loss on the test set: Exactly the same as the original baseline! That was certainly promising. Now, the use of three decimal places for the output from the loss eval is just a formatting thing, so I bumped it up to 6 dps, and the new model got this: Running that against the original baseline model: Again, exactly the same. Finally, more out of idle interest than anything else, I decided to see if the models were at least different: That is, quite frankly, amazing to me. I was expecting pretty close results, but what we're seeing here is that two separate models, trained on the same data, but on different machines more than a month apart, have weights that are bit-wise identical. No random noise at all. That's actually really reassuring! It makes me much more comfortable that we're standing on a stable foundation here. Now it was time to see what effect changing that random seed would have. Let's think about what the random seed does. When we call , we're initialising Python's pseudo-random number generator so that it will start at a particular point -- after we've called it, it will generate the same sequence of "random" numbers each time it's asked for a new one. So the effect of this code: ...is to initialise three separate pseudo-random number generators to be in a known deterministic state, so they'll all generate the same sequence in every run. So, the first thing to do was to see what happened if we changed that number. I decided to do two training runs, each with exactly the same code as the baseline, but with different random seeds. Firstly, I changed it from 42 to 22 1 : That training run completed: Here's the model . Time for the evals; the smoke test: ...and the loss test: So, that's 3.673453 compared to 3.691526, an improvement of 0.018 over the run with a seed of 42. That's more than the 0.014 improvement we got from gradient clipping (and indeed, the 0.013 from full-fat float32 training), and quite close to the 0.023 improvement from adding attention weight bias. Time for another training run: Another 3h24m later: Here's the model . The smoke test: ...and the test set loss: A further improvement! That's 0.038 better than our original baseline, which beats adding on attention weight bias (though it's worse than the weight decay update). Now, three data points is rather a small number for any kind of statistical analysis, but just out of interest, let's do the basics. GeeksForGeeks has a good refresher here if you're a bit rusty. Firstly, our mean is ...and our variance 2 is: If we take the square root of that, we get the standard deviation (SD): So, if we assume a normal distribution, what would that say about our results? Here's the results table again. If we assume that the results are on a normal distribution: That seemed a bit saddening -- were all of the results apart from scheduling the learning rate within the noise? Well, so as I said, three data points is too small a number to take those results without a fistful of salt. I was thinking of perhaps trying another few random seeds to see what would happen, and perhaps to tighten those numbers up a bit, but then something occurred to me -- randomness was being used in two different ways in the training run, and perhaps we could separate them? Where do we use the random numbers? Well, immediately after we set the seeds, we create our uninitialised model for training: One of the random number generators -- Python's, PyTorch's, or one of the CUDA ones -- will be used to generate the initial weights that we're going to start training. That means that for the same model setup , we'll always start with exactly the same weights. But if the model settings change such that we initialise different things in a different order, then we'll have different weights. After we've done that, we go into the training loop. That can have randomness in it; although the AdamW optimiser itself is deterministic, we are (in all but one of these training runs) using dropout, which drops a random bunch of activations at various points -- 10% of them with our config. And it seems entirely possible that each of the interventions could change the order of execution of different steps in non-obvious ways, which would lead to dropout being applied in different ways in different runs. So, the question was: what kinds of randomness -- in terms of the initial weights, or in terms of the training run -- did each intervention potentially change vs the baseline? Disregarding the full-fat float32 run: Given that, I wanted to get two measures of how sensitive to noise each phase of the training run was: the initialisation of weights at the start, and the training run itself. I decided to start by nailing down exactly what the training run started with. We already had a baseline training run with a specific state of the random number generator at the start; in our "real" baseline, we seeded with 42 at the start, and then initialised our weights. After that, the random number generator would have reached some specific state based on its initial seed and how many numbers had been generated so far. Now, in theory, we could get the RNG into that specific state by seeding it with some number A at that point. We don't know what A is, of course. But it seems vanishingly unlikely that it would be something we'd come up with -- specifically, we can be pretty sure that A ≠ 23 and A ≠ 67 . So, I put the old initial seed of 42 back in, but re-seeded after the model had been initialised: Firstly, with a re-seed value of 23: I let that run.... ...and got this model . Time for the normal evals: Next, I did another training run, the same as the previous one, but with 67 instead of 23 for the re-seed: That one ran: ...producing this model , which eval'ed like this 3 : Let's bring those together: That's a mean of ~3.684462, with a variance of ~0.0000752 and a standard deviation of ~0.008672. Those are tiny compared to the numbers from the two trains we did with the change of the seed prior to the model initialisation. That actually surprised me a bit; we're using dropout in all of these training runs, and it's dropping a random 10% of activations in every forward training pass. With our different training run starting seeds, they should be getting very different dropout patterns. Hand-wavingly, perhaps over the three million or so sequences we're training on, it averages out? Still a little counterintuitive, though. Anyway, let's take a look at the intervention results again, this time highlighting the ones that we believe will be starting with the same weights: Using the "99.7% should be within three SDs" heuristic, we get a range of 3.658446 - 3.710478. Of the intervention runs with (I believe) stable weights, only the no-AMP and the gradient clipping ones are within that range. That made me feel quite positive. If my beliefs are correct about which runs have the same weights, then noise in the training runs seems unlikely to be causing the differences -- that is, perhaps the results from the interventions for those same-weight training runs are real signal and not just noise. What would happen if instead of pinning the seed for generating the weights and varying the starting seed for the training run, we varied the weight seed and pinned the training one? We'd already done a training run with a seed of 42 before generating the weights and a re-seed to 23 after that: So I decided to see what would happen if I varied the pre-weights initialisation seed. Let that train: ...getting this model . Evals: Next, one with 67 as the weights initialisation seed: That trained: ...getting this model , and 4 : OK, so here we have: Compared to the SD we got when we varied just the initial seed, 0.0154919, it's not too far off. Using the 3-SD rule, we get a range of 3.637030 - 3.709400, and looking at the table again, this time with the ones that we don't expect to have the same weights highlighted: ...we can see that the QKV bias is well within that range (as are all of the interventions apart from the two negative-effect ones and scheduling the learning rate). Right, what does all of that tell us? This post obviously isn't even trying to be statistically rigorous. The number of training runs I've done and the amount of data is way too small for that. However, training runs are expensive (Lambda have raised their prices again, so these cost more than US$50 each!), so there's a limit to how much I can do. But even with the limited amount of data, something seems pretty clear: "One of these things is not like the others". Keeping the model weights stable and only allowing variation in randomness across the training run itself meant that almost all of the differences between training runs disappeared. Could this be a result of the small number of samples? I guess conceivably it might, but it seems vanishingly unlikely. So I feel reasonably confident in saying that the bulk of the variation in results that we can chalk up to random noise in these training runs comes from variations in the model weights' initialisation. Additionally, the first training run in this post -- the re-run of the baseline model with no changes -- gave exactly the same numbers as the original baseline run. So we can be confident that all of the models with no changes to the weight initialisation started with the same weights. Of course, I could be wrong about which models really did have the same weights, but given that they were running the same code with the same seed, I'm pretty much sure. That makes me fairly confident that the intervention runs that had the same initial weights gave a real signal about whether or not the intervention in question actually helped. The only exception is gradient clipping, which fell within the three-SD range for the same-weights tests -- and it's essentially free, adding just 100 seconds to a three hour training run. That's a really interesting result! As I said earlier, given that dropout is making us ignore a random 10% of activations during the training run, I would have thought that changing which random 10% were being ignored would have a much larger effect. And that's not even considering other sources of random noise in the training run. I was less surprised that model weight initialisation was important, though. It's pretty obvious that your starting position in the loss landscape is going to affect where you end up at the end of the training run. Still, we now have a reasonable level of trust that our interventions gave a real signal, so I think we have everything in place to see how they stack together, and do a best-effort training run. Can we approach the original GPT-2 small weights' performance on our test set loss? It should be fun to find out :-) Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩ I trained a baseline model on an 8x A100 40 GiB per GPU machine on Lambda (which was better than my original locally-trained model, I believe due to the larger batch size that the larger machine made possible). I tried adding gradient clipping to see if that would help by limiting the effects of loss spikes. I tried removing dropout , given that these days people tend not to use it (because we're doing single-epoch training runs). I tried adding bias to the attention weight matrices -- something that was popular back in the GPT-2 era, and was used by the original weights, but which my code did not use. Instead of just using the learning rate of 0.0004 that was used in the code from the book, I looked into what values people use these days, and learned how to schedule it over the course of the training run . Similarly, I learned more about weight decay and tried some alternative values. Then I tried making my model more like the original GPT-2 one by introducing weight tying to see if that would help. Finally, I decided to try training in "full-fat" float32 instead of using PyTorch's AMP and TF32 matrix multiplication performance enhancements. Weight tying and the number for weight decay I derived from a paper by Cerebras Research (probably without understanding it properly) were negatives. Full-fat float32, gradient clipping, attention biases, the GPT-2 weight decay parameter, removing dropout, and scheduling (and updating) the learning rate were positives. We would expect ~68.2% of results to be within one SD of the mean -- that is, between 3.6573651 and 3.6883489. Interestingly, our actual baseline result is outside that range! But it does include both the gradient clipping and the QKV bias results. We would additionally expect ~95.4% of the results to be within two SDs, which is 3.6418732 to 3.7038408. That includes our baseline and our weight decay result (though not our experiment removing dropout -- the six-DP loss number for that is 3.641282). Finally, we'd expect ~99.7% of results to be within three SDs, which is a range from 3.6263813 to 3.7193327. That covers all of our positive results apart from scheduling learning rate! Gradient clipping: randomness only affected the training run -- the weights it started with would have been exactly the same as the baseline model's. Removing dropout: although this is a parameter on the model, I don't think it changes the initial weights. But in the training run, it certainly does affect randomness by removing its use of the random number generator. Adding bias to the attention weights. This will change both the initial weights -- because we have those bias weights, things will be initialised differently -- and as a result, the training run, as the random number generator will have been sampled a different number of times prior to the run. Changing and scheduling the learning rate certainly should not change the initial weights, but it might conceivably have a non-obvious effect on training. Likewise weight decay; no effect I can see on the initial weights, but it could well change training dynamics. Weight-tying. When I added it to the code , I tried to do so in such a way that the other weights would be unaffected -- I created exactly the same weights as I would without weight tying, then threw away the output head and replaced it with a reference to the input embedding weights. So I think that in theory, this one won't have changed the other model weights (apart from ignoring the initialised-but-thrown-away output head), but it could well have changed the training run. Our normal baseline: weights initialised with seed 42, and training run starts with a "seed" of our imaginary A value from above: 3.691526 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 The second run above: weights initialised with seed 42, and training run starts with a seed of 67: 3.680505 The first run above: weights initialised with seed 42, and training run starts with a seed of 23: 3.681356 Mean: ~3.673215 Variance: ~0.000145 SD: ~0.012062 Varying the random seed at the start, prior to initialising weights, and not constraining the starting point for the training runs, gave a mean of 3.672857, with an SD of 0.0154919. Keeping the same seed for model weights (so that they all started with the same weights), and varying the seed for the training run, gave a mean of 3.684462, with an SD of 0.008672. Varying the seed for the model weights (so that they all started with different weights), and keeping the training run seed pinned, gave a mean of 3.673215 and an SD of 0.012062. Numbers chosen based on a misremembering of this XKCD . For some reason (perhaps because it rhymes) I thought that the old-timey funny number thing was "22 skidoo" rather than "23 skidoo".  ↩ On working through this later: with n samples from a dataset, it is (as I understand it) best to use n − 1 as the denominator here (Bessel's correction) for the "sample variance". If we had every possible value, then it would be correct to use n . However, while this changes a few details in the analysis, I don't think it changes the final conclusion of the post meaningfully (it would just bump up the SDs by 22% or so), so I've left it as-is.  ↩ I found it interesting that this model does the "you and I" hypercorrection that so many people do when trying to write formally! Based on the (correct) correction of "me and you move back home" to "you and I move back home", I think as a result of excessive pattern-matching.  ↩ Another grammatical error based on pattern-matching -- it would make sense that the possessive form of "it" in English was "it's", just like the possessive form of "John" is "John's".  ↩

0 views

Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me

Anthropic didn't release their latest model, Claude Mythos ( system card PDF ), today. They have instead made it available to a very restricted set of preview partners under their newly announced Project Glasswing . The model is a general purpose model, similar to Claude Opus 4.6, but Anthropic claim that its cyber-security research abilities are strong enough that they need to give the software industry as a whole time to prepare. Mythos Preview has already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser . Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely. Project Glasswing partners will receive access to Claude Mythos Preview to find and fix vulnerabilities or weaknesses in their foundational systems—systems that represent a very large portion of the world’s shared cyberattack surface. We anticipate this work will focus on tasks like local vulnerability detection, black box testing of binaries, securing endpoints, and penetration testing of systems. There's a great deal more technical detail in Assessing Claude Mythos Preview’s cybersecurity capabilities on the Anthropic Red Team blog: In one case, Mythos Preview wrote a web browser exploit that chained together four vulnerabilities, writing a complex  JIT heap spray  that escaped both renderer and OS sandboxes. It autonomously obtained local privilege escalation exploits on Linux and other operating systems by exploiting subtle race conditions and KASLR-bypasses. And it autonomously wrote a remote code execution exploit on FreeBSD's NFS server that granted full root access to unauthenticated users by splitting a 20-gadget ROP chain over multiple packets. Plus this comparison with Claude 4.6 Opus: Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more. Saying "our model is too dangerous to release" is a great way to build buzz around a new model, but in this case I expect their caution is warranted. Just a few days ( last Friday ) ago I started a new ai-security-research tag on this blog to acknowledge an uptick in credible security professionals pulling the alarm on how good modern LLMs have got at vulnerability research. Greg Kroah-Hartman of the Linux kernel: Months ago, we were getting what we called 'AI slop,' AI-generated security reports that were obviously wrong or low quality. It was kind of funny. It didn't really worry us. Something happened a month ago, and the world switched. Now we have real reports. All open source projects have real reports that are made with AI, but they're good, and they're real. Daniel Stenberg of : The challenge with AI in open source security has transitioned from an AI slop tsunami into more of a ... plain security report tsunami. Less slop but lots of reports. Many of them really good. I'm spending hours per day on this now. It's intense. And Thomas Ptacek published Vulnerability Research Is Cooked , a post inspired by his podcast conversation with Anthropic's Nicholas Carlini. Anthropic have a 5 minute talking heads video describing the Glasswing project. Nicholas Carlini appears as one of those talking heads, where he said (highlights mine): It has the ability to chain together vulnerabilities. So what this means is you find two vulnerabilities, either of which doesn't really get you very much independently. But this model is able to create exploits out of three, four, or sometimes five vulnerabilities that in sequence give you some kind of very sophisticated end outcome. [...] I've found more bugs in the last couple of weeks than I found in the rest of my life combined . We've used the model to scan a bunch of open source code, and the thing that we went for first was operating systems, because this is the code that underlies the entire internet infrastructure. For OpenBSD, we found a bug that's been present for 27 years, where I can send a couple of pieces of data to any OpenBSD server and crash it . On Linux, we found a number of vulnerabilities where as a user with no permissions, I can elevate myself to the administrator by just running some binary on my machine. For each of these bugs, we told the maintainers who actually run the software about them, and they went and fixed them and have deployed the patches patches so that anyone who runs the software is no longer vulnerable to these attacks. I found this on the OpenBSD 7.8 errata page : 025: RELIABILITY FIX: March 25, 2026 All architectures TCP packets with invalid SACK options could crash the kernel. A source code patch exists which remedies this problem. I tracked that change down in the GitHub mirror of the OpenBSD CVS repo (apparently they still use CVS!) and found it using git blame : Sure enough, the surrounding code is from 27 years ago. I'm not sure which Linux vulnerability Nicholas was describing, but it may have been this NFS one recently covered by Michael Lynch . There's enough smoke here that I believe there's a fire. It's not surprising to find vulnerabilities in decades-old software, especially given that they're mostly written in C, but what's new is that coding agents run by the latest frontier LLMs are proving tirelessly capable at digging up these issues. I actually thought to myself on Friday that this sounded like an industry-wide reckoning in the making, and that it might warrant a huge investment of time and money to get ahead of the inevitable barrage of vulnerabilities. Project Glasswing incorporates "$100M in usage credits ... as well as $4M in direct donations to open-source security organizations". Partners include AWS, Apple, Microsoft, Google, and the Linux Foundation. It would be great to see OpenAI involved as well - GPT-5.4 already has a strong reputation for finding security vulnerabilities and they have stronger models on the near horizon. The bad news for those of us who are not trusted partners is this: We do not plan to make Claude Mythos Preview generally available, but our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. To do so, we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs. We plan to launch new safeguards with an upcoming Claude Opus model, allowing us to improve and refine them with a model that does not pose the same level of risk as Mythos Preview. I can live with that. I think the security risks really are credible here, and having extra time for trusted teams to get ahead of them is a reasonable trade-off. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views

Russia Hacked Routers to Steal Microsoft Office Tokens

Hackers linked to Russia’s military intelligence units are using known flaws in older Internet routers to mass harvest authentication tokens from Microsoft Office users, security experts warned today. The spying campaign allowed state-backed Russian hackers to quietly siphon authentication tokens from users on more than 18,000 networks without deploying any malicious software or code. Microsoft said in a blog post today it identified more than 200 organizations and 5,000 consumer devices that were caught up in a stealthy but remarkably simple spying network built by a Russia-backed threat actor known as “ Forest Blizzard .” How targeted DNS requests were redirected at the router. Image: Black Lotus Labs. Also known as APT28 and Fancy Bear, Forest Blizzard is attributed to the military intelligence units within Russia’s General Staff Main Intelligence Directorate (GRU). APT 28 famously compromised the Hillary Clinton campaign, the Democratic National Committee, and the Democratic Congressional Campaign Committee in 2016 in an attempt to interfere with the U.S. presidential election. Researchers at Black Lotus Labs , a security division of the Internet backbone provider Lumen , found that at the peak of its activity in December 2025, Forest Blizzard’s surveillance dragnet ensnared more than 18,000 Internet routers that were mostly unsupported, end-of-life routers, or else far behind on security updates. A new report from Lumen says the hackers primarily targeted government agencies—including ministries of foreign affairs, law enforcement, and third-party email providers. Black Lotus Security Engineer Ryan English said the GRU hackers did not need to install malware on the targeted routers, which were mainly older Mikrotik and TP-Link devices marketed to the Small Office/Home Office (SOHO) market. Instead, they used known vulnerabilities to modify the Domain Name System (DNS) settings of the routers to include DNS servers controlled by the hackers. As the U.K.’s National Cyber Security Centre (NCSC) notes in a new advisory detailing how Russian cyber actors have been compromising routers, DNS is what allows individuals to reach websites by typing familiar addresses, instead of associated IP addresses. In a DNS hijacking attack, bad actors interfere with this process to covertly send users to malicious websites designed to steal login details or other sensitive information. English said the routers attacked by Forest Blizzard were reconfigured to use DNS servers that pointed to a handful of virtual private servers controlled by the attackers. Importantly, the attackers could then propagate their malicious DNS settings to all users on the local network, and from that point forward intercept any OAuth authentication tokens transmitted by those users. DNS hijacking through router compromise. Image: Microsoft. Because those tokens are typically transmitted only after the user has successfully logged in and gone through multi-factor authentication, the attackers could gain direct access to victim accounts without ever having to phish each user’s credentials and/or one-time codes. “Everyone is looking for some sophisticated malware to drop something on your mobile devices or something,” English said. “These guys didn’t use malware. They did this in an old-school, graybeard way that isn’t really sexy but it gets the job done.” Microsoft refers to the Forest Blizzard activity as using DNS hijacking “to support post-compromise adversary-in-the-middle (AiTM) attacks on Transport Layer Security (TLS) connections against Microsoft Outlook on the web domains.” The software giant said while targeting SOHO devices isn’t a new tactic, this is the first time Microsoft has seen Forest Blizzard using “DNS hijacking at scale to support AiTM of TLS connections after exploiting edge devices.” Black Lotus Labs engineer Danny Adamitis said it will be interesting to see how Forest Blizzard reacts to today’s flurry of attention to their espionage operation, noting that the group immediately switched up its tactics in response to a similar NCSC report (PDF) in August 2025. At the time, Forest Blizzard was using malware to control a far more targeted and smaller group of compromised routers. But Adamitis said the day after the NCSC report, the group quickly ditched the malware approach in favor of mass-altering the DNS settings on thousands of vulnerable routers. “Before the last NCSC report came out they used this capability in very limited instances,” Adamitis told KrebsOnSecurity. “After the report was released they implemented the capability in a more systemic fashion and used it to target everything that was vulnerable.” TP-Link was among the router makers facing a complete ban in the United States. But on March 23, the U.S. Federal Communications Commissio n (FCC) took a much broader approach, announcing it would no longer certify consumer-grade Internet routers that are produced outside of the United States. The FCC warned that foreign-made routers had become an untenable national security threat, and that poorly-secured routers present “a severe cybersecurity risk that could be leveraged to immediately and severely disrupt U.S. critical infrastructure and directly harm U.S. persons.” Experts have countered that few new consumer-grade routers would be available for purchase under this new FCC policy (besides maybe Musk’s Starlink satellite Internet routers, which are produced in Texas). The FCC says router makers can apply for a special “conditional approval” from the Department of War or Department of Homeland Security, and that the new policy does not affect any previously-purchased consumer-grade routers.

0 views
neilzone Yesterday

Sex and the Fedi

Over the weekend, Girl on the Net - an esteemed sex blogger who, incidentally, happens to be one of the smartest, strongest, and downright loveliest people that I know - tooted : If you ever get sick of me banging on about my life and think ‘ugh I wish she would stick to the porn’ then please know: hardly anyone ever boosts the … porn. And this made me think. I had an engaging conversation with numerous people about it, and I still don’t have good answers, but I enjoyed the discussion and wanted to keep a note of it. This is that note. I follow and chat with quite a lot of sex positive / sex work-related people in the fediverse, and many have expressed similar sentiments. They create, they share, they get “likes” - and, of course, ample criticism - but very few boosts / shares. It must be incredibly demoralising. (I am in a different position in that I neither know nor care how many views my blogposts get .) It made me ponder why people do not share sex-related content, when sex is clearly part of life for many (but not all) people. My thoughts were: stigma about sex as pleasure. It’s fine to have sex, but not to talk about it. One of Girl on the Net’s regular themes is about communication, and simply asking questions (not just about sex, but also including about sex and one’s preferences and horizons). But I imagine that, for some, talking about sex is uncomfortable, including sharing other people talking about sex. concerns relating to professional expectations and obligations. I fall into this category. I am sex positive, but I do not know where the Solicitors Regulation Authority would draw the line, and I don’t wish to be even close to where that line might be. So I play it safe, even though there is stuff that I would like to post or share. But, oh well, self-censorship ftw. Sometimes, I would love not to be “me” online . being embarrassed about what others here might think. Similar, but different, to the points above. This is about other fedizens, who might be co-workers, employers, family members, or whatever. sex as being in the sphere of one’s private life. older people, perhaps especially men, being self-aware of engaging with younger adults posting sex-related stuff, and coming across as creepy. I completely get this, and I am somewhat paranoid about it myself. Several people responded to say that, yes, they felt like this. They might want to engage with public content (and I’m not talking about responding lasciviously, or sending dick pics), but do not want to be perceived as being inappropriate. I received some thought-provoking feedback too: women and non-binary people said that they felt unsafe boosting or posting sex-related content, because of reactions from men hitting on them. That, by posting about sex, some men took it as an unwelcome opportunity to solicit sex with them. some people not wanting to boost as they feel that they don’t have enough followers to make it worthwhile. And, in terms of increasing the distribution of a toot, yes, that makes sense. It probably still sends a nice endorphin boost to the poster though, that someone likes their work enough to want to boost it :) Where someone has a popular “main” account, and a less popular “alt” account, but would only be willing/able to post sex-related stuff via that alt, this perhaps comes into play. just not liking the stuff enough to boost it. Fair enough! concerns over whether their server rules allow boosting of this kind of content, and not wanting to get blocked / banned. I can understand each of these, and why they might lead to a “like” rather than a “boost”. None of them inhibit paying or tipping someone, as a thank you for their work though, which is another way of being supportive. But this also comes against a backdrop of increasing difficulties for sex workers and other people post sex-related stuff. Payment processors denying income streams. Platform operators enforcing their ever more restrictive morality rules, making working harder, and requiring more admin just to keep going. If people take, take, take, without giving back in some meaningful way, then that is challenging even for those who create and share for fun (for appreciation, perhaps, rather than tooting into the void), let alone those for whom this is their livelihood. I wish that I had better answers than I do. stigma about sex as pleasure. It’s fine to have sex, but not to talk about it. One of Girl on the Net’s regular themes is about communication, and simply asking questions (not just about sex, but also including about sex and one’s preferences and horizons). But I imagine that, for some, talking about sex is uncomfortable, including sharing other people talking about sex. concerns relating to professional expectations and obligations. I fall into this category. I am sex positive, but I do not know where the Solicitors Regulation Authority would draw the line, and I don’t wish to be even close to where that line might be. So I play it safe, even though there is stuff that I would like to post or share. But, oh well, self-censorship ftw. Sometimes, I would love not to be “me” online . being embarrassed about what others here might think. Similar, but different, to the points above. This is about other fedizens, who might be co-workers, employers, family members, or whatever. sex as being in the sphere of one’s private life. older people, perhaps especially men, being self-aware of engaging with younger adults posting sex-related stuff, and coming across as creepy. I completely get this, and I am somewhat paranoid about it myself. Several people responded to say that, yes, they felt like this. They might want to engage with public content (and I’m not talking about responding lasciviously, or sending dick pics), but do not want to be perceived as being inappropriate. women and non-binary people said that they felt unsafe boosting or posting sex-related content, because of reactions from men hitting on them. That, by posting about sex, some men took it as an unwelcome opportunity to solicit sex with them. some people not wanting to boost as they feel that they don’t have enough followers to make it worthwhile. And, in terms of increasing the distribution of a toot, yes, that makes sense. It probably still sends a nice endorphin boost to the poster though, that someone likes their work enough to want to boost it :) Where someone has a popular “main” account, and a less popular “alt” account, but would only be willing/able to post sex-related stuff via that alt, this perhaps comes into play. just not liking the stuff enough to boost it. Fair enough! concerns over whether their server rules allow boosting of this kind of content, and not wanting to get blocked / banned.

0 views
Kev Quirk Yesterday

Why Have a Dedicated Music Device?

In the last year or so I've read about many people moving from streaming services, like Apple Music and Spotify, to their own music library. To support these local libraries, many seem to be getting themselves a music player, such as the Fiio Echo Mini . While moving to a local library is something that I've thought about many times 1 , I don't understand why people are buying these little music players. The big selling points generally seem to be: With the exception of the 3rd point, pretty much every smartphone on the market will do all of this. And let's be honest, #3 doesn't really matter as most people use Bluetooth buds these days. Yes, I know some people still use old school wired earphones. I don't need an email from you. So if the device that's already in your pocket will do everything these little music players will already do, why get an extra device to lug around everywhere? I want to stress, these look really cool, and if that's why you want one, that's totally fine. But anecdotally, that's not what I'm seeing. Can someone enlighten me? I see the advantages of owning your own music library, but I don't get why people want to carry another device everywhere. I've decided to stick with streaming, but that's a post for another day.  ↩ Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment . Bluetooth connectivity so you can use with buds, or in your car. Plenty of local storage. Audio jack. Easy to drag and drop music. I've decided to stick with streaming, but that's a post for another day.  ↩

0 views

FlexGuard: Fast Mutual Exclusion Independent of Subscription

FlexGuard: Fast Mutual Exclusion Independent of Subscription Victor Laforet, Sanidhya Kashyap, Călin Iorgulescu, Julia Lawall, and Jean-Pierre Lozi SOSP'25 This paper presents an interesting use of eBPF to effectively add an OS feature: coordination between user space locking code and the kernel thread scheduler to improve locking performance. The paper describes most lock implementations as spin-then-park locks (e.g., busy wait in user space for some time, then give up and call the OS to block the waiting thread). A big problem with busy waiting is the performance cliff under oversubscription . Oversubscription occurs when there are more active threads than cores. In this case, busy waiting can be harmful, because it wastes CPU cycles when there is other useful work to do. The worst case occurs when a thread acquires a lock and then is preempted by the OS scheduler while many other threads are busy waiting. If the OS thread scheduler were smart, it would preempt one of the busy waiters and let the lock holder keep running. But alas, that level of coordination isn’t available … until now. In the good old days, researchers would have modified Linux scheduling code and tested their modified kernel. The modern (easier) way to achieve this is to use eBPF. The authors wrote an eBPF program that runs (in kernel space) each time a context switch occurs. This program is called the Preemption Monitor . The Preemption Monitor works in conjunction with a custom user space lock implementation. The net result is that the Preemption Monitor can reliably detect when the OS scheduler preempts a thread that is holding a lock. When this occurs the eBPF program writes information to a variable that user space code can read. The locking algorithm is as follows: First, try to acquire the lock with a simple atomic compare-and-swap. If that fails, then busy wait. Similar to Hapax locks , this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables. During busy waiting, the variable written by the Preemption Monitor is checked. If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use). Fig. 2 has performance results. The x-axis shows thread count (which varies over time). The green line is FlexGuard. The idea is that it gives great performance when there is no oversubscription (i.e., fewer than 150 threads) and offers performance similar to a purely blocking lock (the dark blue line) when there is oversubscription. Source: https://dl.acm.org/doi/10.1145/3731569.3764852 Dangling Pointers This problem seems ripe for overengineering. In some sick world, the compiler, OS, and hardware could all coordinate to support a “true critical section”. All pages accessed inside this critical section would be pinned into main memory (or even closer to the CPU), and the OS would try extremely hard not to preempt threads inside of the critical section. This would require some upper bound on the critical section working set and running time. Subscribe now First, try to acquire the lock with a simple atomic compare-and-swap. If that fails, then busy wait. Similar to Hapax locks , this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables. During busy waiting, the variable written by the Preemption Monitor is checked. If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use).

0 views
Stratechery Yesterday

Anthropic’s New TPU Deal, Anthropic’s Computing Crunch, The Anthropic-Google Alliance

Anthropic needs compute, and Google has the most: it's a natural partnership, particularly for Google.

0 views
Anton Sten Yesterday

What a UX strategy is — and why most teams should write one

A UX strategy is a short document that says what good looks like for the people using your product, and how the team plans to deliver it. That's it. No frameworks, no McKinsey decks, no 113 slides. When I join a project, one of the first things I do is ask if there's a UX strategy in place. Most of the time there isn't. Sometimes there's a brand book, or a product roadmap, or a Notion page someone wrote a year ago and then forgot. Rarely is there a document that actually says: this is what we mean by a good experience, and this is how we're going to get there. It's not that teams don't care. They almost always do. It's that nobody's written it down, so the care gets spent in fifty different directions and the product ends up feeling like a committee made it. Which, in a way, it did. ## What a UX strategy actually is The phrase trips people up, so I usually pull the two words apart. **UX** is what someone experiences when they use your product. Not what the product does — what it feels like to use. Two apps can have identical feature lists and feel completely different. iPhone and Android. Notion and Confluence. Linear and Jira. The features are the same on paper. The experience isn't. **Strategy**, stripped of the consultant baggage, is three questions. Where are we now. Where do we want to be. How do we get there. That's the whole shape of it. Everything else is detail. Put them together and a UX strategy is a document that answers those three questions specifically about the experience you're building. What's the experience like today, what do you want it to feel like, and what are you actually going to do to close the gap. It's not a deliverable for clients. It's not a marketing document. It's a working tool the team uses to make decisions when nobody's in the room to ask. ## What goes in one I've written a lot of these over the years and the contents vary, but the bones are usually the same. **Where you are now.** A short, honest snapshot. Who your users are, what they're trying to do, where they get stuck, and how the current experience compares to alternatives. Not a research report — a summary the team can hold in their heads. If it's longer than two pages, it's not doing its job. **Where you want to be.** This is the part most teams skip, because it requires picking a direction and sticking to it. Not goals like "improve the user experience" — that's not a goal, that's a wish. Specific principles you can actually hold a design decision against. We'll come back to those in a minute, because they're the part that does the most work. **How you'll get there.** The practical bit. What's going to change. Who's going to do it. What you'll stop doing to make room. I'm partial to this section because it's the part most strategies leave out, and it's the reason most strategies don't survive contact with real work. A direction without a plan is a wish list. Length-wise, a good UX strategy is short. A page can be enough. Two is plenty. Anything longer and people will paste it into Claude and ask for the summary — which means the summary is the real strategy, and you wrote the rest for nothing. ## Goals are principles, not action items The most important section in any UX strategy is the one defining what good looks like. And the best way I've found to do that is to write principles, not features. Principles are desired outcomes. They sound like sentences, not roadmap items. Some I've used over the years: **Design for everyone.** Build for the eighty percent, not the loudest twenty. Every team I've worked with has someone — a stakeholder, a power user, a vocal customer — who keeps asking for the next feature. Most of those features serve almost nobody. A principle like this gives the team something to point at when the request comes in, instead of just saying no and feeling bad about it. **Optimize for speed.** Most products are judged on how fast they feel before they're judged on anything else. Not literal load time — perceived speed. How quickly something responds. How few steps it takes. People will forgive almost anything if the product feels fast. **Different is good.** Make the important thing obvious. The primary action should be visually distinct, placed where people expect it, and impossible to miss. Insecurity is the root of bad user experiences. If the user is wondering what to do next, you've already failed. **Always start with what's familiar.** Your users spend most of their time in other apps, not yours. Look at the patterns they already know. Borrow shamelessly from the conventions of the industry you're in. Familiarity isn't a lack of imagination — it's respect for the user's time. These are just examples. Yours will be different, and they should be. The point isn't the specific principles, it's the form. Write things you can hold a real design decision against on a Tuesday afternoon when nobody's watching. ## Why this matters more now For most of my career, UX strategy was the kind of document large teams wrote because they could afford to. Smaller teams skipped it. They were busy shipping, and shipping was the hard part. Shipping isn't the hard part anymore. The cost of building has collapsed. Anyone can put a working product on the internet in a weekend. Tools write half the code. AI handles the parts that used to take a junior designer a week. The bottleneck used to be execution, and execution is now nearly free. Which means the hard part is the part that was always hard but easier to ignore: knowing what's worth building, who it's for, and what good would even look like. When making things was expensive, you had to be careful before you started. Now you can start anything, which is exactly why so many teams are shipping a lot of things that nobody needed. A UX strategy used to be a luxury. It's becoming the thing that separates teams who ship useful work from teams who ship a lot of work. I've written about this from a couple of different angles — [vibe coding for designers](https://www.antonsten.com/articles/vibe-coding-for-designers/) covers what changes when designers can build, and [simple is hard](https://www.antonsten.com/articles/simple-is-hard/) is about why restraint is the harder discipline. A UX strategy is the document that makes restraint possible. ## What a strategy actually does The thing nobody tells you about UX strategies is that the document itself isn't really the point. The point is the conversations you have while writing it. The disagreements that surface. The assumptions that turn out not to be shared. The "wait, is *that* what we're optimizing for?" moments that happen when you try to put it on paper. The strategy is the artifact. The alignment is the work. When I look back at the projects where the team shipped well and the ones where it didn't, the difference was almost never talent or budget. It was whether the team agreed on what they were actually trying to do. Sometimes that agreement existed without a document. More often it didn't, and the document was what made it real. If you're on a team without a UX strategy, you don't need a long one. You don't need a template. You don't even need to call it a strategy if the word makes people roll their eyes. You need a few pages that say what good looks like, what you're going to do about it, and what you're going to stop doing to make room. Then you need everyone on the team to actually read it. The surprising thing isn't that most teams don't have a UX strategy. It's that most of them are doing fine without one, until suddenly they aren't. A strategy is what you wish you'd written before things got hard. *I wrote a chapter on UX strategy in [Products People Actually Want](https://www.antonsten.com/books/products-people-actually-want/) — if you want the longer version.*

0 views

Rules and Gates

One of the first things I stumbled into as I was building what became Superpowers was the idea of a "gate" in a prompt. It's a term that Claude Code introduced me to, so clearly was part of the literature already. But the basic idea is that a gate is something that must be completed before moving on to the next step. Gates work really, really well. But I never thought much about why they work. Today, I finally asked one of my agents how they thought about the difference between rules and gates. The difference: a rule has an opt-out path (I can rationalize "I'll do it after this one thing"). A gate doesn't — the next action is blocked until the gate condition is met. Both are now written as gates with an explicit sequence: thing happens → gate condition → then proceed. Here's a concrete example from an agent today: Before (rule): "Verify claims with web research before asserting them." That's a rule. It lives in my head. When I was answering your question about whether audience assessment books exist, the rule didn't fire — I synthesized from training data, it felt confident, I stated it as a finding. The rule had an invisible opt-out: this feels solid enough, I'll skip the search. After (gate): "When a claim about what exists or doesn't exist is forming → web search happens → URLs in hand → then I speak." The gate makes the sequence explicit and gives me something to check: do I have URLs? If no, I haven't cleared the gate. "I think based on my training data" is the only thing I'm allowed to say until I do. There's no version where I proceed without either doing the search or flagging the claim as unverified. The practical test of whether something is a gate rather than a rule: when I'm about to skip it, does the gate formulation give me a concrete question I can't answer? "Do I have URLs?" is concrete. "Did I verify this?" is too easy to answer yes to without having actually done anything. Gates are conditions that an agent can evaluate objectively to figure out if they're allowed to finish a task and move on to the next one. They're an incredibly powerful framing for prompting that can dramatically reduce all kinds of agentic misbehavior.

0 views
Jim Nielsen Yesterday

Prototyping with LLMs

Did you know that Jesus gave advice about prototyping with an LLM? Here’s Luke 14:28-30: Suppose one of you wants to build a tower. Won’t you first sit down and estimate the cost to see if you have enough money to complete it? For if you lay the foundation and are not able to finish it, everyone who sees it will ridicule you, saying, ‘This person began to build and wasn’t able to finish.’ That pretty much sums me up when I try to vibe a prototype . Don’t get me wrong, I’m a big advocate of prototyping . And LLMs make prototyping really easy and interesting. And because it’s so easy, there’s a huge temptation to jump straight to prototyping. But what I’ve been finding in my own behavior is that I’ll be mid-prototyping with the LLM and asking myself, “What am I even trying to do here?” And the thought I have is: “I’d be in a much more productive place right now if I’d put a tiny bit more thought upfront into what I am actually trying to build.” Instead, I just jumped right in, chasing a fuzzy feeling or idea only to end up in a place where I’m more confused about what I set out to do than when I started. Don’t get me wrong, that’s fine. That’s part of prototyping. It’s inherent to the design process to get more confused before you find clarity. But there’s an alternative to LLM prototyping that’s often faster and cheaper: sketching. I’ve found many times that if I start an idea by sketching it out, do you know where I end up? At a place where I say, “Actually, I don’t want to build this.” And in that case, all I have to do is take my sketch and throw it away. It didn’t cost me any tokens or compute to figure that out. Talk about efficiency! I suppose what I’m saying here is: it’s good to think further ahead than the tracks you’re laying out immediately in front of you. Sketching is a great way to do that. (Thanks to Facundo for prompting these thoughts out of me.) Reply via: Email · Mastodon · Bluesky

0 views
Kev Quirk 2 days ago

I Hate Insurance!

So yesterday I received an email from Admiral , our insurance provider, where we have a combined policy for both our cars and our home. Last year this cost £1,426.00 , but this year the renewal had gone up by a huge 33%, to £1,897.93 broken down as follows: Even at last year's price this was a shit tonne of money, so I started shopping around and here's what I ended up with: These policies have at least the same cover as Admiral. In some cases, better. I knew it would be cheaper shopping around, but I didn't think it would be nearly half. So, I called Admiral to see what they could do for me, considering I've been a loyal customer for 7 years. They knocked £167,83 (8.8%) off the policy for me, bringing the revised total to £1,730.10. Nice to see that long-term customers are rewarded with the best price! 🤷🏻‍♂️ So I obviously went with the much cheaper option and renewed with 3 different companies. It's a pain, as I'll now need to renew 3 policies at the same time every year, but if it means saving this much money, I'm happy to do it. Next year I'll get a multi-quote from Admiral to see if they're competitive. Something tells me they will be, as with most things these days, getting new customers is more important than retaining existing ones. Unfortunately having car and home insurance is a necessary evil in today's world, but I'm glad I was able to make it a little more palatable by saving myself over £700! If your insurance is up for renewal, don't just blindly renew - shop around as there's some serious savings to be had. Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment . Wife's car - £339.34 My car - £455.68 Our home (building & contents) - £1,102.91 Wife's car - £300.17 My car - £402.22 Our home (building and contents) - £533.52 Total: £1056.86 (44% reduction!)

1 views

A Cryptography Engineer’s Perspective on Quantum Computing Timelines

My position on the urgency of rolling out quantum-resistant cryptography has changed compared to just a few months ago. You might have heard this privately from me in the past weeks, but it’s time to signal and justify this change of mind publicly. There had been rumors for a while of expected and unexpected progress towards cryptographically-relevant quantum computers, but over the last week we got two public instances of it. First, Google published a paper revising down dramatically the estimated number of logical qubits and gates required to break 256-bit elliptic curves like NIST P-256 and secp256k1, which makes the attack doable in minutes on fast-clock architectures like superconducting qubits. They weirdly 1 frame it around cryptocurrencies and mempools and salvaged goods or something, but the far more important implication are practical WebPKI MitM attacks. Shortly after, a different paper came out from Oratomic showing 256-bit elliptic curves can be broken in as few as 10,000 physical qubits if you have non-local connectivity , like neutral atoms seem to offer, thanks to better error correction. This attack would be slower, but even a single broken key per month can be catastrophic. They have this excellent graph on page 2 ( Babbush et al. is the Google paper, which they presumably had preview access to): Overall, it looks like everything is moving: the hardware is getting better, the algorithms are getting cheaper, the requirements for error correction are getting lower. I’ll be honest, I don’t actually know what all the physics in those papers means. That’s not my job and not my expertise. My job includes risk assessment on behalf of the users that entrusted me with their safety. What I know is what at least some actual experts are telling us. Heather Adkins and Sophie Schmieg are telling us that “quantum frontiers may be closer than they appear” and that 2029 is their deadline. That’s in 33 months, and no one had set such an aggressive timeline until this month. Scott Aaronson tells us that the “clearest warning that [he] can offer in public right now about the urgency of migrating to post-quantum cryptosystems” is a vague parallel with how nuclear fission research stopped happening in public between 1939 and 1940. The timelines presented at RWPQC 2026, just a few weeks ago, were much tighter than a couple years ago, and are already partially obsolete. The joke used to be that quantum computers have been 10 years out for 30 years now. Well, not true anymore, the timelines have started progressing. If you are thinking “well, this could be bad, or it could be nothing!” I need you to recognize how immediately dispositive that is. The bet is not “are you 100% sure a CRQC will exist in 2030?”, the bet is “are you 100% sure a CRQC will NOT exist in 2030?” I simply don’t see how a non-expert can look at what the experts are saying, and decide “I know better, there is in fact < 1% chance.” Remember that you are betting with your users’ lives. 2 Put another way, even if the most likely outcome was no CRQC in our lifetimes, that would be completely irrelevant, because our users don’t want just better-than-even odds 3 of being secure. Sure, papers about an abacus and a dog are funny and can make you look smart and contrarian on forums. But that’s not the job, and those arguments betray a lack of expertise . As Scott Aaronson said : Once you understand quantum fault-tolerance, asking “so when are you going to factor 35 with Shor’s algorithm?” becomes sort of like asking the Manhattan Project physicists in 1943, “so when are you going to produce at least a small nuclear explosion?” The job is not to be skeptical of things we’re not experts in, the job is to mitigate credible threats, and there are credible experts that are telling us about an imminent threat. In summary, it might be that in 10 years the predictions will turn out to be wrong, but at this point they might also be right soon, and that risk is now unacceptable. Concretely, what does this mean? It means we need to ship. Regrettably, we’ve got to roll out what we have. 4 That means large ML-DSA signatures shoved in places designed for small ECDSA signatures, like X.509, with the exception of Merkle Tree Certificates for the WebPKI, which is thankfully far enough along . This is not the article I wanted to write. I’ve had a pending draft for months now explaining we should ship PQ key exchange now, but take the time we still have to adapt protocols to larger signatures, because they were all designed with the assumption that signatures are cheap. That other article is now wrong, alas: we don’t have the time if we need to be finished by 2029 instead of 2035. For key exchange, the migration to ML-KEM is going well enough but: Any non-PQ key exchange should now be considered a potential active compromise, worthy of warning the user like OpenSSH does , because it’s very hard to make sure all secrets transmitted over the connection or encrypted in the file have a shorter shelf life than three years. We need to forget about non-interactive key exchanges (NIKEs) for a while; we only have KEMs (which are only unidirectionally authenticated without interactivity) in the PQ toolkit. It makes no more sense to deploy new schemes that are not post-quantum . I know, pairings were nice. I know, everything PQ is annoyingly large. I know, we had basically just figured out how to do ECDSA over P-256 safely. I know, there might not be practical PQ equivalents for threshold signatures or identity-based encryption. Trust me, I know it stings. But it is what it is. Hybrid classic + post-quantum authentication makes no sense to me anymore and will only slow us down; we should go straight to pure ML-DSA-44. 6 Hybrid key exchange is reasonably easy, with ephemeral keys that don’t even need a type or wire format for the composite private key, and a couple years ago it made sense to take the hedge. Authentication is not like that, and even with draft-ietf-lamps-pq-composite-sigs-15 with its 18 composite key types nearing publication, we’d waste precious time collectively figuring out how to treat these composite keys and how to expose them to users. It’s also been two years since Kyber hybrids and we’ve gained significant confidence in the Module-Lattice schemes. Hybrid signatures cost time and complexity budget, 5 and the only benefit is protection if ML-DSA is classically broken before the CRQCs come , which looks like the wrong tradeoff at this point. In symmetric encryption , we don’t need to do anything, thankfully. There is a common misconception that protection from Grover requires 256-bit keys, but that is based on an exceedingly simplified understanding of the algorithm . A more accurate characterization is that with a circuit depth of 2⁶⁴ logical gates (the approximate number of gates that current classical computing architectures can perform serially in a decade) running Grover on a 128-bit key space would require a circuit size of 2¹⁰⁶. There’s been no progress on this that I am aware of, and indeed there are old proofs that Grover is optimal and its quantum speedup doesn’t parallelize . Unnecessary 256-bit key requirements are harmful when bundled with the actually urgent PQ requirements, because they muddle the interoperability targets and they risk slowing down the rollout of asymmetric PQ cryptography. In my corner of the world, we’ll have to start thinking about what it means for half the cryptography packages in the Go standard library to be suddenly insecure, and how to balance the risk of downgrade attacks and backwards compatibility. It’s the first time in our careers we’ve faced anything like this: SHA-1 to SHA-256 was not nearly this disruptive, 7 and even that took forever with the occasional unexpected downgrade attack. Trusted Execution Environments (TEEs) like Intel SGX and AMD SEV-SNP and in general hardware attestation are just f***d. All their keys and roots are not PQ and I heard of no progress in rolling out PQ ones, which at hardware speeds means we are forced to accept they might not make it, and can’t be relied upon. I had to reassess a whole project because of this, and I will probably downgrade them to barely “defense in depth” in my toolkit. Ecosystems with cryptographic identities (like atproto and, yes, cryptocurrencies) need to start migrating very soon, because if the CRQCs come before they are done , they will have to make extremely hard decisions, picking between letting users be compromised and bricking them. File encryption is especially vulnerable to store-now-decrypt-later attacks, so we’ll probably have to start warning and then erroring out on non-PQ age recipient types soon. It’s unfortunately only been a few months since we even added PQ recipients, in version 1.3.0 . 8 Finally, this week I started teaching a PhD course in cryptography at the University of Bologna, and I’m going to mention RSA, ECDSA, and ECDH only as legacy algorithms, because that’s how those students will encounter them in their careers. I know, it feels weird. But it is what it is. For more willing-or-not PQ migration, follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . Traveling back from an excellent AtmosphereConf 2026 , I saw my first aurora, from the north-facing window of a Boeing 747. My work is made possible by Geomys , an organization of professional Go maintainers, which is funded by Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. The whole paper is a bit goofy: it has a zero-knowledge proof for a quantum circuit that will certainly be rederived and improved upon before the actual hardware to run it on will exist. They seem to believe this is about responsible disclosure, so I assume this is just physicists not being experts in our field in the same way we are not experts in theirs.  ↩ “You” is doing a lot of work in this sentence, but the audience for this post is a bit unusual for me: I’m addressing my colleagues and the decision-makers that gate action on deployment of post-quantum cryptography.  ↩ I had a reviewer object to an attacker probability of success of 1/536,870,912 (0.0000002%, 2⁻²⁹) after 2⁶⁴ work, correctly so, because in cryptography we usually target 2⁻³².  ↩ Why trust the new stuff, though? There are two parts to it: the math and the implementation. The math is also not my job, so I again defer to experts like Sophie Schmieg, who tells us that she is very confident in lattices , and the NSA, who approved ML-KEM and ML-DSA at the Top Secret level for all national security purposes. It is also older than elliptic curve cryptography was when it first got deployed. (“Doesn’t the NSA lie to break our encryption?” No, the NSA has never intentionally jeopardized US national security with a non- NOBUS backdoor, and there is no way for ML-KEM and ML-DSA to hide a NOBUS backdoor .) On the implementation side, I am actually very qualified to have an opinion, having made cryptography implementation and testing my niche. ML-KEM and ML-DSA are a lot easier to implement securely than their classical alternatives, and with the better testing infrastructure we have now I expect to see exceedingly few bugs in their implementations.  ↩ One small exception in that if you already have the ability to convey multiple signatures from multiple public keys in your protocol, it can make sense to to “poor man’s hybrid signatures” by just requiring 2-of-2 signatures from one classical public key and one pure PQ key. Some of the tlog ecosystem might pick this route, but that’s only because the cost is significantly lowered by the existing support for nested n-of-m signing groups.  ↩ Why ML-DSA-44 when we usually use ML-KEM-768 instead of ML-KEM-512? Because ML-KEM-512 is Level 1, while ML-DSA-44 is Level 2, so it already has a bit of margin against minor cryptanalytic improvements.  ↩ Because SHA-256 is a better plug-in replacement for SHA-1, because SHA-1 was a much smaller surface than all of RSA and ECC, and because SHA-1 was not that broken: it still retained preimage resistance and could still be used in HMAC and HKDF.  ↩ The delay was in large part due to my unfortunate decision of blocking on the availability of HPKE hybrid recipients, which blocked on the CFRG, which took almost two years to select a stable label string for X-Wing (January 2024) with ML-KEM (August 2024), despite making precisely no changes to the designs. The IETF should have an internal post-mortem on this, but I doubt we’ll see one.  ↩ Any non-PQ key exchange should now be considered a potential active compromise, worthy of warning the user like OpenSSH does , because it’s very hard to make sure all secrets transmitted over the connection or encrypted in the file have a shorter shelf life than three years. We need to forget about non-interactive key exchanges (NIKEs) for a while; we only have KEMs (which are only unidirectionally authenticated without interactivity) in the PQ toolkit. The whole paper is a bit goofy: it has a zero-knowledge proof for a quantum circuit that will certainly be rederived and improved upon before the actual hardware to run it on will exist. They seem to believe this is about responsible disclosure, so I assume this is just physicists not being experts in our field in the same way we are not experts in theirs.  ↩ “You” is doing a lot of work in this sentence, but the audience for this post is a bit unusual for me: I’m addressing my colleagues and the decision-makers that gate action on deployment of post-quantum cryptography.  ↩ I had a reviewer object to an attacker probability of success of 1/536,870,912 (0.0000002%, 2⁻²⁹) after 2⁶⁴ work, correctly so, because in cryptography we usually target 2⁻³².  ↩ Why trust the new stuff, though? There are two parts to it: the math and the implementation. The math is also not my job, so I again defer to experts like Sophie Schmieg, who tells us that she is very confident in lattices , and the NSA, who approved ML-KEM and ML-DSA at the Top Secret level for all national security purposes. It is also older than elliptic curve cryptography was when it first got deployed. (“Doesn’t the NSA lie to break our encryption?” No, the NSA has never intentionally jeopardized US national security with a non- NOBUS backdoor, and there is no way for ML-KEM and ML-DSA to hide a NOBUS backdoor .) On the implementation side, I am actually very qualified to have an opinion, having made cryptography implementation and testing my niche. ML-KEM and ML-DSA are a lot easier to implement securely than their classical alternatives, and with the better testing infrastructure we have now I expect to see exceedingly few bugs in their implementations.  ↩ One small exception in that if you already have the ability to convey multiple signatures from multiple public keys in your protocol, it can make sense to to “poor man’s hybrid signatures” by just requiring 2-of-2 signatures from one classical public key and one pure PQ key. Some of the tlog ecosystem might pick this route, but that’s only because the cost is significantly lowered by the existing support for nested n-of-m signing groups.  ↩ Why ML-DSA-44 when we usually use ML-KEM-768 instead of ML-KEM-512? Because ML-KEM-512 is Level 1, while ML-DSA-44 is Level 2, so it already has a bit of margin against minor cryptanalytic improvements.  ↩ Because SHA-256 is a better plug-in replacement for SHA-1, because SHA-1 was a much smaller surface than all of RSA and ECC, and because SHA-1 was not that broken: it still retained preimage resistance and could still be used in HMAC and HKDF.  ↩ The delay was in large part due to my unfortunate decision of blocking on the availability of HPKE hybrid recipients, which blocked on the CFRG, which took almost two years to select a stable label string for X-Wing (January 2024) with ML-KEM (August 2024), despite making precisely no changes to the designs. The IETF should have an internal post-mortem on this, but I doubt we’ll see one.  ↩

0 views

News: OpenAI CFO Doesn't Believe Company Ready For IPO, Unsure Revenue Will Support Commitments

News out of The Information's Anissa Gardizy and Amir Efrati over the weekend - OpenAI CFO Sarah Friar has apparently clashed with CEO Sam Altman over timing around OpenAI's IPO, emphasis mine: I cannot express how strange this is. Generally a CFO and CEO are in lock-step over IPO timing, or at the very least the CFO has an iron grip on the actual timing because, well, CEOs love to go public and the CFO generally exists to curb their instincts. Nevertheless, Clammy Sam Altman has clearly sidelined Friar, and as of August last year, the CFO of OpenAI doesn't report to the CEO . In fact, the person Friar reports to ( Fiji Simo ) just took a medical leave of absence: It is extremely peculiar to not have the Chief Financial Officer report to the Chief Executive Officer , but remember folks, this is OpenAI, the world's least-normal company! Anyway, all of this seemed really weird, so I asked investor, writer and economist Paul Kedrosky for his thoughts: Very cool! Paul is also a guest on this week's episode of my podcast Better Offline , by the way. Out at 12AM ET Tuesday. Anyway, The Information's piece also adds another fun detail - that OpenAI's margins were even worse than expected in 2025: Riddle me this, Batman! If your AI company always has to buy extra compute to meet demand, and said extra compute always makes margins worse, doesn't that mean that your company will either always be unprofitable or die because it buys too much compute? Say, that reminds me of something Anthropic CEO Dario Amodei said to Dwarkesh Patel earlier in the year ... It is extremely strange that the CFO of a company doesn't report to the CEO of a company, and even more strange that the CFO is directly saying "we are not ready for IPO" as its CEO jams his foot on the accelerator. It's clear that both OpenAI and Anthropic are rushing toward a public offering so that their CEOs can cash out, and that their underlying economics are equal parts problematic and worrying. Though I am entirely guessing here, I imagine Friar sees something within OpenAi's finances that give her pause. An S-1 - one of the filings a company makes before going public - is an audited document, and I imagine the whimsical mathematics that OpenAI engages in - such as, per The Wall Street Journal , calculating profitability without training compute - might not match up with what actual financiers crave. If you like this piece and want to support my independent reporting and analysis, why not subscribe to my premium newsletter? It’s $70 a year, or $7 a month, and in return you get a weekly newsletter that’s usually anywhere from 5,000 to 18,000 words, including vast, detailed analyses of  NVIDIA ,  Anthropic and OpenAI’s finances , and  the AI bubble writ large . I just put out  a massive Hater’s Guide To The SaaSpocalypse , as well as last week’s deep dive into How AI Isn't Too Big To Fail . Supporting my premium supports my free newsletter. OpenAI CFO Sarah Friar has, per The Information, said that OpenAI is not ready to go public in 2026, in part because of the "risks from its spending commitments" and not being sure whether the company's revenue growth would support its spending commitments. Friar (CFO) no longer reports to Sam Altman (CEO) and hasn't done so since August 2025. OpenAI's margins were lower in 2025 "...due to the company having to buy more expensive compute at the last minute."

0 views
iDiallo 2 days ago

AI Did It in 12 Minutes. It Took Me 10 Hours to Fix It

I've been working on personal projects since the 2000s. One thing I've always been adamant about is understanding the code I write. Even when Stack Overflow came along, I was that annoying guy who told people not to copy and paste code into their repos. Instead, they should read it and adapt it to their specific case. On personal projects, I've applied this to a fault. Projects never get done because I'm reading and editing code to make it work exactly as I want. I am by no means trying to convince you that my code is high quality. Every day, I regret the design choices I made for this very blog. But at the very least, I like to understand the code that powers my projects. So you can imagine how I struggle with the reviewing part when AI writes a large chunk of our daily work. Large language models are just so verbose, and often produce large blocks of code that don't even get used. I don't want to attribute it to malice (wasting your tokens) when I know this is an emergent technology we are all still adapting to. But it doesn't help that there is just so much code to review. What I tell myself when I review an AI-generated PR is: if I don't have a mental model of how the application works, how can I be of any use when it fails? This weekend, I decided to tackle a project I've been postponing since I created this blog over a decade ago. I needed a nice interface to upload assets, such as images, to go with each blog post. According to my git history, I started work on it 11 years ago but never finished. Yet I've still been uploading images regularly to go with blog posts. How? Using and . Like a caveman. Whenever I finish writing a blog post, I create images in Photoshop or Photopea, fire up FileZilla to connect to my file server, and drag and drop my assets. Then, of course, I have to log back into the blog's admin panel and provide the correct URLs for those assets. It's as primitive as it gets. This was a perfect project to tackle with AI. The good news is that I was able to complete it in a single day. The bad news. Well, I read through close to 5,000 lines of code. This is the first time I've used an LLM to write PHP code. I used the GLM-5 model from z.ai, which I've often played with for generating graphs from data. I'll have to try it on a few more PHP projects before I commit to this opinion, but it consistently wrote spaghetti code . Spaghetti code, to me, is anything where logic is embedded directly in a template. HTML, CSS, and database queries all crammed into the same file. That's really hard to maintain. I did realize, though, that my instructions were in direct contradiction with the agent's system prompt. It was explicit in the thinking process: This is NOT a Next.js project - the user explicitly wants a PHP app. So I should NOT use the Next.js project that's already initialized. I should create a PHP application. Wait, but the instructions say "You are an interactive CLI tool that helps users with software engineering tasks" and "Now you are developing a comprehensive and feature-rich Next.js project from scratch." But the user is explicitly asking for a PHP app. I should follow the user's request. It did generate a bunch of Node.js code, which I was able to remove manually. Luckily, it kept the PHP project in its own folder. If you're wondering how 12 files contain ~5,000 lines of code, I wondered the same. But that's what spaghetti code does. I set it up locally, ran and , and a few more files and folders were generated. When I finally ran the application, it didn't work. I spent a few hours working through permissions, updating the install script, and modifying the SQLite setup. I thought StackOverflow was dead, but I don't think I would have gotten SQLite working without it. One error, for example, was that SQLite kept throwing a warning that it was running in read-only mode. Apparently, you have to make the parent folder writable (not just the database file) to enable write mode. It had been a long time since I'd manually d files in PHP. I normally use namespaces and autoload. Since this project was generated from scratch, I had to hunt down various statements that all had incorrect paths. Once I sorted those out, I had to deal with authentication. PHP sessions come with batteries included, you call and you can read and write session variables via the global. But I couldn't figure out why it kept failing. When I created a standalone test file, sessions worked fine. But when loaded through the application, values weren't being saved. I spent a good while debugging before I found that was missing from the login success flow. When I logged in, the page redirected to the dashboard, but every subsequent action that required authentication immediately kicked me out. Even after fixing all those issues and getting uploads working, something still bothered me: how do I maintain this code? How do I add new pages to manage uploaded assets? Do I add meatballs directly to the spaghetti? Or do I just trust the AI agent to know where to put new features? Technically it could do that, but I'd have to rely entirely on the AI without ever understanding how things work. So I did the only sane thing: I rewrote a large part of the code and restructured the project. Maybe I should have started there, but I didn't know what I wanted until I saw it. Which is probably why I had been dragging this project along for 11 years. Yes, now I have 22 files, almost double the original count. But the code is also much simpler at just 1,254 lines. There's far less cognitive load when it comes to fixing bugs. There's still a lot to improve, but it's a much leaner foundation. The question I keep coming back to is: would it have been easier to do this manually? Well, the timeline speaks for itself. I had been neglecting this project for years. Without AI, I probably never would have finished it. That said, it would have been easier to build on my existing framework. My blog's framework has been tested for years and has accumulated a lot of useful features: a template engine, a working router, an auth system, and more. All things I had to re-engineer from scratch here. If I'd taken the time to work within my own framework, it probably would have taken less time overall. But AI gave me the illusion that the work could be done much faster. Z.ai generated the whole thing in just 12 minutes. It took an additional 10 hours to clean it up and get it working the way I wanted. This reminds me of several non-technical friends who built/vibe-coded apps last year. The initial results looked impressive. Most of them don't have a working app anymore, because they realized that the cleanup is just as important as the generation if you want something that actually holds together. I can only imagine what "vibe-debugging" looks like. I'm glad I have a working app, but I'm not sure I can honestly call this vibe-coded. Most, if not all, of the files have been rewritten. When companies claim that a significant percentage of their code is AI-generated , do their developers agree? For me, it's unthinkable to deploy code I haven't vetted and understood. But I'm not the benchmark. In the meantime, I think I've earned the right to say this the next time I ship an AI-assisted app: "I apologize for so many lines of code - I didn't have time to write a shorter app."

0 views