Latest Posts (11 found)
Filippo Valsorda 1 weeks ago

The 2025 Go Cryptography State of the Union

This past August, I delivered my traditional Go Cryptography State of the Union talk at GopherCon US 2025 in New York. It goes into everything that happened at the intersection of Go and cryptography over the last year. You can watch the video (with manually edited subtitles, for my fellow subtitles enjoyers) or read the transcript below (for my fellow videos not-enjoyers). The annotated transcript below was made with Simon Willison’s tool . All pictures were taken around Rome, the Italian contryside, and the skies of the Northeastern United States. Welcome to my annual performance review. We are going to talk about all of the stuff that we did in the Go cryptography world during the past year. When I say "we," it doesn't mean just me, it means me, Roland Shoemaker, Daniel McCarney, Nicola Morino, Damien Neil, and many, many others, both from the Go team and from the Go community that contribute to the cryptography libraries all the time. I used to do this work at Google, and I now do it as an independent as part of and leading Geomys , but we'll talk about that later. When we talk about the Go cryptography standard libraries, we talk about all of those packages that you use to build secure applications. That's what we make them for. We do it to provide you with encryption and hashes and protocols like TLS and SSH, to help you build secure applications . The main headlines of the past year: We shipped post quantum key exchanges, which is something that you will not have to think about and will just be solved for you. We have solved FIPS 140, which some of you will not care about at all and some of you will be very happy about. And the thing I'm most proud of: we did all of this while keeping an excellent security track record, year after year. This is an update to something you've seen last year. The Go Security Track Record It's the list of vulnerabilities in the Go cryptography packages. We don't assign a severity—because it's really hard, instead they're graded on the "Filippo's unhappiness score." It goes shrug, oof, and ouch. Time goes from bottom to top, and you can see how as time goes by things have been getting better. People report more things, but they're generally more often shrugs than oofs and there haven't been ouches. More specifically, we haven't had any oof since 2023. We didn't have any Go-specific oof since 2021. When I say Go-specific, I mean: well, sometimes the protocol is broken, and as much as we want to also be ahead of that by limiting complexity, you know, sometimes there's nothing you can do about that. And we haven't had ouches since 2019 . I'm very happy about that. But if this sounds a little informal, I'm also happy to report that we had the first security audit by a professional firm. Trail of Bits looked at all of the nuts and bolts of the Go cryptography standard library: primitives, ciphers, hashes, assembly implementations. They didn't look at the protocols, which is a lot more code on top of that, but they did look at all of the foundational stuff. And I'm happy to say that they found nothing . Two of a kind t-shirts, for me and Roland Shoemaker. It is easy though to maintain a good security track record if you never add anything, so let's talk about the code we did add instead. First of all, post-quantum key exchanges. We talked about post-quantum last year, but as a very quick refresher: Now, we focused on post-quantum key exchange because the key exchange defends against the most urgent risk, which is that somebody might be recording connections today, keeping them saved on some storage for the next 5-50 years and then use the future quantum computers to decrypt those sessions. I'm happy to report that we now have ML-KEM, which is the post-quantum key exchange algorithm selected by the NIST competition, an international competition run in the open. You can use it directly from the crypto/mlkem standard library package starting in Go 1.24, but you're probably not gonna do that. Instead, you're probably going to just use crypto/tls, which by default now uses a hybrid of X25519 and ML-KEM-768 for all connections with other systems that support it. Why hybrid? Because this is new cryptography. So we are still a little worried that somebody might break it. There was one that looked very good and had very small ciphertext, and we were all like, “yes, yes, that's good, that's good.” And then somebody broke it on a laptop. It was very annoying. We're fairly confident in lattices. We think this is the good one. But still, we are taking both the old stuff and the new stuff, hashing them together, and unless you have both a quantum computer to break the old stuff and a mathematician who broke the new stuff, you're not breaking the connection. crypto/tls can now negotiate that with Chrome and can negotiate that with other Go 1.24+ applications. Not only that, we also removed any choice you had in ordering of key exchanges because we think we know better than you and— that didn't come out right, uh. … because we assume that you actually want us to make those kind of decisions, so as long as you don't turn it off, we will default to post-quantum. You can still turn it off. But as long as you don't turn it off, we'll default to the post-quantum stuff to keep your connection safe from the future. Same stuff with x/crypto/ssh. Starting in v0.38.0. SSH does the same thing, they just put X25519 and ML-KEM-768 in a different order, which you would think doesn't matter—and indeed it doesn't matter—but there are rules where "no, no, no, you have to put that one first." And the other rule says "no, you have to put that one first." It's been a whole thing. I'm tired. OpenSSH supports it, so if you connect to a recent enough version of OpenSSH, that connection is post-quantum and you didn't have to do anything except update. Okay, but you said key exchanges and digital signatures are broken. What about the latter? Well, key exchanges are urgent because of the record-now-decrypt-later problem, but unless the physicists that are developing quantum computers also develop a time machine, they can't use the QC to go back in time and use a fake signature today. So if you're verifying a signature today, I promise you it's not forged by a quantum computer. We have a lot more time to figure out post-quantum digital signatures. But if we can, why should we not start now? Well, it's different. Key exchange, we knew what hit we had to take. You have to do a key exchange, you have to do it when you start the connection, and ML-KEM is the algorithm we have, so we're gonna use it. Signatures, we developed a lot of protocols like TLS, SSH, back when it was a lot cheaper to put signatures on the wire. When you connect to a website right now, you get five signatures. We can't send you five 2KB blobs every time you connect to a website. So we are waiting to give time to protocols to evolve, to redesign things with the new trade-offs in mind of signatures not being cheap. We are kind of slow rolling intentionally the digital signature side because it's both not as urgent and not as ready to deploy. We can't do the same “ta-da, it's solved for you” show because signatures are much harder to roll out. Let's talk about another thing that I had mentioned last year, which is FIPS 140. FIPS 140 is a US government regulation for how to do cryptography. It is a list of algorithms, but it's not just a list of algorithms. It's also a list of rules that the modules have to follow. What is a module? Well, a module used to be a thing you would rack. All the rules are based on the idea that it's a thing you can rack. Then the auditor can ask “what is the module’s boundary?” And you're like, “this shiny metal box over here." And, you know, that works. When people ask those questions of libraries, though, I do get a little mad every time. Like, what are the data input ports of your library? Ports. Okay. Anyway, it's an interesting thing to work with. To comply with FIPS 140 in Go, up to now, you had to use an unsupported GOEXPERIMENT, which would replace all of the Go cryptography standard library, all of the stuff I'm excited about, with the BoringCrypto module, which is a FIPS 140 module developed by the BoringSSL folks. We love the BoringSSL folks, but that means using cgo, and we do not love cgo. It has memory safety issues, it makes cross-compilation difficult, it’s not very fast. Moreover, the list of algorithms and platforms of BoringCrypto is tailored to the needs of BoringSSL and not to the needs of the Go community, and their development cycle doesn't match our development cycle: we don't decide when that module gets validated. Speaking of memory safety, I lied a little. Trail of Bits did find one vulnerability. They found it in Go+BoringCrypto, which was yet another reason to try to push away from it. Instead, we've got now the FIPS 140-3 Go Cryptographic Module. Not only is it native Go, it's actually just a different name for the internal Go packages that all the regular Go cryptography package use for the FIPS 140 algorithms. We just moved them into their own little bubble so that when they ask us “what is the module boundary” we can point at those packages. Then there's a runtime mode which enables some of the self-tests and slow stuff that you need for compliance. It also tells crypto/tls not to negotiate stuff that's not FIPS, but aside from that, it doesn't change any observable behavior. We managed to keep everything working exactly the same: you don't import a different package, you don't do anything different, your applications just keep working the same way. We're very happy about that. Finally, you can at compile time select a GOFIPS140 frozen module, which is just a zip file of the source of the module as it was back when we submitted it for validation, which is a compliance requirement sometimes. By the way, that means we have to be forward compatible with future versions of Go, even for internal packages, which was a little spicy. You can read more in the upstream FIPS 140-3 docs . You might be surprised to find out that using a FIPS 140 algorithm from a FIPS 140 module is not actually enough to be FIPS 140 compliant The FIPS 140 module also has to be tested for that specific algorithm. What we did is we just tested them all, so you can use any FIPS 140 algorithm without worrying about whether it's tested in our module. When I say we tested them all, I mean that some of them we tested with four different names. NIST calls HKDF alternatively SP 800-56C two-step KDF, SP 800-133 Section 6.3 CKG, SP 800-108 Feedback KDF, and Implementation Guidance D.P OneStepNoCounter KDF (you don't wanna know). It has four different names for the same thing. We just tested it four times, it's on the certificate, you can use it whatever way you want and it will be compliant. But that's not enough. Even if you use a FIFS 140 algorithm from a FIPS 140 module that was tested for the algorithm it's still not enough because it has to run on a platform that was tested as part of the validation. So we tested on a lot of platforms. Some of them were paid for by various Fortune 100s that had an interest in them getting tested, but some of them had no sponsors. We really wanted to solve this problem for everyone, once and for all, so Geomys just paid for all the FreeBSD, macOS, even Windows testing so that we could say “run it on whatever and it's probably going to be compliant.” (Don't quote me on that.) How did we test on that many machines? Well, you know, we have this sophisticated data center… Um, no. No, no. I got a bunch of stuff shipped to my place. That's my NAS now. It's an Ampere Altra Q64-22, sixty-four arm64 cores, and yep, it's my NAS. Then I tested it on, you know, this sophisticated arm64 macOS testing platform. And then on the Windows one, which is my girlfriend's laptop. And then the arm one, which was my router. Apparently I own an EdgeRouter now? It's sitting in the data center which is totally not my kitchen. It was all a very serious and regimented thing, and all of it is actually recorded, in recorded sessions with the accredited laboratories, so all this is now on file with the US government. You might or might not be surprised to hear that the easiest way to meet the FIPS 140 requirements is not to exceed them. That's annoying and a problem of FIPS 140 in general: if you do what everybody else does, which is just clearing the bar, nobody will ask questions, so there’s a strong temptation to lower security in FIPS 140 mode. We just refused to accept that. Instead, we figured out complex stratagems. For example, for randomness, the safest thing to do is to just take randomness from the kernel every time you need it. The kernel knows if a virtual machine was just cloned and we don't, so we risk generating the same random bytes twice. But NIST will not allow that. You need to follow a bunch of standards for how the randomness is generated, and the kernel doesn’t. So what we do is we do everything that NIST asks and then every time you ask for randomness, we squirrel off, go to the kernel, get a little piece of extra entropy, stir it into the pot before giving back the result, and give back the result. It's still NIST compliant because it's as strong as both the NIST and the kernel solution, but it took some significant effort to show it is compliant. We did the same for ECDSA. ECDSA is a digital signature mechanism. We've talked about it a few other times. It's just a way to take a message and a private key and generate a signature, here (s, r) . To make a signature, you also need a random number, and that number must be used only once with the same private key. You cannot reuse it. That number is k here. Why can you not reuse it? Because if you reuse it, then you can do this fun algebra thing and then pop the private key falls out by just smashing two signatures together. Bad, really, really bad. How do we generate this number that must never be the same? Well, one option is we make it random. But what if your random number generator breaks and generates twice the same random number? That would leak the private key, and that would be bad. So the community came up with deterministic ECDSA . Instead of generating the nonce at random, we are going to hash the message and the private key. This is still actually a little risky though, because if there's a fault in the CPU , for example, or a bug, because for example you're taking the wrong inputs , you might still end up generating the same value but signing a slightly different message. How do we mitigate both of those? We do both. We take some randomness and the private key and the message, we hash them all together, and now it's really, really hard for the number to come out the same. That's called hedged ECDSA. The Go crypto library has been doing hedged ECDSA from way before it was called hedged and way before I was on the team . Except… random ECDSA has always been FIPS. Deterministic ECDSA has been FIPS since a couple years ago. Hedged ECDSA is technically not FIPS. We really didn't want to make our ECDSA package less secure, so we found a forgotten draft that specifies a hedged ECDSA scheme, and we proceeded to argue that actually if you read SP 800-90A Revision 1 very carefully you realize that if you claim that the private key is just the DRBG entropy plus two-thirds of the DRBG nonce, you are allowed to use it because of SP 800-57 Part 1, etc etc etc . We basically just figured out a way to claim it was fine and the lab eventually said "okay, shut up." I'm very proud of that one. If you want to read more about this, check out the announcement blog post . If you know you need commercial services for FIPS 140, here’s Geomys FIPS 140 commercial services page . If you don't know if you need them, you actually probably don't. It's fine, the standard library will probably solve this for you now. Okay, but who cares about this FIPS 140 stuff? "Dude, we've been talking about FIPS 140 for 10 minutes and I don't care about that." Well, I care because I spent my last year on it and that apparently made me the top committer for the cycle to the Go repo and that's mostly FIPS 140 stuff. I don't know how to feel about that. There have been actually a lot of positive side effects from the FIPS 140 effort. We took care to make sure that everything that we found we would leave in a better state. For example, there are new packages that moved from x/crypto into the standard library: crypto/hkdf, crypto/pbkdf, crypto/sha3. SHA-3 is faster and doesn't allocate anymore. HKDF has a new generic API which lets you pass in a function that returns either a concrete type that implements Hash or a function that returns a Hash interface, which otherwise was a little annoying. (You had to make a little closure.) I like it. We restructured crypto/aes and crypto/cipher and in the process merged a contribution from a community member that made AES-CTR, the counter mode, between 2 and 9 times faster. That was a pretty good result. The assembly interfaces are much more consistent now. Finally, we finished cleaning up crypto/rsa. If you remember from last year, we made the crypto/rsa sign and verify operations not use math/big and use constant time code. Now we also made key generation, validation, and pre-computation all not use math/big. That loading keys that were serialized to JSON a lot faster, and made key generation much faster. But how much faster? Benchmarking key generation is really hard because it's a random process: you take a number random number and you check, is it prime? No. Toss. Is it prime? Nope. Toss. Is it prime? You keep doing this. If you're lucky, it’s very fast. If you are unlucky, very slow. It’s a geometric distribution and if you want to average it out, you have to run for hours. Instead, I figured out a new way by mathematically deriving the average number of pulls you are supposed to do and preparing a synthetic run that gives exactly the expected mean number of checks, so that we get a representative sample to benchmark deterministically . That was a lot of fun. Moreover, we detect more broken keys, and we did a rare backwards compatibility break to stop supporting keys smaller than 1024 bits. 1024 is already pretty small, you should be using 2048 minimum, but if you're using less than 1024, it can be broken on the proverbial laptop. It's kind of silly that a production library lets you do something so insecure, and you can't tell them apart just by looking at the code. You have to know what the size of the key is. So we just took that out. I expected people to yell at me. Nobody yelled at me. Good job community. Aside from adding stuff, you know that we are very into testing and that testing is how we keep that security track record that we talked about. I have one bug in particular that is my white whale. (You might say, "Filippo, well-adjusted people don't have white whales." Well, we learned nothing new, have we?) My white whale is this assembly bug that we found at Cloudflare before I joined the Go team. I spent an afternoon figuring out an exploit for it with Sean Devlin in Paris, while the yellow jackets set fire to cop cars outside. That's a different story. It's an assembly bug where the carry—literally the carry like when you do a pen and paper multiplication—was just not accounted for correctly. You can watch my talk Squeezing a Key through a Carry Bit if you are curious to learn more about it. The problem with this stuff is that it's so hard to get code coverage for it because all the code always runs. It's just that you don't know if it always runs with that carry at zero, and if the carry was one, it’d do the wrong math. I think we've cracked it, by using mutation testing. We have a framework that tells the assembler, "hey, anywhere you see an add-with-carry, replace it with a simple add that discards the carry." Then we run the tests. If the tests still pass, the test did not cover that carry. If that happens we fail a meta-test and tell whoever's sending the CL, “hey, no, no, no, you gotta test that.” Same for checking the case in which the carry is always set. We replace the add-with-carry with a simple add and then insert a +1. It's a little tricky. If you want to read more about it, it's in this blog post . I'm very hopeful that will help us with all this assembly stuff. Next, accumulated test vectors . This is a little trick that I'm very very fond of. Say you want to test a very large space. For example there are two inputs and they can both be 0 to 200 bytes long, and you want to test all the size combinations. That would be a lot of test vectors, right? If I checked in a megabyte of test vectors every time I wanted to do that, people eventually would yell at me. Instead what we do is run the algorithm with each size combination, and take the result and we put it inside a rolling hash. Then at the end we take the hash result and we check that it comes out right. We do this with two implementations. If it comes out to the same hash, great. If it comes out not to the same hash, it doesn't help you figure out what the bug is, but it tells you there's a bug. I'll take it. We really like reusing other people's tests. We're lazy. The BoringSSL people have a fantastic suite of tests for TLS called BoGo and Daniel has been doing fantastic work integrating that and making crypto/tls stricter and stricter in the process. It's now much more spec compliant on the little things where it goes like, “no, no, no, you're not allowed to put a zero here” and so on. Then, the Let's Encrypt people have a test tool for the ACME protocol called Pebble. (Because it's a small version of their production system called Boulder! It took me a long time to figure it out and eventually I was like ooooohhh.) Finally, NIST has this X.509 interoperability test suite, which just doesn't have a good name. It's good though. More assembly cleanups. There used to be places in assembly where—as if assembly was not complicated enough—instructions were just written down as raw machine code. Sometimes even the comment was wrong! Can you tell the comment changed in that patch? This is a thing Roland and Joel found. Now there's a test that will just yell at you if you try to commit a or instruction. We also removed all the assembly that was specifically there for speeding up stuff on CPUs that don't have AVX2. AVX2 came out in 2015 and if you want to go fast, you're probably not using the CPU generation from back then. We still run on it, just not as fast. More landings! I’m going to speed through these ones. This is all stuff that we talked about last year and that we actually landed. Stuff like data independent timing to tell the CPU, "no, no, I actually did mean for you to do that in constant time, goddammit." And server-side TLS Encrypted Client Hello, which is a privacy improvement. We had client side, now we have server side. crypto/rand.Read never fails. We promised that, we did that. Now, do you know how hard it is to test the failure case of something that never fails? I had to re-implement the seccomp library to tell the kernel to break the getrandom syscall to check what happens when it doesn’t work. There are tests all pointing guns at each other to make sure the fallback both works and is never hit unexpectedly. It's also much faster now because Jason Donenfeld added the Linux getrandom VDSO. Sean Liao added rand.Text like we promised. Then more stuff like hash.Cloner , which I think makes a lot of things a little easier, and more and more and more and more. The Go 1.24 and Go 1.25 release notes are there for you. x/crypto/ssh is also under our maintenance and some excellent stuff happened there, too. Better tests, better error messages, better compatibility, and we're working on some v2 APIs . If you have opinions, it’s time to come to those issues to talk about them! It’s been an exciting year, and I'm going to give you just two samples of things we're planning to do for the next year. One is TLS profiles. Approximately no one wants to specifically configure the fifteen different knobs of a TLS library. Approximately no one—because I know there are some people who do and they yell at me regularly. But instead most people just want "hey, make it broadly compatible." "Hey, make it FIPS compliant." "Hey, make it modern." We're looking for a way to make it easy to just say what your goal is, and then we do all the configuration for you in a way that makes sense and that evolves with time. I'm excited about this one. And maybe something with passkeys? If you run websites that authenticate users a bunch with password hashes and maybe also with WebAuthN, find me, email us, we want feedback. We want to figure out what to build here, into the standard library. Alright, so it's been a year of cryptography, but it's also been a year of Geomys. Geomys launched a year ago here at GopherCon. If you want an update, we went on the Fallthrough podcast to talk about it , so check that out. We are now a real company and how you know is that we have totes: it's the equivalent of a Facebook-official relationship. The best FIPS 140 side effect has been that we have a new maintainer. Daniel McCarney joined us to help with the FIPS effort and then we were working very well together so Geomys decided to just take him on as a permanent maintainer on the Go crypto maintenance team. I’m very excited about that. This is all possible thanks to our clients, and if you have any questions, here are the links. You might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . My work is made possible by Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. Post-quantum cryptography is about the future. We are worried about quantum computers that might exist… 5-50 (it's a hell of a range) years from now, and that might break all of asymmetrical encryption. (Digital signatures and key exchanges.) Post-quantum cryptography runs on classical computers. It's cryptography that we can do now that resists future quantum computers. Post-quantum cryptography is fast, actually. If you were convinced that for some reason it was slow, that's a common misconception. However, post-quantum cryptography is large. Which means that we have to send a lot more bytes on the wire to get the same results.

0 views
Filippo Valsorda 4 weeks ago

Claude Code Can Debug Low-level Cryptography

Over the past few days I wrote a new Go implementation of ML-DSA, a post-quantum signature algorithm specified by NIST last summer. I livecoded it all over four days, finishing it on Thursday evening. Except… Verify was always rejecting valid signatures. I was exhausted, so I tried debugging for half an hour and then gave up, with the intention of coming back to it the next day with a fresh mind. On a whim, I figured I would let Claude Code take a shot while I read emails and resurfaced from hyperfocus. I mostly expected it to flail in some maybe-interesting way, or rule out some issues. Instead, it rapidly figured out a fairly complex low-level bug in my implementation of a relatively novel cryptography algorithm. I am sharing this because it made me realize I still don’t have a good intuition for when to invoke AI tools, and because I think it’s a fantastic case study for anyone who’s still skeptical about their usefulness. Full disclosure: Anthropic gave me a few months of Claude Max for free. They reached out one day and told me they were giving it away to some open source maintainers. Maybe it’s a ploy to get me hooked so I’ll pay for it when the free coupon expires. Maybe they hoped I’d write something like this. Maybe they are just nice. Anyway, they made no request or suggestion to write anything public about Claude Code. Now you know. I started Claude Code v2.0.28 with Opus 4.1 and no system prompts, and gave it the following prompt (typos included): I implemented ML-DSA in the Go standard library, and it all works except that verification always rejects the signatures. I know the signatures are right because they match the test vector. YOu can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Look for potential reasons the signatures don’t verify. ultrathink I spot-checked and w1 is different from the signing one. To my surprise, it pinged me a few minutes later with a complete fix . Maybe I shouldn’t be surprised! Maybe it would have been clear to anyone more familiar with AI tools that this was a good AI task: a well-scoped issue with failing tests. On the other hand, this is a low-level issue in a fresh implementation of a complex, relatively novel algorithm. It figured out that I had merged and into a single function for using it from Sign, and then reused it from Verify where already produces the high bits, effectively taking the high bits of w1 twice in Verify. Looking at the log , it loaded the implementation into the context and then immediately figured it out, without any exploratory tool use! After that it wrote itself a cute little test that reimplemented half of verification to confirm the hypothesis, wrote a mediocre fix, and checked the tests pass. I threw the fix away and refactored to take high bits as input, and changed the type of the high bits, which is both clearer and saves a round-trip through Montgomery representation. Still, this 100% saved me a bunch of debugging time. On Monday, I had also finished implementing signing with failing tests. There were two bugs, which I fixed in the following couple evenings. The first one was due to somehow computing a couple hardcoded constants (1 and -1 in the Montgomery domain) wrong . It was very hard to find, requiring a lot of deep printfs and guesswork. Took me maybe an hour or two. The second one was easier: a value that ends up encoded in the signature was too short (32 bits instead of 32 bytes) . It was relatively easy to tell because only the first four bytes of the signature were the same, and then the signature lengths were different. I figured these would be an interesting way to validate Claude’s ability to help find bugs in low-level cryptography code, so I checked out the old version of the change with the bugs (yay Jujutsu!) and kicked off a fresh Claude Code session with this prompt: I am implementing ML-DSA in the Go standard library, and I just finished implementing signing, but running the tests against a known good test vector it looks like it goes into an infinite loop, probably because it always rejects in the Fiat-Shamir with Aborts loop. You can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Figure out why it loops forever, and get the tests to pass. ultrathink It spent some time doing printf debugging and chasing down incorrect values very similarly to how I did it, and then figured out and fixed the wrong constants . Took Claude definitely less than it took me. Impressive. It gave up after fixing that bug even if the tests still failed, so I started a fresh session (on the assumption that the context on the wrong constants would do more harm than good investigating an independent bug), and gave it this prompt: I am implementing ML-DSA in the Go standard library, and I just finished implementing signing, but running the tests against a known good test vector they don’t match. You can run the tests with “bin/go test crypto/internal/fips140/mldsa” You can find the code in src/crypto/internal/fips140/mldsa Figure out what is going on. ultrathink It took a couple wrong paths, thought for quite a bit longer, and then found this one too . I honestly expected it to fail initially. It’s interesting how Claude found the “easier” bug more difficult. My guess is that maybe the large random-looking outputs of the failing tests did not play well with its attention. The fix it proposed was updating only the allocation’s length and not its capacity, but whatever, the point is finding the bug, and I’ll usually want to throw away the fix and rewrite it myself anyway. Three out of three one-shot debugging hits with no help is extremely impressive . Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it. As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete or “make me a PR.” For example, how nice would it be if every time tests fail, an LLM agent was kicked off with the task of figuring out why, and only notified us if it did before we fixed it? For more low-level cryptography bugs implementations, follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . I promise I almost never post about AI. Enjoy the silliest floof. Surely this will help redeem me in the eyes of folks who consider AI less of a tool and more of something to be hated or loved. My work is made possible by Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team.

0 views
Filippo Valsorda 1 months ago

The Geomys Standard of Care

One of the most impactful effects of professionalizing open source maintenance is that as professionals we can invest into upholding a set of standards that make our projects safer and more reliable. The same commitments and overhead that are often objected to when required of volunteers should be table stakes for professional maintainers. I didn’t find a lot of prior art, so to compile the Geomys Standard of Care I started by surveying recent supply chain compromises to look for mitigable root causes. (By the way, you might have missed that email because it includes the name of a domain used for a phishing campaign, so it got flagged as phishing. Oops.) I also asked feedback from experts in various areas such as CI security, and from other Geomys maintainers. The first draft is below, and we’ll maintain the latest version at geomys.org/standard-of-care . It covers general maintenance philosophy, ongoing stability and reliability, dependency management, account and CI security, vulnerability handling, licensing, and more. In the future, we want to look into adopting more binary transparency tools, and into doing periodic reviews of browser extensions and of authorized Gerrit and GitHub OAuth apps and tokens (just GitHub has four places 1 to look in!). We also welcome feedback on things that would be valuable to add, for security or for reliability. We aim to maintain our projects sustainably and predictably. We are only able to do this thanks to our retainer contracts with our clients, but these commitments are offered to the whole community, not just to paying clients. Scope . We apply this standard to projects maintained or co-maintained by Geomys, including For projects where we are not the sole maintainers, we prioritize working well with the rest of the team. Geomys maintainers may also have personal projects that are not held to this standard (e.g. everything in mostly-harmless ). Code review . If the project accepts external contributions, we review all the code provided to us. This extends to any code generated with LLMs, as well. Complexity . A major part of the role of a maintainer is saying no. We consciously limit complexity, and keep the goals and non-goals of a project in mind when considering features. (See for example the Go Cryptography Principles .) Static analysis . We run staticcheck , by our very own @dominikh , in CI. Stability . Once a Go package reaches v1, we maintain strict backwards compatibility within a major version, similarly to the standard library’s compatibility promise . Ongoing maintenance . Not all projects are actively worked on at all times (e.g. some projects may be effectively finished, or we may work in batches). However, unless a project is explicitly archived or deprecated, we will address newly arising issues that make the project unsuitable for a previously working use case (e.g. compatibility with a new OS). Dependency management . We don’t use automatic dependency version bump tools, like Dependabot. For our purposes, they only cause churn and increase the risk of supply chain attacks by adopting new module versions before the ecosystem has had time to detect attacks. (Dependabot specifically also has worrying impersonation risks , which would make for trivial social engineering attacks.) Instead, we run govulncheck on a schedule, to get high signal-to-noise ratio notifications of vulnerable dependencies that actually affect our projects; and run isolated CI jobs with the latest versions of our dependencies (i.e. running before ) to ensure we’re alerted early of breakages, so we can easily update to future security releases and so we’re aware of potential compatibility issues for our dependents. Phishing-resistant authentication . Phishing is by far the greatest threat to our security and, transitively, to that of our users. We acknowledge there is no amount of human carefulness that can systematically withstand targeted attacks, so we use technically phishing-resistant authentication for all services that allow impacting our projects’ users. Phishing-resistant authentication means passkeys or WebAuthn 2FA, with credentials stored in platform authenticators (e.g. iCloud Keychain), password managers (e.g. 1Password or Chrome), or hardware tokens (e.g. YubiKeys). Critical accounts that allow escalating to user impact include: If a strict mode such as Google’s Advanced Protection Program or Apple’s Advanced Data Protection is available, we enable it. If a phishable fallback authentication or account recovery method is instead required, we configure one that is secret-based (e.g. TOTP or recovery codes) and either delete the secret or commit to never using it without asking a fellow Geomys maintainer to review the circumstances that necessitated it. TOTP can’t hurt us if we don’t use it. We never enable SMS as an authentication mechanism or as an account recovery mechanism, because SIM jacking is possible even without action on our part. Long-lived credentials . We avoid where possible long-lived persistent credentials, or make them non-extractable if possible. For example, we use git-credential-oauth instead of Gerrit cookies, and hardware-bound SSH keys with yubikey-agent or Secretive instead of personal access tokens for git pushes to GitHub. Unlike phishing-resistant authentication, we found it impractical to roll out short-lived credentials universally. Notably, we have not found a way to use the GitHub CLI without extractable long-lived credentials. CI security . We run zizmor on our GitHub Actions workflows, and we don’t use dangerous GitHub Actions triggers that run privileged workflows with attacker-controlled contexts, such as . We run GitHub Actions workflows with read-only permissions and no secrets by default. Workflows that have write permissions or access to secrets disable all use of caches (including indirectly through actions like ), to mitigate cache poisoning attacks . (Note that, incredibly, read-only workflows can write arbitrary cache entries, which is why this must be mitigated at cache use time.) Third-party access . For projects maintained solely by Geomys, we avoid providing user-impacting (i.e. push or release) access to external people, and publicly disclose any exceptions. If abandoning a project, we prefer archiving it and letting a fork spawn to handing over control to external people. This way dependents can make their own assessment of whether to trust the new maintainers. Any exceptions will be widely communicated well in advance. Under no circumstances will we release to public registration a domain, GitHub user/org, or package name that was previously assigned to a Geomys project. Availability monitoring . We have automated uptime monitoring for critical user-facing endpoints, such as the Go import path meta pages. This also provides monitoring for critical domain expiration, preventing accidental takeovers. Transparency logging . We subscribe to new version notifications via GopherWatch , to be alerted of unauthorized module versions published to the Go Checksum Database. We monitor Certificate Transparency logs for critical domains (e.g. the roots of our Go import paths) using tools such as Cert Spotter or Silent CT . We also set CAA records on those domains limiting issuance to the minimal set of CAs required for operation. Vulnerability handling . We document the official vulnerability reporting mechanism of each project, we encourage coordinated vulnerability reporting, and we appreciate the work of security researchers. We honor embargoes of up to 90 days, and we do not share vulnerability details with people not involved in fixing it until they are public. (Paying clients do not get access to private vulnerability details. This is to honor our responsibility to the various stakeholders of an open source project, and to acknowledge that often these details are not ours to share.) Once a vulnerability is made public, we ensure it is included in the Go vulnerability database with accurate credit and metadata, including a CVE number. If the documented vulnerability reporting mechanism is unresponsive, an escalation path is available by emailing security at geomys.org. Licenses . We use permissive, well-known licenses: BSD-3-Clause, BSD-2-Clause, BSD-1-Clause, 0BSD, ISC, MIT, or (less preferably) Apache-2.0. Disclaimer . This is not a legally binding agreement. Your use of the projects continues to be controlled by their respective licenses, and/or by your contract with Geomys, which does not include this document unless explicitly specified. I am getting a cat (if I successfully defeat my allergies through a combination of LiveClear , SLIT , antihistamines, and HEPA filters), so obviously you are going to get a lot of cat pictures going forward. For more, you can follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . This is the work of Geomys , an organization of professional Go maintainers, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. https://github.com/settings/tokens and https://github.com/settings/personal-access-tokens and https://github.com/settings/apps/authorizations and https://github.com/settings/applications  ↩ the and packages in the Go standard library and the FIPS 140-3 Go Cryptographic Module (co-maintained with the rest of the Go team) Staticcheck filippo.io/edwards25519 filippo.io/csrf filippo.io/keygen filippo.io/intermediates (externalized from the standard library) age and typage Sunlight and filippo.io/torchwood yubikey-agent run govulncheck on a schedule, to get high signal-to-noise ratio notifications of vulnerable dependencies that actually affect our projects; and run isolated CI jobs with the latest versions of our dependencies (i.e. running before ) to ensure we’re alerted early of breakages, so we can easily update to future security releases and so we’re aware of potential compatibility issues for our dependents. All Google accounts linked to a Gerrit account Password manager Passkey sync (e.g. Apple iCloud) Website host Domain registrar Package registry (if applicable, although Go’s decentralized package management largely removes this attack surface) https://github.com/settings/tokens and https://github.com/settings/personal-access-tokens and https://github.com/settings/apps/authorizations and https://github.com/settings/applications  ↩

0 views
Filippo Valsorda 1 months ago

A Retrospective Survey of 2024/2025 Open Source Supply Chain Compromises

Lack of memory safety is such a predominant cause of security issues that we have a responsibility as professional software engineering to robustly mitigate it in security-sensitive use cases—by using memory safe languages. Similarly, I have the growing impression that software supply chain compromises have a few predominant causes which we might have a responsibility as a professional open source maintainers to robustly mitigate. To test this impression and figure out any such mitigations, I collected all 2024/2025 open source supply chain compromises I could find, and categorized their root cause. (If you find more, do email me!) Since I am interested in mitigations we can apply as maintainers of depended-upon projects to avoid compromises, I am ignoring: intentionally malicious packages (e.g. typosquatting), issues in package managers (e.g. internal name shadowing), open source infrastructure abuse (e.g. using package registries for post-compromise exfiltration), and isolated app compromises (i.e. not software that is depended upon). Also, I am specifically interested in how an attacker got their first unauthorized access, not in what they did with it. Annoyingly, there is usually a lot more written about the latter than the former. In no particular order, but kind of grouped. XZ Utils Long term pressure campaign on the maintainer to hand over access. Root cause : control handoff. Contributing factor: non-reproducible release artifacts. Nx S1ingularity Shell injection in GitHub Action with trigger and unnecessary read/write permissions 1 , used to extract a npm token. Root cause : pull_request_target. Contributing factors: read/write CI permissions, long-lived credential exfiltration, post-install scripts. Shai-Hulud Worm behavior by using compromised npm tokens to publish packages with malicious post-install scripts, and compromised GitHub tokens to publish malicious GitHub Actions workflows. Root cause : long-lived credential exfiltration. Contributing factor: post-install scripts. npm debug/chalk/color Maintainer phished with an "Update 2FA Now" email. Had TOTP 2FA enabled. Root cause : phishing. polyfill.io Attacker purchased CDN domain name and GitHub organization. Root cause : control handoff. MavenGate Expired domains and changed GitHub usernames resurrected to take control of connected packages. Root causes : domain resurrection, username resurrection. reviewdog and tj-actions/changed-files Contributors deliberately granted automatic write access for GitHub Action repository 2 . Malicious tag re-published to compromise GitHub PAT of more popular GitHub Action 3 . Root cause : control handoff. Contributing factors: read/write CI permissions, long-lived credential exfiltration, mutable GitHub Actions tags. Ultralytics Shell injection in GitHub Action with trigger (which required read/write permissions), pivoted to publishing pipeline via GitHub Actions cache poisoning. Compromised again later using an exfiltrated PyPI token. Root cause : pull_request_target. Contributing factors: GitHub Actions cache poisoning, long-lived credential exfiltration. Kong Ingress Controller GitHub Action with trigger restricted to trusted users but bypassed via Dependabot impersonation 4 , previously patched but still available on old branch. GitHub PAT exfiltrated and used. Root causes : pull_request_target, Dependabot impersonation. Contributing factors: per-branch CI configuration, long-lived credential exfiltration. Rspack Pwn request 5 against workflow 6 in other project, leading to a GitHub classic token of a maintainer with permissions to the web-infra-dev organization 7 (kindly confirmed via email by the Rspack Team). Similar to previously reported and fixed vulnerability 8 in the Rspack repository. Root causes : issue_comment. Contributing factor: long-lived credential exfiltration. eslint-config-prettier "Verify your account" 9 npm phishing. Root cause : phishing. num2words "Email verification" PyPI phishing. Root cause : phishing. @solana/web3.js A "phishing attack on the credentials for publishing npm packages." Root cause : phishing. rustfoundation.dev Fake compromise remediation 10 Crates.io phishing. Unclear if successful. Root cause : phishing. React Native ARIA & gluestack-ui "[U]nauthorized access to publishing credentials." Colorful and long Incident Report lacks any details on "sophisticated" entry point. Presumably an exposed npm token. Root cause : long-lived credential exfiltration(?). lottie-player Unclear, but mitigation involved "remov[ing] all access and associated tokens/services accounts of the impacted developer." Root cause : long-lived credential exfiltration(?) or control handoff(?). rand-user-agent Unclear. Malicious npm versions published, affected company seems to have deleted the project. Presumably npm token compromise. Root cause : long-lived credential exfiltration(?). DogWifTool GitHub token extracted from distributed binary. Root cause : long-lived credential exfiltration. Surprising no one, the most popular confirmed initial compromise vector is phishing. It works against technical open source maintainers. It works against 2FA TOTP. It. Works. It is also very fixable. It’s 2025 and every professional open source maintainer should be using phishing-resistant authentication (passkeys or WebAuthn 2FA) on all developer accounts, and accounts upstream of them. Upstream accounts include email, password manager, passkey sync (e.g. Apple iCloud), web/DNS hosting, and domain registrar. Some services, such as GitHub, require a phishable 2FA method along with phishing-resistant ones. In that case, the best option is to enable TOTP, and delete the secret or write it down somewhere safe and never ever use it—effectively disabling it. This does not work with SMS, since SIM jacking is possible even without action by the victim. Actually surprisingly—to me—a number of compromises are due to, effectively, giving access to the attacker. This is a nuanced people issue. The solution is obviously “don’t do that” but that really reduces to the decades-old issue of open source maintenance sustainability. In a sense, since this analysis is aimed at professional maintainers who can afford it, control handoff is easily avoided by not doing it. Kind of incredible that a specific feature has a top 3 spot, but projects get compromised by “pwn requests” all the time. The workflow trigger runs privileged CI with a context full of attacker-controlled data in response to pull requests. It makes a meek attempt to be safer by not checking out the attacker’s code, instead checking out the upstream target. That’s empirically not enough, with shell injection attacks causing multiple severe compromises. The zizmor static analyzer can help detect injection vulnerabilities, but it seems clear that is unsafe at any speed, and should just never be used. Other triggers that run privileged with attacker-controlled context should be avoided for the same reason. The Rspack compromise, for example, was due to checking out attacker-controlled code on an trigger if the PR receives a comment. What are the alternatives? One option is to implement an external service in a language that can safely deal with untrusted inputs (i.e. not YAML’d shell), and use webhooks. That unfortunately requires long-lived credentials (see below). GitHub itself recommends using the unprivileged trigger followed by the trigger, but it’s unclear to me how safer that would actually be against injection attacks. Finally, since two out of three compromises were due to shell injection, it might be safer to use a proper programming language, like JavaScript with actions/github-script , or any other language accessing the context via environment variables instead of YAML interpolation. This means not using any third-party actions, as well. Allowlisting actors and read-only steps are not robust mitigations, see Read/write CI permissions and Dependabot impersonation below. Overall, none of the mitigations are particularly satisfactory, so the solution might be simply to eschew features that require and other privileged attacker-controlled triggers. (To be honest, I am not a fan of chatty bots on issues and PRs, so I never needed them.) Attackers love to steal tokens. There is no universal solution, but it’s so predominant that we can consider piecemeal solutions. Long-lived credentials are only a root cause when they are accidentally exposed. Otherwise, they are a secondary compromise mechanism for lateral movement or persistence, after the attacker got privileged code execution. Mitigating the latter is somewhat less appealing because an attacker with code execution can find more creative ways to carry out an attack, but we can prune some low-hanging fruit. Go removes the need for package registry tokens by simply not having accounts. (Instead, the go command fetches modules directly from VCS, with caching by the Go Modules Proxy and universality and immutability guaranteed by the Go Checksum Database.) In other ecosystems Trusted Publishing replaces long-lived private tokens with short-lived OIDC tokens, although there is no way to down-scope the capabilities of an OIDC token. GitHub Personal Access Tokens are harder to avoid for anything that’s not supported by GitHub Actions permissions. Chainguard has a third-party Security Token Service that trades OIDC tokens for short-lived tokens , and their article has a good list of cases in which PATs end up otherwise necessary. Given the risk, it might be worth giving up on non-critical features that would require powerful tokens. Gerrit “git cookies” (which are actually just OAuth refresh tokens for the Gerrit app) can be replaced with… well, OAuth refresh tokens but kept in memory instead of disk, using git-credential-oauth . They can also be stored a little more safely in the platform keychain by treating them as an HTTP password, although that’s not well documented . In the long term, it would be great to see the equivalent of Device Bound Session Credentials for developer and automated workflows. Turns out you can just exfiltrate a token from a GitHub Actions runner to impersonate Dependabot with arbitrary PRs ??? I guess! Fine! Just don’t allowlist Dependabot. Not sure what a deeper meta-mitigation that didn’t require knowing this factoid would have been. This is also a social engineering risk, so I guess just turn off Dependabot? Multiple ecosystems (Go and Maven, for example) are vulnerable to name takeovers, whether expired domain names or changed GitHub user/org names. The new owner of the name gets to publish updates for that package. From the point of view of the maintainer, the mitigation is just not to change GitHub names (at least without registering the old one), and to register critical domains for a long period, with expiration alerting. Some CI compromises happened in contexts that could or should have been read-only. It sounds like giving GitHub Actions workflows only read permissions like should be a robust mitigation for any compromise of the code they run. Unfortunately, and kind of incredibly, even a read-only workflow is handed a token that can write to the cross-workflow cache for any key. This cache is then used implicitly by a number of official actions, allowing cross-workflow escalation by GitHub Actions cache poisoning . This contradicts some of GitHub’s own recommendations, and makes the existence of a setting to make GitHub Actions read-only by default more misleading than useful. The behavior does not extend to regular triggers, which are actually read-only (otherwise anyone could poison caches with a PR). GitHub simply doesn’t seem to offer a way to opt in to it. I can see no robust mitigation in the GitHub ecosystem. I would love to be wrong, this is maddening. Two compromises propagated by injecting npm post-install scripts, to obtain code execution as soon as a dependency was installed. This can be disabled with which is worth doing for defense in depth. However, it’s only useful if the dependency is not going to be executed in a privileged context, e.g. to run tests in Node.js. Go, unlike most ecosystems, considers code execution during fetch or compilation to be a security vulnerability, so has this safety margin by default. The XZ backdoor was hidden in a release artifact that didn’t match the repository source. It would be great if that was more detectable, in the form of reproducible artifacts. The road to a fail-closed world where systems automatically detect non-reproducing artifacts is still long, though. How supply chain attacks usually work these days is that an attacker gets the ability to publish new versions for a package, publishes a malicious version, and waits for dependents to update (maybe with the help of Dependabot) or install the latest version ex novo. Not with GitHub Actions! The recommended and most common way to refer to a GitHub Action is by its major version, which is resolved to a git tag that is expected to change arbitrarily when new versions are published. This means that an attacker can instantly compromise every dependent workflow. This was an unforced error already in 2019, when GitHub Actions launched while Go had already shipped an immutable package system. This has been discussed many times since and most other ecosystems have improved somewhat. A roadmap item for immutable Actions has been silent since 2022 . The new immutable releases feature doesn’t apply to non-release tags, and the GitHub docs still recommend changing tags for Actions. As maintainers, we can opt in to pinning where it’s somehow still not the default. For GitHub Actions, that means using unreadable commit hashes, which can be somewhat ameliorated with tooling . For npm, it means using instead of . One compromise was due to a vulnerability that was already fixed, but had persisted on an old branch. Any time we make a security improvement (including patching a vulnerable Action) on a GitHub Actions workflow, we need to remember to cherry-pick it to all branches, including stale ones. Can’t think of a good mitigation, just yet another sharp edge of GitHub Actions you need to be aware of, I suppose. There are a number of useful mitigations, but the ones that appear to be as clearly a professional responsibility as memory safety are phishing-resistant authentication; not handing over access to attackers; and avoiding privileged attacker-controlled GitHub Actions triggers (e.g. ). This research was part of an effort to compile a Geomys Standard of Care that amongst other things mitigates the most common security risks to the projects we are entrusted with. We will publish and implement it soon, to keep up to date follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . On Saturday, between 250,000 and 1,000,000 people (depending on who you believe, 0.4–1.7% of the whole population of Italy) took part in a demonstration against the genocide unfolding in Gaza. Anyway, here's a picture of the Archbasilica of San Giovanni in Laterano at the end of the march. My work is made possible by Geomys , an organization of professional Go maintainer, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. https://github.com/nrwl/nx/security/advisories/GHSA-cxm3-wv7p-598c#:~:text=20%20AM%20EDT-,Attack%20Vector,-Vulnerable%20Workflow https://github.com/reviewdog/reviewdog/issues/2079 https://github.com/tj-actions/changed-files/issues/2464#issuecomment-2727020537 https://www.synacktiv.com/publications/github-actions-exploitation-dependabot https://github.com/module-federation/core/pull/3324 https://github.com/module-federation/core/tree/c3aff14a4b9de2588122ec24cf456dc1fdd742f0/.github/workflows https://github.com/web-infra-dev/rspack/issues/8767#issuecomment-2563345582 https://www.praetorian.com/blog/compromising-bytedances-rspack-github-actions-vulnerabilities/ https://github.com/prettier/eslint-config-prettier/issues/339#issuecomment-3090304490 https://github.com/rust-lang/crates.io/discussions/11889#discussion-8886064 One option is to implement an external service in a language that can safely deal with untrusted inputs (i.e. not YAML’d shell), and use webhooks. That unfortunately requires long-lived credentials (see below). GitHub itself recommends using the unprivileged trigger followed by the trigger, but it’s unclear to me how safer that would actually be against injection attacks. Finally, since two out of three compromises were due to shell injection, it might be safer to use a proper programming language, like JavaScript with actions/github-script , or any other language accessing the context via environment variables instead of YAML interpolation. This means not using any third-party actions, as well. Allowlisting actors and read-only steps are not robust mitigations, see Read/write CI permissions and Dependabot impersonation below. phishing-resistant authentication; not handing over access to attackers; and avoiding privileged attacker-controlled GitHub Actions triggers (e.g. ). https://github.com/nrwl/nx/security/advisories/GHSA-cxm3-wv7p-598c#:~:text=20%20AM%20EDT-,Attack%20Vector,-Vulnerable%20Workflow https://github.com/reviewdog/reviewdog/issues/2079 https://github.com/tj-actions/changed-files/issues/2464#issuecomment-2727020537 https://www.synacktiv.com/publications/github-actions-exploitation-dependabot https://github.com/module-federation/core/pull/3324 https://github.com/module-federation/core/tree/c3aff14a4b9de2588122ec24cf456dc1fdd742f0/.github/workflows https://github.com/web-infra-dev/rspack/issues/8767#issuecomment-2563345582 https://www.praetorian.com/blog/compromising-bytedances-rspack-github-actions-vulnerabilities/ https://github.com/prettier/eslint-config-prettier/issues/339#issuecomment-3090304490 https://github.com/rust-lang/crates.io/discussions/11889#discussion-8886064

0 views
Filippo Valsorda 3 months ago

Maintainers of Last Resort

Geomys is an organization of professional open source maintainers, focused on a portfolio of critical Go projects. For example, we are two thirds of the Go standard library cryptography maintainers, we provide the FIPS 140-3 validation of the upstream Go Cryptographic Module, and we fund the maintenance of x/crypto/ssh and staticcheck amongst others. Our retainer clients engage us both to get access to our expertise, and so that the critical dependencies they rely on are professionally maintained. Beyond our portfolio, we sometimes act as maintainers of last resort when critical, security-relevant Go projects go unmaintained. Recently, there were two occasions in which we stepped into this informal role: We can professionally serve in this role, including contracting external help, thanks to the sustainable funding of our retainer agreements. Our clients benefit from our maintenance efforts, and have a direct line to highlight projects in need. bluemonday is the most popular HTML sanitizer in the Go ecosystem, used by thousands of applications and libraries to clean up user-generated markup before including it in web pages. Needless to say, it’s a security-critical, load-bearing component. In late 2023, the sole previous maintainer announced that their new professional circumstances were not compatible with volunteer OSS work, and that they were looking for responsible ways to wind it down. Geomys offered to take over maintenance instead. Over 2024, Geomys worked with the maintainer to take over the project at its original location, avoiding the disruption of a deprecation, and guaranteeing a natural path for future security updates. Since we work on Go and open source on a daily basis, the marginal load for Geomys is tiny, but there is outsized value to the community in knowing that security reports would be handled by dedicated professionals that can prioritize them appropriately. Beyond handling security and critical issues, we are also discussing bringing on a domain subject expert on a contract basis to improve safety in edge cases and to future-proof the library further. Again, we can do that because we are sustainably funded through our retainer agreements. This was welcomed as a great outcome by the original maintainer . The existence of a maintainer of last resort is not only beneficial to the consumers of the ecosystem, but also releases a lot of pressure from volunteer maintainers who would otherwise sometimes carry unsustainable loads out of a sense of duty. gorilla/csrf is an extremely popular Cross-Site Request Forgery protection middleware. In December 2024, Patrick O’Doherty discovered that the library was unintentionally vulnerable to schemelessly same-site cross-origin request forgeries . This means could be attacked by or, even worse, . Unless HTTP Strict-Transport-Security with is used, any network attacker can control the latter and mount the attack. This was fixed publicly in January , but a new release (v1.7.3) and an advisory ( CVE-2025-24358 ) weren’t published until April. Alerted by Patrick’s finding, we looked into the library, and found a further issue that again allowed network attackers to mount CSRF attacks if the application used the option. We reported this to the project on April 18th, 2025; however, it hasn’t been acknowledged and the project appears unmaintained. (We are publicly disclosing it as the customary 90-day deadline has lapsed, and all the upgrade paths listed below are available as of yesterday, with the release of Go 1.25.) We tried reaching out to past maintainers via email and Slack to offer to take over the project, but unfortunately never heard back. Therefore, we set out to find other solutions to fill this critical CSRF-shaped hole in the ecosystem. Again, all of this is enabled by and part of the Geomys retainer contracts. If you work at a company with a critical dependency on the Go ecosystem, consider reaching out at [email protected]. Regardless, you might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . Since we’re talking about Geomys, here’s a throwback to… last year? Was it just last year?? Anyway, we sponsored GopherCon US and set up a booth mostly to cover it with my collection of gophers and pins. Geomys is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. we took over maintenance of the popular bluemonday HTML sanitizer when the maintainer chose to move on; and we built alternative upgrade paths for the seemingly unmaintained gorilla/csrf library, by introducing a new carefully researched implementation into the standard library and creating a drop-in package replacement , after we discovered a security vulnerability in the original. First, we researched the landscape of CSRF countermeasures, and consulted with subject experts, including some of the authors of relevant Web specifications. We found that modern browsers provide security metadata in request headers that makes it possible to reject cross-origin requests without any tokens or keys, leading to a drastically better developer experience, better security, and fewer false positives! The results of that investigation are public for other projects that may benefit from it. Second, we proposed and introduced a new CrossOriginProtection middleware in the standard library package. It is part of Go 1.25, released yesterday , and we recommend all gorilla/csrf users consider switching to it. We trust that a standard library solution will safely serve the ecosystem going forward. For applications that are not ready to update to Go 1.25, we made a nearly-identical middleware available as a Go module, at filippo.io/csrf . Finally, we made a drop-in replacement package for the whole gorilla/csrf API that uses the new countermeasures instead: filippo.io/csrf/gorilla . We tried to minimize any side-effects of the substitution, for example by returning random values in place of the now disused tokens, but we invite you to read the package docs.

0 views
Filippo Valsorda 3 months ago

Cross-Site Request Forgery

Cross-Site Request Forgery (CSRF) is a confused deputy attack where the attacker causes the browser to send a request to a target using the ambient authority of the user’s cookies or network position. 1 For example, can serve the following HTML to a victim and the browser will send a POST request to using the victim’s cookies. Essentially all applications that use cookies for authentication need to protect against CSRF. Importantly, this is not about protecting against an attacker that can make arbitrary requests 2 (as an attacker doesn’t know the user’s cookies), but about working with browsers to identify authenticated requests initiated from untrusted sources. Unlike Cross-Origin Resource Sharing (CORS) , which is about sharing responses across origins, CSRF is about accepting state-changing requests, even if the attacker will not see the response. Defending against leaks is significantly more complex and nuanced , especially in the age of Spectre. Why do browsers allow these requests in the first place? Like anything in the Web platform, primarily for legacy reasons: that’s how it used to work and changing it breaks things. Importantly, disabling these third-party cookies breaks important Single-Sign On (SSO) flows. All CSRF solutions need to support a bypass mechanism for those rare exceptions. (There are also complex intersections with cross-site tracking and privacy concerns, which are beyond the scope of this article.) To protect against CSRF, it’s important to first define what is a cross-site or cross-origin request, and which should be allowed. , , and even (depending on the definition) are all same-site but not same-origin. It’s tempting to declare the goal as ensuring requests are simply from the same site, but different origins in the same site can actually sit at very different trust levels: for example it might be much easier to get XSS into an old marketing blog than in the admin panel. The starkest difference in trust though is between an HTTPS and an HTTP origin, since a network attacker can serve anything it wants on the latter. This is sometimes referred to as the MitM CSRF bypass, but really it’s just a special case of a schemelessly same-site cross-origin CSRF attack. Some parts of the Web platform apply a schemeful definition of same-site, where and are not same-site: Using HTTP Strict Transport Security (HSTS) , if possible, is a potential mitigation for HTTP→HTTPS issues. There are a number of potential countermeasures to CSRF, some of which have been available only for a few years. The “classic” countermeasure is a CSRF token , a large random value submitted in the request (e.g. as a hidden ) and compared against a value stored in a cookie ( double-submit ) or in a stateful server-side session ( synchronized tokens ). Normally, double-submit is not a same-origin countermeasure, because same-site origins can set cookies on each other by “cookie tossing”. This can be mitigated with the cookie prefix , or by binding the token to the session/user with signed metadata. The former makes it impossible for the attacker to set the cookie, the latter ensures the attacker doesn’t know a valid value to set it to. Note that signing the cookies or tokens is unnecessary and ineffectual, unless it is binding the token to a user: an attacker that’s cookie tossing can otherwise obtain a valid signed pair by logging into the website themselves and then use that for the attack. This countermeasure turns a cross-origin forgery problem into a cross-origin leak problem: if the attacker can obtain a token from a cross-origin response, it can forge a valid request. The token in the HTML body should be masked as a countermeasure against the BREACH compression attack . The primary issue with CSRF tokens is that they require developers to instrument all their forms and other POST requests. Browsers send the source of a request in the Origin header, so CSRF can be mitigated by rejecting non-safe requests from other origins. The main issue is knowing the application’s own origin. One option obviously is asking the developer to configure it, but that’s friction and might not always be easy (such as for open source projects and proxied setups). The closest readily available approximation of the application’s own origin is the Host header. This has two issues: Some older (pre-2020) browsers didn’t send the Origin header for POST requests . The value can be in a variety of cases, such as due to or following cross-origin redirects. must be treated as an indication of a cross-origin request. Some privacy extensions remove the Origin header instead of setting it to . This should be considered a security vulnerability introduced by the extension, since it removes any reliable indication of a browser cross-origin request. If authentication cookies are explicitly set with the SameSite attribute Lax or Strict, they will not be sent with non-safe cross-site requests. This is, by design, not a cross-origin protection, and it can’t be fixed with the prefix (or Secure attribute), since that’s about who can set and read cookies, not about where the requests originate. (This difference is reflected in the difference between Scheme-Bound Cookies and Schemeful Same-Site .) The risk of same-site HTTP origins is still present, too, in browsers that don’t implement Schemeful Same-Site. Note that the rollout of SameSite Lax by default has mostly failed due to widespread breakage, especially in SSO flows. Some browsers now default to Lax-allowing-unsafe , while others default(ed) to None for the first two minutes after the cookie was set. These defaults are not effective CSRF countermeasures. Although CORS is not designed to protect against CSRF, “ non-simple requests ” which for example set headers that a simple couldn’t set are preflighted by an OPTIONS request. An application could choose to allow only non-simple requests, but that is fairly limiting precisely because “simple requests” includes all the ones produced by . To provide a reliable cross-origin signal to websites, browsers introduced Fetch metadata . In particular, the Sec-Fetch-Site header is set to / / / 3 and is now the recommended method to mitigate CSRF . The header has been available in all major browsers since 2023 (and earlier for all but Safari). One limitation is that it is only sent to “ trustworthy origins ”, i.e. HTTPS and localhost. Note that this is not about the scheme of the initiator origin, but of the target, so it is sent for HTTP→HTTPS requests, but not for HTTPS→HTTP or HTTP→HTTP requests (except localhost→localhost). If Sec-Fetch-Site is missing, a lax fallback on Origin=Host is an option, since HTTP→HTTPS requests are not a concern. In summary, to protect against CSRF applications (or, rather, libraries and frameworks) should reject cross-origin non-safe browser requests. The most developer-friendly way to do so is using primarily Fetch metadata, which requires no extra instrumentation or configuration. Allow all GET, HEAD, or OPTIONS requests. These are safe methods, and are assumed not to change state at various layers of the stack already. If the Origin header matches an allow-list of trusted origins, allow the request. Trusted origins should be configured as full origins (e.g. ) and compared by simple equality with the header value. If the Sec-Fetch-Site header is present: This secures all major up-to-date browsers for sites hosted on trustworthy (HTTPS or localhost) origins. If neither the Sec-Fetch-Site nor the Origin headers are present, allow the request. These requests are not from (post-2020) browsers, and can’t be affected by CSRF. If the Origin header’s host (including the port) matches the Host header, allow the request, otherwise reject it. This is either a request to an HTTP origin, or by an out-of-date browser. The only false positives (unnecessary blocking) of this algorithm are requests to non-trustworthy (plain HTTP) origins that go through a reverse proxy that changes the Host header. That edge case can be worked around by adding the origin to the allow-list. There are no false negatives in modern browsers, but pre-2023 browsers will be vulnerable to HTTP→HTTPS requests, because the Origin fallback is scheme-agnostic. HSTS can be used to mitigate that (in post-2020 browsers), but note that out-of-date browsers are likely to have more pressing security issues. Finally, there should be a tightly scoped bypass mechanism for e.g. SSO edge cases, with the appropriate safety placards . For example, it could be route-based, or require manual tagging of requests before the CSRF middleware. Go 1.25 introduces a CrossOriginProtection middleware in which implements this algorithm . (This research was done as background for that proposal.) Thank you to Roberto Clapis for helping with this analysis, and to Patrick O’Doherty for setting in motion and testing this work. For more, follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . Back to Rome photoblogging. This was taken from the municipal rose garden, which opens for a couple weeks every spring and fall. This work is made possible by Geomys , my Go open source maintenance organization, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. Abuse of the ambient authority of network position, often through DNS rebinding, is being addressed by Private Network Access . The rest of this post will focus on abuse of cookie authentication.  ↩ This is why API traffic generally doesn’t need to be protected against CSRF. If it looks like it’s not from a browser, it can’t be a CSRF.  ↩ means the request was directly user-initiated, e.g. a bookmark.  ↩ Cookies in general apply the schemeless definition (HTTP = HTTPS). There is a proposal to address this, Origin-Bound-Cookies (and specifically its lack of opt-out for scheme binding, which subsumes the earlier Scheme-Bound Cookies proposal), which however hasn’t shipped yet . The SameSite cookie attribute used to apply the schemeless definition (HTTP = HTTPS). Chrome changed that with Schemeful Same-Site in 2020, but Firefox and Safari never implemented it. Sec-Fetch-Site (and the HTML and Fetch specifications in general) apply the schemeful definition (HTTP ≠ HTTPS). it may be different from the browser origin if a reverse proxy is involved; it does not include the scheme, so there is no way to know if an Origin is a cross-origin HTTP→HTTPS request or a same-origin HTTP request. Allow all GET, HEAD, or OPTIONS requests. These are safe methods, and are assumed not to change state at various layers of the stack already. If the Origin header matches an allow-list of trusted origins, allow the request. Trusted origins should be configured as full origins (e.g. ) and compared by simple equality with the header value. If the Sec-Fetch-Site header is present: if its value is or , allow the request; otherwise, reject the request. If neither the Sec-Fetch-Site nor the Origin headers are present, allow the request. These requests are not from (post-2020) browsers, and can’t be affected by CSRF. If the Origin header’s host (including the port) matches the Host header, allow the request, otherwise reject it. This is either a request to an HTTP origin, or by an out-of-date browser. Abuse of the ambient authority of network position, often through DNS rebinding, is being addressed by Private Network Access . The rest of this post will focus on abuse of cookie authentication.  ↩ This is why API traffic generally doesn’t need to be protected against CSRF. If it looks like it’s not from a browser, it can’t be a CSRF.  ↩ means the request was directly user-initiated, e.g. a bookmark.  ↩

0 views
Filippo Valsorda 4 months ago

Go Assembly Mutation Testing

While maintaining and developing the Go cryptography standard library, we often spend significantly more time on testing than on implementation. That’s good and an important part of how we achieve our excellent security track record . Ideally, this would be especially true for the least safe parts of the library. However, testing assembly cores presents unique challenges, due to their constant-time nature. This has been a long-standing issue. For Go 1.26, I am working on introducing a mutation testing framework for assembly, which will effectively act as enhanced code coverage. This will not improve tests by itself, but it will let us see what assembly code and data paths are not covered by our test suite, so we can improve it. Cryptographic assembly is sort of my “origin story” as a Go maintainer. Back in 2017, a colleague at Cloudflare found a certificate that failed to validate with Go’s crypto/x509. The bug was a mishandled carry in the amd64 assembly implementation of P-256 modular subtraction . It had escaped all testing because that carry flag had a 1 in 2³² chance of being set when operating on random inputs. Adam Langley commented that exploiting it was unlikely and “would be a cool paper” . Then Sean Devlin and I hid in a Starbucks in Paris for a whole day while the yellow jackets set fire to cop cars outside, and figured out how to turn it into a Hollywood-looking key recovery attack . That was fun, but it’s a different story . Fast forward one year, and it was now my job to stop this from happening again. Finding a robust countermeasure to this bug class has been my white whale ever since. “Filippo, normal, well-adjusted people don’t have white whales.” “Well, we have learned nothing new, have we?” The Assembly Policy has (hopefully) helped reduce the risk of introducing new manually-written assembly bugs, if anything because it made it harder to introduce new manually written assembly, but a fundamental problem is that we don’t know how well our assembly is tested , because code coverage doesn’t work for cryptographic assembly. Most cryptographic code has to operate in constant time, meaning it executes the same instructions regardless of the inputs, to avoid leaking secrets through timing side-channels. To achieve that, we often compute both “branches” of an operation (e.g. both and , for ), and then discard one of the results with constant-time select instructions. The problem is that if you run code coverage, you’ll see all “branches” light up, even if all tests actually discard the result of one of them. We could have other untested paths like #20040 and not know about it. At some point in 2019 I tried instrumenting binaries at runtime with DynamoRIO to capture the flags before each flag-consuming instruction, to feed a more comprehensive coverage report. It almost worked. “Almost” being dispositive. Enter mutation testing. Mutation tests modify the program, for example by turning a into a , and check that tests fail for each “mutation.” If they don’t, that line is not—effectively—tested. This is actually more accurate than regular test coverage because it doesn’t just check that code is executed, but also that the result influences the success of the tests, such that producing a different result would cause the tests to fail. It’s also a great match for constant time assembly! For example, if we turn an add-with-carry into a regular add, and tests still pass, we are not actually testing the case in which the carry is set. The next question is how to programmatically mutate assembly. I was going to do it at the source level, but Russ Cox suggested modifying the assembler instead, to avoid having to deal with macros and parsing. cmd/asm assigns a virtual program counter to instructions right after parsing, before encoding them. CL 665375 1 adds a flag to print the listing at that point to standard error, and a flag that allows replacing any instruction by its program counter with one or more other instructions. Implementing it was fairly easy, reusing the parser and patching the instructions linked list. These assembler flags can be enabled for a specific package during with . Thankfully cmd/go already knows to fold the argument into the cache key of assembler artifacts, and it even caches the stderr output, so output is available even when using a cached result. Driving these tests is relatively straightforward . First, we run to obtain the listing of potential targets. Then, for each mutation of each target instruction, we run , and make sure it fails. To speed things up, we run first with and then without only if short tests pass. Also, we first run with to ensure our mutation compiles. Finally, we need to decide which target instructions we mutate and how. Mutations turn an instruction that behaves differently based on a flag, into an equivalent instruction that behaves as if the flag was always or never set. They need not to change anything else, to avoid accidentally breaking the test run and causing a mutation testing false negative. In particular, we can’t use any register and we need to leave the final flags untouched. Let’s look at a few arm64 examples. ADCS adds two registers and the carry, and sets output flags. Mutating it into an instruction that ignores the carry flag is easy, we just turn it into a ADDS . To mutate in the other direction, we prepend an instruction that sets the C flag. We don’t care about smashing the other flags, because ADCS will reset them anyway. SBCS is the equivalent subtraction instruction, and we mutate it the same way, except that SUBS behaves as if the carry (aka “no borrow”) flag was always set, so we need to unset it in the mirror mutation. ADC and SBC are the corresponding instructions that don’t set output flags. This makes things a little different because we can’t smash the flags with a prepended instruction, but on the other hand we don’t need to worry about setting them accurately. Instead of setting the carry bit beforehand, we add or subtract one from the destination afterwards. There is one more wrinkle: if one of the operands is the zero register ZR, then the equivalent ADD or SUB can’t be encoded, because if you’re not setting flags it makes no sense to add or subtract to/from zero instead of storing. In those cases we mutate into an appropriate MOVD. CSEL is a constant-time select that stores one value or another based on a flag, usually the equality or carry flag. Mutating it into MOVDs is trivial. I initially ran this on the arm64 P-256 assembly, for white whale and hardware availability reasons, and it found a number of untested instructions , including… in , goddammit. Writing tests to cover them is tedious and sometimes really hard, because 2^-32 edge cases like a P-256 field overflow deep in a function are hard to hit explicitly. That’s another indication that this assembly core should be broken up into smaller, more easily tested operations . To stay up to date on my fight with cryptographic assembly, follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . It’s that time of the year again: I just attended the Italian national math team’s training retreat, where my main contributions are playing in the abstract board game tournament, moderating the night-time Werewolf games, bringing snacks, and looking at the pretty lake. This year I also gave an introduction to modern cryptography in Italian, walking the audience through composing primitives to build an IND-CCA2 cipher. My Go work is made possible by Geomys , my Go open source maintenance organization, which is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. This is a prototype, I expect the actual cmd/asm interface will change based on the compiler team’s feedback.  ↩ This is a prototype, I expect the actual cmd/asm interface will change based on the compiler team’s feedback.  ↩

0 views
Filippo Valsorda 4 months ago

Encrypting Files with Passkeys and age

Typage ( on npm) is a TypeScript 1 implementation of the age file encryption format . It runs with Node.js, Deno, Bun, and browsers, and implements native age recipients, passphrase encryption, ASCII armoring, and supports custom recipient interfaces, like the Go implementation . However, running in the browser affords us some special capabilities, such as access to the WebAuthn API. Since version 0.2.3 , Typage supports symmetric encryption with passkeys and other WebAuthn credentials, and a companion age CLI plugin allows reusing credentials on hardware FIDO2 security keys outside the browser. Let’s have a look at how encrypting files with passkeys works, and how it’s implemented in Typage. Passkeys are synced, discoverable WebAuthn credentials. They’re a phishing-resistant standard-based authentication mechanism. Credentials can be stored in platform authenticators (such as end-to-end encrypted iCloud Keychain), in password managers (such as 1Password), or on hardware FIDO2 tokens (such as YubiKeys, although these are not synced). I am a strong believer in passkeys, especially when paired with email magic links , as a strict improvement over passwords for average users and websites. If you want to learn more about passkeys and WebAuthn I can’t recommend Adam Langley’s A Tour of WebAuthn enough. The primary functionality of a WebAuthn credential is to cryptographically sign an origin-bound challenge. That’s not very useful for encryption. However, credentials with the extension can also compute a Pseudo-Random Function while producing an “assertion” (i.e. while logging in). You can think of a PRF as a keyed hash (and indeed for security keys it’s backed by the FIDO2 extension): a given input always maps to the same output, without the secret there’s no way to compute the mapping, and there’s no way to extract the secret. Specifically, the WebAuthn PRF takes one or two inputs and returns a 32-byte output for each of them. That lets “relying parties” implement symmetric encryption by treating the PRF output as a key that’s only available when the credential is available. Using the PRF extension requires User Verification (i.e. PIN or biometrics). You can read more about the extension in Adam’s book . Note that there’s no secure way to do asymmetric encryption: we could use the PRF extension to encrypt a private key, but then an attacker that observes that private key once can decrypt anything encrypted to its public key in the future, without needing access to the credential. Support for the PRF extension landed in Chrome 132, macOS 15, iOS 18, and 1Password versions from July 2024 . To encrypt an age file to a new type of recipient, we need to define how the random file key is encrypted and encoded into a header stanza . Here’s a stanza that wraps the file key with an ephemeral FIDO2 PRF output. The first argument is a fixed string to recognize the stanza type. The second argument is a 128-bit nonce 2 that’s used as the PRF input. The stanza body is the ChaCha20Poly1305 encryption of the file key using a wrapping key derived from the PRF output. Each credential assertion (which requires a single User Presence check, e.g. a YubiKey touch) can compute two PRFs. This is meant for key rotation , but in our use case it’s actually a minor security issue: an attacker who compromised your system but not your credential could surreptitiously decrypt an “extra” file every time you intentionally decrypt or encrypt one. We mitigate this by using two PRF outputs to derive the wrapping key. The WebAuthn PRF inputs are composed of a domain separation prefix, a counter, and the nonce. The two 32-byte PRF outputs are concatenated and passed to HKDF-Extract-SHA-256 with as salt to derive the ChaCha20Poly1305 wrapping key. That key is used with a zero nonce (since it’s used only once) to encrypt the file key. This age recipient format has two important properties: Now that we have a format, we need an implementation. Enter Typage 0.2.3. The WebAuthn API is pretty complex, at least in part because it started as a way to expose U2F security keys before passkeys were a thing, and grew organically over the years. However, Typage’s passkey support amounts to less than 300 lines , including a simple implementation of CTAP2’s CBOR subset . Before any encryption or decryption operation, a new passkey must be created with a call to . calls with a random to avoid overwriting existing keys, set to to ask the authenticator to store a passkey, and of course . Passkeys not generated by can also be used if they have the extension enabled. To encrypt or decrypt a file, you instantiate an or , which implement the new and interfaces. The recipient and identity implementations call with the PRF inputs to obtain the wrapping key and then parse or serialize the format we described above. Aside from the key name, the only option you might want to set is the relying party ID . This defaults to the origin of the web page (e.g. ) but can also be a parent (e.g. ). Credentials are available to subdomains of the RP ID, but not to parents. Since passkeys are usually synced, it means you can e.g. encrypt a file on macOS and then pick up your iPhone and decrypt it there, which is pretty cool. Also, you can use passkeys stored on your phone with a desktop browser thanks to the hybrid BLE protocol . It should even be possible to use the AirDrop passkey sharing mechanism to let other people decrypt files! You can store passkeys (discoverable or “resident” credentials) on recent enough FIDO2 hardware tokens (e.g. YubiKey 5). However, storage is limited and support still not universal. The alternative is for the hardware token to return all the credential’s state encrypted in the credential ID, which the client will need to give back to the token when using the credential. This is limiting for web logins because you need to know who the user is (to look up the credential ID in the database) before you invoke the WebAuthn API. It can also be desirable for encryption, though: decrypting files this way requires both the hardware token and the credential ID, which can serve as an additional secret key, or a second factor if you’re into factors . Rather than exposing all the layered WebAuthn nuances through the typage API, or precluding one flow, I decided to offer two profiles: by default, we’ll generate and expect discoverable passkeys, but if the option is passed, we’ll request the credential is not stored on the authenticator and ask the browser to show UI for hardware tokens. returns an age identity string that encodes the credential ID, relying party ID, and transports as CTAP2 CBOR, 4 in the format . This identity string is required for the security key flow, but can also be used as an optional hint when encrypting or decrypting using passkeys. More specifically, the data encoded in the age identity string is a CBOR Sequence of One more thing… since FIDO2 hardware tokens are easily accessible outside the browser, too, we were able to build a age CLI plugin that interoperates with typage security key identity strings: age-plugin-fido2prf . Since FIDO2 PRF only supports symmetric encryption, the identity string is used both for decryption and for encryption (with ). This was an opportunity to dogfood the age Go plugin framework , which easily turns an implementation of the Go interface into a CLI plugin usable from age or rage , abstracting away all the details of the plugin protocol . The scaffolding turning the importable Identity implementation into a plugin is just 50 lines . For more details, refer to the typage README and JSDoc annotations. To stay up to date on the development of age and its ecosystem, follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . On the last day of this year’s amazing CENTOPASSI motorcycle rallye, we watched the sun set over the plain below Castelluccio , and then rushed to find a place to sleep before the “engines out” time. Found an amazing residence where three cats kept us company while planning the next day. Geomys , my Go open source maintenance organization, is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. It started as a way for me to experiment with the JavaScript ecosystem, and the amount of time I spent setting up things that we can take for granted in Go such as testing, benchmarks, formatting, linting, and API documentation is… incredible. It took even longer because I insisted on understanding what tools were doing and using defaults rather than copying dozens of config files. The language is nice, but the tooling for library authors is maddening. I also have opinions on the Web Crypto APIs now. But all this is for another post.  ↩ 128 bits would usually be a little tight for avoiding random collisions , but in this case we care only about never using the same PRF input with the same credential and, well, I doubt you’re getting any credential to compute more than 2⁴⁸ PRFs.  ↩ This is actually a tradeoff: it means we can’t tell the user a decryption is not going to work before asking them the PIN of the credential. I considered adding a tag like the one being considered for stanzas or like the one. The problem is that the WebAuthn API only lets us specify acceptable credential IDs upfront, there is no “is this credential ID acceptable” callback, so we’d have to put the whole credential ID in the stanza. This is undesirable both for privacy reasons, and because the credential ID (encoded in the identity string) can otherwise function as a “second factor” with security keys.  ↩ Selected mostly for ecosystem consistency and because it’s a couple hundred lines to handroll.  ↩ Per-file hardware binding : each file has its own PRF input(s), so you strictly need both the encrypted file and access to the credential to decrypt a file. You can’t precompute some intermediate value and use it later to decrypt arbitrary files. Unlinkability : there is no way to tell that two files are encrypted to the same credential, or to link a file to a credential ID without being able to decrypt the file. 3 the version, always the credential ID as a byte string the RP ID as a text string the transports as an array of text strings It started as a way for me to experiment with the JavaScript ecosystem, and the amount of time I spent setting up things that we can take for granted in Go such as testing, benchmarks, formatting, linting, and API documentation is… incredible. It took even longer because I insisted on understanding what tools were doing and using defaults rather than copying dozens of config files. The language is nice, but the tooling for library authors is maddening. I also have opinions on the Web Crypto APIs now. But all this is for another post.  ↩ 128 bits would usually be a little tight for avoiding random collisions , but in this case we care only about never using the same PRF input with the same credential and, well, I doubt you’re getting any credential to compute more than 2⁴⁸ PRFs.  ↩ This is actually a tradeoff: it means we can’t tell the user a decryption is not going to work before asking them the PIN of the credential. I considered adding a tag like the one being considered for stanzas or like the one. The problem is that the WebAuthn API only lets us specify acceptable credential IDs upfront, there is no “is this credential ID acceptable” callback, so we’d have to put the whole credential ID in the stanza. This is undesirable both for privacy reasons, and because the credential ID (encoded in the identity string) can otherwise function as a “second factor” with security keys.  ↩ Selected mostly for ecosystem consistency and because it’s a couple hundred lines to handroll.  ↩

0 views
Filippo Valsorda 4 months ago

You Should Run a Certificate Transparency Log

Hear me out. If you are an organization with some spare storage and bandwidth, or an engineer looking to justify an overprovisioned homelab , you should consider running a Certificate Transparency log. It’s cheaper, easier, and more important than you might think. Certificate Transparency (CT) is one of the technologies that underpin the security of the whole web. It keeps Certificate Authorities honest, and allows website owners to be notified of unauthorized certificate issuance. It’s a big part of how the WebPKI went from the punchline of “weakest link” jokes to the robust foundation of the security of most of digital life… in less than fifteen years! CT is an intrinsically distributed system: CAs must submit each certificate to two CT logs operated by third parties and trusted by the browsers. This list is, and has been for a couple years, uncomfortably short. There just aren’t as many independent log operators as we’d like. Operating a log right now would be an immense contribution to the security of virtually every Internet user. It also comes with the bragging rights to claim that your public key is on billions of devices. Where’s the catch? Well, until recently running a log was a pain, and expensive. I am writing this because as of a few months ago, this has changed ! Browsers now accept CT logs that implement the new Static CT API , which I designed and productionized in collaboration with Let’s Encrypt and the rest of the WebPKI community over the past year and a half. The key difference is that it makes it possible to serve the read path of a CT log exclusively through static, S3 and CDN friendly files. Moreover, the new Sunlight implementation, sponsored by Let’s Encrypt, implements the write path with minimal dependencies and requirements. It can upload the Static CT assets directly to object storage, or store them on any POSIX filesystem. You can learn more if you are curious in Let’s Encrypt’s retrospective , in the original Sunlight design document , or in the summarized public announcement . Geomys , my open source maintenance firm, operates a pro-bono Sunlight-backed trusted Static CT log for $10k/year , including hardware amortization, colocation, and bandwidth. I’m sure it can be done for cheaper. Ok, so what does it take to run a CT log in 2025 6 ? Storage: you have two options. Static CT logs are just flat static files, which you can serve with any HTTP server 4 from disk, or expose as a public object storage bucket. That’s pretty much it! Durability is the first priority: it’s really important that you never lose data once it’s fsync’ed to disk or PUT to object storage, since your log will have signed and returned SCTs, which are promises to serve the certificates it received. This means for example that backups are useless: they would rollback the log’s state. In terms of ongoing effort, a log operator is expected to read the Google and Apple CT Log policies, monitor the [email protected] mailing list, update the log implementation from time to time, and rotate log temporal shards every year. (For example, we just stood up 2027 shards of our log.) Given the logs lifecycle, you should plan to stick around for at least three years. If you want to become a CT log operator, first of all… thank you! The Sunlight README was rewritten recently to get you up and running easily. Sunlight is highly specialized for Certificate Transparency and the WebPKI, and it’s designed to help you operate a healthy, useful CT log with minimal configuration. The community is eager to welcome new log operators. You can post questions, reports, and updates on the transparency.dev Slack , ct-policy mailing list , or Sunlight issue tracker . I encourage you to reach out even just to share your plans, or to ask any questions you might have before committing to running a log. You might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . I systematically make the mistake of reaching a beautiful spot with my motorcycle, watching the sunset, and then realizing “oh, shoot, now it’s dark!” This time, the motorcycle didn’t start, too, and it was the first ride of the season in January. Got to read A Tour of WebAuthn by Adam Langley, though, so who can say if it was good or bad. Geomys , my Go open source maintenance organization, is funded by Smallstep , Ava Labs , Teleport , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. If a six months shard is assumed to grow up to 2B entries (the biggest so far has been 1.93B), and old shards are deleted one month after they expire, Sunlight on ZFS configured like Tuscolo will need at most 2.75 TB. However, the WebPKI is always growing, and shorter-lived certificates will increase issuance rate, but will also make rotation more efficient. Provisioning 3 TB and having a plan to get to 5 TB if necessary over the next couple years would be prudent.  ↩ ↩ This is a conservative estimate of potentially necessary peak capacity. Right now the Tuscolo log produces ~50Mbps average / ~250Mbps peak, but there are relatively few monitors. RFC 6962 logs reported numbers around 1 – 2 Gbps. Static CT reduces bandwidth by almost 80%, but also makes it easier to monitor a log, which might increase demand. YMMV. Verifiable Indexes will hopefully reduce full monitor count in the future.  ↩ It might be possible to run the object storage part on HDD. The write path would probably be fine, but the read path serves a lot of files with random accesses. Maybe with a large SSD cache layer.  ↩ Or with Sunlight’s specialized HTTP read path, called Skylight, which has a bunch of nice metrics and health checks.  ↩ Yep, two nines. Availability of the write path in particular is not a big deal at all: CAs will just fallback to other logs. Availability of the read path is important to ensure timely monitoring of new entries, but it’s just a simple static HTTP server. Note that Google is planning to split the requirements between read and write endpoints, and to require higher availability on the read path.  ↩ It’s possible the requirements will grow in the future because of short-lived certificates and/or post-quantum signatures, but the ecosystem is very aware of the potential burden on CT log operators, and there are a number of proposals to mitigate it, such as Merkle Tree Certificates and Verifiable Indexes . I am optimistic this will be solved, but even if it won’t you can always turn your log read-only without disrupting the ecosystem, should it get too large.  ↩ Servers: one. No need to make the log a distributed system, CT itself is a distributed system. If you want to offer redundancy you can run multiple logs. The uptime target is 99% 5 over three months, which allows for nearly 22h of downtime. That’s more than three motherboard failures per month. CPU and memory: whatever, as long as it’s ECC memory . Four cores and 2 GB will do. Bandwidth: 2 Gbps outbound peak capacity 2 (which you can offload to a CDN). Storage: you have two options. 3 – 5 TB 1 of usable redundant filesystem space on SSDs 3 . 3 – 5 TB 1 of S3-compatible object storage, and 200 GB of cache on SSD. People: Google policy requires the email addresses of two representatives. The uptime target is forgiving enough that it can probably be met by a single person working during business hours. If a six months shard is assumed to grow up to 2B entries (the biggest so far has been 1.93B), and old shards are deleted one month after they expire, Sunlight on ZFS configured like Tuscolo will need at most 2.75 TB. However, the WebPKI is always growing, and shorter-lived certificates will increase issuance rate, but will also make rotation more efficient. Provisioning 3 TB and having a plan to get to 5 TB if necessary over the next couple years would be prudent.  ↩ ↩ This is a conservative estimate of potentially necessary peak capacity. Right now the Tuscolo log produces ~50Mbps average / ~250Mbps peak, but there are relatively few monitors. RFC 6962 logs reported numbers around 1 – 2 Gbps. Static CT reduces bandwidth by almost 80%, but also makes it easier to monitor a log, which might increase demand. YMMV. Verifiable Indexes will hopefully reduce full monitor count in the future.  ↩ It might be possible to run the object storage part on HDD. The write path would probably be fine, but the read path serves a lot of files with random accesses. Maybe with a large SSD cache layer.  ↩ Or with Sunlight’s specialized HTTP read path, called Skylight, which has a bunch of nice metrics and health checks.  ↩ Yep, two nines. Availability of the write path in particular is not a big deal at all: CAs will just fallback to other logs. Availability of the read path is important to ensure timely monitoring of new entries, but it’s just a simple static HTTP server. Note that Google is planning to split the requirements between read and write endpoints, and to require higher availability on the read path.  ↩ It’s possible the requirements will grow in the future because of short-lived certificates and/or post-quantum signatures, but the ecosystem is very aware of the potential burden on CT log operators, and there are a number of proposals to mitigate it, such as Merkle Tree Certificates and Verifiable Indexes . I am optimistic this will be solved, but even if it won’t you can always turn your log read-only without disrupting the ecosystem, should it get too large.  ↩

0 views
Filippo Valsorda 11 months ago

Benchmarking RSA Key Generation

RSA key generation is both conceptually simple, and one of the worst implementation tasks of the field of cryptography engineering. Even benchmarking it is tricky, and involves some math: here’s how we generated a stable but representative “average case” instead of using the ordinary statistical approach. Say you want to generate a 2048-bit RSA key. The idea is that you generate random 1024-bit numbers until you find two that are prime, you call them p and q , and compute N = p × q and d = 65537⁻¹ mod φ(N) 1 (and then some more stuff to make operations faster, but you could stop there). The computation of d is where the RSA magic lies, but today we are focusing on the first part. There is almost nothing special to selecting prime candidates. You draw an appropriately sized random number from a CSPRNG, and to avoid wasting time, you set the least significant bit and the two most significant bits: large even numbers are not prime, and setting the top two guarantees N won’t come out too small. Checking if a number x is prime is generally done with the Miller-Rabin test 2 , a probabilistic algorithm where you pick a “base” and use it to run some computations on x . It will either conclusively prove x is composite (i.e. not prime), or fail to do so. Figuring out how many Miller-Rabin tests you need to run is surprisingly difficult: initially you will learn the probability of a test failing for a composite is 1/4, which suggests you need 40 rounds to reach 2⁻⁸⁰; then you learn that’s only the upper bound for worst-case values of x , 3 while random values have a much much lower chance of failure; eventually you also realize that it doesn’t matter that much because you only run all the iterations on the prime, while most composites are rejected in the first iteration. Anyway, BoringSSL has a table and we’ll want 5 Miller-Rabin tests with random bases for a 1024-bit prime. Miller-Rabin is kinda slow though, and most numbers have small divisors, so it’s usually more efficient to quickly reject those by doing “trial divisions” or a GCD with the first handful of primes. The first few dozens are usually a major win, but using more and more primes has diminishing returns. There are a million and one things that can go wrong, but interestingly enough you have to go out of your way to get them wrong: if generating large candidates fully at random, all those cases have cryptographically negligible chance. To recap, to generate an RSA key you generate two primes. To generate a prime you pick random numbers, try to rule them out with trial divisions, and then do a few Miller-Rabin tests on them. Now, how are we supposed to benchmark that? Luck will drastically affect runtime: you’re essentially benchmarking a lottery. While debugging a performance regression Russ Cox ran hundreds of measurements and still got some noisy and in some places suspect results. It’s also not fast enough that you can run millions of measurements and let things average out. One might be tempted to normalize the measurements by dividing the runtime by the number of candidates tested, but this unevenly dilutes all the final computations, and is still perturbed by how many candidates are caught by trial division and how many proceed to the Miller-Rabin tests. Similarly, benchmarking Miller-Rabin in isolation ignores the final computations, and doesn’t measure the impact of trial divisions. What we can do is use math to figure out how an average representative sequence of candidates looks like and benchmark that. Since the key generation process is repeatable, 4 we can pre-generate a golden sequence of candidates, and even share it across implementations to benchmark apples to apples. First, we need to figure out how many composites we should expect on average before each prime. The prime-counting function approximation tells us there are Li(x) primes less than x , which works out 5 to one prime every 354 odd integers of 1024 bits. Then, we normalize the small divisors of the composites. A random number has a 1/p chance of being divisible by p , and based on that we can calculate how many composites divisible by the first n primes we’d expect to encounter before a prime. For example, we’d expect 33% of numbers to be divisible by 3, 46% to be divisible by 3 or 5, 69% of numbers to be divisible by one of the first 10 primes, 80% to be divisible by one of the first 50 primes, and so on . Flipping that around, we make 118 of our 353 composites divisible by 3, 47 divisible by 5 but not by 3, 27 divisible by 7 but not by 3 or 5, and so on. This will make the number of successful trial divisions representative, and will even let us do comparative benchmarks between different trial division thresholds without regenerating the inputs. Beyond setting the top and bottom bits like keygen will , we also unset the second-least significant bit and set the third-least significant bit of each candidate to normalize the number of iterations of the inner loop of Miller-Rabin, which depends on the trailing zeroes of x-1 . We don’t need to worry about composites failing Miller-Rabin tests: if 5 tests are enough to get to 2⁻¹¹² then one test fails with at most 2⁻²² chance, which is not cryptographically negligible but will not show up in benchmarks. Similarly, we don’t need to worry about e not being invertible modulo φ(N) : we use 65537 as e , which is prime, so only 1/65537 numbers aren’t coprime with it. The result is remarkably stable and should be representative both in terms of absolute runtime and in terms of CPU time spent in different functions, allowing meaningful profiling. Generating 20 random average traces and benchmarking them yields variance of less than 1%. You can see it in use in this Go standard library code review . The script to generate the traces, as well as ten ready to use traces are available in CCTV and you’re welcome to use them to benchmark your implementations! If you got this far, you might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . One day a friend was driving me to the SFO airport from Redwood Park and we were late. Like, flight begins boarding in a few minutes late. But then we came up to this view, and had to stop to take it in. The people in the other car had set out a little camping chair to watch the sun set over the clouds below. I have an incredible video of driving down into the clouds. Made the flight! My maintenance work is funded by the awesome Geomys clients: Interchain , Smallstep , Ava Labs , Teleport , SandboxAQ , Charm , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. SandboxAQ — SandboxAQ ’s AQtive Guard is a unified cryptographic management software platform that helps protect sensitive data and ensures compliance with authorities and customers. It provides a full range of capabilities to achieve cryptographic agility, acting as an essential cryptography inventory and data aggregation platform that applies current and future standardization organizations mandates. AQtive Guard automatically analyzes and reports on your cryptographic security posture and policy management, enabling your team to deploy and enforce new protocols, including quantum-resistant cryptography, without re-writing code or modifying your IT infrastructure. Charm — If you’re a terminal lover, join the club. Charm builds tools and libraries for the command line. Everything from styling terminal apps with Lip Gloss to making your shell scripts interactive with Gum . Charm builds libraries in Go to enhance CLI applications while building with these libraries to deliver CLI and TUI-based apps. Or, if you want to make your life harder and your code more complex for no practical benefit, you can use λ(N) instead of φ(N) , but that’s for a different rant.  ↩ There is also the Lucas test, and doing both a round of Miller-Rabin with base 2 and a Lucas test is called a Baillie–PSW. There are no known composites that pass the Baillie–PSW test, which sounds great, but the Lucas test is a major pain to implement.  ↩ In an adversarial setting, you also need to worry about the attacker forcing or adapting to your selection of bases. The amazingly-named Prime and Prejudice: Primality Testing Under Adversarial Conditions by Albrecht et al. pulls a number of fun tricks, but the main one boils down to the observation that if you hardcode the bases or generate them from x , they are not random.  ↩ It’s not strictly speaking deterministic, because the tests are randomized, but the chance of coming to a different conclusion is cryptographically negligible, and even the chance of major deviations in runtime is very small, as we will see.  ↩ I did a quick Monte Carlo simulation to check this was correct, and it was really fun to see the value swing and converge to the expected value. Math!  ↩ Or, if you want to make your life harder and your code more complex for no practical benefit, you can use λ(N) instead of φ(N) , but that’s for a different rant.  ↩ There is also the Lucas test, and doing both a round of Miller-Rabin with base 2 and a Lucas test is called a Baillie–PSW. There are no known composites that pass the Baillie–PSW test, which sounds great, but the Lucas test is a major pain to implement.  ↩ In an adversarial setting, you also need to worry about the attacker forcing or adapting to your selection of bases. The amazingly-named Prime and Prejudice: Primality Testing Under Adversarial Conditions by Albrecht et al. pulls a number of fun tricks, but the main one boils down to the observation that if you hardcode the bases or generate them from x , they are not random.  ↩ It’s not strictly speaking deterministic, because the tests are randomized, but the chance of coming to a different conclusion is cryptographically negligible, and even the chance of major deviations in runtime is very small, as we will see.  ↩ I did a quick Monte Carlo simulation to check this was correct, and it was really fun to see the value swing and converge to the expected value. Math!  ↩

0 views
Filippo Valsorda 11 months ago

frood, an Alpine initramfs NAS

My NAS, frood , has a bit of a weird setup. It’s just one big initramfs containing a whole Alpine Linux system. It’s delightful and I am not sure why it’s not more common. If this already sounds appealing, you can skip to the “How it works” section below. I’ve always liked running systems from memory: it’s fast and prevents wear on the system storage device, which is often some janky SD card, because the good drives are dedicated to the ZFS pool. However, you immediately have the problem of how to persist configuration changes. Alpine’s answer to this is “ diskless mode ” where any customization is kept in an overlay file. After boot, the stock system looks for a file matching in all available filesystems, applies it, and then installs any missing apk packages from a local cache. The first problem with that is complexity: the tool to generate and manage the apkovl, lbu(1) , is pretty good but that process has a lot of moving parts. Find the apkovl, apply it, mount the filesystems in the new fstab, install the missing apks, resume the boot process. Over the past year, I had this break multiple times, either because it couldn’t find the filesystem anymore or because the apks did not get installed. The boot process depends on the package manager! The second problem is that I would really like the state of the system to be tracked in git. Graham Christensen has a very good pitch for declarative or immutable systems in “ Erase your darlings ”. I erase my systems at every boot. Over time, a system collects state on its root partition. This state lives in assorted directories like and , and represents every under-documented or out-of-order step in bringing up the services. “Right, run .” These small, inconsequential “oh, oops” steps are the pieces that get lost and don’t appear in your runbooks. “Just download ca-certificates to … to fix …” Each of these quick fixes leaves you doomed to repeat history in three years when you’re finally doing that dreaded RHEL 7 to RHEL 8 upgrade. “Oh, touch or the l2tp tunnel won’t work.” I used to solve that by making (most) changes via Ansible, but then I had a multi-layer situation where I needed to make a change in Ansible, then deploy it, then save it with lbu to the apkovl. There are of course many alternatives for declarative systems: from NixOS (which just doesn’t sound fun ) to gokrazy (which is not quite ready to ship ZFS ) to embedded toolchains like buildroot or the newer u-root . Thing is though, I really like Alpine: a simple, well-packaged, lightweight, GNU-less Linux distribution. What I don’t like are its init and persistence mechanisms. When it boots, Linux expects an “initramfs” image . It’s a simple cpio archive of the files that make up the very first root filesystem at boot. Usually the job of this system is to load enough modules to mount the real rootfs and pivot into it. Nothing stops us from putting the entire system in it, though! Who needs a rootfs? The starting point is alpine-make-rootfs , which is a short (~500 lines) script meant to build a container image. It’s really 90% of what we need. alpine-make-rootfs will copy the files from the directory, install the packages from the file, and run the script in a chroot. Then, we extract the boot directory and package the rest into an initramfs archive. That’s truly very nearly it! It’s impressive how Alpine lends itself to this with practically no hacks. The packages we install are the usual stuff you’d install on a server. Only a few are noteworthy. The script is also nothing special. We just need to link , set up the run-levels, and set the root password. (Yes, that’s my actual password hash. No you won’t break it.) In practice I set up a few more services here, but they are not needed to run the system. This is just where you declaratively specify how the system is configured. The root skeleton is similarly system-specific, and it’s so nice to be able to drop files into the image just by creating them. For example, if I want something to run at boot, I just add a file to . A few noteworthy files in the skeleton. makes the power button work with openrc-init. and and get the network to work. and and for obvious reasons. avoids generating non-Ed25519 host keys. Finally, a bit of persistence for the two things that truly can’t do without it: the RNG seed (arguably not necessary with hardware randomness) and Tailscale (which really doesn’t know how to run without persistence, alas). Rigorously UUID mounted. Here’s something beautiful about this setup: you can meaningfully test it in qemu by just pointing it at the kernel and initramfs. Even works emulated on my arm64 M2. This includes a persistence device that I formatted with the same UUID as the production one. 1 Since Tailscale configuration is in there, the qemu image comes up as a different Tailscale device, and I can SSH into it separately. Installing or updating the bootloader is done from the system itself with . We have three boot entries: regular, old, and new. When deploying a new version of the system, we rsync it over, and then use to select it for the next boot. If the machine comes up cleanly, then we move the regular image to old, and new to regular. Otherwise, another reboot rolls it back. I wanted a simple service to get the status of the system at a glance. There are a million ways to do this, but I chose to write a small Go server. It’s not needed to make this system work, but I am including it to show how easy it is to add a service. Before the alpine-make-rootfs invocation, I added a couple lines to build all Go binaries in a local module into . Note that even the Go toolchain is selected declaratively from the thanks to . Then I created . And finally I added one line to . That’s it. The Go server listens on port 80 on the Tailscale IP, and serves the output of scripts I put in . The entire setup is open source, in my mostly-harmless repository . You might be interested in how I made ZFS imports work , which is not covered above. I have not made it into a reusable project partially because there is so little to it. Adding hooks to configure things would easily double its size. I encourage you to just fork it if you’d like. One thing I haven’t solved yet is how to inject secrets. For now they are just ’d. Maybe I’ll plug in a YubiKey and use to decrypt them, and for the host key. Or maybe this board has a TPM and I can use the simplicity of this system to get a full Secure Boot chain that unlocks TPM keys. That’d be fun. If you got this far, you might also want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @[email protected] . The natural pools of Porto Moniz, in Madeira. They’re publicly accessible, made of volcanic rock, and filled by the ocean waves that crash spectacularly against them. I was not doing great that day, but it was an excellent place to not do great at. Madeira is pretty cool. 2 Also one of the trickiest crosswind landings. My maintenance work is funded by the awesome Geomys clients: Interchain , Smallstep , Ava Labs , Teleport , SandboxAQ , Charm , Tailscale , and Sentry . Through our retainer contracts they ensure the sustainability and reliability of our open source maintenance work and get a direct line to my expertise and that of the other Geomys maintainers. (Learn more in the Geomys announcement .) Here are a few words from some of them! Teleport — For the past five years, attacks and compromises have been shifting from traditional malware and security breaches to identifying and compromising valid user accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity is designed to eliminate weak access patterns through access monitoring, minimize attack surface with access requests, and purge unused permissions via mandatory access reviews. Ava Labs — We at Ava Labs , maintainer of AvalancheGo (the most widely used client for interacting with the Avalanche Network ), believe the sustainable maintenance and development of open source cryptographic protocols is critical to the broad adoption of blockchain technology. We are proud to support this necessary and impactful work through our ongoing sponsorship of Filippo and his team. SandboxAQ — SandboxAQ ’s AQtive Guard is a unified cryptographic management software platform that helps protect sensitive data and ensures compliance with authorities and customers. It provides a full range of capabilities to achieve cryptographic agility, acting as an essential cryptography inventory and data aggregation platform that applies current and future standardization organizations mandates. AQtive Guard automatically analyzes and reports on your cryptographic security posture and policy management, enabling your team to deploy and enforce new protocols, including quantum-resistant cryptography, without re-writing code or modifying your IT infrastructure. Charm — If you’re a terminal lover, join the club. Charm builds tools and libraries for the command line. Everything from styling terminal apps with Lip Gloss to making your shell scripts interactive with Gum . Charm builds libraries in Go to enhance CLI applications while building with these libraries to deliver CLI and TUI-based apps.   ↩ I am not paid by the Madeira Dept. of Tourism, I swear.  ↩ As long as the bootloader can find the kernel and initramfs, the machine comes up cleanly. A/B deployments and rollbacks are just a matter of choosing a different boot option. The system is defined declaratively in the git repo that builds the initramfs. Importantly to me, it’s not defined in some complex DSL: if I want a file to exist at I put it in , and the rest is done by a few hundred lines of scripts I can (and have) read. Configuring it doesn’t look any different than configuring any regular Alpine system. I can test the next deploy with a qemu oneliner. There are very very few moving parts. alpine-base is the metapackage that installs apk, busybox, openrc, and a few config files. linux-lts is the kernel, along with its modules. I considered thinning down the modules to only the ones I needed, but it’s ultimately a lot of hacks just to save a couple hundred MB. Note there is no modloop! The modules are always available. linux-firmware-i915 is the i915 folder of Linux firmware. Need to install at least one package providing (including ) or gets installed, which installs them all. intel-ucode is the microcode update. It installs a file in that can be used as a pre-initramfs. This is in fact easier to set up than on bigger systems. syslinux is the bootloader. Way simpler than GRUB, it installs in the filesystem partition, and then boots the kernel from that partition. This closes the loop: as long as we boot the right partition, there is no way for anything but our system to load. Nothing in the boot process needs to discover or even give a name to a filesystem. openrc-init is the init. Alpine doesn’t actually use OpenRC’s init, it uses the one from busybox, but I found OpenRC’s easier to set up. Note though that it doesn’t work with busybox’s shutdown/reboot/poweroff commands so you need to use . agetty if you plan to ever connect a keyboard and screen.   ↩ I am not paid by the Madeira Dept. of Tourism, I swear.  ↩

0 views