Posts in Testing (20 found)

Coverage

Sometimes, the question arises: which tests trigger this code here? Maybe I've found a block of code that doesn't look like it can't be hit, but it's hard to prove. Or I want to answer the age-old question of which subset of quick tests might be useful to run if the full test suite is kinda slow. So, run each test with coverage by itself. Then, instead of merging all the coverage data, find which tests cover the line in question. Oddly enough, though some of the Java tools (e.g., Clover) support per-test coverage, the tools here in general are somewhat lacking. , part of the suite, supports a ("test name") marker, but only displays the per test data on a per-file level: This is the kind of thing where in 2025, you can ask a coding agent to vibe-code or vibe-modify a generator, and it'll work fine. I have not found the equivalent of Profilerpedia for coverage file formats, but the lowest common denominator seems to be . The file format is described at geninfo(1) . Most language ecosystems can either produce LCOV output directly or have pre-existing conversion tools.

0 views
Anton Zhiyanov 6 days ago

Gist of Go: Concurrency testing

This is a chapter from my book on Go concurrency , which teaches the topic from the ground up through interactive examples. Testing concurrent programs is a lot like testing single-task programs. If the code is well-designed, you can test the state of a concurrent program with standard tools like channels, wait groups, and other abstractions built on top of them. But if you've made it so far, you know that concurrency is never that easy. In this chapter, we'll go over common testing problems and the solutions that Go offers. Waiting for goroutines • Checking channels • Checking for leaks • Durable blocking • Instant waiting • Time inside the bubble • Thoughts on time 1  ✎ • Thoughts on time 2  ✎ • Checking for cleanup • Bubble rules • Keep it up Let's say we want to test this function: Calculations run asynchronously in a separate goroutine. However, the function returns a result channel, so this isn't a problem: At point ⓧ, the test is guaranteed to wait for the inner goroutine to finish. The rest of the test code doesn't need to know anything about how concurrency works inside the function. Overall, the test isn't any more complicated than if were synchronous. But we're lucky that returns a channel. What if it doesn't? Let's say the function looks like this: We write a simple test and run it: The assertion fails because at point ⓧ, we didn't wait for the inner goroutine to finish. In other words, we didn't synchronize the and goroutines. That's why still has its initial value (0) when we do the check. We can add a short delay with : The test is now passing. But using to sync goroutines isn't a great idea, even in tests. We don't want to set a custom delay for every function we're testing. Also, the function's execution time may be different on the local machine compared to a CI server. If we use a longer delay just to be safe, the tests will end up taking too long to run. Sometimes you can't avoid using in tests, but since Go 1.25, the package has made these cases much less common. Let's see how it works. The package has a lot going on under the hood, but its public API is very simple: The function creates an isolated bubble where you can control time to some extent. Any new goroutines started inside this bubble become part of the bubble. So, if we wrap the test code with , everything will run inside the bubble — the test code, the function we're testing, and its goroutine. At point ⓧ, we want to wait for the goroutine to finish. The function comes to the rescue! It blocks the calling goroutine until all other goroutines in the bubble are finished. (It's actually a bit more complicated than that, but we'll talk about it later.) In our case, there's only one other goroutine (the inner goroutine), so will pause until it finishes, and then the test will move on. Now the test passes instantly. That's better! ✎ Exercise: Wait until done Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. As we've seen, you can use to wait for the tested goroutine to finish, and then check the state of the data you are interested in. You can also use it to check the state of channels. Let's say there's a function that generates N numbers like 11, 22, 33, and so on: And a simple test: Set N=2, get the first number from the generator's output channel, then get the second number. The test passed, so the function works correctly. But does it really? Let's use in "production": Panic! We forgot to close the channel when exiting the inner goroutine, so the for-range loop waiting on that channel got stuck. Let's fix the code: And add a test for the channel state: The test is still failing, even though we're now closing the channel when the goroutine exits. This is a familiar problem: at point ⓧ, we didn't wait for the inner goroutine to finish. So when we check the channel, it hasn't closed yet. That's why the test fails. We can delay the check using : But it's better to use : At point ⓧ, blocks the test until the only other goroutine (the inner goroutine) finishes. Once the goroutine has exited, the channel is already closed. So, in the select statement, the case triggers with set to , allowing the test to pass. As you can see, the package helped us avoid delays in the test, and the test itself didn't get much more complicated. As we've seen, you can use to wait for the tested goroutine to finish, and then check the state of the data or channels. You can also use it to detect goroutine leaks. Let's say there's a function that runs the given functions concurrently and sends their results to an output channel: And a simple test: Send three functions to be executed, get the first result from the output channel, and check it. The test passed, so the function works correctly. But does it really? Let's run three times, passing three functions each time: After 50 ms — when all the functions should definitely have finished — there are still 9 running goroutines ( ). In other words, all the goroutines are stuck. The reason is that the channel is unbuffered. If the client doesn't read from it, or doesn't read all the results, the goroutines inside get blocked when they try to send the result of to . Let's fix this by adding a buffer of the right size to the channel: Then add a test to check the number of goroutines: The test is still failing, even though the channel is now buffered, and the goroutines shouldn't block on sending to it. This is a familiar problem: at point ⓧ, we didn't wait for the running goroutines to finish. So is greater than zero, which makes the test fail. We can delay the check using (not recommended), or use a third-party package like goleak (a better option): The test passes now. By the way, goleak also uses internally, but it does so much more efficiently. It tries up to 20 times, with the wait time between checks increasing exponentially, starting at 1 microsecond and going up to 100 milliseconds. This way, the test runs almost instantly. Even better, we can check for leaks without any third-party packages by using : Earlier, I said that blocks the calling goroutine until all other goroutines finish. Actually, it's a bit more complicated. blocks until all other goroutines either finish or become durably blocked . We'll talk about "durably" later. For now, let's focus on "become blocked." Let's temporarily remove the buffer from the channel and check the test results: Here's what happens: Next, comes into play. It not only starts the bubble goroutine, but also tries to wait for all child goroutines to finish before it returns. If sees that some goroutines are stuck (in our case, all 9 are blocked trying to send to the channel), it panics: main bubble goroutine has exited but blocked goroutines remain So, we found the leak without using or goleak, thanks to the useful features of and : Now let's make the channel buffered and run the test again: As we've found, blocks until all goroutines in the bubble — except the one that called — have either finished or are durably blocked. Let's figure out what "durably blocked" means. For , a goroutine inside a bubble is considered durably blocked if it is blocked by any of the following operations: Other blocking operations are not considered durable, and ignores them. For example: The distinction between "durable" and other types of blocks is just a implementation detail of the package. It's not a fundamental property of the blocking operations themselves. In real-world applications, this distinction doesn't exist, and "durable" blocks are neither better nor worse than any others. Let's look at an example. Let's say there's a type that performs some asynchronous computation: Our goal is to write a test that checks the result while the calculation is still running . Let's see how the test changes depending on how is implemented (except for the version — we'll cover that one a bit later). Let's say is implemented using a done channel: Naive test: The check fails because when is called, the goroutine in hasn't set yet. Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on reading from the channel. This channel is created inside the bubble, so the block is durable. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using select: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on a select statement. Both channels used in the select ( and ) are created inside the bubble, so the block is durable. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using a wait group: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the wait group's call. The group's method was called inside the bubble, so this is a durable block. The call in the test returns as soon as happens, and we get the current value of . Let's say is implemented using a condition variable: Let's use to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the condition variable's call. This is a durable block. The call returns as soon as happens, and we get the current value of . Let's say is implemented using a mutex: Let's try using to wait until the goroutine is blocked at point ⓧ: In ⓧ, the goroutine is blocked on the mutex's call. doesn't consider blocking on a mutex to be durable. The call ignores the block and never returns. The test hangs and only fails when the overall timeout is reached. You might be wondering why the authors didn't consider blocking on mutexes to be durable. There are a couple of reasons: ⌘ ⌘ ⌘ Let's go back to the original question: how does the test change depending on how is implemented? It doesn't change at all. We used the exact same test code every time: If your program uses durably blocking operations, always works the same way: Very convenient! ✎ Exercise: Blocking queue Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Inside the bubble, time works differently. Instead of using a regular wall clock, the bubble uses a fake clock that can jump forward to any point in the future. This can be quite handy when testing time-sensitive code. Let's say we want to test this function: The positive scenario is straightforward: send a value to the channel, call the function, and check the result: The negative scenario, where the function times out, is also pretty straightforward. But the test takes the full three seconds to complete: We're actually lucky the timeout is only three seconds. It could have been as long as sixty! To make the test run instantly, let's wrap it in : Note that there is no call here, and the only goroutine in the bubble (the root one) gets durably blocked on a select statement in . Here's what happens next: Thanks to the fake clock, the test runs instantly instead of taking three seconds like it would with the "naive" approach. You might have noticed that quite a few circumstances coincided here: We'll look at the alternatives soon, but first, here's a quick exercise. ✎ Exercise: Wait, repeat Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. The fake clock in can be tricky. It move forward only if: ➊ all goroutines in the bubble are durably blocked; ➋ there's a future moment when at least one goroutine will unblock; and ➌ isn't running. Let's look at the alternatives. I'll say right away, this isn't an easy topic. But when has time travel ever been easy? :) Here's the function we're testing: Let's run in a separate goroutine, so there will be two goroutines in the bubble: panicked because the root bubble goroutine finished while the goroutine was still blocked on a select. Reason: only advances the clock if all goroutines are blocked — including the root bubble goroutine. How to fix: Use to make sure the root goroutine is also durably blocked. Now all three conditions are met again (all goroutines are durably blocked; the moment of future unblocking is known; there is no call to ). The fake clock moves forward 3 seconds, which unblocks the goroutine. The goroutine finishes, leaving only the root one, which is still blocked on . The clock moves forward another 2 seconds, unblocking the root goroutine. The assertion passes, and the test completes successfully. But if we run the test with the race detector enabled (using the flag), it reports a data race on the variable: Logically, using in the root goroutine doesn't guarantee that the goroutine (which writes to the variable) will finish before the root goroutine reads from . That's why the race detector reports a problem. Technically, the test passes because of how is implemented, but the race still exists in the code. The right way to handle this is to call after : Calling ensures that the goroutine finishes before the root goroutine reads , so there's no data race anymore. Here's the function we're testing: Let's replace in the root goroutine with : panicked because the root bubble goroutine finished while the goroutine was still blocked on a select. Reason: only advances the clock if there is no active running. If all bubble goroutines are durably blocked but a is running, won't advance the clock. Instead, it will simply finish the call and return control to the goroutine that called it (in this case, the root bubble goroutine). How to fix: don't use . Let's update to use context cancellation instead of a timer: We won't cancel the context in the test: panicked because all goroutines in the bubble are hopelessly blocked. Reason: only advances the clock if it knows how much to advance it. In this case, there is no future moment that would unblock the select in . How to fix: Manually unblock the goroutine and call to wait for it to finish. Now, cancels the context and unblocks the select in , while makes sure the goroutine finishes before the test checks and . Let's update to lock the mutex before doing any calculations: In the test, we'll lock the mutex before calling , so it will block: The test failed because it hit the overall timeout set in . Reason: only works with durable blocks. Blocking on a mutex lock isn't considered durable, so the bubble can't do anything about it — even though the sleeping inner goroutine would have unlocked the mutex in 10 ms if the bubble had used the wall clock. How to fix: Don't use . Now the mutex unlocks after 10 milliseconds (wall clock), finishes successfully, and the check passes. The clock inside the buuble won't move forward if: ✎ Exercise: Asynchronous repeater Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Let's practice understanding time in the bubble with some thinking exercises. Try to solve the problem in your head before using the playground. Here's a function that performs synchronous work: And a test for it: What is the test missing at point ⓧ? ✓ Thoughts on time 1 There's only one goroutine in the test, so when gets blocked by , the time in the bubble jumps forward by 3 seconds. Then sets to and finishes. Finally, the test checks and passes successfully. No need to add anything. Let's keep practicing our understanding of time in the bubble with some thinking exercises. Try to solve the problem in your head before using the playground. Here's a function that performs asynchronous work: And a test for it: What is the test missing at point ⓧ? ✓ Thoughts on time 2 Let's go over the options. ✘ synctest.Wait This won't help because returns as soon as inside is called. The check fails, and panics with the error: "main bubble goroutine has exited but blocked goroutines remain". ✘ time.Sleep Because of the call in the root goroutine, the wait inside in is already over by the time is checked. However, there's no guarantee that has run yet. That's why the test might pass or might fail. ✘ synctest.Wait, then time.Sleep This option is basically the same as just using , because returns before the in even starts. The test might pass or might fail. ✓ time.Sleep, then synctest.Wait This is the correct answer: Since the root goroutine isn't blocked, it checks while the goroutine is blocked by the call. The check fails, and panics with the message: "main bubble goroutine has exited but blocked goroutines remain". Sometimes you need to test objects that use resources and should be able to release them. For example, this could be a server that, when started, creates a pool of network connections, connects to a database, and writes file caches. When stopped, it should clean all this up. Let's see how we can make sure everything is properly stopped in the tests. We're going to test this server: Let's say we wrote a basic functional test: The test passes, but does that really mean the server stopped when we called ? Not necessarily. For example, here's a buggy implementation where our test would still pass: As you can see, the author simply forgot to stop the server here. To detect the problem, we can wrap the test in and see it panic: The server ignores the call and doesn't stop the goroutine running inside . Because of this, the goroutine gets blocked while writing to the channel. When finishes, it detects the blocked goroutine and panics. Let's fix the server code (to keep things simple, we won't support multiple or calls): Now the test passes. Here's how it works: Instead of using to stop something, it's common to use the method. It registers a function that will run when the test finishes: Functions registered with run in last-in, first-out (LIFO) order, after all deferred functions have executed. In the test above, there's not much difference between using and . But the difference becomes important if we move the server setup into a separate helper function, so we don't have to repeat the setup code in different tests: The approach doesn't work because it calls when returns — before the test assertions run: The approach works because it calls when has finished — after all the assertions have already run: Sometimes, a context ( ) is used to stop the server instead of a separate method. In that case, our server interface might look like this: Now we don't even need to use or to check whether the server stops when the context is canceled. Just pass as the context: returns a context that is automatically created when the test starts and is automatically canceled when the test finishes. Here's how it works: To check for stopping via a method or function, use or . To check for cancellation or stopping via context, use . Inside a bubble, returns a context whose channel is associated with the bubble. The context is automatically canceled when ends. Functions registered with inside the bubble run just before finishes. Let's go over the rules for living in the bubble. The following operations durably block a goroutine: The limitations are quite logical, and you probably won't run into them. Don't create channels or objects that contain channels (like tickers or timers) outside the bubble. Otherwise, the bubble won't be able to manage them, and the test will hang: Don't access synchronization primitives associated with a bubble from outside the bubble: Don't call , , or inside a bubble: Don't call inside the bubble: Don't call from outside the bubble: Don't call concurrently from multiple goroutines: ✎ Exercise: Testing a pipeline Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. The package is a complicated beast. But now that you've studied it, you can test concurrent programs no matter what synchronization tools they use—channels, selects, wait groups, timers or tickers, or even . In the next chapter, we'll talk about concurrency internals (coming soon). Pre-order for $10   or read online Three calls to start 9 goroutines. The call to blocks the root bubble goroutine ( ). One of the goroutines finishes its work, tries to write to , and gets blocked (because no one is reading from ). The same thing happens to the other 8 goroutines. sees that all the child goroutines in the bubble are blocked, so it unblocks the root goroutine. The root goroutine finishes. unblocks as soon as all other goroutines are durably blocked. panics when finished if there are still blocked goroutines left in the bubble. Sending to or receiving from a channel created within the bubble. A select statement where every case is a channel created within the bubble. Calling if all calls were made inside the bubble. Sending to or receiving from a channel created outside the bubble. Calling or . I/O operations (like reading a file from disk or waiting for a network response). System calls and cgo calls. Mutexes are usually used to protect shared state, not to coordinate goroutines (the example above is completely unrealistic). In tests, you usually don't need to pause before locking a mutex to check something. Mutex locks are usually held for a very short time, and mutexes themselves need to be as fast as possible. Adding extra logic to support could slow them down in normal (non-test) situations. It waits until all other goroutines in the bubble are blocked. Then, it unblocks the goroutine that called it. The bubble checks if the goroutine can be unblocked by waiting. In our case, it can — we just need to wait 3 seconds. The bubble's clock instantly jumps forward 3 seconds. The select in chooses the timeout case, and the function returns . The test assertions for and both pass successfully. There's no call. There's only one goroutine. The goroutine is durably blocked. It will be unblocked at certain point in the future. There are any goroutines that aren't durably blocked. It's unclear how much time to advance. is running. Because of the call in the root goroutine, the wait inside in is already over by the time is checked. Because of the call, the goroutine is guaranteed to finish (and hence to call ) before is checked. The main test code runs. Before the test finishes, the deferred is called. In the server goroutine, the case in the select statement triggers, and the goroutine ends. sees that there are no blocked goroutines and finishes without panicking. The main test code runs. Before the test finishes, the context is automatically canceled. The server goroutine stops (as long as the server is implemented correctly and checks for context cancellation). sees that there are no blocked goroutines and finishes without panicking. A bubble is created by calling . Each call creates a separate bubble. Goroutines started inside the bubble become part of it. The bubble can only manage durable blocks. Other types of blocks are invisible to it. If all goroutines in the bubble are durably blocked with no way to unblock them (such as by advancing the clock or returning from a call), panics. When finishes, it tries to wait for all child goroutines to complete. However, if even a single goroutine is durably blocked, panics. Calling returns a context whose channel is associated with the bubble. Functions registered with run inside the bubble, immediately before returns. Calling in a bubble blocks the goroutine that called it. returns when all other goroutines in the bubble are durably blocked. returns when all other goroutines in the bubble have finished. The bubble uses a fake clock (starting at 2000-01-01 00:00:00 UTC). Time in the bubble only moves forward if all goroutines are durably blocked. Time advances by the smallest amount needed to unblock at least one goroutine. If the bubble has to choose between moving time forward or returning from a running , it returns from . A blocking send or receive on a channel created within the bubble. A blocking select statement where every case is a channel created within the bubble. Calling if all calls were made inside the bubble.

0 views
Sean Goedecke 1 weeks ago

Why it takes months to tell if new AI models are good

Nobody knows how to tell if current-generation models are any good . When GPT-5 launched, the overall mood was very negative, and the consensus was that it wasn’t a strong model. But three months later it turns out that GPT-5 (and its derivative GPT-5-Codex) is a very strong model for agentic work 1 : enough to break Anthropic’s monopoly on agentic coding models. In fact, GPT-5-Codex is my preferred model for agentic coding. It’s slower than Claude Sonnet 4.5, but in my experience it gets more hard problems correct. Why did it take months for me to figure this out? The textbook solution for this problem is evals - datasets of test cases that models can be scored against - but evals are largely unreliable . Many models score very well on evals but turn out to be useless in practice. There are a couple of reasons for this. First, it’s just really hard to write useful evals for real-world problems , since real-world problems require an enormous amount of context. Can’t you take previous real-world problems and put them in your evals - for instance, by testing models on already-solved open-source issues? You can, but you run into two difficulties: Another problem is that evals are a target for AI companies . How well Anthropic or OpenAI’s new models perform on evals has a direct effect on the stock price of those companies. It’d be naive to think that they don’t make some kind of effort to do well on evals: if not by directly training on public eval data 2 , then by training on data that’s close enough to eval data to produce strong results. I’m fairly confident that big AI companies will not release a model unless they can point to a set of evals that their model does better than competitors. So you can’t trust that strong evals will mean a strong model, because every single new model is released with strong evals. If you can’t rely on evals to tell you if a new model is good, what can you rely on? For most people, the answer is the “vibe check”: interacting with the model themselves and making their own judgement. Often people use a set of their own pet questions, which are typically questions that other LLMs get wrong (say, word puzzles). Trick questions can be useful, but plenty of strong models struggle with specific trick questions for some reason. My sense is also that current models are too strong for obvious word puzzles. You used to be able to trip up models with straightforward questions like “If I put a ball in a box, then put the box in my pocket, where is the ball?” Now you have to be more devious, which gives less signal about how strong the model is. Sometimes people use artistic prompts. Simon Willison famously asks new models to produce a SVG of a pelican riding a bicycle. It’s now a common Twitter practice to post side-by-side “I asked two models to build an object in Minecraft” screenshots. This is cool - you can see at a glance that bigger models produce better images - but at some point it becomes difficult to draw conclusions from the images. If Claude Sonnet 4.5 puts the pelican’s feet on the pedals correctly, but GPT-5.1 adds spokes to the wheels, which model is better? Finally, many people rely on pure vibes: the intangible sense you get after using a model about whether it’s good or not. This is sometimes described as “big model smell”. I am fairly agnostic about people’s ability to determine model capability from vibes alone. It seems like something humans might be able to do, but also like something that would be very easy to fool yourself about. For instance, I would struggle to judge a model with the conversational style of GPT-4o as very smart, but there’s nothing in principle that would prevent that. Of course, for people who engage in intellectually challenging pursuits, there’s an easy (if slow) way to evaluate model capability: just give it the problems you’re grappling with and see how it does. I often ask a strong agentic coding model to do a task I’m working on in parallel with my own efforts. If the model fails, it doesn’t slow me down much; if it succeeds, it catches something I don’t, or at least gives me a useful second opinion. The problem with this approach is that it takes a fair amount of time and effort to judge if a new model is any good, because you have to actually do the work : if you’re not engaging with the problem yourself, you will have no idea if the model’s solution is any good or not. So testing out a new model can be risky. If it’s no good, you’ve wasted a fair amount of time and effort! I’m currently trying to decide whether to invest this effort into testing out Gemini 3 Pro or GPT-5.1-Codex - right now I’m still using GPT-5-Codex for most tasks, or Claude Sonnet 4.5 on some simpler problems. Each new model release reignites the debate over whether AI progress is stagnating. The most prominent example is Gary Marcus, who has written that GPT-4 , GPT-4o , Claude 3.5 Sonnet , GPT-5 and DeepSeek all prove that AI progress has hit a wall. But almost everyone who writes about AI seems to be interested in the topic. Each new model launch is watched to see if this is the end of the bubble, or if LLMs will continue to get more capable. The reason this debate never ends is that there’s no reliable way to tell if an AI model is good . Suppose that base AI models were getting linearly smarter (i.e. that GPT-5 really was as far above GPT-4 as GPT-4 was above GPT-3.5, and so on). Would we actually be able to tell? When you’re talking to someone who’s less smart than you 3 , it’s very clear. You can see them failing to follow points you’re making, or they just straight up spend time visibly confused and contradicting themselves. But when you’re talking to someone smarter than you, it’s far from clear (to you) what’s going on. You can sometimes feel that you’re confused by what they say, but that doesn’t necessarily mean they’re smarter. It could be that they’re just talking nonsense. And smarter people won’t confuse you all the time - only when they fail to pitch their communication at your level. Talking with AI models is like that. GPT-3.5 was very clearly less smart than most of the humans who talked to it. It was mainly impressive that it was able to carry on a conversation at all. GPT-4 was probably on par with the average human (or a little better) in its strongest domains. GPT-5 (at least in thinking mode) is smarter than the average human across most domains, I believe. Suppose we had no objective way of measuring chess ability. Would I be able to tell if computer chess engines were continuing to get better? I’d certainly be impressed when the chess engines went from laughably bad to beating me every time. But I’m not particularly good at chess. I would lose to chess engines from the early 1980s . It would thus seem to me as if chess engine progress had stalled out, when in fact modern chess engines have double the rating of chess engines from the 1980s. I acknowledge that “the model is now at least partly smarter than you” is an underwhelming explanation for why AI models don’t appear to be rapidly getting better. It’s easy to point to cases where even strong models fall over. But it’s worth pointing out that if models were getting consistently smarter, this is what it would look like : rapid subjective improvement as the models go from less intelligent than you to on par with you, and then an immediate plateau as the models surpass you and you become unable to tell how smart they are. By “agentic work” I mean “LLM with tools that runs in a loop”, like Copilot Agent Mode, Claude Code, and Codex. I haven’t yet tried GPT-5.1-Codex enough to have a strong opinion. If you train a model on the actual eval dataset itself, it will get very good at answering those specific questions, even if it’s not good at answering those kinds of questions. This is often called “benchmaxxing”: prioritizing evals and benchmarks over actual capability. I want to bracket the question of whether “smart” is a broad category, or how exactly to define it. I’m talking specifically about the way GPT-4 is smarter than GPT-3.5 - even if we can’t define exactly how, we know that’s a real thing. Open-source coding is often meaningfully different from the majority of programming work. For more on this, see my comments in METR’S AI productivity study is really good , where I discuss an AI-productivity study that was done on open-source codebases. You’re still only covering a tiny subset of all programming work. For instance, the well-known SWE-Bench set of coding evals are just in Python. A model might be really good at Python but struggle with other languages. Nobody knows how good a model is when it’s launched. Even the AI lab who built it are only guessing and hoping it’ll turn out to be effective for real-world use cases. Evals are mostly marketing tools. It’s hard to figure out how good the eval is, or if the model is being “taught to the test”. If you’re trying to judge models from their public evals you’re fighting against the billions of dollars of effort going into gaming the system. Vibe checks don’t test the kind of skills that are useful for real work, but testing a model by using it to do real work takes a lot of time. You can’t figure out if a brand new model is good that way. Because of all this, it’s very hard to tell if AI progress is stagnating or not. Are the models getting better? Are they any good right now? Compounding that problem, it’s hard to judge between two models that are both smarter than you (in a particular domain). If the models do keep getting better, we might expect it to feel like they’re plateauing, because once they get better than us we’ll stop seeing evidence of improvement. By “agentic work” I mean “LLM with tools that runs in a loop”, like Copilot Agent Mode, Claude Code, and Codex. I haven’t yet tried GPT-5.1-Codex enough to have a strong opinion. ↩ If you train a model on the actual eval dataset itself, it will get very good at answering those specific questions, even if it’s not good at answering those kinds of questions. This is often called “benchmaxxing”: prioritizing evals and benchmarks over actual capability. ↩ I want to bracket the question of whether “smart” is a broad category, or how exactly to define it. I’m talking specifically about the way GPT-4 is smarter than GPT-3.5 - even if we can’t define exactly how, we know that’s a real thing. ↩

0 views
Schneems 1 weeks ago

Disallow code usage with a custom `clippy.toml`

I recently discovered that adding a file to the root of a Rust project gives the ability to disallow a method or a type when running . This has been really useful. I want to share two quick ways that I’ve used it: Enhancing calls via and protecting CWD threadsafety in tests. Update: you can also use this technique to disallow unwrap() ! There’s also which you use by adding to your . I use the fs_err crate in my projects, which provides the same filesystem API as but with one crucial difference: error messages it produces have the name of the file you’re trying to modify. Recently, while I was skimming the issues, someone mentioned using clippy.toml to deny usage . I thought the idea was neat, so I tried it in my projects, and it worked like a charm. With this in the file: Someone running will get an error: Running will now automatically update the code. Neat! Why was I skimming issues in the first place? I suggested adding a feature to allow enhancing errors with debugging information , so instead of: The message could contain a lot more info: To implement that functionality, I wrote path_facts , a library that provides facts about your filesystem (for debugging purposes). And since the core value of the library is around producing good-looking output, I wanted snapshot tests that covered all my main branches. This includes content from both relative and absolute paths. A naive implementation might look like this: In the above code, the test changes the current working directory to a temp dir where it is then free to make modifications on disk. But, since Rust uses a multi-threaded test runner and affects the whole process, this approach is not safe ☠️. There are a lot of different ways to approach the fix, like using cargo-nextest , which executes all tests in their own process (where changing the CWD is safe). Though this doesn’t prevent someone from running accidentally. There are other crates that use macros to force non-concurrent test execution, but they require you to remember to tag the appropriate tests . I wanted something lightweight that was hard to mess up, so I turned to to fail if anyone used for any reason: Then I wrote a custom type that used a mutex to guarantee that only one test body was executing at a time: You might call my end solution hacky (this hedge statement brought to you by too many years of being ONLINE), but it prevents anyone (including future-me) from writing an accidentally thread-unsafe test: Those are only two quick examples showing how to use clippy.toml to enhance a common API, and how to safeguard against incorrect usage. There’s plenty more you can do with that file, including: You wouldn’t want to use this technique of annotating your project with if the thing you’re trying to prevent would be actively malicious for the system if it executes, since rules won’t block your . You’ll also need to make sure to run in your CI so some usage doesn’t accidentally slip through. And that clippy lint work has paid off, my latest PR to was merged and deployed in version , and you can use it to speed up your development debugging by turning on the feature: Clip cautiously, my friends.

0 views
devansh 3 weeks ago

AI pentest scoping playbook

Disclosure: Certain sections of this content were grammatically refined/updated using AI assistance, as English is not my first language. Organizations are throwing money at "AI red teams" who run a few prompt injection tests, declare victory, and cash checks. Security consultants are repackaging traditional pentest methodologies with "AI" slapped on top, hoping nobody notices they're missing 80% of the actual attack surface. And worst of all, the people building AI systems, the ones who should know better, are scoping engagements like they're testing a CRUD app from 2015. This guide/playbook exists because the current state of AI security testing is dangerously inadequate. The attack surface is massive. The risks are novel. The methodologies are immature. And the consequences of getting it wrong are catastrophic. These are my personal views, informed by professional experience but not representative of my employer. What follows is what I wish every CISO, security lead, and AI team lead understood before they scoped their next AI security engagement. Traditional web application pentests follow predictable patterns. You scope endpoints, define authentication boundaries, exclude production databases, and unleash testers to find SQL injection and XSS. The attack surface is finite, the vulnerabilities are catalogued, and the methodologies are mature. AI systems break all of that. First, the system output is non-deterministic . You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder. Second, the attack surface is layered and interconnected . You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration). Third, novel attack classes exist that don't map to traditional vuln categories . Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this. Fourth, you might not control the model . If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals. Fifth, AI systems are probabilistic, data-dependent, and constantly evolving . A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases. This isn't incrementally harder than web pentesting. It's just fundamentally different. And if your scope document looks like a web app pentest with "LLM" find-and-replaced in, you're going to miss everything that matters. Before you can scope an AI security engagement, you need to understand what you're actually testing. And most organizations don't. Here's the stack: This is the thing everyone focuses on because it's the most visible. But "the model" isn't monolithic. Base model : Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, different safety mechanisms, different failure modes. Fine-tuning : Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope. Instruction tuning : Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work. Multi-model orchestration : Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says "yes" and Model B says "no"? How do you handle consensus? Can an adversary exploit disagreements? Model serving infrastructure : How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security characteristics. AI systems don't just run models. They feed data into models. And that data pipeline is massive attack surface. Training data : Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data? RAG (Retrieval-Augmented Generation) : Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings? Vector databases : If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope. Real-time data ingestion : Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point. Data preprocessing : How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses. Models don't exist in isolation. They're integrated into applications. And those integration points are attack surface. APIs : How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors. Authentication and authorization : Who can access the model? How are permissions enforced? Can an adversary escalate privileges? Rate limiting : Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries? Logging and monitoring : Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII. Plugins and tool use : Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run . Multi-turn conversations : Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually/ If you've built agentic systems, AI that can plan, reason, use tools, and take actions autonomously, you've added an entire new dimension of attack surface. Tool access : What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk. Planning and reasoning : How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals? Memory systems : Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory? Multi-agent coordination : Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms? Escalation paths : Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents? AI systems run on infrastructure. That infrastructure has traditional security vulnerabilities that still matter. Cloud services : Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos? Containers and orchestration : Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly? CI/CD pipelines : How do you deploy model updates? Can an adversary inject malicious code into your pipeline? Dependencies : Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions? Secrets management : Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers? How much of that did you include in your last AI security scope document? If the answer is "less than 60%", your scope is inadequate. And you're going to get breached by someone who understands the full attack surface. The OWASP Top 10 for LLM Applications is the closest thing we have to a standardized framework for AI security testing. If you're scoping an AI engagement and you haven't mapped every item in this list to your test plan, you're doing it wrong. Here's the 2025 version: That's your baseline. But if you stop there, you're missing half the attack surface. The OWASP LLM Top 10 is valuable, but it's not comprehensive. Here's what's missing: Safety ≠ security . But unsafe AI systems cause real harm, and that's in scope for red teaming. Alignment failures : Can the model be made to behave in ways that violate its stated values? Constitutional AI bypass : If you're using constitutional AI techniques (like Anthropic's Claude), can adversaries bypass the constitution? Bias amplification : Does the model exhibit or amplify demographic biases? This isn't just an ethics issue—it's a legal risk under GDPR, EEOC, and other regulations. Harmful content generation : Can the model be tricked into generating illegal, dangerous, or abusive content? Deceptive behavior : Can the model lie, manipulate, or deceive users? Traditional adversarial ML attacks apply to AI systems. Evasion attacks : Can adversaries craft inputs that cause misclassification? Model inversion : Can adversaries reconstruct training data from model outputs? Model extraction : Can adversaries steal model weights through repeated queries? Membership inference : Can adversaries determine if specific data was in the training set? Backdoor attacks : Does the model have hidden backdoors that trigger on specific inputs? If your AI system handles multiple modalities (text, images, audio, video), you have additional attack surface. Cross-modal injection : Attackers embed malicious instructions in images that the vision-language model follows. Image perturbation attacks : Small pixel changes invisible to humans cause model failures. Audio adversarial examples : Audio inputs crafted to cause misclassification. Typographic attacks : Adversarial text rendered as images to bypass filters. Multi-turn multimodal jailbreaks : Combining text and images across multiple turns to bypass safety. AI systems must comply with GDPR, HIPAA, CCPA, and other regulations. PII handling : Does the model process, store, or leak personally identifiable information? Right to explanation : Can users get explanations for automated decisions (GDPR Article 22)? Data retention : How long is data retained? Can users request deletion? Cross-border data transfers : Does the model send data across jurisdictions? Before you write your scope document, answer every single one of these questions. If you can't answer them, you don't understand your system well enough to scope a meaningful AI security engagement. If you can answer all these questions, you're ready to scope. If you can't, you're not. Your AI pentest/engagement scope document needs to be more detailed than a traditional pentest scope. Here's the structure: What we're testing : One-paragraph description of the AI system. Why we're testing : Business objectives (compliance, pre-launch validation, continuous assurance, incident response). Key risks : Top 3-5 risks that drive the engagement. Success criteria : What does "passing" look like? Architectural diagram : Include everything—model, data pipelines, APIs, infrastructure, third-party services. Component inventory : List every testable component with owner, version, and deployment environment. Data flows : Document how data moves through the system, from user input to model output to downstream consumers. Trust boundaries : Identify where data crosses trust boundaries (user → app, app → model, model → tools, tools → external APIs). Be exhaustive. List: For each component, specify: Map every OWASP LLM Top 10 item to specific test cases. Example: LLM01 - Prompt Injection : Include specific threat scenarios: Explicitly list what's NOT being tested: Tools : List specific tools testers will use: Techniques : Test phases : Authorization : All testing must be explicitly authorized in writing. Include names, signatures, dates. Ethical boundaries : No attempts at physical harm, financial fraud, illegal content generation (unless explicitly scoped for red teaming). Disclosure : Critical findings must be disclosed immediately via designated channel (email, Slack, phone). Standard findings can wait for formal report. Data handling : Testers must not exfiltrate user data, training data, or model weights except as explicitly authorized for demonstration purposes. All test data must be destroyed post-engagement. Legal compliance : Testing must comply with all applicable laws and regulations. If testing involves accessing user data, appropriate legal review must be completed. Technical report : Detailed findings with severity ratings, reproduction steps, evidence (screenshots, logs, payloads), and remediation guidance. Executive summary : Business-focused summary of key risks and recommendations. Threat model : Updated threat model based on findings. Retest availability : Will testers be available for retest after fixes? Timeline : Start date, end date, report delivery date, retest window. Key contacts : That's your scope document. It should be 10-20 pages. If it's shorter, you're missing things. Here's what I see organizations get wrong: Mistake 1: Scoping only the application layer, not the model You test the web app that wraps the LLM, but you don't test the LLM itself. You find XSS and broken authz, but you miss prompt injection, jailbreaks, and data extraction. Fix : Scope the full stack-app, model, data pipelines, infrastructure. Mistake 2: Treating the model as a black box when you control it If you fine-tuned the model, you have access to training data and weights. Test for data poisoning, backdoors, and alignment failures. Don't just test the API. Fix : If you control any part of the model lifecycle (training, fine-tuning, deployment), include that in scope. Mistake 3: Ignoring RAG and vector databases You test the LLM, but you don't test the document store. Adversaries inject malicious documents, manipulate retrieval, and poison embeddings—and you never saw it coming. Fix : If you're using RAG, the vector database and document ingestion pipeline are in scope. Mistake 4: Not testing multi-turn interactions You test single-shot prompts, but adversaries condition the model over 10 turns to bypass refusal mechanisms. You missed the attack entirely. Fix : Test multi-turn dialogues explicitly. Test conversation history isolation. Test memory poisoning. Mistake 5: Assuming third-party models are safe You're using OpenAI's API, so you assume it's secure. But you're passing user PII in prompts, you're not validating outputs before execution, and you haven't considered what happens if OpenAI's safety mechanisms fail. Fix : Even with third-party models, test your integration. Test input/output handling. Test failure modes. Mistake 6: Not including AI safety in security scope You test for technical vulnerabilities but ignore alignment failures, bias amplification, and harmful content generation. Then your model generates racist outputs or dangerous instructions, and you're in the news. Fix : AI safety is part of AI security. Include alignment testing, bias audits, and harm reduction validation. Mistake 7: Underestimating autonomous agent risks You test the LLM, but your agent can execute code, call APIs, and access databases. An adversary hijacks the agent, and it deletes production data or exfiltrates secrets. Fix : Autonomous agents are their own attack surface. Test tool permissions, privilege escalation, and agent behavior boundaries. Mistake 8: Not planning for continuous testing You do one pentest before launch, then never test again. But you're fine-tuning weekly, adding new plugins monthly, and updating RAG documents daily. Your attack surface is constantly changing. Fix : Scope for continuous red teaming, not one-time assessment. Organizations hire expensive consultants to run a few prompt injection tests, declare the system "secure," and ship to production. Then they get breached six months later when someone figures out a multi-turn jailbreak or poisons the RAG document store. The problem isn't that the testers are bad. The problem is that the scopes are inadequate . You can't find what you're not looking for. If your scope doesn't include RAG poisoning, testers won't test for it. If your scope doesn't include membership inference, testers won't test for it. If your scope doesn't include agent privilege escalation, testers won't test for it. And attackers will. The asymmetry is brutal: you have to defend every attack vector. Attackers only need to find one that works. So when you scope your next AI security engagement, ask yourself: "If I were attacking this system, what would I target?" Then make sure every single one of those things is in your scope document. Because if it's not in scope, it's not getting tested. And if it's not getting tested, it's going to get exploited. Traditional pentests are point-in-time assessments. You test, you report, you fix, you're done. That doesn't work for AI systems. AI systems evolve constantly: Every change introduces new attack surface. And if you're only testing once a year, you're accumulating risk for 364 days. You need continuous red teaming . Here's how to build it: Use tools like Promptfoo, Garak, and PyRIT to run automated adversarial testing on every model update. Integrate tests into CI/CD pipelines so every deployment is validated before production. Set up continuous monitoring for: Quarterly or bi-annually, bring in expert red teams for comprehensive testing beyond what automation can catch. Focus deep assessments on: Train your own security team on AI-specific attack techniques. Develop internal playbooks for: Every quarter, revisit your threat model: Update your testing roadmap based on evolving threats. Scoping AI security engagements is harder than traditional pentests because the attack surface is larger, the risks are novel, and the methodologies are still maturing. But it's not impossible. You need to: If you do this right, you'll find vulnerabilities before attackers do. If you do it wrong, you'll end up in the news explaining why your AI leaked training data, generated harmful content, or got hijacked by adversaries. First, the system output is non-deterministic . You can't write a test case that says "given input X, expect output Y" because the model might generate something completely different next time. This makes reproducibility, the foundation of security testing, fundamentally harder. Second, the attack surface is layered and interconnected . You're not just testing an application. You're testing a model (which might be proprietary and black-box), a data pipeline (which might include RAG, vector stores, and real-time retrieval), integration points (APIs, plugins, browser tools), and the infrastructure underneath (cloud services, containers, orchestration). Third, novel attack classes exist that don't map to traditional vuln categories . Prompt injection isn't XSS. Data poisoning isn't SQL injection. Model extraction isn't credential theft. Jailbreaks don't fit CVE taxonomy. The OWASP Top 10 doesn't cover this. Fourth, you might not control the model . If you're using OpenAI's API or Anthropic's Claude, you can't test the training pipeline, you can't audit the weights, and you can't verify alignment. Your scope is limited to what the API exposes, which means you're testing a black box with unknown internals. Fifth, AI systems are probabilistic, data-dependent, and constantly evolving . A model that's safe today might become unsafe after fine-tuning. A RAG system that's secure with Dataset A might leak PII when Dataset B is added. An autonomous agent that behaves correctly in testing might go rogue in production when it encounters edge cases. Base model : Is it GPT-4? Claude? Llama 3? Mistral? A custom model you trained from scratch? Each has different vulnerabilities, different safety mechanisms, different failure modes. Fine-tuning : Have you fine-tuned the base model on your own data? Fine-tuning can break safety alignment. It can introduce backdoors. It can memorize training data and leak it during inference. If you've fine-tuned, that's in scope. Instruction tuning : Have you applied instruction-tuning or RLHF to shape model behavior? That's another attack surface. Adversaries can craft inputs that reverse your alignment work. Multi-model orchestration : Are you running multiple models and aggregating outputs? That introduces new failure modes. What happens when Model A says "yes" and Model B says "no"? How do you handle consensus? Can an adversary exploit disagreements? Model serving infrastructure : How is the model deployed? Is it an API? A container? Serverless functions? On-prem hardware? Each deployment model has different security characteristics. Training data : Where did the training data come from? Who curated it? How was it cleaned? Is it public? Proprietary? Scraped? Licensed? Can an adversary poison the training data? RAG (Retrieval-Augmented Generation) : Are you using RAG to ground model outputs in retrieved documents? That's adding an entire data retrieval system to your attack surface. Can an adversary inject malicious documents into your knowledge base? Can they manipulate retrieval to leak sensitive docs? Can they poison the vector embeddings? Vector databases : If you're using RAG, you're running a vector database (Pinecone, Weaviate, Chroma, etc.). That's infrastructure. That has vulnerabilities. That's in scope. Real-time data ingestion : Are you pulling live data from APIs, databases, or user uploads? Each data source is a potential injection point. Data preprocessing : How are inputs sanitized before hitting the model? Are you stripping dangerous characters? Validating formats? Filtering content? Attackers will test every preprocessing step for bypasses. APIs : How do users interact with the model? REST APIs? GraphQL? WebSockets? Each has different attack vectors. Authentication and authorization : Who can access the model? How are permissions enforced? Can an adversary escalate privileges? Rate limiting : Can an adversary send 10,000 requests per second? Can they DOS your model? Can they extract the entire training dataset via repeated queries? Logging and monitoring : Are you logging inputs and outputs? If yes, are you protecting those logs from unauthorized access? Logs containing sensitive user queries are PII. Plugins and tool use : Can the model call external APIs? Execute code? Browse the web? Use tools? Every plugin is an attack vector. If your model can execute Python, an adversary will try to get it to run . Multi-turn conversations : Do users have multi-turn dialogues with the model? Multi-turn interactions create new attack surfaces because adversaries can condition the model over multiple turns, bypassing safety mechanisms gradually/ Tool access : What tools can the agent use? File system access? Database queries? API calls? Browser automation? The more powerful the tools, the higher the risk. Planning and reasoning : How does the agent decide what actions to take? Can an adversary manipulate the planning process? Can they inject malicious goals? Memory systems : Do agents have persistent memory? Can adversaries poison that memory? Can they extract sensitive information from memory? Multi-agent coordination : Are you running multiple agents that coordinate? Can adversaries exploit coordination protocols? Can they cause agents to turn on each other or collude against safety mechanisms? Escalation paths : Can an agent escalate privileges? Can it access resources it shouldn't? Can it spawn new agents? Cloud services : Are you running on AWS, Azure, GCP? Are your S3 buckets public? Are your IAM roles overly permissive? Are your API keys hardcoded in repos? Containers and orchestration : Are you using Docker, Kubernetes? Are your container images vulnerable? Are your registries exposed? Are your secrets managed properly? CI/CD pipelines : How do you deploy model updates? Can an adversary inject malicious code into your pipeline? Dependencies : Are you using vulnerable Python libraries? Compromised npm packages? Poisoned PyPI distributions? Secrets management : Where are your API keys, database credentials, and model weights stored? Are they in environment variables? Config files? Secret managers? Alignment failures : Can the model be made to behave in ways that violate its stated values? Constitutional AI bypass : If you're using constitutional AI techniques (like Anthropic's Claude), can adversaries bypass the constitution? Bias amplification : Does the model exhibit or amplify demographic biases? This isn't just an ethics issue—it's a legal risk under GDPR, EEOC, and other regulations. Harmful content generation : Can the model be tricked into generating illegal, dangerous, or abusive content? Deceptive behavior : Can the model lie, manipulate, or deceive users? Evasion attacks : Can adversaries craft inputs that cause misclassification? Model inversion : Can adversaries reconstruct training data from model outputs? Model extraction : Can adversaries steal model weights through repeated queries? Membership inference : Can adversaries determine if specific data was in the training set? Backdoor attacks : Does the model have hidden backdoors that trigger on specific inputs? Cross-modal injection : Attackers embed malicious instructions in images that the vision-language model follows. Image perturbation attacks : Small pixel changes invisible to humans cause model failures. Audio adversarial examples : Audio inputs crafted to cause misclassification. Typographic attacks : Adversarial text rendered as images to bypass filters. Multi-turn multimodal jailbreaks : Combining text and images across multiple turns to bypass safety. PII handling : Does the model process, store, or leak personally identifiable information? Right to explanation : Can users get explanations for automated decisions (GDPR Article 22)? Data retention : How long is data retained? Can users request deletion? Cross-border data transfers : Does the model send data across jurisdictions? What base model are you using (GPT-4, Claude, Llama, Mistral, custom)? Is the model proprietary (OpenAI API) or open-source? Have you fine-tuned the base model? On what data? Have you applied instruction tuning, RLHF, or other alignment techniques? How is the model deployed (API, on-prem, container, serverless)? Do you have access to model weights? Can testers query the model directly, or only through your application? Are there rate limits? What are they? What's the model's context window size? Does the model support function calling or tool use? Is the model multimodal (vision, audio, text)? Are you using multiple models in ensemble or orchestration? Where did training data come from (public, proprietary, scraped, licensed)? Was training data curated or filtered? How? Is training data in scope for poisoning tests? Are you using RAG (Retrieval-Augmented Generation)? If RAG: What's the document store (vector DB, traditional DB, file system)? If RAG: How are documents ingested? Who controls ingestion? If RAG: Can testers inject malicious documents? If RAG: How is retrieval indexed and searched? Do you pull real-time data from external sources (APIs, databases)? How is input data preprocessed and sanitized? Is user conversation history stored? Where? For how long? Can users access other users' data? How do users interact with the model (web app, API, chat interface, mobile app)? What authentication mechanisms are used (OAuth, API keys, session tokens)? What authorization model is used (RBAC, ABAC, none)? Are there different user roles with different permissions? Is there rate limiting? At what levels (user, IP, API key)? Are inputs and outputs logged? Where? Who has access to logs? Are logs encrypted at rest and in transit? How are errors handled? Are error messages exposed to users? Are there webhooks or callbacks that the model can trigger? Can the model call external APIs? Which ones? Can the model execute code? In what environment? Can the model browse the web? Can the model read/write files? Can the model access databases? What permissions do plugins have? How are plugin outputs validated before use? Can users add custom plugins? Are plugin interactions logged? Do you have autonomous agents that plan and execute multi-step tasks? What tools can agents use? Can agents spawn other agents? Do agents have persistent memory? Where is it stored? How are agent goals and constraints defined? Can agents access sensitive resources (DBs, APIs, filesystems)? Can agents escalate privileges? Are there kill-switches or circuit breakers for agents? How is agent behavior monitored? What cloud provider(s) are you using (AWS, Azure, GCP, on-prem)? Are you using containers (Docker)? Orchestration (Kubernetes)? Where are model weights stored? Who has access? Where are API keys and secrets stored? Are secrets in environment variables, config files, or secret managers? How are dependencies managed (pip, npm, Docker images)? Have you scanned dependencies for known vulnerabilities? How are model updates deployed? What's the CI/CD pipeline? Who can deploy model updates? Are there staging environments separate from production? What safety mechanisms are in place (content filters, refusal training, constitutional AI)? Have you red-teamed for jailbreaks? Have you tested for bias across demographic groups? Have you tested for harmful content generation? Do you have human-in-the-loop review for sensitive outputs? What's your incident response plan if the model behaves unsafely? Can testers attempt to jailbreak the model? Can testers attempt prompt injection? Can testers attempt data extraction (training data, PII)? Can testers attempt model extraction or inversion? Can testers attempt DoS or resource exhaustion? Can testers poison training data (if applicable)? Can testers test multi-turn conversations? Can testers test RAG document injection? Can testers test plugin abuse? Can testers test agent privilege escalation? Are there any topics, content types, or test methods that are forbidden? What's the escalation process if critical issues are found during testing? What regulations apply (GDPR, HIPAA, CCPA, FTC, EU AI Act)? Do you process PII? What types? Do you have data processing agreements with model providers? Do you have the legal right to test this system? Are there export control restrictions on the model or data? What are the disclosure requirements for findings? What's the confidentiality agreement for testers? Model(s) : Exact model names, versions, access methods APIs : All endpoints with authentication requirements Data stores : Databases, vector stores, file systems, caches Integrations : Every third-party service, plugin, tool Infrastructure : Cloud accounts, containers, orchestration Applications : Web apps, mobile apps, admin panels Access credentials testers will use Environments (dev, staging, prod) that are in scope Testing windows (if limited) Rate limits or usage restrictions Test direct instruction override Test indirect injection via RAG documents Test multi-turn conditioning Test system prompt extraction Test jailbreak techniques (roleplay, hypotheticals, encoding) Test cross-turn memory poisoning "Can an attacker leak other users' conversation history?" "Can an attacker extract training data containing PII?" "Can an attacker bypass content filters to generate harmful instructions?" Production environments (if testing only staging) Physical security Social engineering of employees Third-party SaaS providers we don't control Specific attack types (if any are prohibited) Manual testing Promptfoo for LLM fuzzing Garak for red teaming PyRIT for adversarial prompting ART (Adversarial Robustness Toolbox) for ML attacks Custom scripts for specific attack vectors Traditional tools (Burp Suite, Caido, Nuclei) for infrastructure Prompt injection testing Jailbreak attempts Data extraction attacks Model inversion Membership inference Evasion attacks RAG poisoning Plugin abuse Agent privilege escalation Infrastructure scanning Reconnaissance and threat modeling Automated vulnerability scanning Manual testing of high-risk areas Exploitation and impact validation Reporting and remediation guidance Engagement lead (security team) Technical point of contact (AI team) Escalation contact (for critical findings) Legal contact (for questions on scope) Models get fine-tuned RAG document stores get updated New plugins get added Agents gain new capabilities Infrastructure changes Prompt injection attempts Jailbreak successes Data extraction queries Unusual tool usage patterns Agent behavior anomalies Novel attack vectors that tools don't cover Complex multi-step exploitation chains Social engineering combined with technical attacks Agent hijacking and multi-agent exploits Prompt injection testing Jailbreak methodology RAG poisoning Agent security testing What new attacks have been published? What new capabilities have you added? What new integrations are in place? What new risks does the threat landscape present? Understand the full stack : model, data pipelines, application, infrastructure, agents, everything. Map every attack vector : OWASP LLM Top 10 is your baseline, not your ceiling. Answer scoping questions (mentioned above) : If you can't answer them, you don't understand your system. Write detailed scope documents : 10-20 pages, not 2 pages. Use the right tools : Promptfoo, Garak, ART, LIME, SHAP—not just Burp Suite. Test continuously : Not once, but ongoing. Avoid common mistakes : Don't ignore RAG, don't underestimate agents, don't skip AI safety.

0 views
sunshowers 3 weeks ago

`SocketAddrV6` is not roundtrip serializable

A few weeks ago at Oxide , we encountered a bug where a particular, somewhat large, data structure was erroring on serialization to JSON via . The problem was that JSON only supports map keys that are strings or numbers, and the data structure had an infrequently-populated map with keys that were more complex than that 1 . We fixed the bug, but a concern still remained: what if some other map that was empty most of the time had a complex key in it? The easiest way to guard against this is by generating random instances of the data structure and attempting to serialize them, checking that this operation doesn’t panic. The most straightforward way to do this is with property-based testing , where you define: Modern property-based testing frameworks like , which we use at Oxide, combine these two algorithms into a single strategy , through a technique known as integrated shrinking . (For a more detailed overview, see my monad tutorial , where I talk about the undesirable performance characteristics of monadic composition when it comes to integrated shrinking.) The library has a notion of a canonical strategy for a type, expressed via the trait . The easiest way to define instances for large, complex types is to use a derive macro . Annotate your type with the macro: As long as all the fields have defined for them—and the library defines the trait for most types in the standard library—your type has a working random generator and shrinker associated with it. It’s pretty neat! I put together an implementation for our very complex type, then wrote a property-based test to ensure that it serializes properly: And, running it: The test passed! But while we’re here, surely we should also be able to deserialize a , and then ensure that we get the same value back, right? We’ve already done the hard part, so let’s go ahead and add this test: The roundtrip test failed! Why in the world did the test fail? My first idea was to try and do a textual diff of the outputs of the two data structures. In this case, I tried out the library, with something like: And the output I got was: There’s nothing in the output! No or as would typically be printed. It’s as if there wasn’t a difference at all, and yet the assertion failing indicated the before and after values just weren’t the same. We have one clue to go by: the integrated shrinking algorithm in tries to shrink maps down to empty ones. But it looks like the map is non-empty . This means that something in either the key or the value was suspicious. A is defined as: Most of these types were pretty simple. The only one that looked even remotely suspicious was the , which ostensibly represents an IPv6 address plus a port number. What’s going on with the ? Does the implementation for it do something weird? Well, let’s look at it : Like a lot of abstracted-out library code it looks a bit strange, but at its core it seems to be simple enough: The is self-explanatory, and the is probably the port number. But what are these last two values? Let’s look at the constructor : What in the world are these two and values? They look mighty suspicious. A thing that caught my eye was the “Textual representation” section of the , which defined the representation as: Note what’s missing from this representation: the field! We finally have a theory for what’s going on: Why did this not show up in the textual diff of the values? For most types in Rust, the representation breaks out all the fields and their values. But for , the implementation (quite reasonably) forwards to the implementation . So the field is completely hidden, and the only way to look at it is through the method . Whoops. How can we test this theory? The easiest way is to generate random values of where is always set to zero, and see if that passes our roundtrip tests. The ecosystem has pretty good support for generating and using this kind of non-canonical strategy. Let’s try it out: Pretty straightforward, and similar to how lets you provide custom implementations through . Let’s test it out again: All right, looks like our theory is confirmed! We can now merrily be on our way… right? This little adventure left us with more questions than answers, though: The best place to start looking is in the IETF Request for Comments (RFCs) 2 that specify IPv6. The Rust documentation for helpfully links to RFC 2460, section 6 and section 7 . The field is actually a combination of two fields that are part of every IPv6 packet: Section 6 of the RFC says: Flow Labels The 20-bit Flow Label field in the IPv6 header may be used by a source to label sequences of packets for which it requests special handling by the IPv6 routers, such as non-default quality of service or “real-time” service. This aspect of IPv6 is, at the time of writing, still experimental and subject to change as the requirements for flow support in the Internet become clearer. […] And section 7: Traffic Classes The 8-bit Traffic Class field in the IPv6 header is available for use by originating nodes and/or forwarding routers to identify and distinguish between different classes or priorities of IPv6 packets. At the point in time at which this specification is being written, there are a number of experiments underway in the use of the IPv4 Type of Service and/or Precedence bits to provide various forms of “differentiated service” for IP packets […]. Let’s look at the Traffic Class field first. This field is similar to IPv4’s differentiated services code point (DSCP) , and is meant to provide quality of service (QoS) over the network. (For example, prioritizing low-latency gaming and video conferencing packets over bulk downloads.) The DSCP field in IPv4 is not part of a , but the Traffic Class—through the field—is part of a . Why is that the case? Rust’s definition of mirrors the defined by RFC 2553, section 3.3 : Similarly, Rust’s mirrors the struct. There isn’t a similar RFC for ; the de facto standard is Berkeley sockets , designed in 1983. The Linux man page for defines it as: So , which includes the Traffic Class, is part of , but the very similar DSCP field is not part of . Why? I’m not entirely sure about this, but here’s an attempt to reconstruct a history: (Even if could be extended to have this field, would it be a good idea to do so? Put a pin in this for now.) RFC 2460 says that the Flow Label is “experimental and subject to change”. The RFC was written back in 1998, over a quarter-century ago—has anyone found a use for it since then? RFC 6437 , published in 2011, attempts to specify semantics for IPv6 Flow Labels. Section 2 of the RFC says: The 20-bit Flow Label field in the IPv6 header [RFC2460] is used by a node to label packets of a flow. […] Packet classifiers can use the triplet of Flow Label, Source Address, and Destination Address fields to identify the flow to which a particular packet belongs. The RFC says that Flow Labels can potentially be used by routers for load balancing, where they can use the triplet source address, destination address, flow label to figure out that a series of packets are all associated with each other. But this is an internal implementation detail generated by the source program, and not something IPv6 users copy/pasting an address generally have to think about. So it makes sense that it isn’t part of the textual representation. RFC 6294 surveys Flow Label use cases, and some of the ones mentioned are: But this Stack Exchange answer by Andrei Korshikov says: Nowadays […] there [are] no clear advantages of additional 20-bit QoS field over existent Traffic Class (Differentiated Class of Service) field. So “Flow Label” is still waiting for its meaningful usage. In my view, putting in was an understandable choice given the optimism around QoS in 1998, but it was a bit of a mistake in hindsight. The Flow Label field never found widespread adoption, and the Traffic Class field is more of an application-level concern. In general, I think there should be a separation between types that are losslessly serializable and types that are not, and violates this expectation. Making the Traffic Class (QoS) a socket option, like in IPv4, avoids these serialization issues. What about the other additional field, ? What does it mean, and why does it not have to be zeroed out? The documentation for a says that in its textual representation, the scope identifier is included after the IPv6 address and a character, within square brackets. So, for example, the following code sample: prints out . What does this field mean? The reason exists has to do with link-local addressing . Imagine you connect two computers directly to each other via, say, an Ethernet cable. There isn’t a central server telling the computers which addresses to use, or anything similar—in this situation, how can the two computers talk to each other? To address this issue, OS vendors came up with the idea to just assign random addresses on each end of the link. The behavior is defined in RFC 3927, section 2.1 : When a host wishes to configure an IPv4 Link-Local address, it selects an address using a pseudo-random number generator with a uniform distribution in the range from 169.254.1.0 to 169.254.254.255 inclusive. (You might have seen these 169.254 addresses on your home computers if your router is down. Those are link-local addresses.) Sounds simple enough, right? But there is a pretty big problem with this approach: what if a computer has more than one interface on which a link-local address has been established? When a program tries to send some data over the network, the computer has to know which interface to send the data out on. But with multiple link-local interfaces, the outbound one becomes ambiguous. This is described in section 6.3 of the RFC: Address Ambiguity Application software run on a multi-homed host that supports IPv4 Link-Local address configuration on more than one interface may fail. This is because application software assumes that an IPv4 address is unambiguous, that it can refer to only one host. IPv4 Link-Local addresses are unique only on a single link. A host attached to multiple links can easily encounter a situation where the same address is present on more than one interface, or first on one interface, later on another; in any case associated with more than one host. […] The IPv6 protocol designers took this lesson to heart. Every time an IPv6-capable computer connects to a network, it establishes a link-local address starting with . (You should be able to see this address via on Linux, or your OS’s equivalent.) But if you’re connected to multiple networks, all of them will have addresses beginning with . Now if an application wants to establish a connection to a computer in this range, how can it tell the OS which interface to use? That’s exactly where comes in: it allows the to specify which network interface to use. Each interface has an index associated with it, which you can see on Linux with . When I run that command, I see: The , , and listed here are all the indexes that can be used as the scope ID. Let’s try pinging our address: Aha! The warning tells us that for a link-local address, the scope ID needs to be specified. Let’s try that using the syntax: Success! What if we try a different scope ID? This makes sense: the address is only valid for scope ID 2 (the interface). When we told to use a different scope, 3, the address was no longer reachable. This neatly solves the 169.254 problem with IPv4 addresses. Since scope IDs can help disambiguate the interface on which a connection ought to be made, it does make sense to include this field in , as well as in its textual representation. The keen-eyed among you may have noticed that the commands above printed out an alternate representation: . The at the end is the network interface that corresponds to the numeric scope ID. Many programs can handle this representation, but Rust’s can’t. Another thing you might have noticed is that the scope ID only makes sense on a particular computer. A scope ID such as means different things on different computers. So the scope ID is roundtrip serializable, but not portable across machines. In this post we started off by looking at a somewhat strange inconsistency and ended up deep in the IPv6 specification. In our case, the instances were always for internal services talking to each other without any QoS considerations, so was always zero. Given that knowledge, we were okay adjusting the property-based tests to always generate instances where was set to zero. ( Here’s the PR as landed .) Still, it raises questions: Should we wrap in a newtype that enforces this constraint? Should provide a non-standard alternate serializer that also includes the field? Should not forward to when hides fields? Should Rust have had separate types from the start? (Probably too late now.) And should Berkeley sockets not have included at all, given that it makes the type impossible to represent as text without loss? The lesson it really drives home for me is how important the principle of least surprise can be. Both and have lossless textual representations, and does as well. By analogy it would seem like would, too, and yet it does not! IPv6 learned so much from IPv4’s mistakes, and yet its designers couldn’t help but make some mistakes of their own. This makes sense: the designers could only see the problems they were solving then, just as we can only see those we’re solving now—and just as we encounter problems with their solutions, future generations will encounter problems with ours. Thanks to Fiona , and several of my colleagues at Oxide, for reviewing drafts of this post. Discuss on Hacker News and Lobsters . This is why our Rust map crate where keys can borrow from values, , serializes its maps as lists or sequences.  ↩︎ The Requests for Discussion we use at Oxide are inspired by RFCs, though we use a slightly different term (RFD) to convey the fact that our documents are less set in stone than IETF RFCs are.  ↩︎ The two fields sum up to 28 bits, and the field is a , so there’s four bits remaining. I couldn’t find documentation for these four bits anywhere—they appear to be unused padding in the . If you know about these bits, please let me know!  ↩︎ a way to generate random instances of a particular type, and given a failing input, a way to shrink it down to a minimal failing value. generate four values: an , a , a , and another then pass them in to . A left square bracket ( ) The textual representation of an IPv6 address Optionally , a percent sign ( ) followed by the scope identifier encoded as a decimal integer A right square bracket ( ) A colon ( ) The port, encoded as a decimal integer. generated a with a non-zero field. When we went to serialize this field as JSON, we used the textual representation, which dropped the field. When we deserialized it, the field was set to zero. As a result, the before and after values were no longer equal. What does this field mean? A is just an plus a port ; why is a different? Why is the not part of the textual representation? , , and are all roundtrip serializable. Why is not? Also: what is the field? a 20-bit Flow Label, and an 8-bit Traffic Class 3 . QoS was not originally part of the 1980s Berkeley sockets specification. DSCP came about much later ( RFC 2474 , 1998). Because C structs do not provide encapsulation, the definition was set in stone and couldn’t be changed. So instead, the DSCP field is set as an option on the socket, via . By the time IPv6 came around, it was pretty clear that QoS was important, so the Traffic Class was baked into the struct. as a pseudo-random value that can be used as part of a hash key for load balancing, or as extra QoS bits on top of the 8 bits provided by the Traffic Class field. This is why our Rust map crate where keys can borrow from values, , serializes its maps as lists or sequences.  ↩︎ The Requests for Discussion we use at Oxide are inspired by RFCs, though we use a slightly different term (RFD) to convey the fact that our documents are less set in stone than IETF RFCs are.  ↩︎ The two fields sum up to 28 bits, and the field is a , so there’s four bits remaining. I couldn’t find documentation for these four bits anywhere—they appear to be unused padding in the . If you know about these bits, please let me know!  ↩︎

0 views
Alex Molas 1 months ago

Bayesian A/B testing is not immune to peeking

Introduction Over the last few months at RevenueCat I’ve been building a statistical framework to flag when an A/B test has reached statistical significance. I went through the usual literature, including Evan Miller’s posts. In his well known “How Not to Run an A/B Test” there’s a claim that with Bayesian experiment design you can stop at any time and still make valid inferences, and that you don’t need a fixed sample size to get a valid result. I’ve read this claim in other posts. The impression I got is that you can peek as often as you want, stop the moment the posterior clears a threshold (eg $P(A>B) > 0.95$), and you won’t inflate false positives. And this is not correct. If you’re an expert in Bayesian statistics this is probably obvious, but it wasn’t for me. So I decided to run some simulations to see what really happens, and I’m sharing the results here in case it can be useful for others.

0 views
Grumpy Gamer 1 months ago

Death By Testing

The following is a guest post by Robert Megone, the lead tester on Death by Scrolling. Wish List today! Most of the games I’ve worked on over the years have been slow and deliberate. Narrative-driven adventures like Return to Monkey Island, Thimbleweed Park, Broken Sword 5 and The Darkside Detective invite you to take your time to sit with every line of dialogue, carefully piece together puzzles, and explore the world at your own pace. That kind of work has always suited me. It gives you space to be meticulous, to catch the tiniest continuity errors or subtle logic gaps before anyone else does. Death by Scrolling is the complete opposite of that. Unlike an adventure game, Death by Scrolling never gives you a moment to breathe. The screen scrolls constantly and the player must run constantly. As soon as you think you’ve found your footing the game throws something new at you. That’s part of its charm but it’s also what makes testing it such a unique challenge. This is a game where players can dramatically alter or buff their stats in a whole range of ways, upgrading movement speed, stacking power-ups, boosting damage and more. Each player’s setup can be wildly different from the next, which makes it impossible to test just “one version” of the game. Instead, I’ve had to test against what feels like a moving target. Sometimes literally. The moment I think I’ve pinned down a bug, the next run will throw me something entirely different, a new combination of buffs or modifiers that bends the rules in unexpected ways and reveals a completely new edge case to investigate. Because of that unpredictability, Death by Scrolling demands heavy, sustained testing. There’s little room for error. A single overlooked bug can completely break the flow and ruin a run. Where adventure games give you space to be methodical, this one forces you to be reactive to stay just as alert and fast-moving as the game itself. Thankfully, I haven’t been tackling this beast alone. We’ve had a host of fantastic playtesters who’ve been instrumental in helping us track down some of the more elusive issues. Their feedback has been invaluable not just in surfacing bugs, but in showing us how different players approach the game, what kinds of setups they gravitate towards, and where the game balance can wobble. It’s been a collaborative effort, and the game is so much stronger for it. With so many different level prefabs (used to randomly generate the levels that you traverse in the game), ensuring full coverage across every biome, enemy type, and powerup combo quickly became a task that I was battling to manage through manual testing alone, despite having the assistance of so many wonderful playtesters. That’s where TesterTron3000™ came in. TesterTron3000™ is a name that may be familiar to some, it’s been a faithful helper that’s proved its worth time and time again, across Thimbleweed Park and Return to Monkey Island, albeit by name only. The underlying functionality is vastly different in this game. TesterTron3000™ is a script that mimics player input, automatically testing many of the core game mechanics while blasting through level after level at an impressive speed. It has been most helpful in uncovering gameplay blockers and crash bugs that could take many hours to find through manual play. It’s been especially useful during build sign off. While I’m testing one platform, TesterTron3000™ can be burning through hundreds of levels on Mac, Windows, or Steam Deck/Linux, this gives us wider test coverage and extra confidence in the stability of a build. For me, Death by Scrolling has been a complete change of pace from my usual work. It’s chaotic, unpredictable, and relentless in all the best ways. This has tested not just the game, but my own ability to adapt, to keep up with something that never stands still long enough to let you catch your breath. And honestly? I’ve loved every minute of it. – Robert Megone

0 views
Farid Zakaria 1 months ago

Fuzzing for fun and profit

I watched recently a keynote by Will Wilson on fuzzing – Fuzzing’25 Keynote . The talk is excellent, and one main highlight is the fact we have at our disposal is the capability to “fuzz” our software toaday and yet we do not. While I’ve seen the power of QuickCheck-like tools to create property based testing, I never had never used fuzzing over an application as a whole, specifically American Fuzzy Lop . I was intrigued to add this skill to my toolbelt and maybe apply it to CppNix . As with everything else, I need to learn things from first principles . I would like to create a scenario with a known-failure and see how AFL discovers it. To get started let’s first make sure we have access to AFL via Nix . We will be using AFL++ , the daughter of AFL that incorporates newer updates and features. How does AFL work? 🤔 AFL will feed your program various inputs to try and cause a crash! 💥 In order to generate better inputs, you compile your code with a variant of or distributed by AFL which will insert special instructions to keep track of coverage of branches as it creates various test cases. Let’s create a program that crashes when given the input . We leverage a so that the compiler does not optimize the multiple instructions together. We now can compile our code with to get the instrumented binary . AFL needs to be given some sample inputs Let’s feed it the simplest starter seed possible – an empty file! Now we simply run , and the magic happens . ✨ A really nice TUI appears that informs you of various statistics of the running fuzzer, and importantly if any crashes had been found – ! The output directory contains all the saved information including the input that caused the crashes. Let’s inspect it! AFL was successfully able to find our code-word, , that caused the crash. It is important to note however that for my simple program it found the failure-case rather quickly, however for large programs it can take a long time to explore the complete state space. Companies such as Google, continously run fuzzers such as AFL on well-known open source projects to help detect failures.

0 views
./techtipsy 1 months ago

Testing two 18 TB white label SATA hard drives from datablocks.dev

This post is NOT sponsored, the products were bought with my hard-earned money. I’ve been running a full SSD storage setup for a few years in my home server and I’ve been happy with it, except for the storage anxiety that I get with running small pools of fast storage, which is why I started looking at how the hard drive market is doing. Half of tech YouTube has been sponsored by companies like ServerPartDeals, so they were one of the first places I looked at, but they seem to only operate within the US and the shipping+taxes destroy any price advantages from ordering there to Estonia (which is in Europe). At some point I stumbled upon datablocks.dev , which seems to operate within a similar niche, but in Europe and on a much smaller scale. What caught my eye were their white label hard drive offerings. Their website has a good explanation on the differences between recertified and white label hard drives. In short: white label drives have no branding, have no or very low number of power-on hours, may have small scratches or dents, but are in all other aspects completely functional and usable. White label drives also have a price advantage compared to branded recertified drives. Here’s one example with 18 TB drives, the recertified one is 16.7% more expensive compared to the white label one, and the only obvious difference seems to be the sticker on the drive. I highly suspect that the white label one is also manufactured by Seagate based on the physical similarities. I took some time to think things over and compared the pricing of various drives. The drives were all competitively priced between each other, with the price per terabyte hovering around 13 EUR/TB, so it didn’t matter much which drive size you picked, you’d still get a pretty solid deal. It was also a better deal compared to using an WD Elements/My Book drive of the same size. I decided to go with two 18 TB hard drives. I considered buying the 20 TB or 22 TB capacities, but decided to go with 18 TB because it’s the largest single hard drive that I can easily and quickly buy a replacement for in the form of a WD Elements/My Book drive. The stock on is quite volatile, the drives are in stock when new batches arrive, but they can also quickly go out of stock. I saw this live with the 22 TB hard drives, one day there are 35 left, the next day there can be 7 left, and then only one lone drive. At the time of writing, the 18 TB model that I bought is out of stock, so my choice to go with a slightly smaller but more easily replaceable one is validated. For those that have followed my blog for a while will know that I’m a huge fan of all-SSD server builds, especially this one by Jeff Geerling that I still consider building from time to time. If I dislike noise, higher power usage and slower performance, then why did I get the hard drives? It’s simple, really: I now have an actual closet that I can stash my home server in, meaning that noise isn’t that big of a worry, and as long as my home server takes about the same amount of power as my refrigerator or dishwasher, then that’s fine. SSD prices still haven’t gone down as much as I’ve hoped over the years, so the all-SSD build ideas that I have are way outside my budget. The drives arrived in a reasonable time window. The packaging was adequate, although I was slightly concerned with the cardboard box showing signs of something hitting it hard. The drives were packaged within sealed antistatic bags, and with ample bubble wrap surrounding them. Just as described, the drives did have slight scratches and very minor dents in them, but in all other aspects they looked like new. Before putting them to use, I formatted the drives using . It took a full 24 hours to do a full drive write. The write performance peaked at 275 MB/s and slowed down to 123 MB/s at the end, which is expected. 1 I also had to choose a larger block size for because otherwise it could not handle the drive, resulting in the command being . I unfortunately did not save the SMART data from the time I received the drives, but the contents were as expected, there were no more than a few power on hours and other metrics were OK. Keep in mind that it’s also possible to reset SMART data on a drive so this information cannot be taken at face value. The drives are noisy, as expected. They run at 7200 RPM and do the usual clicks and clacks that a normal hard drive does. If this bothers you, use foam to fix it. The soft side of a sponge can work just as well. With these drives I’ve now followed my own advice and tiered my storage: two 1 TB SSD-s for the things that benefit from good speed and latency (databases, containers), and 18 TB hard drives for bulk storage, backups and less frequently used data. Coming from an all-SSD build, I expected the performance to drop in day-to-day operations, but in most cases I cannot tell a difference. My family photos load just fine, media plays back well, and backups take slightly longer, which isn’t noticeable due to them running during the night. Only when I look at the Prometheus node exporter graphs do I notice that sometimes the server is waiting behind the disks a bit more due to higher . The power usage did shoot up as a result, roughly 10-20 W. Not ideal, but my whole networking and home server setup is idling at below 45 W, and I’ve had less efficient home servers in the past, so it’s not that big of a deal. In this configuration, the drives run quite cool. During formatting on a hot day, I saw them go up to a maximum of 51°C, but in general use they sit at around 38-42°C. Overall, I’m reasonably happy with the drives. I expect these to last me at least 5 years, and I’m probably going to switch one of the drives out a bit sooner to reduce the risk of a full drive pool failure. They’ve made it the first 50 days, so that’s good! Oh, and here’s the output for the disks after running them for about two months: hard drives are expected to be slower at the end of the drive because of their design, the platter rotates at 7200 RPM but the end of the drive is located at the inner tracks of the platter, near the center of the spindle, which results in a slower effective speed. Math is cool!  ↩︎ hard drives are expected to be slower at the end of the drive because of their design, the platter rotates at 7200 RPM but the end of the drive is located at the inner tracks of the platter, near the center of the spindle, which results in a slower effective speed. Math is cool!  ↩︎

0 views
Harper Reed 2 months ago

Note #288

We gave our AI coding agents access to social media. They immediately started posting. A lot. Then we tested their performance. Turns out agents with Twitter solve problems faster than agents without it. harper.blog/2025/09/3… Thank you for using RSS. I appreciate you. Email me

0 views
Sean Goedecke 2 months ago

AI coding agents rely too much on fallbacks

One frustrating pattern I’ve noticed in AI agents - at least in Claude Code, Codex and Copilot - is building automatic fallbacks . Suppose you ask Codex to build a system to automatically group pages in a wiki by topic. (This isn’t hypothetical, I just did this for EndlessWiki ). You’ll probably want to use something like the Louvain method to identify clusters. But if you task an AI agent with building something like that, it usually will go one step further, and build a fallback: a separate, simpler code path if the Louvain method fails (say, grouping page slugs alphabetically). If you’re not careful, you might not even know if the Louvain method is working, or if you’re just seeing the fallback behavior. In my experience, AI agents will do this constantly . If you’re building an app that makes an AI inference request, the generated code will likely fallback to some hard-coded response if the inference request fails. If you’re using an agent to pull structured data from some API, the agent may silently fallback to placeholder data for part of it. If you’re writing some kind of clever spam detector, the agent will want to fall back to a basic keyword check if your clever approach doesn’t work. This is particularly frustrating for the main kind of work that AI agents are useful for: prototyping new ideas. If you’re using AI agents to make real production changes to an existing app, fallbacks are annoying but can be easily stripped out before you submit the pull request. But if you’re using AI agents to test out a new approach, you’re typically not checking the code line-by-line. The usual workflow is to ask the agent to try an approach, then benchmark or fiddle with the result, and so on. If your benchmark or testing doesn’t know whether it’s hitting the real code or some toy fallback, you can’t be confident that you’re actually evaluating your latest idea. I don’t think this behavior is deliberate. My best guess is that it’s a reinforcement learning artifact: code with fallbacks is more likely to succeed, so during training the models are learning to include fallback 1 . If I’m wrong and it’s part of the hidden system prompt (or a deliberate choice), I think it’s a big mistake. When you ask an AI agent to implement a particular algorithm, it should implement that algorithm. In researching this post, I saw this r/cursor thread where people are complaining about this exact problem (and also attributing it to RL). Supposedly you can prompt around it, if you repeat “DO NOT WRITE FALLBACK CODE” several times.

1 views
Max Bernstein 2 months ago

Walking around the compiler

Walking around outside is good for you. [ citation needed ] A nice amble through the trees can quiet inner turbulence and make complex engineering problems disappear. Vicki Boykis wrote a post, Walking around the app , about a more proverbial stroll. In it, she talks about constantly using your production application’s interface to make sure the whole thing is cohesively designed with few rough edges. She also talks about walking around other parts of the implementation of the application, fixing inconsistencies, complex machinery, and broken builds. Kind of like picking up someone else’s trash on your hike. That’s awesome and universally good advice for pretty much every software project. It got me thinking about how I walk around the compiler. There’s a certain class of software project that transforms data—compression libraries, compilers, search engines—for which there’s another layer of “walking around” you can do. You have the code, yes, but you also have non-trivial output . By non-trivial, I mean an output that scales along some quality axis instead of something semi-regular like a JSON response. For compression, it’s size. For compilers, it’s generated code. You probably already have some generated cases checked into your codebase as tests. That’s awesome. I think golden tests are fantastic for correctness and for people to help understand. But this isolated understanding may not scale to more complex examples. How does your compiler handle, for example, switch-case statements in loops? Does it do the jump threading you expect it to? Maybe you’re sitting there idly wondering while you eat a cookie, but maybe that thought would only have occurred to you while you were scrolling through the optimizer. Say you are CF Bolz-Tereick and you are paging through PyPy IR. You notice some IR that looks like: “Huh”, you say to yourself, “surely the optimizer can reason that running on the result of is redundant!” But some quirk in your optimizer means that it does not. Maybe it used to work, or maybe it never did. But this little stroll revealed a bug with a quick fix (adding a new peephole optimization function): Now, thankfully, your IR looks much better: and you can check this in as a tidy test case: Fun fact: this was my first exposure to the PyPy project. CF walked me through fixing this bug 1 live at ECOOP 2022! I had a great time. If checking (and, later, testing) your assumptions is tricky, this may be a sign that your library does not expose enough of its internal state to developers. This may present a usability impediment that prevents you from immediately checking your assumptions or suspicions. For an excellent source of inspiration, see Kate’s tweets about program internals . Even if it does provide a flag like to print to the console, maybe this is hard to run from a phone 2 or a friend’s computer. For that, you may want friendlier tools . The right kind of tool invites exploration. Matthew Godbolt built the first friendly compiler explorer tool I used, the Compiler Explorer (“Godbolt”). It allows inputting programs into your web browser in many different languages and immediately seeing the compiled result. It will even execute your programs, within reason. This is a powerful tool: This combination lowers the barrier to check things tremendously . Now, sometimes you want the reverse: a Compiler Explorer -like thing in your terminal or editor so you don’t have to break flow. I unfortunately have not found a comparable tool. In addition to the immediate effects of being able to spot-check certain inputs and outputs, continued use of these tools builds long-term intuition about the behavior of the compiler. It builds mechanical sympathy . I haven’t written a lot about mechanical sympathy other than my grad school statement of purpose (PDF) and a few brief internet posts, so I will leave you with that for now. Your compiler likely compiles some applications and you can likely get access to the IR for the functions in that application. Scroll through every function’s optimized IR. If there are too many, maybe the top N functions’ IRs. See what can be improved. Maybe you will see some unexpected patterns. Even if you don’t notice anything in May, that could shift by August because of compiler advancements or a cool paper that you read in the intervening months. One time I found a bizarre reference counting bug that was causing copy-on-write and potential memory issues by noticing that some objects that should have been marked “immortal” in the IR were actually being refcounted. The bug was not in the compiler, but far away in application setup code—and yet it was visible in the IR. My conclusion is similar to Vicki’s. Put some love into your tools. Your colleagues will notice. Your users will notice. It might even improve your mood. Thank you to CF for feedback on the post. The actual fix that checks for and rewrites to .  ↩ Just make sure to log off and touch grass.  ↩

0 views
Justin Duke 2 months ago

Another reason our pytest suite is slow

I wrote two days ago about how our pytest suite was slow, and how we could speed it up by blessing a suite-wide fixture that was scoped to . This was true. But, like a one-year-old with a hammer, I found myself so gratified by the act of swinging that I found myself also trying to pinpoint another performance issue: why does it take so long to run a single smoke test

0 views
Justin Duke 2 months ago

Why our pytest suite is slow

The speed of Buttondown's pytest suite (which I've written about here , here , and here ) is a bit of a scissor for my friends and colleagues: depending on who you ask, it is (at around three minutes when parallelized on Blacksmith) either quite fast given its robustness or unfathomably slow

0 views
Uros Popovic 2 months ago

Custom CPU simulation and testing

Walkthrough for how the Mrav CPU project handles RTL simulation and other testing aspects.

1 views
Jampa.dev 2 months ago

Using Claude Code SDK to Reduce E2E Test Time by 84%

End-to-end (E2E) tests sit at the top of the test pyramid because they're slow, fragile, and expensive. But they're also the only tests that completely verify complete user workflows actually work across systems. Due to time constraints, most teams run E2E nightly to avoid CI bottlenecks. However, this means bugs can slip through to production and be harder to fix because there are so many changes to isolate the root cause. But what if we could run only the relevant E2E tests for specific code changes of a PR? Instead of waiting hours for the entire suite, we could get the results in under 10 minutes , catch bugs before they ship, and keep our master branch always clean. The first logical step toward running only relevant tests would be using glob patterns. We tell the system to test what changed by matching file paths. Here's how a typical could work: But globs are very limited. They require constant maintenance as the codebase evolves. Every new feature would require updating the glob patterns file. More importantly, they cast too wide a net. A change to might need to trigger every E2E test that involves any page with a button interaction, depending on how deep the change is. So, how can we determine which E2E tests should run for a given PR with both coverage and precision? We need  coverage  because missing a critical test could let bugs slip through to production. But we also need  precision because running tests that will obviously pass just wastes time and resources. The naive approach might be to dump the entire repository and changes into an LLM and ask it to figure out which tests are relevant. But this completely falls apart in practice. Repositories can easily contain millions of tokens worth of code, which makes it impossible for all AI models. Claude Code takes a fundamentally different approach because of one key differentiator: tool calls . Instead of trying to process your entire codebase, Claude Code strategically examines specific files, searches for patterns, traces dependencies, and incrementally builds up an understanding of your changes. So here's the hypothesis: If I see a PR, I will know which E2E tests it should run because I know the codebase. The question is: Can Claude Code replicate my human intuition by searching for it? Let's build and find out. For the E2E selection to be successful, Claude needs to know what I know: the PR modifications, the E2E tests, and the codebase structure. We need to glue all three together in a well-crafted prompt. This is perhaps the easiest piece - we can leverage git to get exactly what we need. We start with the basic command: This gives us the changes of a branch, but we can do much better. First, we want git to be less verbose, so we add to focus on the actual code changes rather than whitespace noise. We also don't care about deleted files since we'll need to remove references in existing files anyway (unless we don't care about those tests), so we add to exclude deleted files and focus on (A)dded, (C)opied, (M)odified, and (R)enamed files. Finally, we need some strategic excludes because there are generally large files in PRs like that would blow up our token count. We add to keep things manageable. Putting it all together: The result is a clean diff showing the actual code modifications: We could hardcode a list of test files in our prompt, but that violates the single source of truth principle. We already maintain this list for our daily benchmarks, so let's reuse it. For example, if the test configuration lives in a WebdriverIO ( ) config file, we can extract it programmatically: This script dynamically reads the file and outputs our exact test suite configuration: The prompt needs to be precise about what we want. We start by setting clear expectations: The key phrase here is "think deep" . This tells Claude Code not to be lazy with its analysis (while spending more thinking tokens). Without it, the output was very inconsistent. I used to joke that without it, Claude runs in “engineering manager mode” by delegating the work. Next, we set boundaries: The "only run tests listed" constraint was added because Claude was being "too smart," finding work-in-progress spec files and scheduling them to run. We added the last piece because it is better to run more specs than leave a test out. I initially asked for JSON output , and since I didn't want Claude's judgment to be a black box, I requested two keys: the list of tests to run and an explanation . This makes it easy to benchmark whether the reasoning is sound. I initially tried using JSON mode and asking Claude to output only JSON: But Claude has strong internal system instructions and couldn't stop adding commentary . I initially fixed this with a regex JSON parser to remove the commentaries, but when you use regex to solve a problem, you get two problems. But then I realized: Claude Code is used to write files, duh So instead of fighting with JSON mode and regex, I asked: Works every time! The final pipeline combines everything with what might be the ugliest bash command known to humankind: The result command is piped to Claude: We add So it can write our file. By the way , you should never use which gives all permissions, including . I am surprised by how many people are taught to do this . If we did add this flag, someone could write in the prompt file and instruct Claude to read our environment variables and send them to a URL using Fetch(). Since the CI runs on a PR open, not a merge, this would be similar to a “0-click” exploit. I won't lie - this exceeded my expectations. We used to run all core tests, which took 44 minutes (and now it would take us more than 2 hours, since we keep adding tests). Most PRs complete E2E testing in less than 7 minutes, even for larger changes. Even if it performed worse, it would still be an incredible success because our system has so many complexities that other types of tests (unit and integration) are nowhere near as effective as E2E. The solution scales well because adding E2E test names consumes few tokens, and PR changes are mostly constant. Claude doesn't read all test files: it focuses on the ones with semantic naming and explores modified file patterns, which is surprisingly effective. Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry. How much does it cost? Without getting into sensitive details, the solution costs about $30 per contributor per month. Despite the steep price, it actually saves money on mobile device farm runners. And I expect these costs will drop as models become cheaper. Overall, we're saving money, developer time, and preventing bugs that would make it to production. So it's a win-win-win! Thanks for reading Jampa.dev! Subscribe for free to receive new posts! We need  coverage  because missing a critical test could let bugs slip through to production. But we also need  precision because running tests that will obviously pass just wastes time and resources.

0 views
Grumpy Gamer 2 months ago

TesterTron3000

Have I mentioned that you should Wish List Death by Scrolling now, before you finish reading this? Here is the code the runs TesterTron3000 in Death by Scrolling. There is some code not listed that does set up, but the following runs the level. It’s written in Dinky, a custom language I wrote for Delores based on what we used for Thimbleweeed Park and then used in Return to Monkey Island . TesterTron3000 is as dumb as a box of rocks, but in some ways that’s what makes it fun to watch. Before we get into code, here is another sample run. It’s not the best code I’ve written but far from the worst and it gets the job done. TesterTron3000 has run for over 48 hours and not found a serious bug, so I’m happy. Source code follows, you’ve been warned…

0 views
Loren Stewart 2 months ago

Production-Ready Astro Middleware: Dependency Injection, Testing, and Performance

Master production-ready Astro middleware with dependency injection, testing strategies, and caching for enterprise applications.

0 views