Posts in Testing (20 found)
Harper Reed 1 weeks ago

Note #288

We gave our AI coding agents access to social media. They immediately started posting. A lot. Then we tested their performance. Turns out agents with Twitter solve problems faster than agents without it. harper.blog/2025/09/3… Thank you for using RSS. I appreciate you. Email me

0 views
Sean Goedecke 2 weeks ago

AI coding agents rely too much on fallbacks

One frustrating pattern I’ve noticed in AI agents - at least in Claude Code, Codex and Copilot - is building automatic fallbacks . Suppose you ask Codex to build a system to automatically group pages in a wiki by topic. (This isn’t hypothetical, I just did this for EndlessWiki ). You’ll probably want to use something like the Louvain method to identify clusters. But if you task an AI agent with building something like that, it usually will go one step further, and build a fallback: a separate, simpler code path if the Louvain method fails (say, grouping page slugs alphabetically). If you’re not careful, you might not even know if the Louvain method is working, or if you’re just seeing the fallback behavior. In my experience, AI agents will do this constantly . If you’re building an app that makes an AI inference request, the generated code will likely fallback to some hard-coded response if the inference request fails. If you’re using an agent to pull structured data from some API, the agent may silently fallback to placeholder data for part of it. If you’re writing some kind of clever spam detector, the agent will want to fall back to a basic keyword check if your clever approach doesn’t work. This is particularly frustrating for the main kind of work that AI agents are useful for: prototyping new ideas. If you’re using AI agents to make real production changes to an existing app, fallbacks are annoying but can be easily stripped out before you submit the pull request. But if you’re using AI agents to test out a new approach, you’re typically not checking the code line-by-line. The usual workflow is to ask the agent to try an approach, then benchmark or fiddle with the result, and so on. If your benchmark or testing doesn’t know whether it’s hitting the real code or some toy fallback, you can’t be confident that you’re actually evaluating your latest idea. I don’t think this behavior is deliberate. My best guess is that it’s a reinforcement learning artifact: code with fallbacks is more likely to succeed, so during training the models are learning to include fallback 1 . If I’m wrong and it’s part of the hidden system prompt (or a deliberate choice), I think it’s a big mistake. When you ask an AI agent to implement a particular algorithm, it should implement that algorithm. In researching this post, I saw this r/cursor thread where people are complaining about this exact problem (and also attributing it to RL). Supposedly you can prompt around it, if you repeat “DO NOT WRITE FALLBACK CODE” several times.

1 views
Max Bernstein 3 weeks ago

Walking around the compiler

Walking around outside is good for you. [ citation needed ] A nice amble through the trees can quiet inner turbulence and make complex engineering problems disappear. Vicki Boykis wrote a post, Walking around the app , about a more proverbial stroll. In it, she talks about constantly using your production application’s interface to make sure the whole thing is cohesively designed with few rough edges. She also talks about walking around other parts of the implementation of the application, fixing inconsistencies, complex machinery, and broken builds. Kind of like picking up someone else’s trash on your hike. That’s awesome and universally good advice for pretty much every software project. It got me thinking about how I walk around the compiler. There’s a certain class of software project that transforms data—compression libraries, compilers, search engines—for which there’s another layer of “walking around” you can do. You have the code, yes, but you also have non-trivial output . By non-trivial, I mean an output that scales along some quality axis instead of something semi-regular like a JSON response. For compression, it’s size. For compilers, it’s generated code. You probably already have some generated cases checked into your codebase as tests. That’s awesome. I think golden tests are fantastic for correctness and for people to help understand. But this isolated understanding may not scale to more complex examples. How does your compiler handle, for example, switch-case statements in loops? Does it do the jump threading you expect it to? Maybe you’re sitting there idly wondering while you eat a cookie, but maybe that thought would only have occurred to you while you were scrolling through the optimizer. Say you are CF Bolz-Tereick and you are paging through PyPy IR. You notice some IR that looks like: “Huh”, you say to yourself, “surely the optimizer can reason that running on the result of is redundant!” But some quirk in your optimizer means that it does not. Maybe it used to work, or maybe it never did. But this little stroll revealed a bug with a quick fix (adding a new peephole optimization function): Now, thankfully, your IR looks much better: and you can check this in as a tidy test case: Fun fact: this was my first exposure to the PyPy project. CF walked me through fixing this bug 1 live at ECOOP 2022! I had a great time. If checking (and, later, testing) your assumptions is tricky, this may be a sign that your library does not expose enough of its internal state to developers. This may present a usability impediment that prevents you from immediately checking your assumptions or suspicions. For an excellent source of inspiration, see Kate’s tweets about program internals . Even if it does provide a flag like to print to the console, maybe this is hard to run from a phone 2 or a friend’s computer. For that, you may want friendlier tools . The right kind of tool invites exploration. Matthew Godbolt built the first friendly compiler explorer tool I used, the Compiler Explorer (“Godbolt”). It allows inputting programs into your web browser in many different languages and immediately seeing the compiled result. It will even execute your programs, within reason. This is a powerful tool: This combination lowers the barrier to check things tremendously . Now, sometimes you want the reverse: a Compiler Explorer -like thing in your terminal or editor so you don’t have to break flow. I unfortunately have not found a comparable tool. In addition to the immediate effects of being able to spot-check certain inputs and outputs, continued use of these tools builds long-term intuition about the behavior of the compiler. It builds mechanical sympathy . I haven’t written a lot about mechanical sympathy other than my grad school statement of purpose (PDF) and a few brief internet posts, so I will leave you with that for now. Your compiler likely compiles some applications and you can likely get access to the IR for the functions in that application. Scroll through every function’s optimized IR. If there are too many, maybe the top N functions’ IRs. See what can be improved. Maybe you will see some unexpected patterns. Even if you don’t notice anything in May, that could shift by August because of compiler advancements or a cool paper that you read in the intervening months. One time I found a bizarre reference counting bug that was causing copy-on-write and potential memory issues by noticing that some objects that should have been marked “immortal” in the IR were actually being refcounted. The bug was not in the compiler, but far away in application setup code—and yet it was visible in the IR. My conclusion is similar to Vicki’s. Put some love into your tools. Your colleagues will notice. Your users will notice. It might even improve your mood. Thank you to CF for feedback on the post. The actual fix that checks for and rewrites to .  ↩ Just make sure to log off and touch grass.  ↩

0 views
Justin Duke 4 weeks ago

Another reason our pytest suite is slow

I wrote two days ago about how our pytest suite was slow, and how we could speed it up by blessing a suite-wide fixture that was scoped to . This was true. But, like a one-year-old with a hammer, I found myself so gratified by the act of swinging that I found myself also trying to pinpoint another performance issue: why does it take so long to run a single smoke test

0 views
Justin Duke 1 months ago

Why our pytest suite is slow

The speed of Buttondown's pytest suite (which I've written about here , here , and here ) is a bit of a scissor for my friends and colleagues: depending on who you ask, it is (at around three minutes when parallelized on Blacksmith) either quite fast given its robustness or unfathomably slow

0 views
Uros Popovic 1 months ago

Custom CPU simulation and testing

Walkthrough for how the Mrav CPU project handles RTL simulation and other testing aspects.

1 views
Grumpy Gamer 1 months ago

TesterTron3000

Have I mentioned that you should Wish List Death by Scrolling now, before you finish reading this? Here is the code the runs TesterTron3000 in Death by Scrolling. There is some code not listed that does set up, but the following runs the level. It’s written in Dinky, a custom language I wrote for Delores based on what we used for Thimbleweeed Park and then used in Return to Monkey Island . TesterTron3000 is as dumb as a box of rocks, but in some ways that’s what makes it fun to watch. Before we get into code, here is another sample run. It’s not the best code I’ve written but far from the worst and it gets the job done. TesterTron3000 has run for over 48 hours and not found a serious bug, so I’m happy. Source code follows, you’ve been warned…

0 views
Loren Stewart 1 months ago

Production-Ready Astro Middleware: Dependency Injection, Testing, and Performance

Master production-ready Astro middleware with dependency injection, testing strategies, and caching for enterprise applications.

0 views
Grumpy Gamer 1 months ago

Death by TesterTron3000

I first created TesterTron3000 during Thimbleweed Park (hence the name). It was a simple automated tester that randomly clicked on the screen. It couldn’t play the game because it has no knowledge of inventory or puzzles. It did find the odd errors, but was of little real value. Fast forward to the futuristic year of 2025 and I’m working on Death by Scrolling and need a new automated game play tester. Death by Scrolling, not being an adventure game, comes along and I need a whole new program to test with, but I like the name so I keep that. TesterTron3000 is pretty simple, it just runs through the level, looks for power-ups and if its health is low, it looks for hearts. It’s not rocket science. I could make it a lot smarter, but what I really need is tool that stress tests the game so smarts of low on the list. I can leave it running overnight and it plays thousands of levels and in the morning I see if any errors occurred or if there are memory leaks. None so far. Its been a great tool for consoles because we can only do limited testing before sending it to outside testers due to limited dev kits 1 , so running TesterTron3000 on it for 24 hours is good piece of mind. There are little animation glitches because it’s not running through the normal controller code and I’ve spotted some missing sfx. TesterTron3000 is written 100% in Dinky and only about 100 lines of code. I might ship it with the final game as an attract mode, but it’s kind of buggy, has bad path finding, and really stupid so I worry players would fixate on what it’s not doing right. It’s not a tool for playing the game with any degree is skill, it’s a stress tester and a dev tool. But, it’s fun to watch. Something a lot of people don’t know is for consoles you need special dev kits to test with, it’s not like the PC (or even the SteamDeck) where you can use any device. You have to buy (often very expensive) special dev kits, even just to test. It’s really annoying.  ↩︎ Something a lot of people don’t know is for consoles you need special dev kits to test with, it’s not like the PC (or even the SteamDeck) where you can use any device. You have to buy (often very expensive) special dev kits, even just to test. It’s really annoying.  ↩︎

0 views
underlap 1 months ago

Developer's block

Writer’s block is the paralysis induced by a blank page, but software developers experience a similar block and it can even get worse over time. Sometimes a good analogy is that your wheels are spinning and you need to gain traction. Let’s look at the different kinds of developer’s block, what causes them, and how to get unblocked. You want to write great code. In fact, most developers want each of their coding projects to be their best ever. That means different thing to different people, but if you apply all of the following practices from the start, you’ll soon get blocked. Once you buy into the benefits of testing, you’ll want to include decent unit and integration test suits in your code. Of course, at least in the longer term, a decent test suite helps maintain velocity. Right? You might also want to include some fuzz testing, to exercise edge cases you haven’t thought of. When you’ve realised how useful good documentation is, you’ll want a good README or user guide and probably some other documentation on how to contribute to or maintain the code. You might want to document community standards too, just in case. Then there are specific coding practices that you have learned such as good naming, modularity, and the creation and use of reusable libraries. You’ll want to stick to those, even if they need a bit more effort up front. You may have favourite programming languages that will influence your choice of language and tooling, regardless of what would actually make the job in hand easier to complete. For example, if you’re working on open source, you may prefer an open source programming language, build tools, and editor or IDE. Then you will probably want to use version control and write good commit logs. How could you not? You’ll then want to set up CI to run the test suite automatically. You may want to set up cross-compilation so you can support multiple operating systems. You may want to stick to a standard coding style and enforce that with automation in your preferred editor or IDE and maybe a check in CI. You’ll want a consistent error-handling approach and decent diagnostics so it’s easy to debug the code. If the code involves concurrency, you’ll want to put in extra effort to make sure your code is free from data races, deadlocks, and livelocks. All these practices are valuable, but sometimes they just mount up until you’re blocked. Another kind of developer’s block occurs later on in a project. Either you are new to the project and you just feel overwhelmed or you’ve been working on the project for a while, but you run out of stream and get stuck. The causes in these two cases are different. Feeling overwhelmed is often due to trying to rush the process of gaining understanding. Nobody comes to a new codebase and instantly understands it. Another issue with a new codebase is unfamiliarity with the implementation language or the conventions in the way the language is used. Running out of steam may be due to overwork or a lack of motivation. You have to find a way in. Sometimes trying the code out as a user gives you a better idea of what it’s all about. Sometimes you need to read the docs or tests to get an idea of the externals. Eventually, you can start looking at the source code and building up a mental model of how it all fits together to achieve its purpose. If there are other people working on the project, don’t be afraid to ask questions. [1] Sometimes a newcomer’s naive questions help others to understand something they took for granted. If you’re new to the implementation language of a project, take some time to learn the basics. Maybe you’re fluent in another language, but that doesn’t mean you can instantly pick up a new language. When you come across a confusing language feature, take the opportunity to go and learn about the feature. Remember the dictum “If you think education is expensive, try ignorance”. It’s important to take regular breaks and holidays, but sometimes you’re mentally exhausted after finishing one or more major features. This is the time to take stock and ease off a little. Perhaps do some small tasks, sometimes known as “chores”, which are less mentally taxing, but nevertheless worthwhile. Maybe take time to pay off some technical debt. Pick a small feature or bug and implement it with the minimum effort. Circle back round to improve the tests, docs, etc. Rather than implementing all your best practices at the start of a project, see if there are some which can wait a while until you’ve gained some traction. Sometime you need to do a quick prototype, sometimes called a “spike”, in which case just hack together something that just about solves the problem. Concern yourself only with the happy path. Write just enough tests to help you gain traction. Then keep the prototype on a branch and circle back round and implement the thing properly with decent tests and docs. It’s ok to refer to the prototype to remind yourself how you did some things, [2] but don’t copy the code wholesale, otherwise you’ll be repaying the technical debt for ages. If you’re trying to learn about a dependency, it’s sometimes easier to write a quick prototype of using the dependency, possibly in an empty repository, or even not under version control at all if it’s really quick. Don’t polish your docs prematurely. Keep the format simple and check it in alongside the code. Capture why you did things a particular way. Provide basic usage instructions, but don’t do too much polishing until you start to gain users. I think Michael A. Jackson summed this up best: Rules of Optimization: Rule 1: Don’t do it. Rule 2 (for experts only): Don’t do it yet. So don’t optimise unless there is a genuine problem - most code performs perfectly well if you write it so a human being can understand it. If you write it that way, you have some chance of being able to optimise it if you need to. In that case, do some profiling to find out where the bottlenecks are and then attack the worst bottleneck first. After any significant changes and if the problem still remains, re-do the profiling. The code might be a little half-baked, with known issues (hopefully in an issue tracker), but don’t let this hold you back from releasing. This will give you a better feeling of progress. You could even get valuable early feedback from users or other developers. You may be held up by a problem in a dependency such as poor documentation. It is tempting to start filling in the missing docs, but try to resist that temptation. Better to make minimal personal notes for now and, after you’ve made good progress, considering scheduling time to contribute some docs to the dependency. Similarly, if your tooling doesn’t work quite right, just try to get something that works even if it involves workarounds or missing out on some function. Fixing tooling can be another time sink you can do without. Are you prone to developer’s block? If so, what are your tips for getting unblocked? I’d love to hear about them. Some interesting comments came up on Hacker News, including a link to an interesting post on test harnesses . But try to ask questions the smart way . ↩︎ I’ve found git worktree useful for referring to a branch containing a prototype. This lets you check the branch out into a separate directory and open this up alongside your development branch in your editor or IDE. ↩︎

0 views
underlap 1 months ago

Software convergence

The fact that such limits turn out to be members of the semantic domain is one of the pleasing results of denotational semantics. That kind of convergence is all very well, but it’s not what I had in mind. I was more interested in code which converges, to some kind of limit, as it is developed over time. The limit could be a specification of some kind, probably formal. But how would we measure the distance of code from the specification. How about number of tests passing? This seems to make two assumptions: Each test really does reflect part of the specification. The more distinct tests there are, the more closely the whole set of tests would reflect the specification. The second assumption, as stated, is clearly false unless the notion of “distinct tests” is firmed up. Perhaps we could define two tests to be distinct if it is possible to write a piece of code which passes one of the tests, but not the other. There’s still a gap. It’s possible to write many tests, but still not test some part of the specification. Let’s assume we can always discover untested gaps and fill them in with more tests. With this notion of a potentially growing series of tests, how would we actually go about developing convergent software? The key is deciding which tests should pass. This can be done en masse , a classic example being when there is a Compliance Test Suite (CTS) that needs to pass. In that case, the number/percentage of tests of the CTS passing is a good measure of the convergence of the code to the CTS requirements. But often, especially with an agile development process, the full set of tests is not known ahead of time. So the approach there is to spot an untested gap, write some (failing) tests to cover the gap, make those (and any previously existing) tests pass, and then look for another gap, and so on. The number of passing tests should increase monotonically, but unfortunately, there is no concept of “done”, like there is when a CTS is available. Essentially, with an agile process, there could be many possible specifications and the process of making more tests pass simply reduces the number of possible specifications remaining. I’m still mulling over the notion of software convergence. I’m interested in any ideas you may have. One of nice property of convergent software should be that releases are backward compatible. Or, I suppose, if tests are changed so that backward incompatible behaviour is introduced, that’s the time to bump the major version of the next release, and warn the users. I’m grateful to some good friends for giving me tips on LaTeX \LaTeX markup. [2] In particular, produces a “curly” epsilon: ε \varepsilon . Tom M. Apostol, “Mathematical Analysis”, 2nd ed., 1977, Addison-Wesley. ↩︎ I’m actually using KaTeX \KaTeX , but it’s very similar to LaTeX \LaTeX . ↩︎

0 views
Den Odell 1 months ago

Code Reviews That Actually Improve Frontend Quality

Most frontend reviews pass quickly. Linting's clean, TypeScript's happy, nothing looks broken. And yet: a modal won't close, a button's unreachable, an API call fails silently. The code was fine. The product wasn't . We say we care about frontend quality. But most reviews never look at the thing users actually touch. A good frontend review isn't about nitpicking syntax or spotting clever abstractions. It's about seeing what this code becomes in production. How it behaves. What it breaks. What it forgets. If you want to catch those bugs, you need to look beyond the diff. Here's what matters most, and how to catch these issues before they ship: When reviewing, start with the obvious question: what happens if something goes wrong? If the API fails, the user is offline, or a third-party script hangs, if the response is empty, slow, or malformed, will the UI recover? Will the user even know? If there's no loading state, no error fallback, no retry logic, the answer is probably no . And by the time it shows up in a bug report, the damage is already done. Once you've handled system failures, think about how real people interact with this code. Does reach every element it should? Does close the modal? Does keyboard focus land somewhere useful after a dialog opens? A lot of code passes review because it works for the developer who wrote it. The real test is what happens on someone else's device, with someone else's habits, expectations, and constraints. Performance bugs hide in plain sight. Watch out for nested loops that create quadratic time complexity: fine on 10 items, disastrous on 10,000: Recalculating values on every render is also a performance hit waiting to happen. And a one-line import that drags in 100KB of unused helpers? If you miss it now, Lighthouse will flag it later. The worst performance bugs rarely look ugly. They just feel slow. And by then, they've shipped. State problems don't always raise alarms. But when side effects run more than they should, when event listeners stick around too long, when flags toggle in the wrong order, things go wrong. Quietly. Indirectly. Sometimes only after the next deploy. If you don't trace through what actually happens when the component (or view) initializes, updates, or gets torn down, you won't catch it. Same goes for accessibility. Watch out for missing labels, skipped headings, broken focus traps, and no live announcements when something changes, like a toast message appearing without a screen reader ever announcing it. No one's writing maliciously; they're just not thinking about how it works without a pointer. You don't need to be an accessibility expert to catch these basics. The fixes aren't hard. The hard part is noticing. And sometimes, the problem isn't what's broken. It's what's missing. Watch out for missing empty states, no message when a list is still loading, and no indication that an action succeeded or failed. The developer knows what's going on. The user just sees a blank screen. Other times, the issue is complexity. The component fetches data, transforms it, renders markup, triggers side effects, handles errors, and logs analytics, all in one file. It's not technically wrong. But it's brittle. And no one will refactor it once it's merged. Call it out before it calcifies. Same with naming. A function called might sound harmless, until you realize it toggles login state, starts a network request, and navigates the user to a new route. That's not a click handler. It's a full user flow in disguise. Reviews are the last chance to notice that sort of thing before it disappears behind good formatting and familiar patterns. A good review finds problems. A great review gets them fixed without putting anyone on the defensive. Keep the focus on the code, not the coder. "This component re-renders on every keystroke" lands better than "You didn't memoize this." Explain why it matters. "This will slow down typing in large forms" is clearer than "This is inefficient." And when you point something out, give the next step. "Consider using here" is a path forward. "This is wrong" is a dead end. Call out what's done well. A quick "Nice job handling the loading state" makes the rest easier to hear. If the author feels attacked, they'll tune out. And the bug will still be there. What journey is this code part of? What's the user trying to do here? Does this change make that experience faster, clearer, or more resilient? If you can't answer that, open the app. Click through it. Break it. Slow it down. Better yet, make it effortless. Spin up a temporary, production-like copy of the app for every pull request. Now anyone, not just the reviewer, can click around, break things, and see the change in context before it merges. Tools like Vercel Preview Deployments , Netlify Deploy Previews , GitHub Codespaces , or Heroku Review Apps make this almost effortless. Catch them here, and they never make it to production. Miss them, and your users will find them for you. The real bugs aren't in the code; they're in the product, waiting in your next pull request.

0 views
Grumpy Gamer 2 months ago

Death By Scrolling Part 4

I was having a discussion with someone on Mastodon about unit testing games and how it is a next to impossible job. Once clarified, I think we saw eye-to-eye on it, but I do hear about it a lot, mostly from programmers that don’t live in game dev. Testing games is hard. So much of what fails during testing is due to random behavior by the player. Them doing something you didn’t anticipate. I break testing down to three groups. 1 - Unit testing This is what I hear about the most and how this would be a good way to test games. It is not. Unit testing will test code components of your game, but not the game. If you have a sorting routine, this will test that, but it has little effect on testing your game. I do unit testing, mostly on the engine commands. This get run once a week or when I add a new feature. It’s testing that all the command or functions return correct values, but has little to do with “testing the game”. It’s also something that would be very hard to run from the build process since the entire engine needs to start up. This is why I run it by hand every so often. If it’s not being run in a real engine environment, it’s not accurate. 2 - Automation I’ve been using automation since Thimbleweed Park. I called it TesterTron 3000 it ran through the game and randomly simulated clicks and tried to follow some logic. It found a few things, mostly bugs that would only happen if you clicked very fast. It was fun to watch and gave mostly peace of mind testing. It often broke because we changed the games logic and it could no longer follow along. If I had a team that did nothing but keep TesterTron 3000 working, it would be more useful but given the limited resources of an indie team, I’m not sure it would have been worth it. I have something similar in Death By Scrolling, it runs through the levels and randomly picks up things and try to attack enemies. I could have it follow a set sequence of key strokes, but then it’s only testing what you know works, not the goofy stuff that real bugs are make of. I’ve run into programmers that build something like this and at first it feels good but as the game changes it falls apart and doesn’t get used much after that. Maybe if you had a simple puzzle game, this might be useful. TesterTron 3000 a good stress tester and it needs to run overnight to be truly useful. 3 - Human Tester The most important testing we do is with testers who play the game all day long. They look at the Git history and see what has changed and beat on that. 99% of our bugs are found this way and I can’t emphasis enough how important it is. Players do odd things that automation never will. While it is important to test your own code, you will never find the “good” bugs. Programmers are lousy testers because they test what they know, you need to be testing what you don’t know. Be afraid if management that says they don’t need testers because the programmer should test their own code. I grew up in a world where a bug meant you have to remake a million floppy disks and it would take months to get the change out to players, if they ever got it.

0 views
Anton Zhiyanov 3 months ago

Expressive tests without testify/assert

Many Go programmers prefer using if-free test assertions to make their tests shorter and easier to read. So, instead of writing if statements with : They would use (or its evil twin, ): However, I don't think you need and its 40 different assertion functions to keep your tests clean. Here's an alternative approach. The testify package also provides mocks and test suite helpers. We won't talk about these — just about assertions. Equality • Errors • Other assertions • Source code • Final thoughts The most common type of test assertion is checking for equality: Let's write a basic generic assertion helper: We have to use a helper function, because the compiler doesn't allow us to compare a typed value with an untyped : Now let's use the assertion in our test: The parameter order in is (got, want), not (want, got) like it is in testify. It just feels more natural — saying "her name is Alice" instead of "Alice is her name". Also, unlike testify, our assertion doesn't support custom error messages. When a test fails, you'll end up checking the code anyway, so why bother? The default error message shows what's different, and the line number points to the rest. is already good enough for all equality checks, which probably make up to 70% of your test assertions. Not bad for a 20-line testify alternative! But we can make it a little better, so let's not miss this chance. First, types like and have an method. We should use this method to make sure the comparison is accurate: Second, we can make comparing byte slices faster by using : Finally, let's call from our function: And test it on some values: Works like a charm! Errors are everywhere in Go, so checking for them is an important part of testing: Error checks probably make up to 30% of your test assertions, so let's create a separate function for them. First we cover the basic cases — expecting no error and expecting an error: Usually we don't fail the test when an assertion fails, to see all the errors at once instead of hunting them one by one. The "unexpected error" case (want nil, got non-nil) is the only exception: the test terminates immediately because any following assertions probably won't make sense and could cause panics. Let's see how the assertion works: So far, so good. Now let's cover the rest of error checking without introducing separate functions (ErrorIs, ErrorAs, ErrorContains, etc.) like testify does. If is an error, we'll use to check if the error matches the expected value: Usage example: If is a string, we'll check that the error message contains the expected substring: Usage example: Finally, if is a type, we'll use to check if the error matches the expected type: Usage example: One last thing: doesn't make it easy to check if there was some (non-nil) error without asserting its type or value (like in testify). Let's fix this by making the parameter optional: Usage example: Now handles all the cases we need: And it's still under 40 lines of code. Not bad, right? and probably handle 85-95% of test assertions in a typical Go project. But there's still that tricky 5-15% left. We may need to check for conditions like these: Technically, we can use . But it looks a bit ugly: So let's introduce the third and final assertion function — . It's the simplest one of all: Now these assertions look better: Here's the full annotated source code for , and : Less than 120 lines of code! I don't think we need forty assertion functions to test Go apps. Three (or even two) are enough, as long as they correctly check for equality and handle different error cases. I find the "assertion trio" — Equal, Err, and True — quite useful in practice. That's why I extracted it into the github.com/nalgeon/be mini-package. If you like the approach described in this article, give it a try!

0 views
James Stanley 3 months ago

AI Test User

Today we're soft-launching AI Test User . It's a robot that uses Firefox like a human to test your website and find bugs all by itself. If you have a site you'd like to test, submit it here , no signup required. It's a project that Martin Falkus and I are working on, currently using Claude Computer Use . The aim at the moment is just to do test runs on as many people's websites as we can manage, so if you have a website and you're curious what the robot thinks of it, fill in the form to tell us where your website is, how to contact you, and any specific instructions if there's something specific you want it to look at. Try it now » AI Test User aims to provide value by finding bugs in your website before your customers do, so that you can fix them sooner and keep your product quality higher. The way it works is we have a Docker container running Firefox, based on the Anthropic Computer Use Demo reference implementation. At startup we automatically navigate Firefox to the customer's website, and then give the bot a short prompt giving it login credentials supplied by the customer (if any), any specific instructions provided by the customer, and asking it to test the website and report bugs. We record a video of the screen inside the Docker container so that the humans can see what the machine saw. It has a tool that it can use to report a bug. Attached to each bug report is a screenshot of Firefox at the time the bug was detected, and a reference to the timestamp in the screen recording, so when you load up the bug report you can see exactly what the bot saw, as well as the bot's description of the bug. We ask it to report bugs for everything from spelling errors and UI/UX inconveniences, all the way up to complete crashes. I have put up a page at https://incoherency.co.uk/examplebug/ that purports to be a login form, except it is deliberately broken in lots of fun and confusing ways. Cursor one-shotted this for me, which was great. Haters will say Cursor is only good at writing weird bugs at scale. I say it's not only good at writing weird bugs at scale, but you have to admit it is good at writing weird bugs at scale. You can see an example report for the examplebug site, in which the AI Test User made a valiant but futile effort with my Kafkaesque login form that keeps deleting the text it enters. For a more realistic example run, you could see this report for an example shop . It goes through a WooCommerce shop, finds the product we asked it to buy, checks it out using the Stripe dummy card number, and then checks that it received the confirmation email. It reported a bug on that one because it didn't get the confirmation email. Or you could see it ordering roast dinner ingredients from Sainsbury's , stopping just short of booking a delivery slot. Apparently Sainsbury's don't have Yorkshire puddings?? I'm not sure what went wrong there, but AI Test User dutifully submitted a bug report. Good bot. I've been reading Eric Ries's "The Lean Startup" recently, and I have worked out that despite best efforts, we have actually already done too much programming work. The goal should be to get a minimum viable product in the hands of customers as soon as possible, so we should not have even implemented automatic bug reporting yet (let alone automatic bug deduplication , which seemed so important while I was programming it, but could easily turn out to be worthless). We also give the machine access to a per-session "email inbox". This is a page hosted on host.docker.internal that just lists all of the emails received on its per-session email address. This basically only exists to handle flows based around email authentication. It doesn't have the ability to send any emails, just see what it received and click links. (Again, maybe should have skipped this until after seeing if anyone wants to use it). One issue we've run into is the classic "LLM sycophancy" syndrome. Upon meeting with a big red error message, the bot would sometimes say "Great! The application has robust validation procedures", instead of reporting a bug! We don't have a great fix for that yet, other than saying in the initial prompt that we really, really want it to report bugs pretty please. It seems we are all prompt engineers on this blessed day. We don't really know yet what to do about pricing. There is some placeholder pricing on the brochure site but that could easily have to change. One of the issues with this technology at the moment is that it's very slow and very expensive. Which people might not like. Being very slow isn't necessarily a problem, we have some tricks to automatically trim down the screen recording so that the user doesn't have to sit through minutes of uneventful robot thinking time. But being expensive definitely is a problem. We are betting that the cost will come down over time, but until that happens either it will have to provide value commensurate with the cost, or else it isn't economically viable yet. Time will tell. We have found that the Anthropic API is not as reliable as we'd like. It's not uncommon to get "Overloaded" responses from the API for 10 minutes straight, meanwhile status.anthropic.com is all green and reporting no issues at all. We've also tried out an Azure-hosted version of OpenAI's computer-use service, and also found it very flakey. For now the Anthropic one looks better but it may be that we would want to dynamically switch based on which is more reliable at any given time. There is a spectrum of automated website testing, from fully-scripted static Playwright tests on one end, through AI-maintained Playwright tests, agentic tests with rigid flows, agentic tests that explore the site on their own and work out what to test, up to a fully-automated QA engineer which would decide on its own when and what to test, and work with developers to get the bugs fixed. AI Test User is (currently!) positioned somewhere between "agentic tests with rigid flows" and "agentic tests that explore on their own". There are a few approximate competitors in the space of AI-powered website testing. QAWolf , Heal , and Reflect seem to be using AI to generate more traditional scripted tests. Autosana looks to be more like the agentic testing that we are doing, except for mobile apps instead of websites. I'm not aware of anyone doing exactly what we're doing but I would be surprised if we're the only ones. It very much feels like agentic testing's time has come, and now it's a race to see who can make a viable product out of it. And there are further generalisations of the technology. There is a lot you can do with a robot that can use a computer like a human does. Joe Hewett is working on a product at Asteroid (YC-funded) where they are using a computer-using agent to automate browser workflows especially for regulated industries like insurance and healthcare. Interested in AI Test User? Submit your site and we'll test it today. We're also interested in general comments and feedback, you can email [email protected] and we'll be glad to hear from you :).

0 views
Nicky Reinert 3 months ago

Monitor SSL Traffic On Android

Fantastic Preface Monitoring SSL Traffic of Android Apps is quite a challenge these days. While it’s an important way to understand how Apps work, find security flaws or possible data breaches, a weak encrypted traffic also is a potential security risk. These days, most apps use a thing called …

0 views
André Arko 3 months ago

You should delete tests

We’ve had decades of thought leadership around testing, especially coming from holistic development philosophies like Agile, TDD, and BDD. After all that time and several supposedly superseding movements, the developers I talk to seem to have developed a folk wisdom around tests. That consensus seems to boil down to simple but mostly helpful axioms, like “include tests for your changes” and “write a new test when you fix a bug to prevent regressions”. Unfortunately, one of those consensus beliefs seems to be “it is blasphemy to delete a test”, and that belief is not just wrong but actively harmful. Let’s talk about why you should delete tests. To know why we should delete tests, let’s start with why we write tests in the first place. Why do we write tests? At the surface level, it’s to see if our program works the way we expect. But that doesn’t explain why we would write automated tests rather than simply run our program and observe if it works. If you’ve ever tried to work on a project with no tests, I’m sure you’ve experienced the sinking sensation of backing yourself into a corner over time. The longer the project runs, the worse it gets, and eventually every possible change includes stressfully wondering if you broke something, wondering what you missed, and frantically deploying fix after revert after fix after revert as fast as possible because each frantic fix broke something else.

0 views

Measuring my Framework laptop's performance in 3 positions

A few months ago, I was talking with a friend about my ergonomic setup and they asked if being vertical helps it with cooling. I wasn't sure, because it seems like it could help but it was probably such a small difference that it wouldn't matter. So, I did what any self-respecting nerd would do: I procrastinated. The question didn't leave me, though, so after those months passed, I did the second thing any self-respecting nerd would do: benchmarks. What we want to find out is whether or not the position of the laptop would affect its CPU performance. I wanted to measure it in three positions: My hypothesis was that using it closed would slightly reduce CPU performance, and that using it normal or vertical would be roughly the same. For this experiment, I'm using my personal laptop. It's one of the early Framework laptops (2nd batch of shipments) which is about four years old. It has an 11th gen Intel CPU in it, the i7-1165G7. My laptop will be sitting on a laptop riser for the closed and normal positions, and it will be sitting in my ergonomic tray for the vertical one. For all three, it will be connected to the same set of peripherals through a single USB-C cable, and the internal display is disabled for all three. I'm not too interested in the initial boost clock. I'm more interested in what clock speeds we can sustain. What happens under a sustained, heavy load, when we hit a saturation point and can't shed any more heat? To test that, I'm doing a test using heavy CPU load. The load is generated by stress-ng , which also reports some statistics. Most notably, it reports CPU temperatures and clock speeds during the tests. Here's the script I wrote to make these consistent. To skip the boost clock period, I warm it up first with a 3-minute load Then I do a 5-minute load and measure the CPU clock frequency and CPU temps every second along the way. We need since we're using an option ( ) which needs root privileges [1] and attempts to make the CPU run harder/hotter. Then we specify the stressor we're using with , which does some matrix calculations over a number of cores we specify. The remaining options are about reporting and logging. I let the computer cool for a minute or two between each test, but not for a scientific reason. Just because I was doing other things. Since my goal was to saturate the temperatures, and they got stable within each warmup period, cooldowh time wasn't necessary—we'd warm it back up anyway. So, I ran this with the three positions, and with two core count options: 8, one per thread on my CPU; and 4, one per physical core on my CPU. Once it was done, I analyzed the results. I took the average clock speed across the 5 minute test for each of the configurations. My hypothesis was partially right and partially wrong. When doing 8 threads, each position had different results: With 4 threads, the results were: So, I was wrong in one big aspect: it does make a clearly measurable difference. Having it open and vertical reduces temps by 3 degrees in one test and 5 in the other, and it had a higher clock speed (by 0.05 GHz, which isn't a lot but isn't nothing). We can infer that, since clock speeds improved in the heavier load test but not in the lighter load test, that the lighter load isn't hitting our thermal limits—and when we do, the extra cooling from the vertical position really helps. One thing is clear: in all cases, the CPU ran slower when the laptop was closed. It's sorta weird that the CPU temps went down when closed in the second test. I wonder if that's from being able to cool down more when it throttled down a lot, or if there was a hotspot that throttled the CPU but which wasn't reflected in the temp data, maybe a different sensor. I'm not sure if having my laptop vertical like I do will ever make a perceptible performance difference. At any rate, that's not why I do it. But it does have lower temps, and that should let my fans run less often and be quieter when they do. That's a win in my book. It also means that when I run CPU-intensive things (say hi to every single Rust compile!) I should not close the laptop. And hey, if I decide to work from my armchair using my ergonomic tray, I can argue it's for efficiency: boss, I just gotta eke out those extra clock cycles. I'm not sure that this made any difference on my system. I didn't want to rerun the whole set without it, though, and it doesn't invalidate the tests if it simply wasn't doing anything. ↩

0 views