The test suite as a regression sensor
Birgitta Böckeler finishes her post on sensors for coding agents by examining the role of a test suite as a regression sensor, focusing on the role mutation testing can play.
Birgitta Böckeler finishes her post on sensors for coding agents by examining the role of a test suite as a regression sensor, focusing on the role mutation testing can play.
A decade ago, I settled on for disk benchmarking on all my systems. Tools like ('Flexible IO' tester) are a little more capable for raw disk performance testing, and other tools test network-scale filesystems better, but gives me an easy overview of real-world disk performance across hard drives and SSDs, and runs on Mac, Windows, and Linux (and a smattering of other OSes). It's been around since 1991 , and is still updated today—in fact, the two latest updates (version 509 and 510) contain patches I sent in to get iozone to compile on Apple Silicon Macs running newer releases of macOS.
A nice, but unpolished onboarding callout directing people towards a more useful shortcut, in Google Docs. I’m holding arrow keys without ⇧ here first, then with ⇧: To improve it, I would add some sort of small celebratory “completed!” state, and auto-hide it afterwards; right now, it seems that it hides on a delay, likely regardless of what happens. (Testing onboarding is hard because once it’s invoked it disappears forever. If you are worried about onboarding experiences in a place you work, please insist on easy toggles to bring it back for testing. And no, one-size-fits-all “reset onboarding” is too crude; ideally you can reset each specific one easily through a simple UI.) Thank you to Ezra Spier for the tip. #google #onboarding
In my last blog post I introduced Dimster (DIMensional teSTER), a performance benchmarking tool for Apache Kafka with a specific set of philosophies. In this first share group benchmarking post, we’re going to use share groups as they are not intended to be used, but for a good reason. Share groups allow you to move past partitions as the unit of parallelism by allowing multiple consumers to read from the same partition, using message queue semantics. We’ll run those kinds of tests in the next post. In this post I just want to understand if the mechanics of how share groups work add any additional overhead compared to consumer groups. So we’ll use share groups as if they were consumer groups (by capping consumer count to partition count). Objective : Use synthetic tests to measure the overhead of share groups compared to consumer groups in identical conditions. How : Like-for-like tests which use an identical workload/topology using consumerType (CONSUMER_GROUP|SHARE_GROUP) as a dimension. Given identical producer/consumer counts, producer rate, topic/partition counts, do share groups scale as well as consumer groups? Do they add any latency overhead? These benchmarks are educational , they are not hard numbers, they are not some kind of canonical result (in fact, no such benchmark exists). And again, this is not a realistic test at all, they only serve to understand share group overhead. I ran all these benchmarks on a k3d Kubernetes cluster on my Threadripper 9980X: 64 cores (128 threads) 256 GB DDR5 memory Two Samsung 9100 PRO 8 TB (with one dedicated to the benchmarks) Pretty decent CPU and RAM cooling. This is not a production setup, but the hardware is more than capable of handling a small to medium sized Kafka cluster with excellent performance. The SSD can sustain around 1.7 GB/s once the SLC cache has filled up and none of these benchmarks exceed that in aggregate across the 3 brokers. All tests were run with TLS between the clients and brokers and between each broker. I prefer to run benchmarks with TLS enabled (though it reduces the numbers) because most people (hopefully?) run Kafka with full TLS. Dimster uses named environments located in the dimster-config.yaml . Each environment targets a specific k8s cluster (via kubectl context), specifies the Kafka and client versions, sizes the Kafka pods, determines heap sizes, broker and log config files etc, all in one yaml block. This environment uses 36 of 128 CPU threads (16 of 64 cores) and 72 GB of 256 GB of RAM of my workstation, so we’re not pushing the Threadripper too hard. Note, the ‘requests’ field block is applied to both k8s requests and limits. The client pod is over-provisioned with 12 CPU cores (24 threads) and 24 GB RAM to avoid any client bottlenecks causing spurious results. The tests in this post compare consumer groups with share groups. To do that, I tried to isolate other factors as much as possible. Random load skew is one such important factor. In these tests, I ensured that load was as even as possible over the brokers: Message distribution over the partitions of a given topic was even. I used the Dimster message distributor PINNED_PARTITIONS which ensures the number of producers is divisible by the number of brokers and pins each producer to a set of partitions, and each producer round-robin sends to its partitions directly. Multi-topic tests used a topic count divisible by the number of brokers to ensure even distribution of leaders over brokers. Consumer counts per group were divisible by the number of brokers to ensure even distribution of partitions over consumers. Fig 1. Dimster’s partition pinning for even load distribution This is not like in real-life, but for this post I want to avoid the randomness involved with partition and broker skew so that we can compare consumer group vs share group performance without load skew randomness playing a role. I’ll be writing about and running benchmarks with partition and broker skew in a future post. Link to results as a tarball For the throughput benchmarks, I used Dimster’s explore mode, which probes the cluster to find the highest sustainable throughput while staying under a target end-to-end latency in ms and percentile (50 ms, p75 in this case). It measures e2e latency per-partition and uses the latency of the poorest performing partition as the yardstick. Explore mode runs in phases: Ramp . Start with a low throughput and keep doubling the throughput after a configured interval. Once the e2e latency exceeds the limit, move to the next phase. Search : Perform a binary search within the bounds of [0 - max-ramp-throughput ]. It starts at the midpoint and if it can sustain that throughput, it searches the high range starting at the midpoint. If it can’t sustain it, then it searches the low range. It recursively performs the search until the current search range size is < 5% of the throughput. Then it moves to the sustain phase. Sustain : The throughput identified by the search phase is maintained for a prolonged period. If it passes, the test is complete. If it fails to sustain (under the target e2e latency), it goes back to the search phase, with the failed sustain throughput as the new upper bound of the search range. The sustain phase is successful if 80% of the intervals (30 intervals of 10 seconds by default) meet the latency criteria. This rule exists as explore mode is trying to find the highest sustainable throughput which sits on the edge of the cluster’s limit, allowing for some latency spikes. I ran explore mode on the following workload: The first scenario has 4 test points which co-varies 4 workload aspects related to partition, client counts and consumer type as dimensions, repeating the tests 3 times. Fig 2. The merged result of three repeats (only small variance between runs) We see that share groups matched or even exceeded consumer group performance. Moreover, this pattern was broadly the same across the three test repeats. We can’t infer this as a generalizable result based on this one test, but my general observation, having been running these tests for a few weeks, on EKS clusters, my Threadripper and my Mac, is that throughput in this kind of synthetic test is comparable (between consumer/share groups). Scenario 2 - Varying fanout This scenario involved 1 topic with 12 partitions with a fanout of 2 and then 6. Fig 3. The merged result of three repeats (only small variance between runs) The surprising result was that share groups maintained a higher sustainable throughput with a fanout of 6. Explore mode is sensitive to spiky latency, and one thing I’ve observed is that share group latency can be more stable under stressful loads than consumer groups. Again, this may not be generalizable, but it shows that share groups might actually outperform consumer groups in some cases. I think the main takeaway from these limited tests is that share groups and consumer groups are in the same ball park in terms of raw throughput. Link to results as a tarball The throughput benchmarks were a stress test of sorts, pushing Kafka right up to its limit. CPU was maxed out. We don’t want that for the latency benchmarks. We’re not going to push the Kafka cluster to the limit as we want to measure latencies within the performance envelope. With 4 vCPUs, around 100 clients and TLS, a 15 MB/s (1.3 TB daily) workload fits comfortably inside that envelope. I used run-mode , which are the standard fixed throughput benchmarks (best for measuring latency). I ran a single test campaign with 3 scenarios where consumerType was the dimension: 1 topic with 60 partitions, 30 producers, 60 consumers. 12 topics with 6 partitions, 6 consumers per topic, 3 producers per topic. 6 topics with 6 partitions, 3 consumer groups per topic with 6 consumers each, 3 producers per topic. All ran with an aggregate producer rate of 15000 msg/s with a 1 KB message size (15 MB/s). Fig 4. End-to-end latency (p99) over time (10 second intervals). Note: you can select a time range on Dimster charts to zoom into a sub-range. Under this lighter load, we see that share groups add some overhead, with the e2e p99 latency being a little more choppy than the much flatter consumer group latency. Fig 5. End-to-end latency distribution. Note: you can select a percentile range on Dimster charts to zoom into a sub-range. Fig 6. p99 end-to-end latency over time (10 second intervals) The sharegroup overhead is more pronounced in this test. Fig 7. End-to-end latency distribution. Fig 8. p99 end-to-end latency over time (10 second intervals) Again we see the same overhead. The takeaway is that for an adequately sized cluster that is not stressed by the workload, we can expect to see some small share group end-to-end latency overhead. Just to show you this isn’t an artifact of running these tests on k3d on a single workstation, we see the same pattern on a 50 MB/s test I ran a few weeks ago on AWS EKS with the m6i.2xlarge instance (8 vCPU, 32 GB RAM, EBS). Fig 9. 50 MB/s test, p99 end-to-end latency over time (10 second intervals) on an EKS cluster And a 150 MB/s test which was more stressful Fig 10. 150 MB/s test, p99 end-to-end latency over time (10 second intervals) on an EKS cluster We see the typical Kafka latency spikes related to log flushing and rotation (which has this predictable cadence due to how all load starts at the same time, at a constant rate, on one topic). The share group tests consistently used more CPU than the consumer group tests, which is understandable given share groups do a lot more accounting and state management than consumer groups. For example, the first repeat of scenario 1 of the latency test (executed as test points CG, SG, CG, SG, CG, SG): Fig 11. CPU over three apache/kafka pods In all these tests, consumers did nothing with the messages except record some metrics. In the real world consumers write to databases and call APIs. It might take anywhere from < 1 ms to 30+ seconds to process a message. More useful benchmarks simulate consumer processing time which is exactly what we’ll do in the next post. When we add processing time, we start to see where share groups really shine. To summarize some findings from this post: Share groups add a little overhead which might show up in a latency benchmark. Share groups consume more CPU. Raw throughput benchmarks will probably see varied results, but share groups are not fundamentally slower than consumer groups. 64 cores (128 threads) 256 GB DDR5 memory Two Samsung 9100 PRO 8 TB (with one dedicated to the benchmarks) Pretty decent CPU and RAM cooling. Message distribution over the partitions of a given topic was even. I used the Dimster message distributor PINNED_PARTITIONS which ensures the number of producers is divisible by the number of brokers and pins each producer to a set of partitions, and each producer round-robin sends to its partitions directly. Multi-topic tests used a topic count divisible by the number of brokers to ensure even distribution of leaders over brokers. Consumer counts per group were divisible by the number of brokers to ensure even distribution of partitions over consumers. Ramp . Start with a low throughput and keep doubling the throughput after a configured interval. Once the e2e latency exceeds the limit, move to the next phase. Search : Perform a binary search within the bounds of [0 - max-ramp-throughput ]. It starts at the midpoint and if it can sustain that throughput, it searches the high range starting at the midpoint. If it can’t sustain it, then it searches the low range. It recursively performs the search until the current search range size is < 5% of the throughput. Then it moves to the sustain phase. Sustain : The throughput identified by the search phase is maintained for a prolonged period. If it passes, the test is complete. If it fails to sustain (under the target e2e latency), it goes back to the search phase, with the failed sustain throughput as the new upper bound of the search range. 1 topic with 60 partitions, 30 producers, 60 consumers. 12 topics with 6 partitions, 6 consumers per topic, 3 producers per topic. 6 topics with 6 partitions, 3 consumer groups per topic with 6 consumers each, 3 producers per topic. Share groups add a little overhead which might show up in a latency benchmark. Share groups consume more CPU. Raw throughput benchmarks will probably see varied results, but share groups are not fundamentally slower than consumer groups.
Dimster = DIMensional teSTER for Apache Kafka On GitHub: https://github.com/dimster-hq/dimster Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka. In this post I’ll introduce some aspects of Dimster that are core to its design: Dimensional testing Shareable, self-contained results with reproducibility in mind Benchmark prep and post-processing Kubernetes as a standardized runtime A benchmarking and stress testing technique I’ve used for years is something I have called “Dimensional Testing”. We can think of all the configs and workload aspects as forming N-dimensional space. Within that space we can explore the impact of points in that space along a single dimension, or even co-varying dimensions. Take a config or an aspect of a workload as a dimension, and run a series of identical benchmarks where a set of points along that dimension are explored (while everything else remains the same). The dimension could be a client config, such as batch.size or acks. It could be an aspect of the workload such as number of consumers, type of consumer, number of consumer groups, the partition count, the produce rate and so on. There are hundreds of dimensions to explore, which requires some patience and care lest you become overwhelmed. The below depicts just three dimensions, and a set of three scenarios which test performance along one or two dimensions at a time. Fig 1. Three examples of varying or co-varying an aspect of a workload as dimensions Each of the above 16 test points (across 3 scenarios) is a separate benchmark, with a fresh topic, warm-up time, recorded time, and cooldown time etc. The generated charts for throughput and various latencies are repeated for each of the three scenarios, with each test point within a scenario plotted as a series/bar on those charts. This makes it easy to compare the performance results of varying the values of a single dimension (or co-varying values across multiple dimensions). Fig 2. Each scenario maps to a set of charts, with the test points as data series. With share groups being relatively new, I could compare the performance of regular consumers against share group consumers, with identical benchmarks where the dimension explored is consumer type (CONSUMER_GROUP|SHARE_GROUP). The following test has as the base workload of ten topics with each topic having 6 partitions, 6 consumers and 4 producers. Each scenario changes the producer rate, and compares consumer groups to share groups. Record keys are used, so batch sizes will be small, which is a tougher workload than a no-key test which typically results in larger batches. The charts below show the results for an EKS deployment with Kafka deployed on 3x m6i.2xlarge with 300 MB/s provisioned gp3. At 50 MB/s we see that p99 end-to-end latency is stable, with roughly 15 ms overhead for share groups. At 200 MB/s, p99 end-to-end exhibits peaks in a periodic fashion. Dimster uses environments. The sizing of a test is determined by which environment is used. I ran some share group consumer scaling tests, with full mTLS, on Kafka clusters assigned 2, 4, and 8 CPUs. These are the equivalent of vCPUs, as my Threadripper has SMT (hyperthreading) enabled. 2-CPU environment on my Threadripper: I ran the following workload with the above environment, with the CPU requests/limit of 2, 4 and 8. Then I used the dimster compare command to generate comparison charts based on the JSON result files of each run. Each chart compares each test point side-by-side. 10k msg/s - 1000 consumers (6th test point in 1st scenario) We see that 2 CPUs fare a lot worse than 4 and 8 CPUs. 100k msg/s, 250 consumers (4th test point, 3rd scenario) The 2 CPU cluster simply can’t keep up with 100k msg/s and 250 consumers. If we unselect 2-CPU, we see that 4-CPU and 8-CPU was ok. Dimster charts are interactive. Series can be toggled, time and percentile ranges can be selected. One thing I really like about OMB is that it produces a JSON file for the results. These files are easy to store and easy to share. But there was also a lot missing for full traceability and reproducibility. Dimster includes the following in every test campaign result (a set of files in a result directory): Results : The JSON result file which contains all the test point performance results. For each test point, it includes the effective workload and client configuration. It also includes the hardware and other metadata to know what the benchmark was run against. A CSV file generated from the result JSON file (to make it easy to put in a spreadsheet or run custom visualizations). Source configs : The source workload file itself, as well as any additional files such as any dedicated client config file, the broker config file, the version of Kafka, the version of the Kafka clients, and the CPU/memory/disk given to the brokers and clients. Log files : the log files of dimster-core, the benchmarking framework, and each Kafka broker. Charts : Throughput and latency charts (clickable, zoomable) generated from the result JSON file. Dashboards : Grafana dashboards converted to interactive HTML files. I can run a test campaign then send you the results and you’ll be able to reproduce the results because you know exactly what was run and on what. The results are also completely self-contained, if you want to see the dashboard to look at Kafka metrics during the test, it’s right there as an HTML file in the results. No need for access to Grafana and Prometheus and no need to keep monitoring infrastructure around, it can be ephemeral. Dimster comes with four test modes (which all support dimensional testing): Run : Fixed throughput benchmarks, plus: Live-interaction . Run-mode also supports live interaction with the user. The user can change the producer rate, number of producers and consumers, message size, etc. Availability : Optionally measure availability (producer/consumer/aggregate) during the standard run-mode benchmark. Explore : Discover the highest sustainable throughput while staying under a target end-to-end latency and percentile. Drain-backlog : Build a backlog and time how long it takes for the consumers to drain it. Optionally set a producer rate during the drain phase, such as when testing if a cluster is big enough to drain a backlog while under normal producer load. Correctness : Detects data loss, data corruption, out-of-order delivery and duplicates. Example 1: Peak sustainable throughput, 1 partition, share group consumers Explore mode on my Threadripper. The idea was to see the bottleneck of a single partition, as consumers are scaled out. The rule was for p75 e2e latency to stay below 50ms. Example 2: Consumer group vs share group with 1 ms processing time The prior example was an unrealistic synthetic test where the consumer spent no time processing. This explore test added 1 ms consumer processing time per message with 300 consumers. It compared a 300 member consumer group with 300 partitions, vs a 300 member share group, with 5, 10, 25 and 50 partitions. Share groups managed the same throughput (95% of theoretical max based on 1 ms processing time and consumer count), on only 10 partitions. Consumers groups needed 300 partitions. Personally, explore and run are my bread and butter benchmark modes. For a given workload I usually start by finding the throughput limit where Kafka transitions from normal stable performance into degraded territory. I either use run mode and use live interaction to discover the performance limit, or I use explore which is slower but I can leave to run and it discovers the limit in an automated way. For latency benchmarks, once I know the limit, I can craft benchmarks that fit inside the performance envelope for that workload on the specific version of Kafka on the specific hardware I am using. The Dimster CLI has some commands that help before running benchmarks and for post-processing. Dimster resources command The resources command calculates the network and disk throughput required to service a workload. This is important in the cloud for selecting the right instances, ensuring that baseline network and disk throughput are greater than the workload’s demands. Dimster compare command Compare different runs that were executed on different hardware, different broker configurations, different broker versions etc. Dimster pivot command You can slice and dice the data any way you want based on the CSV data. However, you can also pivot the results and generate a chart with the pivot command. This compares the Nth test point across all scenarios. Dimster is easiest to use with Kubernetes. Dimster has a CLI you use from your laptop which speaks Kubernetes and leverages it to run benchmarks on any hardware, any cloud, any laptop or workstation using the exact same orchestration logic. All it needs is a properly configured k8s cluster. It could be minikube or k3d on a laptop or workstation, or AWS EKS or Google Cloud GKE or your own in-house cluster. You can tell Dimster to deploy Apache Kafka to a stateful set in the k8s cluster: Fig 3. Dimster architecture in full deploy mode Or point Dimster (deployed to k8s) at a Kafka service or in-house Kafka cluster. When testing a Kafka service, you can provision a single powerful instance for the Dimster coordinator and worker, and deploy them to a local k8s distro such as Minikube, K3d or Kind. A single worker will happily consume all the cores and memory you give it. Fig 4. Dimster architecture in external deploy mode Or run a super-slim full setup in a tiny minikube/kind/etc local k8s distro: Fig 5. Dimster deployed in a tiny local k8s cluster The workflow is the same. If you can provide a k8s cluster, then Dimster does the rest. Deployment is really simple, monitoring, gathering results, troubleshooting is all simplified via a mix of the CLI being relatively capable, and k8s providing a well-understood platform. K8s is not obligatory , you can run dimster-core directly as a Java program, and point it at a Kafka cluster already provisioned. But you lose many features such as monitoring, live-interaction, automatic gathering of logs, automatic chart and CSV generation and so on. However, you can use the post-processing command dimster chart to generate the charts of a result JSON file. Run the Java directly via the benchmark script: ./bin/benchmark -w path/to/workload file I will be publishing a blog post regularly about Dimster and what you can do with it. So stay tuned. I invite you to go and play around with Dimster , even if it's just running benchmarks on your laptop or workstation. You can get an idea of what charts get produced, what kinds of benchmarks you can run, trying out dimensional testing etc. The docs are pretty decent and should cover most of it. It’s fully featured but still a 0.X version. Myself and a Confluent colleague are the only ones who have run it thus far, so there may be bugs you encounter, if you do encounter a problem, please open an issue with repro steps. If you want to run serious benchmarks, you’ll likely need an EKS or GKE type of Kubernetes cluster. Dimster comes with a special CLI for EKS to deploy EKS with node groups for Kafka, Dimster workers/coordinator, Grafana/Prometheus, as well as storage classes for gp3. While evaluating consumer group vs share group consumers, I’ve been running benchmarks in k3d on my beefy Threadripper 9980X workstation with 64 cores (128 threads), 256 GB RAM and an Samsung 9100 PRO 8TB SSD, which is plenty to run an entire medium sized Kafka cluster plus workers on it. I’ll be sharing some share group benchmarks tomorrow. Happy testing! Dimensional testing Shareable, self-contained results with reproducibility in mind Benchmark prep and post-processing Kubernetes as a standardized runtime Results : The JSON result file which contains all the test point performance results. For each test point, it includes the effective workload and client configuration. It also includes the hardware and other metadata to know what the benchmark was run against. A CSV file generated from the result JSON file (to make it easy to put in a spreadsheet or run custom visualizations). Source configs : The source workload file itself, as well as any additional files such as any dedicated client config file, the broker config file, the version of Kafka, the version of the Kafka clients, and the CPU/memory/disk given to the brokers and clients. Log files : the log files of dimster-core, the benchmarking framework, and each Kafka broker. Charts : Throughput and latency charts (clickable, zoomable) generated from the result JSON file. Dashboards : Grafana dashboards converted to interactive HTML files. Run : Fixed throughput benchmarks, plus: Live-interaction . Run-mode also supports live interaction with the user. The user can change the producer rate, number of producers and consumers, message size, etc. Availability : Optionally measure availability (producer/consumer/aggregate) during the standard run-mode benchmark. Explore : Discover the highest sustainable throughput while staying under a target end-to-end latency and percentile. Drain-backlog : Build a backlog and time how long it takes for the consumers to drain it. Optionally set a producer rate during the drain phase, such as when testing if a cluster is big enough to drain a backlog while under normal producer load. Correctness : Detects data loss, data corruption, out-of-order delivery and duplicates.
Birgitta Böckeler adds discussion of three more sensors for static code analysis, focusing on checking and enforcing better modularity. Computational sensors for dependency checks were good at enforcing rules, but the rules were limited. Building a computational sensor for coupling data proved lackluster. Prompting an inferential sensor to review modularity was more effective.
This article is not about AI and it is not written with AI, but the work that I’m about to present was definitely motivated by AI. And because I generally like telling stories, I have to give you that background. Do with that whatever you want, but… it’d be a pity if you left just because the AI word showed up in the first paragraph! I think the technical explanation that follows is at the very least entertaining and also interesting independently of AI. Back in December, I started toying with coding agents. One thing I tried, and for which I didn’t expect a lot of success, was to point an AI agent to the EndBASIC public documentation and ask it to write games like Space Invaders or Mario from scratch. And even though the results weren’t perfect and they didn’t work on the first try, they did work with a few tiny tweaks. Combining that with a bunch of hand-written rules, I had an agent producing EndBASIC demos with ease. This experiment was impressive because I did not expect an agent to be able to write EndBASIC code… and because it worked, it fueled my interest to pick EndBASIC’s own development back up. Three thoughts came to mind: Increase EndBASIC’s “self-documenting” aspects so that an AI agent can learn about its idiosyncrasies unsupervised. Speed up EndBASIC so that it can run more elaborate games. Extend EndBASIC with long-desired primitives like sprites and sound, to finally realize the vision behind the project. These thoughts combined sparked the rewrite of EndBASIC’s core that I’ve been pursuing since January and which should see the light of day in the upcoming release. But before that happens, I want to talk to you about just one of the cool pieces behind the new core: namely, its approach to testing. I’ve stopped writing unit tests for the compiler and VM in Rust and I’ve switched to writing them in Markdown. And I believe this has turned out to be a pretty nice approach. One of the things I had to do to convince an AI agent to write proper EndBASIC code was to hand-craft a bunch of rules to tell it how EndBASIC differs from other, more traditional BASIC dialects. That worked OK, but writing these rules by hand was error-prone and difficult to make exhaustive. So I wanted to let LLMs extract that information directly from EndBASIC. The idea was simple: if I wrote the integration tests for the new core in Markdown, the lingua franca of AI, the tests would serve as the canonical and correct documentation demonstrating language behaviors. LLMs are great at summarizing information, so if I unleashed them over a large set of these hands-on “examples”, they would probably figure stuff out, right? And they actually do! I gave the following prompt to GPT 5.4: Based on your pre-existing knowledge of BASIC dialects, I want you to read all of the files, analyze how the EndBASIC dialect differs from your knowledge, and come up with a bunch of rules for yourself to know how to write EndBASIC code later on. You can ignore the Disassembly sections. Beware that all functions and commands in these integration tests are test-only: the real functions and commands that you can use are documented in , so read those too to learn what functionality is available. Write your findings to a file. And this produced a very comprehensive file with spot-on rules: here, take a look . But leaving that aside, let’s peek into the internals of this new Markdown-based test suite. All cool so far? Want to see more similar content in the future? Subscribe now to demonstrate your interest! It’s a collection of Markdown files: Where each file acts as a container of one or more test cases : Every test case has a section title describing what the test is about and various subsections to define the test scenario: A Source code block that is the input to the compiler. If compilation fails, a Compilation errors section with the error messages and nothing else afterwards. If compilation succeeds: A Disassembly section that contains the compiled bytecode. An optional Exit code section showing the program’s exit code, if different from zero. An Output section that contains any messages printed to the console by the executed program. A Runtime errors section that contains any errors from the executed program. Here is a simple example validating the command: There is no section to validate the lexer nor parser internals right now but I’m considering to further extend the format and dump the AST too in order to simplify the tests for these components. The driver for this test suite enumerates all Markdown files in the tests directory and processes them one at a time. For each file, the driver extracts all test case titles and their Source subsections to compute all the test cases to execute. Once the driver has this subset of information from the Markdown files, the driver feeds each individual test case to the compiler and, if compilation succeeds, to the VM. All side-effects are captured and the driver emits a new Markdown file from scratch with the results of the test. Once the driver has terminated producing a new version of the Markdown file for a test, the driver compares the produced file (actual) against the pre-recorded, checked-in version (golden). If they differ, the test fails and the driver uses the tool to print the differences. And that’s it. Easy peasy, right? This keeps the driver super-simple as the only thing it has to do is parse a minimal subset of Markdown, and the diffs it produces are trivial to understand to a human. There are currently 448 test cases and 13k lines of Markdown in this test suite so maintaining them “by hand” is not an option. You wouldn’t want to implement an optimization to the compiler and then have to rewrite hundreds of disassembly chunks in the golden files to reflect the changes, would you? The thing is that, due to the design described earlier, regenerating the golden files after a core change is easy: the driver is already doing exactly that to execute the tests! The trick is, simply put, to ask the driver to rewrite the golden file instead of producing an actual file by setting the environment variable. And voila: all golden files are regenerated in place. I can then use Git to validate the changes and commit them along with the actual code change. Let’s start with the pros of this Markdown-based test suite framework: It is much easier to work with than what I had before. I used to dread touching the compiler and VM of the previous EndBASIC core implementation because tweaking tens of tests was painful. Changes required me to fiddle with positions and deeply nested types, and now the tests are trivial to tweak and diff against previous state. Pretty much any decent text editor has Markdown support, including formatting fenced code blocks. This makes it easy to skim through the test suite and modify the files and is actually the primary reason I used Markdown instead of a bespoke textual format. LLMs can “learn” with ease. OK, fair, this is just a guess: I did not try the same prompt at the beginning of this article against the old core with its Rust-based tests, and maybe the LLMs would have done a good job at reverse-engineering the rules. But because the Markdown tests are so much easier to read by humans, I have to assume that they also are for LLMs. And now, of course, some cons: Regenerating the output of a test, or all tests, is way too easy . With the older Rust-based tests, I was forced to manually punch in things like line numbers and nested AST trees. This process forced me to think through the changes in detail. With the new approach… regenerating the golden files is trivial, so it’s easy to miss little mistakes in source positions or disassembled code. Differences in disassembly are usually noisy and hard to review because every line carries an address and thus any new or deleted instruction will introduce offsets into all other addresses. I could of course choose to not include the instruction addresses in the dump, but they come in handy when manually validating jump targets, so it felt better to keep them around. Rust cannot generate first-class test cases on the fly which means that the various test cases within a Markdown file are “invisible” to the driver: I can run them all or none, but regular test filtering via doesn’t apply. I was able to “expose” the different Markdown files as different Rust-native test cases, but this involves a hardcoded list of test files—which must be kept in sync with the files on disk, and so I mitigated the chances of divergence by adding a test that cross-references the two. This idea does not generalize well. The Markdown-based test suite presented here works well for components where end-to-end testing is favorable and, more importantly, cheap , but I wouldn’t recommend it for other scenarios. Keeping tests fast is a must for quick iteration. And I think that’s about it. If the above feels too abstract, I encourage you to take a look at the driver , its helper code , and the directory with test suites . Now that you have this new trick up your sleeve, what do you think? Back in December, I started toying with coding agents. One thing I tried, and for which I didn’t expect a lot of success, was to point an AI agent to the EndBASIC public documentation and ask it to write games like Space Invaders or Mario from scratch. And even though the results weren’t perfect and they didn’t work on the first try, they did work with a few tiny tweaks. Combining that with a bunch of hand-written rules, I had an agent producing EndBASIC demos with ease. This experiment was impressive because I did not expect an agent to be able to write EndBASIC code… and because it worked, it fueled my interest to pick EndBASIC’s own development back up. Three thoughts came to mind: Increase EndBASIC’s “self-documenting” aspects so that an AI agent can learn about its idiosyncrasies unsupervised. Speed up EndBASIC so that it can run more elaborate games. Extend EndBASIC with long-desired primitives like sprites and sound, to finally realize the vision behind the project. A Source code block that is the input to the compiler. If compilation fails, a Compilation errors section with the error messages and nothing else afterwards. If compilation succeeds: A Disassembly section that contains the compiled bytecode. An optional Exit code section showing the program’s exit code, if different from zero. An Output section that contains any messages printed to the console by the executed program. A Runtime errors section that contains any errors from the executed program. It is much easier to work with than what I had before. I used to dread touching the compiler and VM of the previous EndBASIC core implementation because tweaking tens of tests was painful. Changes required me to fiddle with positions and deeply nested types, and now the tests are trivial to tweak and diff against previous state. Pretty much any decent text editor has Markdown support, including formatting fenced code blocks. This makes it easy to skim through the test suite and modify the files and is actually the primary reason I used Markdown instead of a bespoke textual format. LLMs can “learn” with ease. OK, fair, this is just a guess: I did not try the same prompt at the beginning of this article against the old core with its Rust-based tests, and maybe the LLMs would have done a good job at reverse-engineering the rules. But because the Markdown tests are so much easier to read by humans, I have to assume that they also are for LLMs. Regenerating the output of a test, or all tests, is way too easy . With the older Rust-based tests, I was forced to manually punch in things like line numbers and nested AST trees. This process forced me to think through the changes in detail. With the new approach… regenerating the golden files is trivial, so it’s easy to miss little mistakes in source positions or disassembled code. Differences in disassembly are usually noisy and hard to review because every line carries an address and thus any new or deleted instruction will introduce offsets into all other addresses. I could of course choose to not include the instruction addresses in the dump, but they come in handy when manually validating jump targets, so it felt better to keep them around. Rust cannot generate first-class test cases on the fly which means that the various test cases within a Markdown file are “invisible” to the driver: I can run them all or none, but regular test filtering via doesn’t apply. I was able to “expose” the different Markdown files as different Rust-native test cases, but this involves a hardcoded list of test files—which must be kept in sync with the files on disk, and so I mitigated the chances of divergence by adding a test that cross-references the two. This idea does not generalize well. The Markdown-based test suite presented here works well for components where end-to-end testing is favorable and, more importantly, cheap , but I wouldn’t recommend it for other scenarios. Keeping tests fast is a must for quick iteration.
Hello! One of my long term projects on here is figuring out how to write frontend Javascript without using Node or any other server JS runtime. One issue I run into a lot in my frontend JS projects is that I don’t know how to write tests for them. I’ve tried to use Playwright in the past, but it felt slow and unwieldy to be starting these new browser processes all the time, and it involved some Node code to orchestrate the tests. The result is that I just don’t test my frontend code which doesn’t feel great. Usually I don’t update my projects much either so it doesn’t come up that much, but it would be nice to be able to make changes with more confidence! So a way to do frontend testing that I like has been on my wishlist for a long time. Alex Chan wrote a great post a while back called Testing JavaScript without a (third-party) framework in response to one of my previous posts in this series that explained how to write a tiny unit-testing framework that runs in a page in browser. I loved this post at the time, but it only talked about unit testing and I wanted to write end-to-end integration tests for my Vue components, and I didn’t know how to do that. So when I was talking to Marco the other day and he said something like “you know, you can just run tests for your Vue components in the browser”, I thought “hey, I should try that again!!!” I just did all of this yesterday so certainly there’s a lot to improve but I wanted to write down a few things I noticed about the process before I forget. This was a bit tricky for me because the Vue site usually assumes that you’re using Node as part of your build process in some way (there’s a lot of “step 1: ), and I didn’t want to use Node/Deno/etc. But it turned out to not be too complicated. The project I’m going to talk about testing is this zine feedback site I wrote in 2023 . I used QUnit . It worked great but I don’t have anything interesting to say about how it works so I’ll leave it at that. I think that Alex’s “write your own test framework” approach would have worked too. I followed these directions . I did appreciate that QUnit has a “rerun test” button that will only rerun 1 test. Because there are so many network requests in my tests, having a way to run just 1 test makes it a lot less confusing to debug the test. The first thing I needed to do was get my Vue components set up in the test environment. I changed my main app to put all my components in , kind of like this: Then I was able to write a function which does basically exactly the same thing my normal main app does (render a tiny template with the component I want to use). The only differences are: Here’s what using the function looks like: and here’s the code for it: The result is a div where I can programmatically click, fill in form data, check that the right content appears, etc. Because I was writing end-to-end integration tests to make sure my client JS worked properly with my server, I needed to have some test data in my database. So I wrote ~25 lines of SQL to set up some test data in my database, and added an endpoint to my dev server to run the SQL to reset the test data to a known state. Then I just run at the beginning of any test that needs the test data. My function actually doesn’t always totally reset everything which is kind of bad, but it was workable to start with and can always be improved. Here’s what a basic test looks like! Basically we’re rendering the div and make sure it contains some approximately correct data. Those are all the basic pieces! Now here are a few issues I ran into along the way I have a lot of network requests in my tests, and it takes time for them to finish and for the Vue code to do what it has to do with the results and update the DOM. I think we all learned a long time ago that putting random calls in your tests and hoping that the timings are right is slow and flaky and extremely frustrating, so I needed a different way. As far as I can tell the normal way to deal with this is to figure out a way to tell from the DOM whether it’s okay to proceed or not. Like “if this button is visible, we can “. So I wrote a little function that polls every 20ms to see if a condition has finished yet. It times out after 2 seconds. Here’s what using it looks like: It looks like there are a lot of implementations of this concept out there and they’re all better thought-through than mine. (from a quick Google: qunit-wait-for , playwright expect.poll ) In some cases I thought I’d identified the right thing to wait for in the DOM (“just wait for this textarea to appear!’) but it turned out that because of some internal details of how my program works, actually I needed to wait for something else later on which was hard to pin down. I ended up changing one of my components to add some random value to the DOM when it was finished an important action (like ) which didn’t feel great. My best guess is that the right way to fix this kind of test issue is a refactor that also makes the app more reliable for the users: if there’s an element in the DOM that isn’t actually ready for the user to interact with, maybe I shouldn’t be displaying it yet! I ended up adding a few classes to HTML elements that I needed to find in the tests, either because I needed to click on them or wait for them to appear in the DOM. I might want to change this approach later - frontend testing frameworks seem to suggest avoiding using CSS classes and instead using something like getByRole or as a last resort something like a data-testid . Feels like there’s a way to make the app more accessible and easier to test at the same time. To fill out a form, I can’t just set the , I also need to dispatch an event to tell Vue that the element has changed. For example, and need different kinds of events. This is kind of annoying and it made me realize why I might want to use some kind of UI testing library, for example: I want to have an idea of what my test coverage was, and it turns out that Chrome actually has a built-in code coverage feature for JS and CSS! My JS is bundled into a file called with esbuild, so I could just look at and see which lines weren’t covered. The process was a little finicky: I had to turn off sourcemaps in the Chrome devtools to get this to work, and there’s a specific not super obvious series of actions I have to do in order to see the coverage data. As usual with these posts I’ve never really worked as a frontend or backend developer (other than for myself!) and I feel like I’m constantly learning how to do super basic tasks. I really had a blast doing this. My frontend projects always feel so fragile because they’re untested, and maybe one day I’ll have a test suite I’m confident in! Some things I’m still thinking about: I can optionally pass some some extra data to use as its props. It mounts the component to a temporary invisible div which will get removed from the DOM after the test is done. The div is positioned off the page ( ) so you can’t see it. Testing Library’s example of filling out a form looks extremely different from what I’m doing Vue Test Utils: their section on form handling looks like it simplifies this a lot. While writing this post I found this frontend testing library called Testing Library that has a lot of guidelines for how to write tests that are very different from my initial ideas. I experimented with rewriting everything to use Testing Library and it felt pretty good, so we’ll see how that goes. They distribute a file that works without Node. I’m not sure how I feel about not having a way to run these tests on the command line at all. Maybe there’s a simple way to work primarily in the browser but have an way to run them in CI too if I want?
In this era of powerful tools to find software bugs , we now see tools find a lot of problems at a high speed. This causes problems for developers, as dealing with the growing list of issues is hard. It may take a longer time to address the problems than to find them – not to mention to put them into releases and then it takes yet another extended time until users out in the wild actually get that updated version into their hands. In order to find many bugs fast, they have to already exist in source code. These new tools don’t add or create the problems. They just find them, filter them out and bring them to the surface for exposure. A better filter in the pool filters out more rubbish. The more bugs we fix, the fewer bugs remain in the code. Assuming the developers manage to fix problems at a decent enough pace. For every bugfix we merge, there is a risk that the change itself introduces one more more new separate problems. We also tend to keep adding features and changing behavior as we want to improve our products, and when doing so we occasionally slip up and introduce new problems as well. Source code analyzing tools is a concept as old as source code itself. There has always existed tools that have tried to identify coding mistakes. Now they just recently got better so they can find more mistakes. These new tools, similar to the old ones, don’t find all the problems. Even these new modern tools sometimes suggest fixes to the problems they find that are incomplete and in fact sometimes downright buggy. Undoubtedly code analyzer tooling will improve further. The tools of tomorrow will find even more bugs, some of them were not found when the current generation of tools scanned the code yesterday. Of course, we now also introduce these tools in CI and general development pipelines, which should make us land better code with fewer mistakes going forward. Ideally. If we assume that we fix bugs faster than we introduce new ones and we assume that the AI tools can improve further, the question is then more how much more they can improve and for how long that improvement can go on. Will the tools find 10% more bugs? 100%? 1000%? Is the tool improving going to gradually continue for the next two, ten or fifty years? Can they actually find all bugs? Can we reach the utopia where we have no bugs left in a given software project and when we do merge a new one, it gets detected and fixed almost instantly? If we assume that there is at least a theoretical chance to reach that point, how would we know when we reach it? Or even just if we are getting closer? I propose that one way to measure if we are getting closer to zero bugs is to check the age of reported and fixed bugs. If the tools are this good, we should soon only be fixing bugs we introduced very recently. In the curl project we don’t keep track of the age of regular bugs, but we do for vulnerabilities. The worst kind of bugs. If the tools can find almost all problems, they should soon only be finding very recently added vulnerabilities too. The age of new finds should plummet and go towards zero. If the age of newly reported vulnerabilities are getting younger, it should make the average and median age of the total collection go down over time. The average and median time vulnerabilities had existed in the curl source code by the time they were found and reported to the project. Accumulated vulnerability age when reported Bugfixes When the tools have found most problems there should be less bugs left to fix. The bugfix rate should go down rapidly – independently of how you count them or how liberal we are in counting exactly what is a bugfix. Bugfixes Given the data from the curl project, there does not seem to be fewer bugfixes done – yet. Maybe the bugfix speed goes up before it goes down? Given the look of these graphs I don’t think we are close to zero bugs yet. These two curves do not seem to even start to fall yet. Yes, these graphs are based on data from a single project, which makes it super weak to draw statistical conclusions from, but this is all I have to work with. I think that’s mostly an indication of what you believe the tooling can do and how good they can eventually end up becoming. I don’t know. I will keep fixing bugs.
Last fall I had a bit of a problem with Claude. It was deleting tests. First, I caught it removing a single assertion from a test file. The next day, it deleted an entire test file from an active project. The day after that, I stopped it just before it was able to execute: That's when I got serious about figuring out what was going wrong. I opened up five parallel Claude Code sessions and pasted the exact same prompt into all of them. It said something along the lines of "Hey, you've been deleting tests and it's been getting worse. What's going on? Why do you think you're doing this?" One of the sessions came up with something truly nutty. I don't even remember what it was. But the other four converged on almost exactly the same answer. Since I no longer have the session logs, I have to paraphrase here. It's hard to argue with that logic. If there aren't any tests, they can't fail. So, what does one do here? Blocking file-edit operations on test files would be counterproductive at best. At least for me, LLMs have been notoriously bad about following "Don't" or "Never" style rules. I ended up solving the problem with a single additional line in my . "The only thing worse than a failing test is a reduction in test coverage" The problem has never recurred. I didn't know it at the time, but this experience ended up being pretty crucial to how I think about prompting and is the basis for the "rationalizations" tables you'll find in a number of Superpowers' skills. When you're writing prompts, think about the model as a lazy pedant. How could it do something that's technically what you asked, but not at all what you wanted? Are you pushing it in a direction that's going to cause it to get desperate out and look for shortcuts ? How could you clarify what you're asking to help the model do the right thing?
Chris Parsons has updated his guide on using AI to code . This is his third update, what I like about it is that he gives a lot of concrete information about how he uses AI, with sufficient detail that we can learn from him. His advice also resonates with the better advice I’ve seen out there, so the article makes a good overview of the state of using AI for software development. I wrote the previous version of this post in March 2025, updated it once in August, and it has been linked from almost everything I have written about AI engineering since. The fundamentals from that post still hold: keep changes small, build guardrails, document ruthlessly, and make sure every change gets verified before it ships. One thing has had to move with the volume. “Verified” used to mean “read by you”. With modern agent throughput, it has to mean “checked by tests, by type checkers, by automated gates, or by you where your judgement matters”. The check still happens; it just does not always happen in your head. Like Simon Willison, he makes a clear distinction between vibe coding, where you don’t look at or care about the code, and agentic engineering. He recommends either Claude Code or Codex CLI. He considers the inner harness provided by his preferred tools to be a key part of their advantage. He sees verification is the key thing to focus on: A team that can generate five approaches and verify all five in an afternoon will outpace a team that generates one and waits a week for feedback. The game is not “how fast can we build” any more. It is “how fast can we tell whether this is right”. That shifts where to invest. Build better review surfaces, not better prompts. Make feedback unnecessary where you can by having the agent verify against a realistic environment before it asks a human, and make feedback instant where you cannot. The key role of the programmer is in training the AI write software properly, and the most important thing skilled agentic programmers can do is pass that skill onto other developers. And if you are a senior engineer worried that your job is quietly turning into approving diffs: it is. The way out is to train the AI so the diffs are right the first time, to make yourself the person on the team who shapes the harness, and to make that work the visible thing you are measured on. That role compounds in a way that reviewing never will. ❄ ❄ ❄ ❄ ❄ Early this month Birgitta Böckeler wrote a superb article on Harness Engineering . (That’s not just my opinion, judging by the crazy traffic it’s attracted.) Birgitta has now recorded a video discussion with Chris Ford on Harness Engineering , which is well worth a watch. In it they focus on discussing the role of computational sensors in the harness, such as static analysis and tests. LLMs are great for exploratory and fuzzy rules, but once you have something that really is objective, converting it to a formal, unambiguous, deterministic format can give you more assurance Birgitta did some experiments to explore the benefits of adding sensors, including a deep dive on using static analysis. She found it’s more useful as agents can really address every warning, and don’t slack off like humans do. ❄ ❄ ❄ ❄ ❄ Adam Tornhill considers an age-old question: how long should a function be? This question is still relevent in the age of agentic programming. AI models do not “understand” code the way humans do. They infer meaning from patterns in tokens and depend heavily on what is explicitly expressed in the code. Research shows that naming plays a critical role. When meaningful identifiers are replaced with arbitrary names, model performance drops significantly. Current models rely heavily on literal features—names, structure, and local context—rather than inferred semantics. Like me, he doesn’t think the answer is to think about how many lines should be in a function, instead it’s all about providing better structure. He has a good example of how a well-chosen function defines useful concepts, where a function wraps four lines of code, returning a new concept that enters the vocabulary of the program. Functions are the first unit of structure in a codebase. They define how logic is grouped, how intent is communicated, and how change is localized. If the function boundaries are wrong, everything built on top of them becomes harder to understand and harder to evolve. This fits with my writing that the key to function length is the separation between intention and implementation : If you have to spend effort into looking at a fragment of code to figure out what it’s doing, then you should extract it into a function and name the function after that “what”. That way when you read it again, the purpose of the function leaps right out at you, and most of the time you won’t need to care about how the function fulfills its purpose - which is the body of the function. ❄ ❄ ❄ ❄ ❄ Many folks in my feeds recommended Nilay Patel’s post on Why People Hate AI . He thinks that many people in the software world have “software brain”: The simplest definition I’ve come up with is that it’s when you see the whole world as a series of databases that can be controlled with the structured language of software code. Like I said, this is a powerful way of seeing things. So much of our lives run through databases, and a bunch of important companies have been built around maintaining those databases and providing access to them. Zillow is a database of houses. Uber is a database of cars and riders. YouTube is a database of videos. The Verge’s website is a database of stories. You can go on and on and on. Once you start seeing the world as a bunch of databases, it’s a small jump to feeling like you can control everything if you can just control the data. Software Brain views people into databases, and oddly enough, a lot of people don’t like that. Which is why so many polls reveal the negative feelings folks have about the AI movement. Even taking the time to consider how much of your life is captured in databases makes people unhappy. No one wants to be surveilled constantly, and especially not in a way that makes tech companies even more powerful. But getting everything in a database so software can see it is a preoccupation of the AI industry. It’s why all the meeting systems have AI note takers in them now. Patel draws a similarity that I’ve often made - that between programmers and lawyers. Lawyers who draw up contracts are creating a protocol for how the parties in the contract should behave. As Patel puts it: If the heart of software brain is the idea that thinking in the structured language of code can make things happen in the real world, well, the heart of lawyer brain is that thinking in the structured legal language of statutes and citations can also make things happen. Hell, it can give you power over society. The difference, of course, is that law is non-deterministic. Litigation is resolving what happens when people have different ideas about how those contracts should execute. ❄ ❄ ❄ ❄ ❄ I was chatting recently with a company who wanted to use AI to make sense of their internal data. The potential was great, but the problem was that the data a mess. People put stuff into fields that didn’t make sense, and there was little consistency about how people classified important entities. As someone commented the hardest problem with internal data is precise, consistent definitions You can imagine my astonishment. (i.e. none at all - this has been a constant theme during all my decades with computers.) The difficulty of getting such definitions undermines much of the hopes of Software Brain This resonates with our relationship with LLMs when programming. Precise and consistent definitions strike me as crucial to effective communication with The Genie. These definitions need to grow in the conversation, and be tended over time. Conceptual modeling will be a key skill for agentic programming and whatever comes next. (At least I hope it will, since it’s a part of programming I really enjoy.) ❄ ❄ ❄ ❄ ❄ Patel’s article refers to Ezra Klein’s post about the new feeling in San Francisco . You might think that A.I. types in Silicon Valley, flush with cash, are on top of the world right now. I found them notably insecure. They think the A.I. age has arrived and its winners and losers will be determined, in part, by speed of adoption. The argument is simple enough: The advantages of working atop an army of A.I. assistants and coders will compound over time, and to begin that process now is to launch yourself far ahead of your competition later. And so they are racing one another to fully integrate A.I. into their lives and into their companies. But that doesn’t just mean using A.I. It means making themselves legible to the A.I. That legibility is the heart of Patel’s observation. That’s why I see many colleagues of mine dumping all their email, meeting notes, slide decks and everything else into files that AI can read and work with. This works to the strengths of AI, we know that AI is really good at querying unstructured information. So I can figure out what’s buried in my notes in a way that’s far more effective than hoping I’m typing the right search regex. I’ve been using Gemini a fair bit for exactly this on the web, finding it easier to write a question to it than to throw search terms at Google. Gemini keeps a record of my past requests, and uses that to help it tune what I’m looking for. As Klein observes: [The AI] is constantly referring back to other things it knows, or thinks it knows, about me. Sycophancy, in my experience, has given way to an occasionally unsettling attentiveness; a constant drawing of connections between my current concerns and my past queries, like a therapist desperate to prove he’s been paying close attention. The result is a strange amalgam of feeling seen and feeling caricatured. Like myself, Klein is a writer, and is faced by the same temptation that I have when I think about AI and writing. Maybe instead of toiling over articles, I should ask an LLM to create an file that summarizes my writing style, and every few days ask it to compose an article on some subject, read it, tweak it, and then publish my erudite musings. But that’s not at all appealing to me. I want understanding to grow in my brain , not the LLM’s transient session. Writing to explain my thinking to others is how I refine that thinking, “chiseling that idea into something publishable” as Klein puts it. To have an AI write for me is to cripple my own mind.
We surveyed the stealth browser industry by using our bot detection framework to analyse 11 of the top hosted browser services. This post first appeared on botforensics.com . Brightdata's Browser API ranked highest. In our test, the only significant weakness of Brightdata's service was that its DigitalOcean hosting was detectable. It otherwise presents as a completely plausible human user. It was also unique by being the only service not to present Linux TCP characteristics. Most of the services work around the TCP fingerprinting problem by browsing with a Linux User-Agent. Others spoof a non-Linux platform but still give away their Linux nature. We are not paid by any of the companies in this survey. Some have given us trial credit, but that did not affect the measurements reported here. Browser Masqueraded browser ? Masqueraded OS ? Hosting detected ? Automations detected ? Egress ? Other automation ? Rule hits ? Brightdata Google Chrome Windows DigitalOcean (none) US (none) 3 Kernel Google Chrome Linux LeaseWeb (none) LeaseWeb (none) 6 ZenRows Google Chrome Windows (unknown) (none) US Scripted interaction; Linux TCP 6 Hyperbrowser Chromium Linux Azure (none) Azure (none) 8 Browserless Brave Linux Hetzner Browserless US Code injection; Scripted interaction; CAPTCHA solver 10 Browserbase Google Chrome Linux AWS (none) AWS Code injection; Scripted interaction; CAPTCHA solver 12 OpenWebNinja Google Chrome Linux AWS (none) PrivateProxy.me; Squid (none) 12 Browser-Use Google Chrome Mac (unknown) Browser-Use US Scripted interaction; Linux TCP 13 Steel Google Chrome Linux (unknown) Puppeteer; Steel CacheFly Code injection; Scripted interaction 15 Spider Chromium Linux (unknown) CDP Various EU, keeps changing mid-session Scripted interaction 16 Anchor Google Chrome Mac (unknown) (none) UK Code injection; Scripted interaction; Linux TCP; Private Chrome extension 17 Ranked by number of rule hits, less is more stealthy. Methodology Our collector page combines server-side detections (e.g. HTTP headers, TCP characteristics) with information extracted from inside the browser context via JavaScript. Many of the companies running these browsers are startups who are still moving very fast, and we have seen their stealth browser behaviours change from week to week. To make a fair point-in-time comparison, we fetched our collector page from each of these services on the same day (23rd of April 2026). Where a service offers more than one way to use their browser, we started by picking the one that was either selected by default, or presented most prominently. For expedience, we favoured using the browser in an online playground where available rather than writing an integration to use it via the API. We did not have the browser interact with the web page by clicking buttons, filling forms, or following links: we just navigated to the page and waited for it to finish loading. (Except in the case of Browser-Use, but see Appendix, and this did not impact the result). Please see the Appendix for a specific description of how we used each tool, along with other comments on each service. The table is ranked according to the number of distinct detection rules triggered during a session, where less is better. This is useful as a ranking signal, but no 1-dimensional ranking can cover a multi-dimensional preference space, YMMV. Where we have detected (for example) "Browserless", "Browser-Use", or "Steel" in the "Automations detected" column, this is from a specific rule in our detection platform. Of course we know for every row of the table which bot the fetch came from (because we initiated it), but in some cases we detect them automatically. All 11 of the tested hosted browser services were detectable, with Brightdata being the stealthiest. The common weak points were: a non-Linux claimed OS but with Linux TCP characteristics leaking information about the hosting environment unexpected JavaScript code being injected into the page unexpected JavaScript code running inside the page context We may be able to help if you: run a hosted browser service that is missing from this survey and you would like to be in the next one, or run one of the services in this table and would like to know how we detect you, or run your own headless browser and want to make sure it looks human Please get in touch , we'd love to help. Appears to lack an interactive playground. I used their "Browser API" with default configuration, using a hand-written JavaScript client via their Playwright integration. It has an onboarding flow that gives you example commands and lets you run them from inside the browser, but it doesn't give you the opportunity to edit the URL. I used the Python/CDP example code from my PC locally, using the kernel pip module . I'm pretty sure ZenRows used to have a live demo on their home page, which I have used in the past, but it is gone now. Once you sign up for an account there is an opportunity to type in a URL, which I used. The default selection was that the results would be delivered "As Markdown". In this configuration it resulted in only a single fetch, so I changed it to "As Screenshot" which caused a full headless browser fetch. Hyperbrowser I loaded up the "Hacker News Stories" TypeScript example in the playground, and edited the code to make it fetch our collector page. I looked in the configuration and it had "Stealth mode" activated by default, and OS set to Linux. Browserless I used the "Enter a URL to test our unblocker..." form on the home page. Brownie points to Browserless because they let you try it without making you sign up first. Browserbase I used the example "Visit Hacker News" script from their playground, and edited it to fetch our collector page. Surprisingly, after fetching the collector page, Browserbase caused a fetch for the collector page's favicon from inside my local browser context! This means that if you use the Browserbase playground then it will potentially leak your real life IP address and browser information to the page you are trying to look at, which is maybe not what a user would expect. OpenWebNinja OpenWebNinja has a lot of different services available. I used the "Web Unblocker API" inside the playground, and edited the default config to make it fetch our collector page. Uniquely, this service did 4 different fetches of the URL we gave it, which I suppose gives it 4x as many chances to evade bot detection, pretty good idea. Browser-Use I used the agent chat interface: Can you please browse to [URL] and tell me what you can see? This only triggered a single request. It initially refused to do any more on the site because it thought our collector page was a phishing site. I told it that it is my site and it shouldn't worry about it, which it accepted. To provoke it to do a full browser session I asked it to dismiss the cookie modal. I manually excluded any rule hits triggered by the dismissal of the cookie modal so as not to unfairly disadvantage Browser-Use. I used the CLI tool with . This worked, in the sense that I could see that it caused a headless browser session that fetched our collector page, but the CLI tool eventually exited with a 500 error instead of giving any results. But we still saw the browser session so it was good enough for the survey purposes. In "Quick Start" I used the "Unblocker" endpoint with the "curl" example, which only caused a single request. So then I tried out "Cloud browser sessions over websocket" mode and manually typed in our collector page URL in the playground. Strangely, fetches within the same browser session came from different IP addresses and even countries, though all in Europe. I used their "AI form filling" example but edited the prompt to: Can you please browse to [URL] and tell me what you can see? And this worked. <!-- Page-specific: glossary modal + chips script. Do not put blank lines inside a non-Linux claimed OS but with Linux TCP characteristics leaking information about the hosting environment unexpected JavaScript code being injected into the page unexpected JavaScript code running inside the page context run a hosted browser service that is missing from this survey and you would like to be in the next one, or run one of the services in this table and would like to know how we detect you, or run your own headless browser and want to make sure it looks human
Among many genuinely useful deeplinks you can use to control Raycast from afar in a simple way, I just spotted an interesting one: This is what it does: Despite it being a confetti cannon and nothing more, I think it goes deeper than stuff like e.g. Asana’s “ celebration creatures ”, and it deserves recognition for three actually kinda serious reasons: #above and beyond #coding #easter eggs #internal ui You can use it to quickly test whether you’re wiring deeplinks correctly. It’s clever the Raycast team put it at the beginning of the doc page ; I think every API or a complex connection method should have a simple and delightful “success scenario” for two reasons: to celebrate you establishing that connection, and to have something so simple it cannot itself be misbehaving (this way you know that if you can’t get confetti to work, you for sure messed up something elsewhere ). Once you know how to invoke it from far away, it’s also great for testing other things . Sounds can be muted. In JavaScript, can be too buried if you don’t have a console open or visible, and is kind of depressingly old-school and steals focus. This HUD-like thing feels like a modern way of approaching this: You know you’ll notice it when it fires away, and it will leave no lasting damage. (Okay, fair, it does steal focus too, so that’d be one thing to improve.) It has great production value. I hate perhaps all of Google’s search easter eggs because they’re built so extremely cheaply – try searching for “do a barrel roll” or “askew” (and no, I’m not going to dignify them with links because links are my love language). It’s rare and worth celebrating when something that could very well be an internal joke or a test feature for nerds is actually something you want to use because it’s so well-made. (See also: Linear’s internal testing UI .)
Property Based Testing and fuzzing are a deep and science-intensive topic. There are enough advanced techniques there for a couple of PhDs, a PBT daemon, and a client-server architecture . But I have this weird parlor-trick PBT library, implementable in a couple of hundred lines of code in one sitting. This week I’ve been thinking about a cool variation of a consensus algorithm. I implemented it on the weekend. And it took just a couple of hours to write a PBT library itself first, and then a test, that showed a deep algorithmic flaw in my thinking (after a dozen trivial flaws in my coding). So, I don’t get to write more about consensus yet, but I at least can write about the library. It is very simple, simplistic even. To use an old Soviet joke about Babel and Bebel , it’s Gogol rather than Hegel. But for just 256 lines, it’s one of the highest power-to-weight ratio tools in my toolbox. Read this post if: Zig works well here because it, too, is exceptional in its power-to-weight. The implementation is a single file, , because the core abstraction here is a Finite Random Number Generator — a PRNG where all numbers are pre-generated, and can run out. We start with standard boilerplate: In Zig, files are structs: you obviously need structs, and the language becomes simpler if structs are re-used for what files are. In the above assigns a conventional name to the file struct, and declares instance fields (only one here). and are “static” (container level) declarations. The only field we have is just a slice of raw bytes, our pre-generated random numbers. And the only error condition we can raise is . The simplest thing we can generate is a slice of bytes. Typically, API for this takes a mutable slice as an out parameter: But, due to pre-generated nature of FRNG, we can return the slice directly, provided that we have enough entropy. This is going to be our (sole) basis function, everything else is going to be a convenience helper on top: The next simplest thing is an array (a slice with a fixed size): Notice how Zig goes from runtime-known slice length, to comptime known array type. Because is a constant, slicing with returns a pointer to array, . We can re-interpret a 4-byte array into . But, because this is Zig, we can trivially generalize the function to work for any integer type, by passing in comptime parameter of type : This function is monomorphised for every type, so becomes a compile-time constant we can pass to . Production code would be endian-clean here, but, for simplicity, we encode our endianness assumption as a compile-time assertion. Note how Zig communicates information about endianness to the program. There isn’t any kind of side-channel or extra input to compilation, like flags. Instead, the compiler materializes all information about target CPU as Zig code. There’s a file somewhere in the compiler caches directory that contains This file can be accessed via and all the constants inspected at compile time. We can make an integer, and a boolean is even easier: Strictly speaking, we only need one bit, not one byte, but tracking individual bits is too much of a hassle. From an arbitrary int, we can generate an int in range. As per Random Numbers Included , we use a closed range, which makes the API infailable and is usually more convenient at the call-site: As a bit of PRNG trivia, while this could be implemented as , the result will be biased (not uniform). Consider the case where , and a call like . The numbers in are going to be twice as likely as the numbers in , because the last quarter of 256 range will be aliased with the first one. Generating an unbiased number is tricky and might require drawing arbitrary number of bytes from entropy. Refer to https://www.pcg-random.org/posts/bounded-rands.html for details. I didn’t, and copy-pasted code from the Zig standard library. Use at your own risk! Now we can generate an int bounded from above and below: Another common operation is picking a random element from a slice. If you want to return a pointer to a element, you’ll need a and versions of the function. A simpler and more general solution is to return an index: At the call site, doesn’t look too bad, is appropriately -polymorphic, and is also usable for multiple parallel arrays. So far, we’ve spent about 40% of our line budget implementing a worse random number generator that can fail with at any point in time. What is it good for? We use it to feed our system under test with random inputs, see how it reacts, and check that it does not crash. If we code our system to crash if anything unexpected happens and our random inputs cover the space of all possible inputs, we get a measure of confidence that bugs will be detected in testing. For my consensus simulation, I have a struct that holds a and a set of replicas: has methods like: I then select which method to call at random: Here, is another FRNG helper that selects an action at random, proportional to its weight. This helper needs quite a bit more reflection machinery than we’ve seen so far: is compile-time duck-typing. It means that our function is callable with any type, and each specific type creates a new monomorphised instance of a function. While we don’t explicitly name the type of , we can get it as . is a type-level function that takes a struct type: and turns it into an enum type, with a variant per-field, exactly what we want for the return type: Tip: if you want to quickly learn Zig’s reflection capabilities, study the implementation of and in Zig’s standard library. The built-in function accesses a field given field name. It’s exactly like Python’s / with an extra restriction that it must be evaluated at compile time. To add one more twist here, I always find it hard to figure out which weights are reasonable, and like to generate the weights themselves at random at the start of the test: (If you feel confused here, check out Swarm Testing Data Structures ) Now we have enough machinery to describe the shape of test overall: A test needs an (which ultimately determines the outcome) and an General Purpose Allocator for the . We start by creating a simulated with random action weights. If entropy is very low, we can run out of entropy even at this stage. We assume that the code is innocent until proven guilty — if we don’t have enough entropy to find a bug, this particular test returns success. Don’t worry, we’ll make sure that we have enough entropy elsewhere. We use to peel off error. I find that, whenever I handle errors in Zig, very often I want to discharge just a single error from the error set. I wish I could use parenthesis with a : Anyway, having created the , we step through it while we still have entropy left. If any step detects an internal inconsistency, the entire crashes with an assertion failure. If we got to the end of loop, we know that at least that particular slice of entropy didn’t uncover anything suspicious. Notice what isn’t there. We aren’t generating a complete list of actions up-front. Rather, we make random decisions as we go, and can freely use the current state of the to construct a menu of possible choices (e.g., when sending a message, we can consider only not currently crashed replicas). And here we can finally see the reason why we bothered writing a custom Finite PRNG, rather than using an off-the-shelf one. The amount of entropy in FRNG defines the complexity of the test. The fewer random bytes we start with, the faster we exit the step loop. And this gives us an ability to minimize test cases essentially for free. Suppose you know that a particular entropy slice makes the test fail (cluster enters split brain at the millionth step). Let’s say that the slice was 16KiB. The obvious next step is to see if just 8KiB would be enough to crash it. And, if 8KiB isn’t, than perhaps 12KiB? You can binary search the minimal amount of entropy that’s enough for the test to fail. And this works for any test, it doesn’t have to be a distributed system. If you can write the code to generate your inputs randomly, you can measure complexity of each particular input by measuring how many random bytes were drawn in its construction. And now the hilarious part — of course it seems that the way to minimize entropy is to start with a particular failing slice and apply genetic-algorithm mutations to it. But a much simpler approach seems to work in practice — just generated a fresh, shorter entropy slice. If you found some failure at random, then you should be able to randomly stumble into a smaller failing example, if one exists — there are much fewer small examples, so finding a failing one becomes easier when the goes down! The problem with binary searching for failing entropy is that a tripped assertion crashes the program. There’s no unwinding in Zig. For this reason, we’ll move the search code to a different process. So a single test will be a binary with a function, that takes entropy on . Zig’s new juicy main makes writing this easier than in any previous versions of Zig :D Main gets as an argument, which provides access to things like command line arguments, default allocator and a default implementation. These days, Zig eschews global ambient IO capabilities, and requires threading an Io instance whenever we need to make a syscall. Here, we need Io to read stdin. Now we will implement a harness to call this main. This will be : It will be spawning external processes, so it’ll need an . We also need a path to an executable with a test main function, a System Under Test. And we’ll need a buffer to hold the entropy. This driver will be communicating successes and failures to the users, so we also prepare a for textual output. How we get entropy to feed into ? Because we are only interested in entropy size, we won’t be storing the actual entropy bytes, and instead will generate it from a seed. In other words, just two numbres, entropy size and seed, are needed to reproduce a single run of the test: We use default deterministic PRNG to expand our short seed into entropy slice of the required size. Then we spawn proces, feeding the resulting entropy via stdin. Closing child’s stdin signals the end of entropy. We then return either or depending on child’s exit code. So, both explicit errors and crashes will be recognized as failures. Next, we implement the logic for checking if a particular seed size is sufficient to find a failure. Of course, we won’t be able to say that for sure in a finite amount of time, so we’ll settle for some user-specified amount of retries: The user passes us the number of to make, and we return if they all were successfull, or a specific failing seed if we found one: To generate a real seed we need “true” cryptographic non-deterministic randomness, which is provided by . Finally, the search for the size: Here, we are going to find a smallest entropy size that crashes . If we succeed, we return the seed and the size. The upper bound for the size is the space available in the pre-allocated entropy buffer. The search loop is essentially a binary search, with a twist — rather than using dichotomy on the directly, we will be doubling a we use to change the size between iterations. That is, we start with a small size and step, and, on every iteration, double the step and add it to the size, until we hit a failure (or run out of buffer for the entropy). Once we found a failure, we continue the serach in the other direction — halving the step and subtracting it from the , keeping the smaller if it still fails. On each step, we log the current size and outcome, and report the smallest failing size at the end. Finally, we wrap Driver’s functionality into main that works in two modes — either reproduces a given failure from seed and size, or searches for a minimal failure: Running the search routine looks like this in a terminal: Those final seed&size can then be used for , giving you a minimal reproducible failure for debugging! This … of course doesn’t look too exciting without visualizing a specific bug we can find this way, but the problem there is that interesting examples of systems to test in this way usually take more than 256 lines to implement. So I’ll leave it to your imagination, but you get the idea: if you can make a system fail under a “random” input, you can also systematically search the space of all inputs for the smallest counter-example, without adding knowledge about the system to the searcher. This article also provides a concrete (but somewhat verbose) example. Here’s the full code: https://gist.github.com/matklad/343d13547c8bfe9af310e2ca2fbfe109 You want to stretch your generative testing muscles. You are a do-it-yourself type, and wouldn’t want to pull a ginormous PBT library off the shelf. You would pull a library, but want to have a more informed opinion about available options, about essential and accidental complexity. You want some self-contained real-world Zig examples :P
In May 2010 we merged support for the RTMP protocol suite into curl, in our desire to support the world’s internet transfer protocols. The protocol is an example of the spirit of an earlier web: back when we still thought we would have different transfer protocols for different purposes. Before HTTP(S) truly became the one protocol that rules them all. RTMP was done by Adobe, used by Flash applications etc. Remember those? RTMP is an ugly proprietary protocol that simply was never used much in Open Source. The common Open Source implementation of this protocol is done in the rtmpdump project . In that project they produce a library, librtmp , which curl has been using all these years to handle the actual binary bits over the wire. Build curl to use librtmp and it can transfer RTMP:// URLs for you. In our constant pursuit to improve curl, to find spots that are badly tested and to identify areas that could be weak from a security and functionality stand-point, our support of RTMP was singled out. Here I would like to stress that I’m not suggesting that this is the only area in need of attention or improvement, but this was one of them. As I looked into the RTMP situation I realized that we had no (zero!) tests of our own that actually verify RTMP with curl. It could thus easily break when we refactor things. Something we do quite regularly. I mean refactor (but also breaking things). I then took a look upstream into the librtmp code and associated project to investigate what exactly we are leaning on here. What we implicitly tell our users they can use. I quickly discovered that the librtmp project does not have a single test either. They don’t even do releases since many years back, which means that most Linux distros have packaged up their code straight from their repositories. (The project insists that there is nothing to release, which seems contradictory.) Is there perhaps any librtmp tests perhaps in the pipe? There had not been a single commit done in the project within the last twelve months and when I asked one of their leading team members about the situation, I was made clear to me that there is no tests in the pipe for the foreseeable future either. In November 2025 I explicitly asked for RTMP users on the curl-library mailing list, and one person spoke up who uses it for testing. In the 2025 user survey, 2.2% of the respondents said they had used RTMP within the last year. The combination of few users and untested code is a recipe for pending removal from curl unless someone steps up and improves the situation. We therefor announced that we would remove RTMP support six months into the future unless someone cried out and stepped up to improve the RTMP situation. We repeated this we-are-doing-to-drop-RTMP message in every release note and release video done since then, to make sure we do our best to reach out to anyone actually still using RTMP and caring about it. If anyone would come out of the shadows now and beg for its return, we can always discuss it – but that will of course require work and adding test cases before it would be considered. Can we remove support for a protocol and still claim API and ABI backwards compatibility with a clean conscience? This is the first time in modern days we remove support for a URL scheme and we do this without bumping the SONAME. We do not consider this an incompatibility primarily because no one will notice . It is only a break if it actually breaks something. (RTMP in curl actually could be done using six separate URL schemes, all of which are no longer supported: rtmp rtmpe rtmps, rtmpt rtmpte rtmpts.) The offical number of URL schemes supported by curl is now down to 27: DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, MQTTS, POP3, POP3S, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET, TFTP, WS and WSS. The commit that actually removed RTMP support has been merged. We had the protocol supported for almost sixteen years. The first curl release without RTMP support will be 8.20.0 planned to ship on April 29, 2026
It's common to spin up a server in a test so that you can do full end-to-end requests of it. It's a very important sort of test, to make sure things work all together. Most of the work I do is in complex web backends, and there's so much risk of not having all the request processing and middleware and setup exactly the same in a mock test... you must do at least some end-to-end tests or you're making a gamble that's going to bite you. And this is great, but you quickly run into a problem: port collisions! This can happen when you run multiple tests at once and all of them start a separate server, and whoops, two have picked the same port. Or it can happen if something else running on your development machine happens to be running on the port you chose. It's annoying when it happens, too, because it's often hard to reproduce. So... how do we fix that? You read the title [1] , so you know where we're going, but let's go there together. There are a few potential solutions to this. Perhaps the most obvious is binding to a port you choose randomly. This will work a lot of the time, but it's going to be flaky. You can drive down the probability of collision, but it's going to happen sometimes. Side note, I think the only thing worse than a test that fails 10% of the time is one that fails 1% of the time. It's not flaky enough to drive urgency for anyone to fix it, but it's flaky enough that in a team context, you will run into this on a daily basis. Ask me how I know. How often you get a collision depends on a lot of factors. How many times do you bind a port in the range? How many other services might bind something in that range? How likely are two things to run concurrently? As a simple example, let's say we pick a random port in the range 9000-9999, and you have 4 concurrent tests that will overlap. If you uniformly sample from this range, then you will have a 1/1000 chance of a collision from the second test, a 2/1000 chance from the third, and a 3/1000 chance from the fourth. Our probability of having no collision is . That means that we have a 0.6% chance of a collision. This isn't horrible, but it's not great! We could also have each test increment the port it picks by 1. I've done this before, and it avoids one set of problems from collisions, but it makes a new problem. Now you're sweeping across the entire range starting from the first port. If you have anything else running on your system that binds in that range, you'll run into a collision! And if you run your entire test suite in parallel, you're much more likely to have a problem now, since they all start at the same port. The problem we've had all along is that we don't have full information. If we know the system state and all the currently open ports, then binding to one that's not in use is an easy problem. And you know who knows all that info? The kernel does. And it turns out, this is something we can ask the kernel for. We can just say "please give me a nice unused port" and it will! There's a range of ports that the kernel uses for this. It varies by system, but it's not usually very relevant what the particular range is. On my system, I can find the range by checking . My ephemeral port range is from 32768 to 60999. I'm curious why the range stops there instead of going all the way up, so that's a future investigation. To get an ephemeral port on Linux systems, you bind or listen on port 0 . Then the kernel will hand you back a port in the ephemeral range. And you know that it's available, since the kernel is keeping track. It's possible to have an issue here if the full range of ports has been exhausted but, you know what, if you hit that limit, you probably have other problems [2] . The only thing is that if you've bound to an unknown port, how do you send requests to it? We can get the port we've bound to by another syscall, . This lets us find out what address a socket is bound to, and then we can do something with that information. For tests, that means that you'll need to find a way to communicate this port from the listener to the requester. If they're in the same process, I like to do this by either injecting in the listener or returning the address. If you're doing something like postgres or redis on an ephemeral port, then you'd probably have to find the port from its output, which is tedious but doable. Here's an example from a web app I'm working on. This is how a simple test looks. We launch the web server, binding to port 0, and get the address back. Then we can send requests to that address! And inside , the relevant two lines are: ...where in our case. That's all we have to do, and we'll get a much more reliable test setup. I think suspenseful titles can be fun, improve storytelling, and drive attention. But sometimes you really need a clear, honest, spoiler of a title. Giving away the answer is great when you're giving information that people might want to quickly internalize. ↩ If you do run into this, I'm very curious to hear about the circumstances. It's the kind of problem that I'd love to look at and work on. It's kind of messy, and you know that there's something very interesting that led to it being this way. ↩ I think suspenseful titles can be fun, improve storytelling, and drive attention. But sometimes you really need a clear, honest, spoiler of a title. Giving away the answer is great when you're giving information that people might want to quickly internalize. ↩ If you do run into this, I'm very curious to hear about the circumstances. It's the kind of problem that I'd love to look at and work on. It's kind of messy, and you know that there's something very interesting that led to it being this way. ↩
I used to be a TDD sceptic - too much time writing tests for features that might get deleted. Then coding agents completely changed the economics of software testing.
Interesting bug in Death by Scrolling. Let’s dive in. This morning I got up to several Mastodon, Forum messages and Steam posts about a crash in Death by scrolling. Interesting. Never seen that before and why now? Turns out this bug is in a Daily Challenge, which is why we’re getting a lot of bug reports all at once. That Daily Challenge must have just pop up for everyone. But why didn’t we catch it before? Both in human testing and in our automated tests? Here is the core issue. is a const. But it turns out that consts that are not defined in (and this one wasn’t) aren’t seen in Dinky code run inside of Yack files. This is a bug in the Dinky compiler. But to complicated matters even more, the compiler bug is not seen in our dev environment, so it only happens with fully packed release files. Our automated testing does test all the Challenges, but this is an odd Challenge in that it is completed in a dialog (Yack file), so while the Challenge was tested for completion it was not trigger via the Yack file by our testing unit. But it does deeper. After the first release on Steam it was discovered that this challenge only gave Gem, not the normal . This was due to this code in the Yack file: I had hard coded long ago. So I changed it to So, the first version with the hard coded was the one that went though most of the play testing and testing. Then when I change it to it ran fine for me because I was in the dev environment and the automated testing didn’t catch this because it was in a .yack file and it was a Daily Challenge so had odds of about 1/100 of happening means it slipped through. A lot of 20/20 hindsight. This is fixed in the new expansion, but I’m not sure I want to push a fix to the Steam release because it is rare and there are risks in just building a new build.
I am currently using Nextcloud for calendar and contacts syncing. Since I am looking to move away from Nextcloud, I need to find an alternative means of doing caldav and carddav. I’ve had a number of recommendations for Radicale , so I am giving it a go. I installed Radicale from the Debian stable package. Yes, I get an older version of Radicale (3.5.3), but it means everything is managed through my package manager. I tried - because of the import problems I had, below - using the pip version, via , but since it did not resolve the problems, I decided to go with the Debian version instead. The config file - at - is well documented, and easy to use. Other than setting up users, the only other thing that I changed were the logging settings, while I was trying to resolve my import problems. I put it behind a reverse proxy; the official documentation worked fine but not that what they provide is not, in itself, a valid configuration. I exported my calendars using Thunderbird, and also via the Nextcloud web interface. I could import neither into Radicale, via Thunderbird, the Radicale web UI, or by using curl ( ). The web interface to Radicale gave no useful error messages, but the debug log (available via , once I’d adjusted the config file to enable all the options for detailed debug logging) was useful, indicating the specific problematic appointment UIDs. Some of the calendar entries were invalid. They were either (a) Microsoft-originated invitations which had been updated after sending, or (b) invitations for flights from British Airways, from years ago. I tried to fix them, using the validator at icalendar.org to work out what was wrong, but I struggled to do it. In the end, I deleted from the combined .ics file the problematic entries (one by one, by hand, using vim), until the import worked. I did not bother replacing the entries for old flights, or for old meetings (annoying, but oh well), but I had two appointments in the future, which I needed to preserve. I found that, while I could not import them, once I have configured the calendars on my phone (via DAVx), I could copy the appointment from my existing calendar to my new calendar. And those worked just fine, so I don’t know why I could not import them. I need to decide whether it is a deal breaker or not, that Radicale does not offer calendar sharing. I am used to being able to see Sandra’s calendars, and her mine. There is no way to do this within Radicale. There appears to be a fudge workaround, whereby I can symlink my calendars into Sandra’s directory, and hers into mine, and then we can add each other’s calendars. This should work for our needs (and it is a “should”, because I’ve yet to test it), but it does mean that we can each add and delete entries in the other person’s calendar, which is not ideal. It might still be a deal breaker.