Latest Posts (20 found)
matklad 2 days ago

Learning Software Architecture

In reply to an email asking about learning software design skills as a researcher physicist: I was attached to a bioinformatics lab early in my career, so I think I understand what you are talking about, the phenomenon of “scientific code”! My thoughts: First meta observation is that “software design” is something best learned by doing. While I had some formal “design” courses at the University, and I was even “an architect” for our course project, that stuff was mostly make-believe, kindergarteners playing fire-fighters. What really taught me how to do stuff was an accident of my career, where my second real project ( IntelliJ Rust ) propelled me to a position of software leadership, and made design my problem. I did make a few mistakes in IJ Rust, but nothing too horrible, and I learned a lot. So that’s good news — software engineering is simple enough that an inquisitive mind can figure it out from first principles (and reading random blog posts). Second meta observation, the bad news: Conway’s law is important. Softwaregenesis repeats the social architecture of the organization producing software. Or, as put eloquently by neugierig , If I were to summarize what I learned in a single sentence, it would be this: we talk about programming like it is about writing code, but the code ends up being less important than the architecture, and the architecture ends up being less important than social issues. I suspect that the difference you perceive between industrial and scientific software is not so much about software-building knowledge, but rather about the field of incentives that compels people to produce the software. Something like “my PhD needs to publish a paper in three months” is perhaps a significant explainer? Two things you can do here. One , at times you get a chance to design or nudge an incentive structure for a project. This happens once in a blue moon, but is very impactful. This is the secret sauce behind TIGER_STYLE , not the set of rules per se, but the social context that makes this set of rules a good idea. Two , you can speedrun the four stages of grief to acceptance. Incentive structure is almost never what you want it to be, but, if you can’t change it, you can adapt to it. This is also true about most industrial software projects — there’s never a time to do a thing properly, you must do the best you can, given constraints. Let me use rust-analyzer as an example. The physical reality of the project is that it’s simultaneously very deep (it’s a compiler! Yay!) and very wide (opposite to an LLM, a classical IDE is a lot of purpose-built special features). The social reality is that “deep compiler” can attract a few brilliant dedicated contributors, and that the “breadth features” can be a good fit for an army of weekend warriors, people who learn Rust, who don’t have sustained capacity to participate in the project, but who can sink an hour or two to scratch their own itch. My insistence that doesn’t require building , that it builds on stable, that it doesn’t have any C dependencies, and that the entire test suite takes seconds, was in the service of the goal of attracting high-impact contributors. I was wrangling the build system to make sure people can work on the borrow checker without thinking about anything else. To attract weekend warriors, the internals of rust-analyzer are split into multiple independent features, where each feature is guarded by at runtime. The thinking was that I explicitly don’t want to care too much about quality there, that the bar for getting a feature PR in is “happy path works & tested”. It’s fine if the code crashes, it will only attract further contributors, provided that: In contrast, when working on the core spine which provided support for features, I was very relatively more pedantic about quality. A word of caution about adapting to, rather than fixing incentive structure — the future is uncertain, and tends to happen in the least convenient manner. The original motivation behind rust-analyzer experiment was to avoid the need to write a parallel compiler (the one in IntelliJ Rust), and to prototype a better architecture for LSP, so that the learnings could be backported to . So, even in core (especially in core), the code was very experimental. Oh well. Stuck with one more compiler now, I guess? I might hazard a guess that something similar happened to uutils project, which started as the primary destination for people learning Rust, and ended up as Ubuntu coreutils implementation. Third , now to some concrete recommendations. Sadly, I don’t know of a single book I can recommend which contains the truths. I suspect one can only find such a book in an apocryphal short story by Borges: practice seems to be an essential element here. But here are some things worth paying attention to: Boundaries talk by Gary Bernhardt is all-time favorite. It contains solid object-level advice, and, for me, it triggered the meta inquiry. How to Test is something I wish I had. I immediately understood the importance of testing, but it took me a long time to grow arrogant enough to admit that most widely-cited testing advice is shamanistic snake-oil, and to conceptualize what actually works. ∅MQ guide and, more generally, writings by Pieter Hintjens introduced me to Conway’s Law thinking. That “feature development” architecture of rust-analyzer? – optimistic merging , applied. Reflections on a decade of coding by Jamii is excellent, goes very meta. It is intentionally the first of my links . Ted Kaminski blog is the closest there is to a coherent theory of software development, appropriately framed as a set of notes to a non-existing book! As for the actual books, Software Engineering at Google and Ousterhout’s The Philosophy of Software Design are often recommended. They are good. SWE, in particular, helped me with a couple of important names . But they weren’t ground breaking for me. the quality is isolated to a feature, and doesn’t spill over, at runtime, the crash is invisible to the user (it’s crucial that rust-analyzer features work with an immutable snapshot, and can’t poison the data).

0 views
matklad 6 days ago

Steering Zig Fmt

Two tips on using effectively. Read this if you are writing Zig, or if you are implementing a code formatter. For me, is better than any other formatter I used: , the one in IntelliJ, . is steerable. For every syntactic construct, it has several variations for how it might be laid out. The variation used is selected by looking at what’s currently in a file. Easier to show a pair of examples: Depending on the trailing comma, function call is formatted on a single line, or with one argument per line. The way this plays out in practice is that you decide how you want to lay out the code, add a couple of , hit the reformat shortcut ( , p is mine ), and does the rest. For me, this works better than the alternative of the formatter guessing. 90% of great formatting are blank lines between logical blocks and tasteful choice of intermediate variables, so you might as well lean into key choices, rather than eliminate them. I know of one non-trivial formatting customization point: columnar layout for arrays: One would think that trailing comma would lead to a number-per-line layout, but, for arrays, also takes note of the first line break. In this case, the line break comes after the first three items, so we get three numbers per line, aligned: How cool is that! Furthermore, with judicious use of ( array concatenation ), you can vary the number of items per line. When I need to pass pairs to subprocess, I often go for formatting like this:

0 views
matklad 1 weeks ago

Minimal Viable Zig Error Contexts

Out of the box, Zig provides minimal and sufficient facilities for error handling — strongly-typed error codes . Error reporting is left to the user. Idiomatic solution is to pass a out parameter (“sink”) to materialize human-readable strings as needed. Diagnostics pattern works well for “production” code, but for more script-y code it adds too much friction relative to the default option of a plain , which of course gives a less than ideal message on failure: Error trace is helpful, but knowing which file is the problem is even more so. The first attempt at finding a middle ground between fully-fledged diagnostics sink pattern and a plain try is something like this: Unsatisfactory. The friction is high, you need to come up with a reasonably-sounding error message, the “happy path” of the code is obscured, and you need to repeat this for every fallible operation. A worse-is-better version of the above code is That is, just log error context as pairs, guarded by . The result is not pretty, but passable: The friction is reduced a lot: There’s one huge drawback though — the error message is logged, even if the error is subsequently handled. This is especially important in Zig 0.16, where cancelation ( serendipitous-success ) is a possible error for any IO-ing operation, and which is intended to be handled, rather than reported. Generalizing: This does feel like a better error management strategy than decorating errors individually, when they happen. I wonder which language features facilitate this style? No need to come up with any error messages beyond existing variable names. No need to change any of the s. The context is set per-block. If a function does several fallible operations on a file, the path needs to be specified only once. The context is “telescopic” every function in the call-stack can add its own context. Happy path adds context to all operations in-progress. Errors materialize current context.

0 views
matklad 3 weeks ago

256 Lines or Less: Test Case Minimization

Property Based Testing and fuzzing are a deep and science-intensive topic. There are enough advanced techniques there for a couple of PhDs, a PBT daemon, and a client-server architecture . But I have this weird parlor-trick PBT library, implementable in a couple of hundred lines of code in one sitting. This week I’ve been thinking about a cool variation of a consensus algorithm. I implemented it on the weekend. And it took just a couple of hours to write a PBT library itself first, and then a test, that showed a deep algorithmic flaw in my thinking (after a dozen trivial flaws in my coding). So, I don’t get to write more about consensus yet, but I at least can write about the library. It is very simple, simplistic even. To use an old Soviet joke about Babel and Bebel , it’s Gogol rather than Hegel. But for just 256 lines, it’s one of the highest power-to-weight ratio tools in my toolbox. Read this post if: Zig works well here because it, too, is exceptional in its power-to-weight. The implementation is a single file, , because the core abstraction here is a Finite Random Number Generator — a PRNG where all numbers are pre-generated, and can run out. We start with standard boilerplate: In Zig, files are structs: you obviously need structs, and the language becomes simpler if structs are re-used for what files are. In the above assigns a conventional name to the file struct, and declares instance fields (only one here). and are “static” (container level) declarations. The only field we have is just a slice of raw bytes, our pre-generated random numbers. And the only error condition we can raise is . The simplest thing we can generate is a slice of bytes. Typically, API for this takes a mutable slice as an out parameter: But, due to pre-generated nature of FRNG, we can return the slice directly, provided that we have enough entropy. This is going to be our (sole) basis function, everything else is going to be a convenience helper on top: The next simplest thing is an array (a slice with a fixed size): Notice how Zig goes from runtime-known slice length, to comptime known array type. Because is a constant, slicing with returns a pointer to array, . We can re-interpret a 4-byte array into . But, because this is Zig, we can trivially generalize the function to work for any integer type, by passing in comptime parameter of type : This function is monomorphised for every type, so becomes a compile-time constant we can pass to . Production code would be endian-clean here, but, for simplicity, we encode our endianness assumption as a compile-time assertion. Note how Zig communicates information about endianness to the program. There isn’t any kind of side-channel or extra input to compilation, like flags. Instead, the compiler materializes all information about target CPU as Zig code. There’s a file somewhere in the compiler caches directory that contains This file can be accessed via and all the constants inspected at compile time. We can make an integer, and a boolean is even easier: Strictly speaking, we only need one bit, not one byte, but tracking individual bits is too much of a hassle. From an arbitrary int, we can generate an int in range. As per Random Numbers Included , we use a closed range, which makes the API infailable and is usually more convenient at the call-site: As a bit of PRNG trivia, while this could be implemented as , the result will be biased (not uniform). Consider the case where , and a call like . The numbers in are going to be twice as likely as the numbers in , because the last quarter of 256 range will be aliased with the first one. Generating an unbiased number is tricky and might require drawing arbitrary number of bytes from entropy. Refer to https://www.pcg-random.org/posts/bounded-rands.html for details. I didn’t, and copy-pasted code from the Zig standard library. Use at your own risk! Now we can generate an int bounded from above and below: Another common operation is picking a random element from a slice. If you want to return a pointer to a element, you’ll need a and versions of the function. A simpler and more general solution is to return an index: At the call site, doesn’t look too bad, is appropriately -polymorphic, and is also usable for multiple parallel arrays. So far, we’ve spent about 40% of our line budget implementing a worse random number generator that can fail with at any point in time. What is it good for? We use it to feed our system under test with random inputs, see how it reacts, and check that it does not crash. If we code our system to crash if anything unexpected happens and our random inputs cover the space of all possible inputs, we get a measure of confidence that bugs will be detected in testing. For my consensus simulation, I have a struct that holds a and a set of replicas: has methods like: I then select which method to call at random: Here, is another FRNG helper that selects an action at random, proportional to its weight. This helper needs quite a bit more reflection machinery than we’ve seen so far: is compile-time duck-typing. It means that our function is callable with any type, and each specific type creates a new monomorphised instance of a function. While we don’t explicitly name the type of , we can get it as . is a type-level function that takes a struct type: and turns it into an enum type, with a variant per-field, exactly what we want for the return type: Tip: if you want to quickly learn Zig’s reflection capabilities, study the implementation of and in Zig’s standard library. The built-in function accesses a field given field name. It’s exactly like Python’s / with an extra restriction that it must be evaluated at compile time. To add one more twist here, I always find it hard to figure out which weights are reasonable, and like to generate the weights themselves at random at the start of the test: (If you feel confused here, check out Swarm Testing Data Structures ) Now we have enough machinery to describe the shape of test overall: A test needs an (which ultimately determines the outcome) and an General Purpose Allocator for the . We start by creating a simulated with random action weights. If entropy is very low, we can run out of entropy even at this stage. We assume that the code is innocent until proven guilty — if we don’t have enough entropy to find a bug, this particular test returns success. Don’t worry, we’ll make sure that we have enough entropy elsewhere. We use to peel off error. I find that, whenever I handle errors in Zig, very often I want to discharge just a single error from the error set. I wish I could use parenthesis with a : Anyway, having created the , we step through it while we still have entropy left. If any step detects an internal inconsistency, the entire crashes with an assertion failure. If we got to the end of loop, we know that at least that particular slice of entropy didn’t uncover anything suspicious. Notice what isn’t there. We aren’t generating a complete list of actions up-front. Rather, we make random decisions as we go, and can freely use the current state of the to construct a menu of possible choices (e.g., when sending a message, we can consider only not currently crashed replicas). And here we can finally see the reason why we bothered writing a custom Finite PRNG, rather than using an off-the-shelf one. The amount of entropy in FRNG defines the complexity of the test. The fewer random bytes we start with, the faster we exit the step loop. And this gives us an ability to minimize test cases essentially for free. Suppose you know that a particular entropy slice makes the test fail (cluster enters split brain at the millionth step). Let’s say that the slice was 16KiB. The obvious next step is to see if just 8KiB would be enough to crash it. And, if 8KiB isn’t, than perhaps 12KiB? You can binary search the minimal amount of entropy that’s enough for the test to fail. And this works for any test, it doesn’t have to be a distributed system. If you can write the code to generate your inputs randomly, you can measure complexity of each particular input by measuring how many random bytes were drawn in its construction. And now the hilarious part — of course it seems that the way to minimize entropy is to start with a particular failing slice and apply genetic-algorithm mutations to it. But a much simpler approach seems to work in practice — just generated a fresh, shorter entropy slice. If you found some failure at random, then you should be able to randomly stumble into a smaller failing example, if one exists — there are much fewer small examples, so finding a failing one becomes easier when the goes down! The problem with binary searching for failing entropy is that a tripped assertion crashes the program. There’s no unwinding in Zig. For this reason, we’ll move the search code to a different process. So a single test will be a binary with a function, that takes entropy on . Zig’s new juicy main makes writing this easier than in any previous versions of Zig :D Main gets as an argument, which provides access to things like command line arguments, default allocator and a default implementation. These days, Zig eschews global ambient IO capabilities, and requires threading an Io instance whenever we need to make a syscall. Here, we need Io to read stdin. Now we will implement a harness to call this main. This will be : It will be spawning external processes, so it’ll need an . We also need a path to an executable with a test main function, a System Under Test. And we’ll need a buffer to hold the entropy. This driver will be communicating successes and failures to the users, so we also prepare a for textual output. How we get entropy to feed into ? Because we are only interested in entropy size, we won’t be storing the actual entropy bytes, and instead will generate it from a seed. In other words, just two numbres, entropy size and seed, are needed to reproduce a single run of the test: We use default deterministic PRNG to expand our short seed into entropy slice of the required size. Then we spawn proces, feeding the resulting entropy via stdin. Closing child’s stdin signals the end of entropy. We then return either or depending on child’s exit code. So, both explicit errors and crashes will be recognized as failures. Next, we implement the logic for checking if a particular seed size is sufficient to find a failure. Of course, we won’t be able to say that for sure in a finite amount of time, so we’ll settle for some user-specified amount of retries: The user passes us the number of to make, and we return if they all were successfull, or a specific failing seed if we found one: To generate a real seed we need “true” cryptographic non-deterministic randomness, which is provided by . Finally, the search for the size: Here, we are going to find a smallest entropy size that crashes . If we succeed, we return the seed and the size. The upper bound for the size is the space available in the pre-allocated entropy buffer. The search loop is essentially a binary search, with a twist — rather than using dichotomy on the directly, we will be doubling a we use to change the size between iterations. That is, we start with a small size and step, and, on every iteration, double the step and add it to the size, until we hit a failure (or run out of buffer for the entropy). Once we found a failure, we continue the serach in the other direction — halving the step and subtracting it from the , keeping the smaller if it still fails. On each step, we log the current size and outcome, and report the smallest failing size at the end. Finally, we wrap Driver’s functionality into main that works in two modes — either reproduces a given failure from seed and size, or searches for a minimal failure: Running the search routine looks like this in a terminal: Those final seed&size can then be used for , giving you a minimal reproducible failure for debugging! This … of course doesn’t look too exciting without visualizing a specific bug we can find this way, but the problem there is that interesting examples of systems to test in this way usually take more than 256 lines to implement. So I’ll leave it to your imagination, but you get the idea: if you can make a system fail under a “random” input, you can also systematically search the space of all inputs for the smallest counter-example, without adding knowledge about the system to the searcher. This article also provides a concrete (but somewhat verbose) example. Here’s the full code: https://gist.github.com/matklad/343d13547c8bfe9af310e2ca2fbfe109 You want to stretch your generative testing muscles. You are a do-it-yourself type, and wouldn’t want to pull a ginormous PBT library off the shelf. You would pull a library, but want to have a more informed opinion about available options, about essential and accidental complexity. You want some self-contained real-world Zig examples :P

0 views
matklad 1 months ago

Consensus Board Game

I have an early adulthood trauma from struggling to understand consensus amidst a myriad of poor explanations. I am overcompensating for that by adding my own attempts to the fray. Today, I want to draw a series of pictures which could be helpful. You can see this post as a set of missing illustrations for Notes on Paxos , or, alternatively, you can view that post as a more formal narrative counter-part for the present one. The idea comes from my mathematics of consensus lecture, with versions in English and Russian . I am going to aggressively hand wave the details away, please refer to Notes for filling in the blanks. And, before we begin, I want to stress again that here I am focusing strictly on the mathematics behind the algorithm, on the logical structure of the universe that makes some things impossible, and others doable. Consensus is but a small part of the engineering behind real data management systems, and I might do something about pragmatics of consensus at some point, just not today ;) There’s a committee of five members that tries to choose a color for a bike shed, but the committee members are not entirely reliable. We want to arrive at a decision even if some members of the committee are absent. The fundamental idea underpinning consensus is simple majority vote. If R0, … R4 are the five committee members, we can use the following board to record the votes: A successful vote looks like this: Here, red collected 3 out of 5 votes and wins. Note that R4 hasn’t voted yet. It might, or might not do so eventually, but that won’t affect the outcome. The problem with voting is that it can get stuck like this: Here, we have two votes for red, two votes for blue, but the potential tie-breaker, R4, voted for green, the rascal! To solve split vote, we are going to designate R0 as the committee’s leader, make it choose the color, and allow others only to approve. Note that meaningful voting still takes place, as someone might abstain from voting — you need at least 50% turnout for the vote to be complete: Here, R0, the leader (marked with yellow napoleonic bicorne), choose red, R2 and R3 acquiesced, so the red “won”, even as R1 and R4 abstained (x signifies absence of a vote). The problem with this is that our designated leader might be unavailable itself: Which brings us to the central illustration that I wanted to share. What are we going to do now is to multiply our voting. Instead of conducting just one vote with a designated leader, the committee will conduct a series of concurrent votes, where the leaders rotate in round-robin pattern. This gives rise to the following half-infinite 2D board on which the game of consensus is played: Each column plays independently. If you are a leader in a column, and your cell is blank, you can choose whatever color. If you are a follower, you need to wait until column’s leader decision, and then you can either fill the same color, or you can abstain. After several rounds the board might end up looking like this: The benefit of our 2D setup is that, if any committee member is unavailable, their columns might get stuck, but, as long as the majority is available, some column somewhere might still complete. The drawback is that, while individual column’s decision is clear and unambiguous, the outcome of the board as whole is undefined. In the above example, there’s a column where red wins, and a column where blue wins. So what we are going to do is to scrap the above board as invalid, and instead require that any two columns that achieved majorities must agree on the color. In other words, the outcome of the entire board is the outcome of any of its columns, whichever finishes first, and the safety condition is that no two colors can reach majorities in different columns. Let’s take a few steps back when the board wasn’t yet hosed, and try to think about the choice of the next move from the perspective of R3: As R3 and the leader for your column, you need to pick a color which won’t conflict with any past or future decisions in other columns. Given that there are some greens and blues already, it feels like maybe you shouldn’t pick red… But it could be the case that the three partially filled columns won’t move anywhere in the future, and the first column gets a solid red line! Though choices! You need to worry about the future and the infinite number of columns to your right! Luckily, the problem can be made much easier if we assume that everyone plays by the same rules, in which case it’s enough to only worry about the columns to your left. Suppose that you, and everyone else is carefully choosing their moves to not conflict with the columns to the left. Then, if you chose red, your column wins, and subsequently some buffoon on the right chooses green, it’s their problem, because you are to their left. So let’s just focus on the left part of the board. Again, it seems like blue or green might be good bets, as they are already present on the board, but there’s a chance that the first column will eventually vote for red. To prevent that, what we are going to do is to collect a majority of participants (R0, R2, R3) and require them to commit to not voting in the first columns. Actually, for that matter, let’s prevent them from voting in any column to the left: Here, you asked R0, R2 and R3 to abstain from casting further votes in the first three columns, signified by black x. With this picture, we can now be sure that red can not win in the first column — no color can win there, because only two out of the five votes are available there! Still, we have the choice between green and blue, which one should we pick? The answer is the rightmost. R2, the participant that picked blue in the column to our immediate left, was executing the same algorithm. If they picked blue, they did it because they knew for certain that the second column can’t eventually vote for green. R2 got a different majority of participants to abstain from voting in the second column, and, while we, as R3, don’t know which majority that was, we know that it exists because we know that R2 did pick blue, and we assume fair play. That’s all for today, that’s the trick that makes consensus click, in the abstract. In a full distributed system the situation is more complicated. Each participant only sees its own row, the board as a whole remains concealed. Participants can learn something about others’ state by communicating, but the knowledge isn’t strongly anchored at time. By the time a response is received the answer could as well be obsolete. And yet, the above birds-eye view can be implemented in a few exchanges of messages. Please see the Notes for further details.

0 views
matklad 2 months ago

JJ LSP Follow Up

In Majjit LSP , I described an idea of implementing Magit style UX for jj once and for all, leveraging LSP protocol. I’ve learned today that the upcoming 3.18 version of LSP has a feature to make this massively less hacky: Text Document Content Request LSP can now provide virtual documents, which aren’t actually materialized on disk. So this: can now be such a virtual document, where highlighting is provided by semantic tokens, things like “check out this commit” are code actions, and “goto definition” jumps from the diff in the virtual file to a real file in the working tree.

0 views
matklad 2 months ago

Wrapping Code Comments

I was today years old when I realized that: It’s a good idea to limit line length to about 100 columns. This is a physical limit, the width at which you can still comfortably fit two editors side by side (see Size Matters ). Note an apparent contradiction: the optimal width for readable prose is usually taken to be narrower, 60–70 columns. The contradiction is resolved by noticing that, for code, indentation eats into usable space. Typically, code is much less typographically dense than prose. Still, I find comment blocks easier to read when they are wrapped narrower than the surrounding code. I want lines to be wrapped at 100, and content of comments to be wrapped at 70 (unless that pushes overall line to be longer than 100). That is, I want layout like this (using 20/30 rulers instead of 70/100, for illustrative purposes): This feels obvious in retrospect, but notably isn’t be well-supported by the tools? The VS Code extension I use allows configuring dedicated fill column for comments, but doesn’t make it relative , so indented comment blocks are always narrower than top-level ones. Emacs also doesn’t do relative wrapping out of the box! Aside on hard-wrapping: should we bother with wrapping comments at all? Can’t we rely on our editor to implement soft-wrapping? The problem with soft-wrapping is that you can’t soft-wrap text correctly without understanding its meaning. Consider a markdown list: If the first item is long enough to necessitate wrapping, the wrapped line should also be indented, which requires parsing the text as markdown first: Code and code comments ideally should be wrapped to a different column. For comments, the width should be relative to the start of the comment.

0 views
matklad 3 months ago

CI In a Box

I wrote , a thin wrapper around ssh for running commands on remote machines. I want a box-shaped interface for CI: That is, the controlling CI machine runs a user-supplied script, whose status code will be the ultimate result of a CI run. The script doesn’t run the project’s tests directly. Instead, it shells out to a proxy binary that forwards the command to a runner box with whichever OS, CPU, and other environment required. The hard problems are in the part: CI discourse amuses me — everyone complains about bad YAML, and it is bad, but most of the YAML (and associated reproducibility and debugging problems) is avoidable. Pick an appropriate position on a dial that includes What you can’t just do by writing a smidgen of text is getting the heterogeneous fleet of runners. And you need heterogeneous fleet of runners if some of the software you are building is cross-platform. If you go that way, be mindful that The SSH wire protocol only takes a single string as the command, with the expectation that it should be passed to a shell by the remote end. In other words, while SSH supports syntax like , it just blindly intersperses all arguments with a space. Amusing to think that our entire cloud infrastructure is built on top of shell injection ! This, and the need to ensure no processes are left behind unintentionally after executing a remote command, means that you can’t “just” use SSH here if you are building something solid. One of them is not UNIX. One of them has licensing&hardware constraints that make per-minute billed VMs tricky (but not impossible, as GitHub Actions does that). All of them are moving targets, and require someone to do the OS upgrade work, which might involve pointing and clicking . writing a bash script, writing a script in the language you already use , using a small build system , using a medium-sized one like or , or using a large one like or .

0 views
matklad 3 months ago

Vibecoding #2

I feel like I got substantial value out of Claude today, and want to document it. I am at the tail end of AI adoption, so I don’t expect to say anything particularly useful or novel. However, I am constantly complaining about the lack of boring AI posts, so it’s only proper if I write one. At TigerBeetle, we are big on deterministic simulation testing . We even use it to track performance , to some degree. Still, it is crucial to verify performance numbers on a real cluster in its natural high-altitude habitat. To do that, you need to procure six machines in a cloud, get your custom version of binary on them, connect cluster’s replicas together and hit them with load. It feels like, quarter of a century into the third millennium, “run stuff on six machines” should be a problem just a notch harder than opening a terminal and typing , but I personally don’t know how to solve it without wasting a day. So, I spent a day vibecoding my own square wheel. The general shape of the problem is that I want to spin a fleet of ephemeral machines with given specs on demand and run ad-hoc commands in a SIMD fashion on them. I don’t want to manually type slightly different commands into a six-way terminal split, but I also do want to be able to ssh into a specific box and poke it around. My idea for the solution comes from these three sources: The big idea of is that you can program distributed system in direct style. When programming locally, you do things by issuing syscalls: This API works for doing things on remote machines, if you specify which machine you want to run the syscall on: Direct manipulation is the most natural API, and it pays to extend it over the network boundary. Peter’s post is an application of a similar idea to a narrow, mundane task of developing on Mac and testing on Linux. Peter suggests two scripts: synchronizes a local and remote projects. If you run inside folder, then materializes on the remote machine. does the heavy lifting, and the wrapper script implements behaviors. It is typically followed by , which runs command on the remote machine in the matching directory, forwarding output back to you. So, when I want to test local changes to on my Linux box, I have roughly the following shell session: The killer feature is that shell-completion works. I first type the command I want to run, taking advantage of the fact that local and remote commands are the same, paths and all, then hit and prepend (in reality, I have alias that combines sync&run). The big thing here is not the commands per se, but the shift in the mental model. In a traditional ssh & vim setup, you have to juggle two machines with a separate state, the local one and the remote one. With , the state is the same across the machines, you only choose whether you want to run commands here or there. With just two machines, the difference feels academic. But if you want to run your tests across six machines, the ssh approach fails — you don’t want to re-vim your changes to source files six times, you really do want to separate the place where the code is edited from the place(s) where the code is run. This is a general pattern — if you are not sure about a particular aspect of your design, try increasing the cardinality of the core abstraction from 1 to 2. The third component, library, is pretty mundane — just a JavaScript library for shell scripting. The notable aspects there are: JavaScript’s template literals , which allow implementing command interpolation in a safe by construction way. When processing , a string is never materialized, it’s arrays all the way to the syscall ( more on the topic ). JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural: Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked — a sour spot of UNIX. Combining the three ideas, I now have a deno script, called , that provides a multiplexed interface for running ad-hoc code on ad-hoc clusters. A session looks like this: I like this! Haven’t used in anger yet, but this is something I wanted for a long time, and now I have it The problem with implementing above is that I have zero practical experience with modern cloud. I only created my AWS account today, and just looking at the console interface ignited the urge to re-read The Castle. Not my cup of pu-erh. But I had a hypothesis that AI should be good at wrangling baroque cloud API, and it mostly held. I started with a couple of paragraphs of rough, super high-level description of what I want to get. Not a specification at all, just a general gesture towards unknown unknowns. Then I asked ChatGPT to expand those two paragraphs into a more or less complete spec to hand down to an agent for implementation. This phase surfaced a bunch of unknowns for me. For example, I wasn’t thinking at all that I somehow need to identify machines, ChatGPT suggested using random hex numbers, and I realized that I do need 0,1,2 naming scheme to concisely specify batches of machines. While thinking about this, I realized that sequential numbering scheme also has an advantage that I can’t have two concurrent clusters running, which is a desirable property for my use-case. If I forgot to shutdown a machine, I’d rather get an error on trying to re-create a machine with the same name, then to silently avoid the clash. Similarly, turns out the questions of permissions and network access rules are something to think about, as well as what region and what image I need. With the spec document in hand, I turned over to Claude code for actual implementation work. The first step was to further refine the spec, asking Claude if anything is unclear. There were couple of interesting clarifications there. First, the original ChatGPT spec didn’t get what I meant with my “current directory mapping” idea, that I want to materialize a local as remote , even if are different. ChatGPT generated an incorrect description and an incorrect example. I manually corrected example, but wasn’t able to write a concise and correct description. Claude fixed that working from the example. I feel like I need to internalize this more — for current crop of AI, examples seem to be far more valuable than rules. Second, the spec included my desire to auto-shutdown machines once I no longer use them, just to make sure I don’t forget to turn the lights off when leaving the room. Claude grilled me on what precisely I want there, and I asked it to DWIM the thing. The spec ended up being 6KiB of English prose. The final implementation was 14KiB of TypeScript. I wasn’t keeping the spec and the implementation perfectly in sync, but I think they ended up pretty close in the end. Which means that prose specifications are somewhat more compact than code, but not much more compact. My next step was to try to just one-shot this. Ok, this is embarrassing, and I usually avoid swearing in this blog, but I just typoed that as “one-shit”, and, well, that is one flavorful description I won’t be able to improve upon. The result was just not good (more on why later), so I almost immediately decided to throw it away and start a more incremental approach. In my previous vibe-post , I noticed that LLM are good at closing the loop. A variation here is that LLMs are good at producing results, and not necessarily good code. I am pretty sure that, if I had let the agent to iterate on the initial script and actually run it against AWS, I would have gotten something working. I didn’t want to go that way for three reasons: And, as I said, the code didn’t feel good, for these specific reasons: The incremental approach worked much better, Claude is good at filling-in the blanks. The very first thing I did for was manually typing-in: Then I asked Claude to complete the function, and I was happy with the result. Note Show, Don’t Tell I am not asking Claude to avoid throwing an exception and fail fast instead. I just give function, and it code-completes the rest. I can’t say that the code inside is top-notch. I’d probably written something more spartan. But the important part is that, at this level, I don’t care. The abstraction for parsing CLI arguments feel right to me, and the details I can always fix later. This is how this overall vibe-coding session transpired — I was providing structure, Claude was painting by the numbers. In particular, with that CLI parsing structure in place, Claude had little problem adding new subcommands and new arguments in a satisfactory way. The only snag was that, when I asked to add an optional path to , it went with , while I strongly prefer . Obviously, its better to pick your null in JavaScript and stick with it. The fact that is unavoidable predetermines the winner. Given that the argument was added as an incremental small change, course-correcting was trivial. The null vs undefined issue perhaps illustrates my complaint about the code lacking character. is the default non-choice. is an insight, which I personally learned from VS Code LSP implementation. The hand-written skeleton/vibe-coded guts worked not only for the CLI. I wrote and then asked Claude to write the body of a particular function according to the SPEC.md. Unlike with the CLI, Claude wasn’t able to follow this pattern itself. With one example it’s not obvious, but the overall structure is that is the AWS-level operation on a single box, and is the CLI-level control flow that deals with looping and parallelism. When I asked Claude to implement , without myself doing the / split, Claude failed to noticed it and needed a course correction. However , Claude was massively successful with the actual logic. It would have taken me hours to acquire specific, non-reusable knowledge to write: I want to be careful — I can’t vouch for correctness and especially completeness of the above snippet. However, given that the nature of the problem is such that I can just run the code and see the result, I am fine with it. If I were writing this myself, trial-and-error would totally be my approach as well. Then there’s synthesis — with several instance commands implemented, I noticed that many started with querying AWS to resolve symbolic machine name, like “1”, to the AWS name/IP. At that point I realized that resolving symbolic names is a fundamental part of the problem, and that it should only happen once, which resulting in the following refactored shape of the code: Claude was ok with extracting the logic, but messed up the overall code layout, so the final code motions were on me. “Context” arguments go first , not last, common prefix is more valuable than common suffix because of visual alignment. The original “one-shotted” implementation also didn’t do up-front querying. This is an example of a shape of a problem I only discover when working with code closely. Of course, the script didn’t work perfectly the first time and we needed quite a few iterations on the real machines both to fix coding bugs, as well gaps in the spec. That was an interesting experience of speed-running rookie mistakes. Claude made naive bugs, but was also good at fixing them. For example, when I first tried to after , I got an error. Pasting it into Claude immediately showed the problem. Originally, the code was doing and not . The former checks if instance is logically created, the latter waits until the OS is booted. It makes sense that these two exist, and the difference is clear (and its also clear that OS booted != SSH demon started). Claude’s value here is in providing specific names for the concepts I already know to exist. Another fun one was about the disk. I noticed that, while the instance had an SSD, it wasn’t actually used. I asked Claude to mount it as home, but that didn’t work. Claude immediately asked me to run and that log immediately showed the problem. This is remarkable! 50% of my typical Linux debugging day is wasted not knowing that a useful log exists, and the other 50% is for searching for the log I know should exist somewhere . After the fix, I lost the ability to SSH. Pasting the error immediately gave the answer — by mounting over , we were overwriting ssh keys configured prior. There were couple of more iterations like that. Rookie mistakes were made, but they were debugged and fixed much faster than my personal knowledge allows (and again, I feel that is trivia knowledge, rather than deep reusable knowledge, so I am happy to delegate it!). It worked satisfactorily in the end, and, what’s more, I am happy to maintain the code, at least to the extent that I personally need it. Kinda hard to measure productivity boost here, but, given just the sheer number of CLI flags required to make this work, I am pretty confident that time was saved, even factoring the writing of the present article! I’ve recently read The Art of Doing Science and Engineering by Hamming (of distance and code), and one story stuck with me: A psychologist friend at Bell Telephone Laboratories once built a machine with about 12 switches and a red and a green light. You set the switches, pushed a button, and either you got a red or a green light. After the first person tried it 20 times they wrote a theory of how to make the green light come on. The theory was given to the next victim and they had their 20 tries and wrote their theory, and so on endlessly. The stated purpose of the test was to study how theories evolved. But my friend, being the kind of person he was, had connected the lights to a random source! One day he observed to me that no person in all the tests (and they were all high-class Bell Telephone Laboratories scientists) ever said there was no message. I promptly observed to him that not one of them was either a statistician or an information theorist, the two classes of people who are intimately familiar with randomness. A check revealed I was right! https://github.com/catern/rsyscall https://peter.bourgon.org/blog/2011/04/27/remote-development-from-mac-to-linux.html https://github.com/dsherret/dax JavaScript’s template literals , which allow implementing command interpolation in a safe by construction way. When processing , a string is never materialized, it’s arrays all the way to the syscall ( more on the topic ). JavaScript’s async/await, which makes managing concurrent processes (local or remote) natural: Additionally, deno specifically valiantly strives to impose process-level structured concurrency, ensuring that no processes spawned by the script outlive the script itself, unless explicitly marked — a sour spot of UNIX. Spawning VMs takes time, and that significantly reduces the throughput of agentic iteration. No way I let the agent run with a real AWS account, given that AWS doesn’t have a fool-proof way to cap costs. I am fairly confident that this script will be a part of my workflow for at least several years, so I care more about long-term code maintenance, than immediate result. It wasn’t the code that I would have written, it lacked my character, which made it hard for me to understand it at a glance. The code lacked any character whatsoever. It could have worked, it wasn’t “naively bad”, like the first code you write when you are learning programming, but there wasn’t anything good there. I never know what the code should be up-front. I don’t design solutions, I discover them in the process of refactoring. Some of my best work was spending a quiet weekend rewriting large subsystems implemented before me, because, with an implementation at hand, it was possible for me to see the actual, beautiful core of what needs to be done. With a slop-dump, I just don’t get to even see what could be wrong. In particular, while you are working the code (as in “wrought iron”), you often go back to requirements and change them. Remember that ambiguity of my request to “shut down idle cluster”? Claude tried to DWIM and created some horrific mess of bash scripts, timestamp files, PAM policy and systemd units. But the right answer there was “lets maybe not have that feature?” (in contrast, simply shutting the machine down after 8 hours is a one-liner).

2 views
matklad 4 months ago

The Second Great Error Model Convergence

I feel like this has been said before, more than once, but I want to take a moment to note that most modern languages converged to the error management approach described in Joe Duffy’s The Error Model , which is a generational shift from the previous consensus on exception handling. C++, JavaScript, Python, Java, C# all have roughly equivalent , , constructs with roughly similar runtime semantics and typing rules. Even functional languages like Haskell, OCaml, and Scala feature exceptions prominently in their grammar, even if their usage is frowned upon by parts of the community. But the same can be said about Go, Rust, Swift, and Zig! Their error handling is similar to each other, and quite distinct from the previous bunch, with Kotlin and Dart being notable, ahem, exceptions. Here are some commonalities of modern error handling: First , and most notably, functions that can fail are annotated at the call side. While the old way looked like this: the new way is There’s a syntactic marker alerting the reader that a particular operation is fallible, though the verbosity of the marker varies. For the writer, the marker ensures that changing the function contract from infallible to fallible (or vice versa) requires changing not only the function definition itself, but the entire call chain. On the other hand, adding a new error condition to a set of possible errors of a fallible function generally doesn’t require reconsidering rethrowing call-sites. Second , there’s a separate, distinct mechanism that is invoked in case of a detectable bug. In Java, index out of bounds or null pointer dereference (examples of programming errors) use the same language machinery as operational errors. Rust, Go, Swift, and Zig use a separate panic path. In Go and Rust, panics unwind the stack, and they are recoverable via a library function. In Swift and Zig, panic aborts the entire process. Operational error of a lower layer can be classified as a programming error by the layer above, so there’s generally a mechanism to escalate an erroneous result value to a panic. But the opposite is more important: a function which does only “ordinary” computations can be buggy, and can fail, but such failures are considered catastrophic and are invisible in the type system, and sufficiently transparent at runtime. Third , results of fallible computation are first-class values, as in Rust’s . There’s generally little type system machinery dedicated exclusively to errors and expressions are just a little more than syntax sugar for that little Go spell. This isn’t true for Swift, which does treat errors specially. For example, the generic function has to explicitly care about errors, and hard-codes the decision to bail early: Swift does provide first-classifier type for errors. Should you want to handle an exception, rather than propagate it, the handling is localized to a single throwing expression to deal with a single specific errors, rather than with any error from a block of statements: Swift again sticks to more traditional try catch, but, interestingly, Kotlin does have expressions. The largest remaining variance is in what the error value looks like. This still feels like a research area. This is a hard problem due to a fundamental tension: The two extremes are well understood. For exhaustiveness, nothing beats sum types ( s in Rust). This I think is one of the key pieces which explains why the pendulum seemingly swung back on checked exceptions. In Java, a method can throw one of the several exceptions: Critically, you can’t abstract over this pair. The call chain has to either repeat the two cases, or type-erase them into a superclass, losing information. The former has a nasty side-effect that the entire chain needs updating if a third variant is added. Java-style checked exceptions are sensitive to “N to N + 1” transitions. Modern value-oriented error management is only sensitive to “0 to 1” transition. Still, if I am back to writing Java at any point, I’d be very tempted to standardize on coarse-grained signature for all throwing methods. This is exactly the second well understood extreme: there’s a type-erased universal error type, and the “throwableness” of a function contains one bit of information. We only care if the function can throw, and the error itself can be whatever. You still can downcast dynamic error value handle specific conditions, but the downcasting is not checked by the compiler. That is, downcasting is “save” and nothing will panic in the error handling mechanism itself, but you’ll never be sure if the errors you are handling can actually arise, and whether some errors should be handled, but aren’t. Go and Swift provide first-class universal errors, like Midori. Starting with Swift 4, you can also narrow the type down. Rust doesn’t really have super strong conventions about the errors, but it started with mostly enums, and then and shone spotlight on the universal error type. But overall, it feels like “midpoint” error handling is poorly served by either extreme. In larger applications, you sorta care about error kinds, and there are usually a few place where it is pretty important to be exhaustive in your handling, but threading necessary types to those few places infects the rest of the codebases, and ultimately leads to “a bag of everything” error types with many “dead” variants. Zig makes an interesting choice of assuming mostly closed-world compilation model, and relying on cross-function inference to learn who can throw what. What I find the most fascinating about the story is the generational aspect. There really was a strong consensus about exceptions, and then an agreement that checked exceptions are a failure , and now, suddenly, we are back to “checked exceptions” with a twist, in the form of “errors are values” philosophy. What happened between the lull of the naughts and the past decade industrial PLT renaissance? On the one hand, at lower-levels you want to exhaustively enumerate errors to make sure that: internal error handling logic is complete and doesn’t miss a case, public API doesn’t leak any extra surprise error conditions. On the other hand, at higher-levels, you want to string together widely different functionality from many separate subsystems without worrying about specific errors, other than: separating fallible functions from infallible, ensuring that there is some top-level handler to show a 500 error or an equivalent.

0 views
matklad 4 months ago

Parsing Advances

I find myself writing yet another toy parser, as one does during a Christmas break. It roughly follows Resilient LL Parsing Tutorial . Not because I need resilience, but mostly because I find producing a syntax tree and a collection of diagnostics a more natural fit for the problem than bailing out on the first error. One practical pitfall with the approach is infinite loops/recursion. Resilience sometimes means not consuming a token, and, if you do that in a loop or a Pratt recursive call, you’ll get yourself an annoying to debug error: For a concrete example, you might parse function argument list using code like this: The implicit contract here is that consumes at least one token, even if there are errors in the source code. If there’s some token that makes bail without consuming anything, the code loops forever, and you’ll need a debugger to get at the stack trace! The way I solved this issue traditionally is via a combination of two techniques: Fuel: parser has a field, which is decremented even by “readonly” lookahead methods, and topped up every time the parser consumes a token. Fuel is useful to make you parser crash somewhat cleanly, though the crash is typically still removed from problematic function by several stack frames. The second technique is to maintain a mental map of functions which always consume at least one token of input, and functions which might bail without consuming anything. And, whenever you write a loop or a recursive call, consult this map to be sure to call at least one token-consuming function. Hard and error prone! Well, I think I’ve figured something better today! You can assert that parser did advance when you expect it to. The smaller benefit here is that if parser didn’t advance, you get an immediate error. The bigger benefit is that these asserts materialize the mental map of advancing functions in the source code, so it doesn’t have to be mental anymore! This seems like an obvious idea in retrospect, but, well, took me more than one parser to figure it out! Concretely, I came up with the following base parser API: And here is the buggy function that lead to the error at the start of the article: The same function, but with advanced assertions: The new error message: and the fix:

0 views
matklad 4 months ago

Static Allocation For Compilers

TigerBeetle famously uses “static allocation” . Infamously, the use of the term is idiosyncratic: what is meant is not arrays, as found in embedded development, but rather a weaker “no allocation after startup” form. The amount of memory TigerBeetle process uses is not hard-coded into the Elf binary. It depends on the runtime command line arguments. However, all allocation happens at startup, and there’s no deallocation. The long-lived event loop goes round and round happily without . I’ve wondered for years if a similar technique is applicable to compilers. It seemed impossible, but today I’ve managed to extract something actionable from this idea? Static allocation depends on the physics of the underlying problem. And distributed databases have surprisingly simple physics, at least in the case of TigerBeetle. The only inputs and outputs of the system are messages. Each message is finite in size (1MiB). The actual data of the system is stored on disk and can be arbitrarily large. But the diff applied by a single message is finite. And, if your input is finite, and your output is finite, it’s actually quite hard to need to allocate extra memory! This is worth emphasizing — it might seem like doing static allocation is tough and requires constant vigilance and manual accounting for resources. In practice, I learned that it is surprisingly compositional. As long as inputs and outputs of a system are finite, non-allocating processing is easy. And you can put two such systems together without much trouble. routing.zig is a good example of such an isolated subsystem. The only issue here is that there isn’t a physical limit on how many messages can arrive at the same time. Obviously, you can’t process arbitrary many messages simultaneously. But in the context of a distributed system over an unreliable network, a safe move is to drop a message on the floor if the required processing resources are not available. Counter-intuitively, not allocating is simpler than allocating, provided that you can pull it off! Alas, it seems impossible to pull it off for compilers. You could say something like “hey, the largest program will have at most one million functions”, but that will lead to both wasted memory and poor user experience. You could also use a single yolo arena of a fixed size, like I did in Hard Mode Rust , but that isn’t at all similar to “static allocation”. With arenas, the size is fixed explicitly, but you can OOM. With static allocation it is the opposite — no OOM, but you don’t know how much memory you’ll need until startup finishes! The “problem size” for a compiler isn’t fixed — both the input (source code) and the output (executable) can be arbitrarily large. But that is also the case for TigerBeetle — the size of the database is not fixed, it’s just that TigerBeetle gets to cheat and store it on disk, rather than in RAM. And TigerBeetle doesn’t do “static allocation” on disk, it can fail with at runtime, and it includes a dynamic block allocator to avoid that as long as possible by re-using no longer relevant sectors. So what we could say is that a compiler consumes arbitrarily large input, and produces arbitrarily large output, but those “do not count” for the purpose of static memory allocation. At the start, we set aside an “output arena” for storing finished, immutable results of compiler’s work. We then say that this output is accumulated after processing a sequence of chunks, where chunk size is strictly finite. While limiting the total size of the code-base is unreasonable, limiting a single file to, say, 4 MiB (runtime-overridable) is fine. Compiling then essentially becomes a “stream processing” problem, where both inputs and outputs are arbitrary large, but the filter program itself must execute in O(1) memory. With this setup, it is natural to use indexes rather than pointers for “output data”, which then makes it easy to persist it to disk between changes. And it’s also natural to think about “chunks of changes” not only spatially (compiler sees a new file), but also temporally (compiler sees a new version of an old file). Is there any practical benefits here? I don’t know! But seems worth playing around with! I feel that a strict separation between O(N) compiler output and O(1) intermediate processing artifacts can clarify compiler’s architecture, and I won’t be too surprised if O(1) processing in compilers would lead to simpler code the same way it does for databases?

0 views
matklad 4 months ago

Newtype Index Pattern In Zig

In efficiency-minded code, it is idiomatic to use indexes rather than pointers. Indexes have several advantages: First , they save memory. Typically a 32-bit index is enough, a saving of four bytes per pointer on 64-bit architectures. I haven’t seen this measured, but my gut feeling is that this is much more impactful than it might initially seem. On modern architectures, saving memory saves time (and energy) as well, because the computing bottleneck is often the bit pipe between the memory and the CPU, not the computation per se. Dense data structures use CPU cache more efficiently, removing prohibitive latency of memory accesses. Bandwidth savings are even better: smaller item size obviously improves bandwidth utilization, but having more items in cache obviates the need to use the bandwidth in the first place. Best case, the working set fits into the CPU cache! Note well that memory savings are evenly spread out. Using indexes makes every data structure slightly more compact, which improves performance across the board, regardless of hotspot distribution. It’s hard to notice a potential for such saving in a profiler, and even harder to test out. For these two reasons, I would default to indexes for code where speed matters, even when I don’t have the code written yet to profile it! There’s also a more subtle way in which indexes save memory. Using indexes means storing multiple items in an array, but such dense storage contains extra information in relative positions of the items. If you need to store a list of items, you can often avoid materializing the list of indexes by storing a range “pointing” into the shared storage. Occasionally, you can even do UTF-8 trick and use just a single bit to mark the end of a list. The second benefit of indexes is more natural modeling of cyclic and recursive data structures. Creating a cycle fundamentally requires mutability somewhere (“tying the knot” in Haskell relies on mutability of lazy thunks). This means that you need to make some pointers nullable, and that usually gets awkward even without borrow checker behind your back. Even without cycles and just recursion, pointers are problematic, due to a combination of two effects: The combination works fine at small scale, but then it fails with stack overflow in production every single time, requiring awkward work-arounds. For example, serializes error traces from nested macro expansions as a deeply nested tree of JSON objects, which requires using stacker hack when parsing the output (which you’ll learn about only after crashes in the hands of macro connoisseur users). Finally , indexes greatly help serialization, they make it trivial to communicate data structures both through space (sending a network message) and time (saving to disk and reading later). Indexes are naturally relocatable, it doesn’t matter where in memory they are. But this is just a half of serialization benefit. The other is that, because everything is in few arrays, you can do bulk serialization. You don’t need to write the items one by one, you can directly arrays around (but be careful to not leak data via padding, and be sure to checksum the result). The big problem with “naive” indexes is of course using the right index with the wrong array, or vice verse. The standard solution here is to introduce a newtype wrapper around the raw index. @andrewrk recently popularized a nice “happy accident of language design” pattern for this in Zig. The core idea is to define an index via non-exhaustive : In Zig, designates a strongly-typed collection of integer constants, not a Rust-style ADT (there’s for that). By default an backing integer type is chosen by the compiler, but you can manually override it with syntax: Finally, Zig allows making enums non-exhaustive with . In a non-exhaustive enum, any numeric value is valid, and some have symbolic labels: and builtins switch abstraction level between a raw integer and an enum value. So, is a way to spell “ , but a distinct type”. Note that there’s no strong encapsulation boundary here, anyone can . Zig just doesn’t provide language-enforced encapsulation mechanisms. Putting everything together, this is how I would model n-ary tree with parent pointers in Zig: Some points of note: P.S. Apparently I also wrote a Rust version of this post a while back? https://matklad.github.io/2018/06/04/newtype-index-pattern.html pointers encourage recursive functions, and recursive data structures lead to arbitrary long (but finite) chains of pointers. As usual with indexes, you start with defining the collective noun first, a rather than a . In my experience, you usually don’t want suffix in your index types, so is just , not the underlying data. Nested types are good! feels just right. For readability, the order is fields, then nested types, then functions. In , we have a couple of symbolic constants. is for the root node that is stored first, for whenever we want to apply offensive programing and make bad indexes blow up. Here, we use for “null” parent. An alternative would be to use , but that would waste of space, or making the root its own parent. If you care about performance, its a good idea to sizes of structures, not to prevent changes, but as a comment that explains to the reader just how the large the struct is. I don’t know if I like or more for representing ranges, but I use the former just because the names align in length. Both and are reasonable shapes for the API. I don’t know which one I prefer more. I default to the former because it works even if there are several node arguments.

0 views
matklad 5 months ago

TigerBeetle Blog

Continuing the tradition , I’ve been also blogging somewhat regularly on TigerBeetle’s blog, so you might want to check those articles out or even subscribe (my favorite RSS reader is RSSSSR ): Today’s post is a video version of Notes on Paxos ! https://tigerbeetle.com/blog/ https://tigerbeetle.com/blog/atom.xml

0 views
matklad 6 months ago

Readonly Characters Are a Big Deal

I like Emacs UX as exemplified by Magit . I consider it to be a user interface paradigm on the same footing as UNIX pipes: An Engine for an Editor Pipes give you 1D read-only streams of characters which excel at batch processing. Emacs is all about interactive mutable 2D buffers of attributed text. Today I realized that an important feature of Emacs text buffers is read-only characters ( manual ). Like in any editor, you can mark an entire Emacs buffer as read-only. But you also can mark individual substrings read-only, so that you can edit anywhere except specific ranges This is a useful feature for bidirectional interaction. Consider the in-editor terminal I am using currently, which looks like this: The first line is the command to execute, this is typed by me manually, and then I hit a “submit” shortcut to actually run the command. Then goes the status line, which shows how long the command has been running so far and the exit code (when the command terminates). The status line is determined by the “terminal” itself. Finally, there’s output of the command itself, updated live. In this sort of the interface, command is modifiable by the user, but is read-only for the editor. Status is the opposite — the editor updates it every second, but the user should be prevented from touching it. And the output can be CRDT-style edited by both parties (I often find it useful to edit the output in place before pasting it elsewhere). Sadly, in VS Code I can’t prevent the user from editing the status, so my implementation is a bit janky, and this, I think, goes to the core of why I don’t see VS Code as a great platform for the kind of interactive tools I want to write. Read-only ranges are hard to implement! Text editing hates you as is, but this feature requires tracking text attributes in an intelligent way under modifications (see sticky properties ), and also feeds back into modifications themselves! No wonder Monaco, the editor engine underlying VS Code, lost this ability at some point. Still, I feel like “does it support sticky read-only attribute?” is a good litmus test to check if an editor can support interactive applications a-la Magit seamlessly.

0 views
matklad 6 months ago

On Async Mutexes

A short note on contradiction or confusion in my language design beliefs I noticed today. One of the touted benefits of concurrent programming multiplexed over a single thread is that mutexes become unnecessary. With only one function executing at any given moment in time data races are impossible. The standard counter to this argument is that mutual exclusion is a property of the logic itself, not of the runtime. If a certain snippet of code must be executed atomically with respect to everything else that is concurrent, then it must be annotated as such in the source code. You can still introduce logical races by accidentally adding an in the middle of the code that should be atomic. And, while programming, you are adding new s all the time! This argument makes sense to me, as well its as logical conclusion. Given that you want to annotate atomic segments of code anyway, it makes sense to go all the way to Kotlin-style explicit async implicit await. The contradiction I realized today is that for the past few years I’ve been working on a system built around implicit exclusion provided by a single thread — TigerBeetle! Consider compaction, a code that is responsible for rewriting data on disk to make it smaller without changing its logical contents. During compaction, TigerBeetle schedules a lot of concurrent disk reads, disk writes, and CPU-side merges. Here’s an average callback: This is the code ( source ) that runs when a disk read finishes, and it mutates — shared state across all outstanding IO. It’s imperative that no other IO completion mutates compaction concurrently, especially inside that monster of a function . Applying “make exclusion explicit” rule to the code would mean that the entire needs to be wrapped in a mutex, and every callback needs to start with lock/unlock pair. And there’s much more to TigerBeetle than just compaction! While some pairs of callbacks probably can execute concurrently relatively to each other, this changes over time. For example, once we start overlapping compaction and execution, those will be using our GridCache (buffer manager) at the same time. So explicit locking probably gravitates towards having just a single global lock around the entire state, which is acquired for the duration of any callback. At which point, it makes sense to push lock acquisition up to the event loop, and we are back to the implicit locking API! This seems to be another case of two paradigms for structuring concurrent programs . The async/await discussion usually presupposes CSP programming style, where you define a set of concurrent threads of execution, and the threads are mostly independent, sharing a little of data. TigerBeetle is written in a state machine/actor style, where the focal point is the large amount of shared state, which is evolving in discrete steps in reaction to IO events (there’s only one “actor” in TigerBeetle). Additionally, TigerBeetle uses manual callbacks instead of async/await syntax, so inserting an in the middle of critical section doesn’t really happen. Any new concurrency requires introducing an explicit named continuation function, and each continuation (callback) generally starts with a bunch of assertions to pin down the current state and make sure that the ground hasn’t shifted too far since the IO was originally scheduled. Or, as is the case with , sometimes the callback doesn’t assume anything at all about the state of the world and instead carries out an exhaustive case analysis from scratch.

0 views
matklad 8 months ago

Look Out For Bugs

One of my biggest mid-career shifts in how I write code was internalizing the idea from this post: Don’t Write Bugs Historically, I approached coding with an iteration-focused mindset — you write a draft version of a program, you set up some kind of a test to verify that it does what you want it to do, and then you just quickly iterate on your draft until the result passes all the checks. This was a great approach when I was only learning to code, as it allowed me to iterate past the things which were not relevant for me at that point, and focus on what matters. Who cares if it is or in the “паблик статик войд мэйн стринг а-эр-джи-эс”, it’s just some obscure magic spell anyway, and completely irrelevant to the maze-traversing thingy I am working on! Carrying over this approach past the learning phase was a mistake. As Lawrence points out, while you can spend time chasing bugs in the freshly written code, it is possible to dramatically cut the amount of bugs you introduce in the first place, if you focus on optimizing that (and not just the iteration time). It felt (and still feels) like a superpower! But there’s already a perfectly fine article about not making bugs, so I am not going to duplicate it. Instead, I want to share a related, but different super power: You can find bugs by just reading code. I remember feeling this superpower for the first time. I was investigating various rope implementations, and, as a part of that, I looked at the , the implementation powering IntelliJ, very old and battle tested code. And, by just reading the code, I found a bug, since fixed . It wasn’t hard, the original code is just 500 lines of verbose Java (yup, that’s all that you need for a production rope). And I wasn’t even trying to find a bug, it just sort-of jumped out at me while I was trying to understand how the code works. That is, you can find some existing piece of software, carefully skim through implementation, and discover real problems that can be fixed. You can do this to your software as well! By just re-reading a module you wrote last year, you might find subtle problems. I regularly discover TigerBeetle issues by just covering this or that topic on IronBeetle : bug discovered live , fixed , and PR merged . Here are some tips for getting better at this: The key is careful, slow reading. What you actually are doing is building the mental model of a program inside your head. Reading the source code is just an instrument for achieving that goal. I can’t emphasize this enough: programming is all about building a precise understanding inside your mind, and then looking for the diff between your brain and what’s in git. Don’t dodge an opportunity to read more of the code. If you are reviewing a PR, don’t review just the diff, review the entire subsystem. When writing code, don’t hesitate to stop and to probe and feel the context around. Go for or to understand the historical “why” of the code. When reading, mostly ignore the textual order, don’t just read each source file top-down. Instead, use these two other frames: Start at or subsystem equivalent, and use “goto definition” to follow an imaginary program counter. Identify the key data structures and fields, and search for all places where they are created and modified. You want to see a slice across space and time, state and control flow (c.f. Concurrent Expression Problem ). Just earlier today I used the second trick to debug an issue for which I haven’t got a repro. I identified as the key assignment that was recently introduced, then ctrl + f for , and that immediately revealed a gap in my mental model. Note how this was helped by the fact that the thing in question, , was always called that in the source code! If your language allows it, avoid , use proper names. Identify and collect specific error-prone patterns or general smells in the code. In Zig, if there’s an allocator and a in the same scope, you need to be very careful . If there’s an isolated tricky function, it’s probably fine. If there’s a tricky interaction between functions, it is a smell, and some bugs are lurking there. Bottom line: reading the code is surprisingly efficient at proactively revealing problems. Create space for calm reading. When reading, find ways to build mental models quickly, this is not entirely trivial.

0 views
matklad 8 months ago

Vibe Coding Terminal Editor

I “wrote” a small tool for myself as my biannual routine check of where llms are currently at. I think I’ve learned a bunch from this exercise. This is frustrating! I don’t want to learn by trial and error, I’d rather read someone’s blog post with lessons learned. Sadly, most of the writing on the topic that percolates to me tends to be high-level — easy to nod along while reading, but hard to extract actionable lessons. So this is what I want to do here, list specific tricks learned. Let me quickly introduce the project. It’s a VS Code extension that allows me to run “shell” inside my normal editor widget, such that the output is normal text buffer where all standard motion/editing commands work. So I can “goto definition” on paths printed as a part of backtrace, use multiple cursors to copy compiler’s suggestions, or just PageUp / PageDown to scroll the output. If you are familiar with Emacs, it’s Eshell , just worse: I now use to launch most of my compilation commands, as it has several niceties on top of what my normal shell provides. For example, by default only the last 50 lines of output are shown, but I can hit tab to fold and unfold full output. Such a simple feature, but such a pain to implement in a UNIX shell/terminal! What follows is an unstructured bag of things learned: I originally tried to use code normally, by iteratively prompting in the terminal until I get the output I want. This was frustrating, as it was too easy to miss a good place to commit a chunk of work, or to rein in a conversation going astray. This “prompting-then-waiting” mode also had a pattern of mental context switches not matching my preferred style of work. This article suggests a better workflow: https://harper.blog/2025/05/08/basic-claude-code/ Instead of writing your single prompt in the terminal, you write an entire course of action as a task list in document, and the actual prompt is then something along the lines of Read @plan.md, complete the next task, and mark it with . After finishes iterating on a step you look at the diff and interactively prompt for necessary corrections. When you are happy, and the conversation, to start the next step from the clean slate. The plan pattern reduces context switches, because it allows you to plan several steps ahead, while you are in the planning mode, even if it makes sense to do the work one step at a time. I often also work on continuing the plan when is working on the current task. A brilliant metaphor from another post https://crawshaw.io/blog/programming-with-agents is that prompting LLM for some coding task and then expecting it to one-shot a working solution is quite a bit like asking a candidate to whiteboard an algorithm during the interview. LLMs are clearly superhuman at whiteboarding, but you can’t go far without feedback. “Agentic” programming like allows LLMs to iterate on solution. LLMs are much better at whiteboarding than at iterating. My experience is that, starting with suboptimal solution, LLM generally can’t improve it by itself along the fuzzy aesthetic metrics I care about. They can make valid changes, but the overall quality stays roughly the same. However, LLMs are tenacious, and can do a lot of iterations. If you do have a value function, you can use it to extract useful work from random walk! A bad value function is human judgement. Sitting in the loop with LLM and pointing out mistakes is both frustrating and slow (you are the bottleneck). In contrast “make this test green” is very efficient at getting working (≠ good) code. LLMs are good at “closing the loop”, they can make the ends meet. This insight combined with the pattern gives my current workflow — spec ↔ code ↔ test loop. Here’s the story: I coded the first version of using just the pattern, but at some point I hit complexity wall. I realized that my original implementation strategy for syntax highlighting was a dead end, and I needed to change it, but that was hard to do without making a complete mess of the code. The accumulated reflected a bunch of historical detours, and the tests were too brittle and coupled to the existing implementation (more on tests later). This worked for incremental additions, but now I wanted to change something in the middle. I realized that what I want is not an append-only that reflects history, but rather a mutable that describes clearly how the software should behave. For normal engineering, this would have been “damn, I guess I need to throw one out and start afresh” moment. With , I added and all the code to the context and asked it to write file in the same task list format. There are two insights here: First , mutable spec is a good way to instruct LLM. When I want to apply a change to now, I prompt to update the spec first (unchecking any items that need re-doing), manually review/touch-up the spec, and use a canned prompt to align the code and tests with the spec. Second , that you can think of an LLM as a machine translation, which can automatically convert between working code, specification, and tests. You can treat any of those things as an input, as if you are coding in miniKanren ! I did have this idea of closing the loop when I started with , so I crafted the prompts to emphasize testing. You can guess the result! wrote a lot of tests, following all the modern “best practices” — a deluge of unit tests that were just needlessly nailing down internal API, a jungle of bug-hiding mocks, and a bunch of unfocused integration tests which were slow, flaky, and contained a copious amount of sleeps to paper over synchronization bugs. Really, this was eerily similar to a typical test suite you can find in the wild. I am wondering why is that? This is perhaps my main take away: if I am vibe-coding anything again, and I want to maintain it and not just one-shot it, I will think very hard about the testing strategy. Really, to tout my own horn, I think that perhaps How to Test? is the best article out there about agentic coding. Test iteration is a multiplier for humans, but a hard requirement for LLMs. Test must be very fast, non-flaky, and should end-to-end test application features , rather than code. Concretely, I just completely wiped out all the existing tests. Then I added testing strategy to the spec. There are two functions: The function waits for all outstanding async work (like external processes) to finish. This requires properly threading causality throughout the code. E.g., there’s a promise you can on to join currently running process. The function captures the entire state of the extension as a single string. There’s just one mock for the clock (another improvement on the usual terminal — process runtime is always show). Then, I prompted with something along the lines of Oups, looks like someone wiped out all the tests here, but the code and the spec look decent, could you re-create the test suite using function as per @spec.md? It worked. Again, “throw one away” is very cheap. That’s it! LLMs obviously can code. You need to hold them right. In particular, you need to engineer a feedback loop to let LLM iterate at its own pace. You don’t want human in the “data plane” of the loop, only in the control plane. Learn to architecture for testing . LLM drastically reduce the activation energy for writing custom tools. I wanted something like forever, but it was never the most attractive yak to shave. Well, now I have the thing, I use it daily. LLMs don’t magically solve all software engineering problems. The biggest time sink with was solving the problem, but LLMs are not yet at the “give me UNIX, but without mess” stage. LLMs don’t solve maintenance. A while ago I wrote about LSP for jj . I think I can actually code that up in a day with Claude now? Not the proof of concept, the production version with everything I would need. But I don’t want to maintain that. I don’t want to context switch to fix a minor bug, if I am the only one using the tool. And, well, if I make this for other people, I’d definitely be on the hook for maintaining it :D

0 views
matklad 8 months ago

Ads Are a Positional Good

Non-technical armchair economics post, where I explain my pet theory for why everything on and outside of the internet is absolutely infested with ads. The traditional explanation is that any paid service will lose out to a “free” service funded by advertising. I think there’s some truth to it! Another explanation is transaction overhead. You basically can’t pay 1 cent for something, as just the annoyance of using the debit card is greater than the value. Contrast with the ease of dismissing a cookie consent banner! I think this is also important, but, at best, a part of the story. The simple fact is that even paying for things doesn’t rid you of those pesky attention vampires! When I go to the cinema, I buy the ticket, but still have to sit through ten minutes of advertisement. Similarly, I pay for subscription to economist.com, and I still see a lot of ads if I disable my ad-blocker. I think there’s a more fundamental force in play here, which is, you guessed it: ads are a positional good. Let me unpack! Positional Good is a term from economics. To explain it, let’s start with a normal good, like bread. You want bread because you derive value from it, you avoid starvation. So you are willing to buy it, and the price you are willing to pay is proportional to the value you personally derive from it. In general, you want to pay only so much for bread, because your hunger is finite. In contrast, the value of a positional good is indirect. It is how much less of the good the others have. What matters is not the absolute amount of good you have, but your relative position among other actors. A standard (though somewhat muddled) example is a University degree. There’s certain intrinsic value in having a degree from a more prestigious University, even if we keep the actual level of knowledge attained fixed. “Better” degree positions you higher relative to other job applicants. So prospective students compete for a limited number of seats not so much with the underlying hard granite of science, but rather with each other, and only the brightest go to the best University (which creates a feedback loop, but this is the other mechanism, not covered by the article). I think it is the same for ads! If you are in a supermarket and want to buy a soft drink, your decision (simplifying, of course!) is based on how much craving you have and how much do you value your craving in euros. If you are particularly thirsty, you might even buy a larger bottle, but unlikely more than one. Now imagine that you are a soft drink company, and you are considering an ad slot before a movie. How much should you pay? Well, it’s easy — just a little bit more than your rival company! Your ads aren’t going to meaningfully affect the amount of thirst people have, but they certainly can nudge the decision of which soft drink to buy. And, given that the other company is locked into the same logic, you are going to spend basically as much as you can afford. And that is my mental model. While advertising creates some value directly, by informing customers of their options, I think that the bulk of ads is pure competitive value destruction, which redirects surplus value created by productive economic activity to companies hosting advertisement battlegrounds, with human attention being merely a collateral damage.

0 views
matklad 8 months ago

Links

If you have a blog, consider adding a “links” page to it, which references resources that you find notable: https://matklad.github.io/links.html I’ve started my links page several years ago, mostly because I found myself referring to the same few links repeatedly in various discussions, and not all the links were easily searchable. Note that the suggestion is different from more typical “monthly links roundup”, which is nice to maintain Substack engagement/community, but doesn’t contribute to long-term knowledge distilling. It is also different from the exhaustive list of everything I’ve read on the Internet. It is relatively short, considering its age.

0 views