Posts in Programming (20 found)

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS Youmin Chen, Jiwu Shu, Yanyan Shen, Linpeng Huang, and Hong Mei SOSP'25 I love this paper, because it grinds one of my axes: efficient pipeline parallelism on general purpose CPUs. In many hardware designs, pipeline parallelism is the dominant form of parallelism, whereas data parallelism takes the cake on CPUs and GPUs. It has always seemed to me that there are applications where pipeline parallelism should be great on multi-core CPUs, and here is an example. Fig. 1 illustrates the design space for key-value stores: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ). It is hard to effectively cache data in a TPR/TPQ model, because each request runs the entire key-value store request code path. For example, a CPU core may have enough cache capacity to hold network buffers or hot tuples, but not both. The key disadvantage to a TPS architecture is load balancing. One stage could become the bottleneck, leaving CPU cores idle. The authors propose dynamic reconfiguration of the pipeline based on workload changes. Another challenge with pipelining is implementing efficient communication between cores, because data associated with each request flows down the pipeline with the request itself. Fig. 3 shows the pipeline proposed in this paper: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 The NIC writes request packets into the network buffer (stored in the LLC). The cache-resident layer reads data from this buffer and handles requests involving commonly used keys by accessing the hot index and hot data caches (also in the LLC). The memory-resident layer handles cold keys and values, which are stored in DRAM. One set of threads (pinned to CPU cores) implement the cache-resident layer, and a different set of threads (pinned to other CPU cores) implement the memory-resident layer. An auto-tuner continually monitors the system and adjusts the number of threads assigned to each layer. Section 3.5 describes the synchronization required to implement this adjustment. The NIC writes request packets into a single queue. The cache-resident threads cooperatively read requests from this queue. If there are threads in the pool, then thread reads all requests with: . Next, threads check to see if the key associated with a request is hot (and thus cached in the LLC). Time is divided into epochs. During a given epoch, the set of cached items does not change. This enables fast lookups without costly synchronization between threads. A background thread gathers statistics to determine the set of items to be cached in the next epoch and has the ability to atomically switch to the next epoch when the time comes. The number of hot keys is kept small enough that it is highly likely that hot keys will be stored in the LLC. Requests that miss in the cache-resident layer are passed on to the memory-resident layer for further processing (via the CR-MR queue ). Typically, the LLC is treated like a global resource (shared by all cores). But this particular use case requires that most of the LLC be dedicated to the cache-resident layer. This is accomplished with the help of the PQOS utility from Intel, which uses “Intel(R) Resource Director Technology” to control which ways of the LLC are assigned to each layer. The memory-resident layer operates on batches of requests. Because the requests are not hot, it is highly likely that each request will require DRAM accesses for index lookups (keys) and data lookups (values). Software prefetching is used to hide DRAM latency during index lookups. When servicing operations, data values are copied directly into the outgoing network buffer. The CR-MR queue is used to communicate between the two layers. Each (CR thread, MR thread) pair has a dedicated lock-free queue. Enqueue operations use a round-robin policy (message from CR thread is sent to MR thread: ). Dequeue operations must potentially scan queues corresponding to all possible senders. Multiple requests can be stored per message, to amortize control overhead. Fig. 7 has throughput results for synthetic workloads (A, B, and C have different ratios of put/get operations), uTPS-T is this work: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 Dangling Pointers The pipelining here is coarse-grained, and the design is only optimized for the LLC. I wonder if a more fine-grained pipeline would allow hot data to be stored in L2 caches. For example, the set of hot keys could be sharded among N cores, with each core holding a different shard in its L2 cache. It seems redundant that this design requires software to determine the set of hot keys, when the hardware cache circuitry already has support to do something like this. Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. Pipelining Advantages A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ).

0 views

Go proposal: Secret mode

Part of the Accepted! series, explaining the upcoming Go changes in simple terms. Automatically erase used memory to prevent secret leaks. Ver. 1.26 • Stdlib • Low impact The new package lets you run a function in secret mode . After the function finishes, it immediately erases (zeroes out) the registers and stack it used. Heap allocations made by the function are erased as soon as the garbage collector decides they are no longer reachable. This helps make sure sensitive information doesn't stay in memory longer than needed, lowering the risk of attackers getting to it. The package is experimental and is mainly for developers of cryptographic libraries, not for application developers. Cryptographic protocols like WireGuard or TLS have a property called "forward secrecy". This means that even if an attacker gains access to long-term secrets (like a private key in TLS), they shouldn't be able to decrypt past communication sessions. To make this work, session keys (used to encrypt and decrypt data during a specific communication session) need to be erased from memory after they're used. If there's no reliable way to clear this memory, the keys could stay there indefinitely, which would break forward secrecy. In Go, the runtime manages memory, and it doesn't guarantee when or how memory is cleared. Sensitive data might remain in heap allocations or stack frames, potentially exposed in core dumps or through memory attacks. Developers often have to use unreliable "hacks" with reflection to try to zero out internal buffers in cryptographic libraries. Even so, some data might still stay in memory where the developer can't reach or control it. The solution is to provide a runtime mechanism that automatically erases all temporary storage used during sensitive operations. This will make it easier for library developers to write secure code without using workarounds. Add the package with and functions: The current implementation has several limitations: The last point might not be immediately obvious, so here's an example. If an offset in an array is itself secret (you have a array and the secret key always starts at ), don't create a pointer to that location (don't create a pointer to ). Otherwise, the garbage collector might store this pointer, since it needs to know about all active pointers to do its job. If someone launches an attack to access the GC's memory, your secret offset could be exposed. The package is mainly for developers who work on cryptographic libraries. Most apps should use higher-level libraries that use behind the scenes. As of Go 1.26, the package is experimental and can be enabled by setting at build time. Use to generate a session key and encrypt a message using AES-GCM: Note that protects not just the raw key, but also the structure (which contains the expanded key schedule) created inside the function. This is a simplified example, of course — it only shows how memory erasure works, not a full cryptographic exchange. In real situations, the key needs to be shared securely with the receiver (for example, through key exchange) so decryption can work. 𝗣 21865 • 𝗖𝗟 704615 • 👥 Daniel Morsing , Dave Anderson , Filippo Valsorda , Jason A. Donenfeld , Keith Randall , Russ Cox Only supported on linux/amd64 and linux/arm64. On unsupported platforms, invokes directly. Protection does not cover any global variables that writes to. Trying to start a goroutine within causes a panic. If calls , erasure is delayed until all deferred functions are executed. Heap allocations are only erased if ➊ the program drops all references to them, and ➋ then the garbage collector notices that those references are gone. The program controls the first part, but the second part depends on when the runtime decides to act. If panics, the panicked value might reference memory allocated inside . That memory won't be erased until (at least) the panicked value is no longer reachable. Pointer addresses might leak into data buffers that the runtime uses for garbage collection. Do not put confidential information into pointers.

0 views

Has the cost of building software just dropped 90%?

Agentic coding tools are dramatically reducing software development costs. Here's why 2026 is going to catch a lot of people off guard.

0 views
pkh.me 3 days ago

A series of tricks and techniques I learned doing tiny GLSL demos

In the past two months or so, I spent some time making tiny GLSL demos. I wrote an article about the first one, Red Alp . There, I went into details about the whole process, so I recommend to check it out first if you're not familiar with the field. We will look at 4 demos: Moonlight , Entrance 3 , Archipelago , and Cutie . But this time, for each demo, we're going to cover one or two things I learned from it. It won't be a deep dive into every aspect because it would be extremely redundant. Instead, I'll take you along a journey of learning experiences. See it on its official page , or play with the code on its Shadertoy portage . In Red Alp, I used volumetric raymarching to go through the clouds and fog, and it took quite a significant part of the code to make the absorption and emission convincing. But there is an alternative technique that is surprisingly simpler. In the raymarching loop, the color contribution at each iteration becomes 1/d or c/d where d is the density of the material at the current ray position, and c an optional color tint if you don't want to work in grayscale level. Some variants exist, for example 1/d^2 , but we'll focus on 1/d . Let's see how it looks in practice with a simple cube raymarch where we use this peculiar contribution: The signed function of the cube is from the classic Inigo Quilez page . For the rotation you can refer to Xor or Blackle article. For the general understanding of the code, see my previous article on Red Alp . The first time I saw it, I wondered whether it was a creative take, or if it was backed by physical properties. Let's simplify the problem with the following figure: The glowing object sends photons that spread all around it. The further we go from the object, the more spread these photons are, basically following the inverse square law 1/r^2 , which gives the photons density, where r is the distance to the target object. Let's say we send a ray and want to know how many photons are present along the whole path. We have to "sum", or rather integrate, all these photons density measures along the ray. Since we are doing a discrete sampling (the dots on the figure), we need to interpolate the photons density between each sampling point as well. Given two arbitrary sampling points and their corresponding distance d_n and d_{n+1} , any intermediate distance can be linearly interpolated with r=\mathrm{mix}(d_n,d_{n+1},t) where t is within [0,1] . Applying the inverse square law from before ( 1/r^2 ), the integrated photons density between these 2 points can be expressed with this formula (in t range): t being normalized, the \Delta t is here to covers the actual segment distance. With the help of Sympy we can do the integration: So the result of this integration is: Now the key is that in the loop, \Delta t stepping is actually d_{n+1} , so we end up with: And we find back our mysterious 1/d . It's "physically correct", assuming vacuum space. Of course, reality is more complex, and we don't even need to stick to that formula, but it was nice figuring out that this simple fraction is a fairly good model of reality. In the cube example we didn't go through the object, using . But if we were to add some transparency, we could have used instead, where could be interpreted as absorption and the pass-through, or transparency. I first saw this formula mentioned in Xor article on volumetric . To understand it a bit better, here is my intuitive take: the causes a potential penetration into the solid at the next iteration, which wouldn't happen otherwise (or only very marginally). When inside the solid, the causes the ray to continue further (by the amount of the distance to the closest edge). Then the multiplication by makes sure we don't penetrate too fast into it; it's the absorption, or "damping". This is basically the technique I used in Moonlight to avoid the complex absorption/emission code. See it on its official page , or play with the code on its Shadertoy portage . This demo was probably one of the most challenging, but I'm pretty happy with its atmospheric vibe. It's kind of different than the usual demos for this size. I initially tried with some voxels, but I couldn't make it work with the light under 512 characters (the initialization code was too large, not the branchless DDA stepping). It also had annoying limitations (typically the animation was unit bound), so I fell back to a classic raymarching. The first thing I did differently was to use an L-∞ norm instead of an euclidean norm for the distance function: every solid is a cube so it's appropriate to use simpler formulas. For the light, it's not an illusion, it's an actual light: after the first raymarch to a solid, the ray direction is reoriented toward the light and the march runs again (it's the macro). Hitting a solid or not defines if the fragment should be lighten up or not. A bad surprise of this demo was uncovering two driver bugs on mobile: The first was worked around with the macro (actually saved 3 characters in the process), but the 2nd one had to be unpacked and made me lose 2 characters. Another thing I studied was how to set up the camera in a non-perspective isometric or dimetric view . I couldn't make sense of the maths from the Wikipedia page (it just didn't work), but Sympy rescued me again: Inspecting the matrices and factoring out the common terms, we obtain the following transform matrices: The ray direction is common to all fragments, so we use the central UV coordinate (0,0) as reference point. We push it forward for convenience: (0,0,1), and transform it with our matrix. This gives the central screen coordinate in world space. Since the obtained point coordinate is relative to the world origin, to go from that point to the origin, we just have to flip its sign. The ray direction formula is then: To get the ray origin of every other pixel, the remaining question is: what is the smallest distance we need to step back the screen coordinates such that, when applying the transformation, the view wouldn't clip into the ground at y=0 . This requirement can be modeled with the following expression: The -1 being the lowest y-screen coordinate (which we don't want into the ground). The lazy bum in me just asks Sympy to solve it for me: We get z>\sqrt{2} for isometric, and z>\sqrt{3} for dimetric. With an arbitrary scale of the coordinate we end up with the following: The is an arbitrary small value to make sure the y-coordinate ends up above 0. In Entrance 3, I used a rough approximation of the isometric setup. See it on its official page , or play with the code on its Shadertoy portage . For this infinite procedurally generated Japan, I wanted to mark a rupture with my red/orange obsession. Technically speaking, it's actually fairly basic if you're familiar with Red Alp. I used the same noise for the mountains/islands, but the water uses a different noise. The per octave noise curve is , with the particularity of shifting the coordinate with its derivative: . This is some form of domain warping that gives the nice effect here. When I say , I'm really referring to the x-axis position. It is not needed to work with the z-component (xz forms the flat plane) because each octave of the fbm has a rotation that "mixes" both axis, so is actually backed in . I didn't come up with the formula, but found it first one this video by Acerola . I don't know if he's the original author, but I've seen the formula being replicated in various places. See it on its official page , or play with the code on its Shadertoy portage . Here I got cocky and thought I could manage to fit it in 512 chars. I failed, by 90 characters. I did use the smoothmin operator for the first time: every limb of the body of Cutie is composed of two spheres creating a rounded cone (two sphere of different size smoothly merged like metaballs). Then I used simple IK kinetics for the animation. Using leg parts with a size of 1 helped simplifying the formula and make it shorter. You may be wondering about the smooth visuals itself: I didn't use the depth map but simply the number of iterations. Due to the nature of the raymarching algorithm, when a ray passes close to a shape, it slows down significantly, increasing the number of iterations. This is super useful because it exaggerate the contour of the shapes naturally. It's wrapped into an exponential, but defines the output color directly. I will continue making more of those, keeping my artistic ambition low because of the 512 characters constraint I'm imposing on myself. You may be wondering why I keep this obsession about 512 characters, and many people called me out on this one. There are actually many arguments: Why 512 in particular? It happens to be the size of a toot on my Mastodon instance so I can fit the code there, and I found it to be a good balance. One with tricky for-loop compounds on Snapdragon/Adreno because I was trying hard to avoid the macros and functions. One with chained assignments on Imagination/PowerVR (typically affect Google Pixel Pro 10). A tiny demo has to focus on one or two very scoped aspects of computer graphics, which makes it perfect as a learning support . It's part of the artistic performance : it's not just techniques and visuals, the wizardry of the code is part of why it's so impressive. We're in an era of visuals, people have been fed with the craziest VFX ever. But have they seen them with a few hundreds bytes of code? The constraint helps me finish the work : when making art, there is always this question of when to stop. Here there is an intractable point where I just cannot do more and I have to move on. Similarly, it prevents my ambition from tricking me into some colossal project I will never finish or even start. That format has a ton of limitations, and that's its strength. Working on such a tiny piece of code for days/weeks just brings me joy . I do feel like a craftsperson, spending an unreasonable amount of time perfecting it, for the beauty of it. I'm trying to build a portfolio, and it's important for me to keep it consistent . If the size limit was different, I would have done things differently, so I can't change it now. If I had hundreds more characters, Red Alp might have had birds, the sky opening to lit a beam of light on the mountains, etc.

0 views

AMD GPU Debugger

I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss , detailing how he achieved this, which inspired me to try to recreate the process myself. The best place to start learning about this is RADV . By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader without using Vulkan, aka RADV in our case. First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling , which is a function defined in , which is a library that acts as middleware between user mode drivers(UMD) like and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling from again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it: Here we’re choosing the domain and assigning flags based on the params, some buffers we will need uncached, as we will see: Now we have the memory, we need to map it. I opt to map anything that can be CPU-mapped for ease of use. We have to map the memory to both the GPU and the CPU virtual space. The KMD creates the page table when we open the DRM file, as shown here . So map it to the GPU VM and, if possible, to the CPU VM as well. Here, at this point, there’s a libdrm function that does all of this setup for us and maps the memory, but I found that even when specifying , it doesn’t always tag the page as uncached, not quite sure if it’s a bug in my code or something in anyways, the function is , I opted to do it manually here and issue the IOCTL call myself: Now we have the context and 2 buffers. Next, fill those buffers and send our commands to the KMD, which will then forward them to the Command Processor (CP) in the GPU for processing. Let’s compile our code. We can use clang assembler for that, like this: The bash script compiles the code, and then we’re only interested in the actual machine code, so we use objdump to figure out the offset and the size of the section and copy it to a new file called asmc.bin, then we can just load the file and write its bytes to the CPU-mapped address of the code buffer. Next up, filling in the commands. This was extremely confusing for me because it’s not well documented. It was mostly learning how does things and trying to do similar things. Also, shout-out to the folks on the Graphics Programming Discord server for helping me, especially Picoduck. The commands are encoded in a special format called , which has multiple types. We only care about : each packet has an opcode and the number of bytes it contains. The first thing we need to do is program the GPU registers, then dispatch the shader. Some of those registers are ; those registers are responsible for a number of configurations, pgm_[lo/hi], which hold the pointer to the code buffer and ; those are responsible for the number of threads inside a work group. All of those are set using the packets, and here is how to encode them: {% mark %} It’s worth mentioning that we can set multiple registers in 1 packet if they’re consecutive.{% /mark %} Then we append the dispatch command: Now we want to write those commands into our buffer and send them to the KMD: {% mark %} Here is a good point to make a more complex shader that outputs something. For example, writing 1 to a buffer. {% /mark %} No GPU hangs ?! nothing happened ?! cool, cool, now we have a shader that runs on the GPU, what’s next? Let’s try to hang the GPU by pausing the execution, aka make the GPU trap. The RDNA3’s ISA manual does mention 2 registers, ; here’s how they describe them respectively: Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap handler is present (1) or not (0) and is not considered part of the address (bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN. Temporary register for shader operations. For example, it can hold a pointer to memory used by the trap handler. {%mark%} You can configure the GPU to enter the trap handler when encountering certain exceptions listed in the RDNA3 ISA manual. {%/mark%} We know from Marcell Kiss’s blog posts that we need to compile a trap handler, which is a normal shader the GPU switches to when encountering a . The TBA register has a special bit that indicates whether the trap handler is enabled. Since these are privileged registers, we cannot write to them from user space. To bridge this gap for debugging, we can utilize the debugfs interface. Luckily, we have UMR , which uses that debugfs interface, and it’s open source; we copy AMD’s homework here which is great. The amdgpu KMD has a couple of files in debugfs under ; one of them is , which is an interface to a in the kernel that writes to the registers. It works by simply opening the file, seeking the register’s offset, and then writing; it also performs some synchronisation and writes the value correctly. We need to provide more parameters about the register before writing to the file, tho and do that by using an ioctl call. Here are the ioctl arguments: The 2 structs are because there are 2 types of registers, GRBM and SRBM, each of which is banked by different constructs; you can learn more about some of them here in the Linux kernel documentation . Turns out our registers here are SBRM registers and banked by VMIDs, meaning each VMID has its own TBA and TMA registers. Cool, now we need to figure out the VMID of our process. As far as I understand, VMIDs are a way for the GPU to identify a specific process context, including the page table base address, so the address translation unit can translate a virtual memory address. The context is created when we open the DRM file. They get assigned dynamically at dispatch time, which is a problem for us; we want to write to those registers before dispatch. We can obtain the VMID of the dispatched process by querying the register with s_getreg_b32. I do a hack here, by enabling the trap handler in every VMID, and there are 16 of them, the first being special, and used by the KMD and the last 8 allocated to the amdkfd driver. We loop over the remaining VMIDs and write to those registers. This can cause issues to other processes using other VMIDs by enabling trap handlers in them and writing the virtual address of our trap handler, which is only valid within our virtual memory address space. It’s relatively safe tho since most other processes won’t cause a trap[^1]. [^1]: Other processes need to have a s_trap instruction or have trap on exception flags set, which is not true for most normal GPU processes. Now we can write to TMA and TBA, here’s the code: And here’s how we write to and : {%mark%} If you noticed, I’m using bitfields. I use them because working with them is much easier than macros, and while the byte order is not guaranteed by the C spec, it’s guaranteed by System V ABI, which Linux adheres to. {%/mark%} Anyway, now that we can write to those registers, if we enable the trap handler correctly, the GPU should hang when we launch our shader if we added instruction to it, or we enabled the bit in rsrc3[^2] register. [^2]: Available since RDNA3, if I’m not mistaken. Now, let’s try to write a trap handler. {%mark%} If you wrote a different shader that outputs to a buffer, u can try writing to that shader from the trap handler, which is nice to make sure it’s actually being run. {%/mark%} We need 2 things: our trap handler and some scratch memory to use when needed, which we will store the address of in the TMA register. The trap handler is just a normal program running in privileged state, meaning we have access to special registers like TTMP[0-15]. When we enter a trap handler, we need to first ensure that the state of the GPU registers is saved, just as the kernel does for CPU processes when context-switching, by saving a copy of the stable registers and the program counter, etc. The problem, tho, is that we don’t have a stable ABI for GPUs, or at least not one I’m aware of, and compilers use all the registers they can, so we need to save everything. AMD GPUs’ Command Processors (CPs) have context-switching functionality, and the amdkfd driver does implement some context-switching shaders . The problem is they’re not documented, and we have to figure them out from the amdkfd driver source and from other parts of the driver stack that interact with it, which is a pain in the ass. I kinda did a workaround here since I didn’t find luck understanding how it works, and some other reasons I’ll discuss later in the post. The workaround here is to use only TTMP registers and a combination of specific instructions to copy the values of some registers, allowing us to use more instructions to copy the remaining registers. The main idea is to make use of the instruction, which adds the index of the current thread within the wave to the writing address, aka $$ ID_{thread} * 4 + address $$ This allows us to write a unique value per thread using only TTMP registers, which are unique per wave, not per thread[^3], so we can save the context of a single wave. [^3]: VGPRs are unique per thread, and SGPRs are unique per wave The problem is that if we have more than 1 wave, they will overlap, and we will have a race condition. Here is the code: Now that we have those values in memory, we need to tell the CPU: Hey, we got the data, and pause the GPU’s execution until the CPU issues a command. Also, notice we can just modify those from the CPU. Before we tell the CPU, we need to write some values that might help the CPU. Here are they: Now the GPU should just wait for the CPU, and here’s the spin code it’s implemented as described by Marcell Kiss here : The main loop in the CPU is like enable trap handler, then dispatch shader, then wait for the GPU to write some specific value in a specific address to signal all data is there, then examine and display, and tell the GPU all clear, go ahead. Now that our uncached buffers are in play, we just keep looping and checking whether the GPU has written the register values. When it does, the first thing we do is halt the wave by writing into the register to allow us to do whatever with the wave without causing any issues, tho if we halt for too long, the GPU CP will reset the command queue and kill the process, but we can change that behaviour by adjusting lockup_timeout parameter of the amdgpu kernel module: From here on, we can do whatever with the data we have. All the data we need to build a proper debugger. We will come back to what to do with the data in a bit; let’s assume we did what was needed for now. Now that we’re done with the CPU, we need to write to the first byte in our TMA buffer, since the trap handler checks for that, then resume the wave, and the trap handler should pick it up. We can resume by writing to the register again: Then the GPU should continue. We need to restore everything and return the program counter to the original address. Based on whether it’s a hardware trap or not, the program counter may point to the instruction before or the instruction itself. The ISA manual and Marcell Kiss’s posts explain that well, so refer to them. Now we can run compiled code directly, but we don’t want people to compile their code manually, then extract the text section, and give it to us. The plan is to take SPIR-V code, compile it correctly, then run it, or, even better, integrate with RADV and let RADV give us more information to work with. My main plan was making like fork RADV and then add then make report for us the vulkan calls and then we can have a better view on the GPU work know the buffers/textures it’s using etc, This seems like a lot more work tho so I’ll keep it in mind but not doing that for now unless someone is willing to pay me for that ;). For now, let’s just use RADV’s compiler . Luckily, RADV has a mode, aka it will not do actual work or open DRM files, just a fake Vulkan device, which is perfect for our case here, since we care about nothing other than just compiling code. We can enable it by setting the env var , then we just call what we need like this: Now that we have a well-structured loop and communication between the GPU and the CPU, we can run SPIR-V binaries to some extent. Let’s see how we can make it an actual debugger. We talked earlier about CPs natively supporting context-switching, this appears to be compute spcific feature, which prevents from implementing it for other types of shaders, tho, it appears that mesh shaders and raytracing shaders are just compute shaders under the hood, which will allow us to use that functionality. For now debugging one wave feels enough, also we can moify the wave parameters to debug some specific indices. Here’s some of the features For stepping, we can use 2 bits: one in and the other in . They’re and , respectively. The former enters the trap handler after each instruction, and the latter enters before the first instruction. This means we can automatically enable instruction-level stepping. Regarding breakpoints, I haven’t implemented them, but they’re rather simple to implement here by us having the base address of the code buffer and knowing the size of each instruction; we can calculate the program counter location ahead and have a list of them available to the GPU, and we can do a binary search on the trap handler. The ACO shader compiler does generate instruction-level source code mapping, which is good enough for our purposes here. By taking the offset[^4] of the current program counter and indexing into the code buffer, we can retrieve the current instruction and disassemble it, as well as find the source code mapping from the debug info. [^4]: We can get that by subtracting the current program counter from the address of the code buffer. We can implement this by marking the GPU page as protected. On a GPU fault, we enter the trap handler, check whether it’s within the range of our buffers and textures, and then act accordingly. Also, looking at the registers, we can find these: which suggests that the hardware already supports this natively, so we don’t even need to do that dance. It needs more investigation on my part, tho, since I didn’t implement this. This needs some serious plumbing, since we need to make NIR(Mesa’s intermediate representation) optimisation passes propagate debug info correctly. I already started on this here . Then we need to make ACO track variables and store the information. This requires ditching our simple UMD we made earlier and using RADV, which is what should happen eventually, then we have our custom driver maybe pause on before a specific frame, or get triggered by a key, and then ask before each dispatch if to attach to it or not, or something similar, since we have a full proper Vulkan implementation we already have all the information we would need like buffers, textures, push constants, types, variable names, … etc, that would be a much better and more pleasant debugger to use. Finally, here’s some live footage: ::youtube{url=“ https://youtu.be/HDMC9GhaLyc ”} Here is an incomplete user-mode page walking code for gfx11, aka rx7900xtx

0 views
Ivan Sagalaev 4 days ago

Have you accepted AI yet?

Is this platform still massively against AI or has it moved more towards acceptance? — Armin Ronacher If you leave a bad argument from a prominent person unchallenged, it starts looking like an accepted common wisdom. Armin Ronacher has enough clout, and his stated position sounds wrong enough to me to not let let it just slide. Here's my argument, for those who still values what I have to say. I'm not directing this directly at Armin, as the tone of his first message should tell you everything about his readiness to have his position challenged. Despite the "have the peons already accepted their fate" tone, some people did try to have an earnest discussion. You can read the whole thread , but here's a few of Armin's replies that caught my eye: Armin, any chance I can convince you to use the term "LLMs" instead of "AI" when you want to talk about LLMs? Or maybe "generative AI" if you think LLM is not flashy enough? AI is an umbrella term that covers a lot of things, some good, some bad. @[email protected] I don’t think it really matters. LLM are a subset of AI. I’m not convinced why being more precise here would matter? Several people did explain why language matters, yet Armin insisted: LLMs can't be generalized as AI @[email protected] LLMs are part of AI, even if you disagree with them for some reason. I'm sure that, as a programmer, Armin understands that "LLMs are part of AI" does not imply they can be generalized as AI. He just wants to paint the entire argument as coming from "disagreement". @[email protected] There are specific concerns and there are abstract fears . It's impossible to work with the latter, it's possible to do a lot with the former. As an example I have a lot of concern about how society is going to deal with AI and that's also something that I'm trying to understand and work in the right way with. But that is a lot more nuance and complex than a policing the use of the word AI which does very little to navigate those complexities. (Emphasis mine.) This was the theme throughout the thread: people were arguing that the current LLM situation is its own separate topic, while Armin kept dismissing it as "word policing", "abstract fears" and "watering down the discourse". He never substantiated how he's "trying to understand and work in the right way with" societal problems. The people might have a point here. See, nobody was "massively against" AI when it was called ML and used for image recognition and translating text. Like any technology, it was also abused and heavily criticized by ethicists. But it wasn't until OpenAI launched its polished product that kick-started the next renaissance insane bubble we're in now, when it started affecting much more people. The adverse effects are widespread and profound: informational pollution , further surrender of privacy , "vibed" code nobody knows how to fix, untold wealth produced for surveillance kings , genocidal sociopaths and fascism enablers , environmental harm and, of course, a giant financial bubble, to name a random few. None of these are abstract, they are measurably and sometimes painfully affecting people right now. So it's hardly a surprise they want to talk about that instead of discussing quantization and context length. You can ignore politics only for so long… What gets me personally very easily irritated is the tone of inevitability. Being concerned is bad for your health, so you better accept the inevitable reality and only think within a comfortable, fenced area. Fuck that . But to be fair, Armin does say he has "a lot of concern about how society is going to deal with AI", so may be he's just looking for a place where he could talk specifically about technology? That should be allowed, right? Well, Simon Willison , for one, does just that, without any backlash, and I'm sure there are others.

4 views
Anton Zhiyanov 5 days ago

Gist of Go: Concurrency internals

This is a chapter from my book on Go concurrency , which teaches the topic from the ground up through interactive examples. Here's where we started this book: Functions that run with are called goroutines. The Go runtime juggles these goroutines and distributes them among operating system threads running on CPU cores. Compared to OS threads, goroutines are lightweight, so you can create hundreds or thousands of them. That's generally correct, but it's a little too brief. In this chapter, we'll take a closer look at how goroutines work. We'll still use a simplified model, but it should help you understand how everything fits together. Concurrency • Goroutine scheduler • GOMAXPROCS • Concurrency primitives • Scheduler metrics • Profiling • Tracing • Keep it up At the hardware level, CPU cores are responsible for running parallel tasks. If a processor has 4 cores, it can run 4 instructions at the same time — one on each core. At the operating system level, a thread is the basic unit of execution. There are usually many more threads than CPU cores, so the operating system's scheduler decides which threads to run and which ones to pause. The scheduler keeps switching between threads to make sure each one gets a turn to run on a CPU, instead of waiting in line forever. This is how the operating system handles concurrency. At the Go runtime level, a goroutine is the basic unit of execution. The runtime scheduler runs a fixed number of OS threads, often one per CPU core. There can be many more goroutines than threads, so the scheduler decides which goroutines to run on the available threads and which ones to pause. The scheduler keeps switching between goroutines to make sure each one gets a turn to run on a thread, instead of waiting in line forever. This is how Go handles concurrency. The Go runtime scheduler doesn't decide which threads run on the CPU — that's the operating system scheduler's job. The Go runtime makes sure all goroutines run on the threads it manages, but the OS controls how and when those threads actually get CPU time. The scheduler's job is to run M goroutines on N operating system threads, where M can be much larger than N. Here's a simple way to do it: Take goroutines G11-G14 and run them: Goroutine G12 got blocked while reading from the channel. Put it back in the queue and replace it with G15: But there are a few things to keep in mind. Let's say goroutines G11–G14 are running smoothly without getting blocked by mutexes or channels. Does that mean goroutines G15–G20 won't run at all and will just have to wait ( starve ) until one of G11–G14 finally finishes? That would be unfortunate. That's why the scheduler checks each running goroutine roughly every 10 ms to decide if it's time to pause it and put it back in the queue. This approach is called preemptive scheduling: the scheduler can interrupt running goroutines when needed so others have a chance to run too. System calls The scheduler can manage a goroutine while it's running Go code. But what happens if a goroutine makes a system call, like reading from disk? In that case, the scheduler can't take the goroutine off the thread, and there's no way to know how long the system call will take. For example, if goroutines G11–G14 in our example spend a long time in system calls, all worker threads will be blocked, and the program will basically "freeze". To solve this problem, the scheduler starts new threads if the existing ones get blocked in a system call. For example, here's what happens if G11 and G12 make system calls: Here, the scheduler started two new threads, E and F, and assigned goroutines G15 and G16 from the queue to these threads. When G11 and G12 finish their system calls, the scheduler will stop or terminate the extra threads (E and F) and keep running the goroutines on four threads: A-B-C-D. This is a simplified model of how the goroutine scheduler works in Go. If you want to learn more, I recommend watching the talk by Dmitry Vyukov, one of the scheduler's developers: Go scheduler: Implementing language with lightweight concurrency ( video , slides ) We said that the scheduler uses N threads to run goroutines. In the Go runtime, the value of N is set by a parameter called . The runtime setting controls the maximum number of operating system threads the Go scheduler can use to execute goroutines concurrently. It defaults to the value of , which is the number of logical CPUs on the machine. Strictly speaking, is either the total number of logical CPUs or the number allowed by the CPU affinity mask, whichever is lower. This can be adjusted by the CPU quota, as explained below. For example, on my 8-core laptop, the default value of is also 8: You can change by setting environment variable or calling : You can also undo the manual changes and go back to the default value set by the runtime. To do this, use the function (Go 1.25+): Go programs often run in containers, like those managed by Docker or Kubernetes. These systems let you limit the CPU resources for a container using a Linux feature called cgroups . A cgroup (control group) in Linux lets you group processes together and control how much CPU, memory, and network I/O they can use by setting limits and priorities. For example, here's how you can limit a Docker container to use only four CPUs: Before version 1.25, the Go runtime didn't consider the CPU quota when setting the value. No matter how you limited CPU resources, was always set to the number of logical CPUs on the host machine: Starting with version 1.25, the Go runtime respects the CPU quota: So, the default value is set to either the number of logical CPUs or the CPU limit enforced by cgroup settings for the process, whichever is lower. Note on CPU limits Cgroups actually offer not just one, but two ways to limit CPU resources: Docker's and / set the quota, while sets the shares. Kubernetes' CPU limit sets the quota, while CPU request sets the shares. Go's runtime only takes the CPU quota into account, not the shares. Fractional CPU limits are rounded up: On a machine with multiple CPUs, the minimum default value for is 2, even if the CPU limit is set lower: The Go runtime automatically updates if the CPU limit changes. It happens up to once per second (less frequently if the application is idle). Let's take a quick look at the three main concurrency tools for Go: goroutines, channels, and select. A goroutine is implemented as a pointer to a structure. Here's what it looks like: The structure has many fields, but most of its memory is taken up by the stack, which holds the goroutine's local variables. By default, each stack gets 2 KB of memory, and it grows if needed. Because goroutines use very little memory, they're much more efficient than operating system threads, which usually need about 1 MB each. Their small size lets you run tens (or even hundreds) of thousands of goroutines on a single machine. A channel is implemented as a pointer to a structure. Here's what it looks like: The buffer array ( ) has a fixed size ( , which you can get with the builtin). It's created when you make a buffered channel. The number of items in the channel ( , which you can get with the builtin) increases when you send to the channel and decreases when you receive from it. The builtin sets the field to 1. Sending an item to an unbuffered channel, or to a buffered channel that's already full, puts the goroutine into the queue. Receiving from an empty channel puts the goroutine into the queue. The select logic is implemented in the function. It's a huge function that takes a list of select cases and (very simply put) works as follows: ✎ Exercise: Runtime simulator Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Metrics show how the Go runtime is performing, like how much heap memory it uses or how long garbage collection pauses take. Each metric has a unique name (for example, ) and a value, which can be a number or a histogram. We use the package to work with metrics. List all available metrics with descriptions: Get the value of a specific metric: Here are some goroutine-related metrics: In real projects, runtime metrics are usually exported automatically with client libraries for Prometheus, OpenTelemetry, or other observability tools. Here's an example for Prometheus: The exported metrics are then collected by Prometheus, visualized, and used to set up alerts. Profiling helps you understand exactly what the program is doing, what resources it uses, and where in the code this happens. Profiling is often not recommended in production because it's a "heavy" process that can slow things down. But that's not the case with Go. Go's profiler is designed for production use. It uses sampling, so it doesn't track every single operation. Instead, it takes quick snapshots of the runtime every 10 ms and puts them together to give you a full picture. Go supports the following profiles: The easiest way to add a profiler to your app is by using the package. When you import it, it automatically registers HTTP handlers for collecting profiles: Or you can register profiler handlers manually: After that, you can start profiling with a specific profile by running the command with the matching URL, or just open that URL in your browser: For the CPU profile, you can choose how long the profiler runs (the default is 30 seconds). Other profiles are taken instantly. After running the profiler, you'll get a binary file that you can open in the browser using the same utility. For example: The pprof web interface lets you view the same profile in different ways. My personal favorites are the flame graph , which clearly shows the call hierarchy and resource usage, and the source view, which shows the exact lines of code. You can also profile manually. To collect a CPU profile, use and : To collect other profiles, use : Profiling is a broad topic, and we've only touched the surface. To learn more, start with these articles: Tracing records certain types of events while the program is running, mainly those related to concurrency and memory: If you enabled the profiling server as described earlier, you can collect a trace using this URL: Trace files can be quite large, so it's better to use a small N value. After tracing is complete, you'll get a binary file that you can open in the browser using the utility: In the trace web interface, you'll see each goroutine's "lifecycle" on its own line. You can zoom in and out of the trace with the W and S keys, and you can click on any event to see more details: You can also collect a trace manually: Flight recording is a tracing technique that collects execution data, such as function calls and memory allocations, within a sliding window that's limited by size or duration. It helps to record traces of interesting program behavior, even if you don't know in advance when it will happen. The type (Go 1.25+) implements a flight recorder in Go. It tracks a moving window over the execution trace produced by the runtime, always containing the most recent trace data. Here's an example of how you might use it. First, configure the sliding window: Then create the recorder and start it: Continue with the application code as usual: Finally, save the trace snapshot to a file when an important event occurs: Use to view the trace in the browser: ✎ Exercise: Comparing blocks Practice is crucial in turning abstract knowledge into skills, making theory alone insufficient. The full version of the book contains a lot of exercises — that's why I recommend getting it . If you are okay with just theory for now, let's continue. Now you can see how challenging the Go scheduler's job is. Fortunately, most of the time you don't need to worry about how it works behind the scenes — sticking to goroutines, channels, select, and other synchronization primitives is usually enough. This is the final chapter of my "Gist of Go: Concurrency" book. I invite you to read it — the book is an easy-to-understand, interactive guide to concurrency programming in Go. Pre-order for $10   or read online Put all goroutines in a queue. Take N goroutines from the queue and run them. If a running goroutine gets blocked (for example, waiting to read from a channel or waiting on a mutex), put it back in the queue and run the next goroutine from the queue. CPU quota — the maximum CPU time the cgroup may use within some period window. CPU shares — relative CPU priorities given to the kernel scheduler. Go through the cases and check if the matching channels are ready to send or receive. If several cases are ready, choose one at random (to prevent starvation, where some cases are always chosen and others are never chosen). Once a case is selected, perform the send or receive operation on the matching channel. If there is a default case and no other cases are ready, pick the default. If no cases are ready, block the goroutine and add it to the channel queue for each case. Count of goroutines created since program start (Go 1.26+). Count of live goroutines (created but not finished yet). An increase in this metric may indicate a goroutine leak. Approximate count of goroutines running or blocked in a system call or cgo call (Go 1.26+). An increase in this metric may indicate problems with such calls. Approximate count of goroutines ready to execute, but not executing (Go 1.26+). An increase in this metric may mean the system is overloaded and the CPU can't keep up with the growing number of goroutines. Approximate count of goroutines executing (Go 1.26+). Always less than or equal to . Approximate count of goroutines waiting on a resource — I/O or sync primitives (Go 1.26+). An increase in this metric may indicate issues with mutex locks, other synchronization blocks, or I/O issues. The current count of live threads that are owned by the runtime (Go 1.26+). The current setting — the maximum number of operating system threads the scheduler can use to execute goroutines concurrently. CPU . Shows how much CPU time each function uses. Use it to find performance bottlenecks if your program is running slowly because of CPU-heavy tasks. Heap . Shows the heap memory currently used by each function. Use it to detect memory leaks or excessive memory usage. Allocs . Shows which functions have used heap memory since the profiler started (not just currently). Use it to optimize garbage collection or reduce allocations that impact performance. Goroutine . Shows the stack traces of all current goroutines. Use it to get an overview of what the program is doing. Block . Shows where goroutines block waiting on synchronization primitives like channels, mutexes and wait groups. Use it to identify synchronization bottlenecks and issues in data exchange between goroutines. Disabled by default. Mutex . Shows lock contentions on mutexes and internal runtime locks. Use it to find "problematic" mutexes that goroutines are frequently waiting for. Disabled by default. Profiling Go Programs Diagnostics goroutine creation and state changes; system calls; garbage collection; heap size changes;

1 views
Kaushik Gopal 5 days ago

Combating AI coding atrophy with Rust

It’s no secret that I’ve fully embraced AI for my coding. A valid concern ( and one I’ve been thinking about deeply ) is the atrophying of the part of my brain that helps me code. To push back on that, I’ve been learning Rust on the side for the last few months. I am absolutely loving it. Kotlin remains my go-to language. It’s the language I know like the back of my hand. If someone sends me a swath of Kotlin code, whether handwritten or AI generated, I can quickly grok it and form a strong opinion on how to improve it. But Kotlin is a high-level language that runs on a JVM. There are structural limits to the performance you can eke out of it, and for most of my career 1 I’ve worked with garbage-collected languages. For a change, I wanted a systems-level language, one without the training wheels of a garbage collector. I also wanted a language with a different core philosophy, something that would force me to think in new ways. I picked up Go casually but it didn’t feel like a big enough departure from the languages I already knew. It just felt more useful to ask AI to generate Go code than to learn it myself. With Rust, I could get code translated, but then I’d stare at the generated code and realize I was missing some core concepts and fundamentals. I loved that! The first time I hit a lifetime error, I had no mental model for it. That confusion was exactly what I was looking for. Coming from a GC world, memory management is an afterthought — if it requires any thought at all. Rust really pushes you to think through the ownership and lifespan of your data, every step of the way. In a bizarre way, AI made this gap obvious. It showed me where I didn’t understand things and pointed me toward something worth learning. Here’s some software that’s either built entirely in Rust or uses it in fundamental ways: Many of the most important tools I use daily are built with Rust. Can’t hurt to know the language they’re written in. Rust is quite similar to Kotlin in many ways. Both use strict static typing with advanced type inference. Both support null safety and provide compile-time guarantees. The compile-time strictness and higher-level constructs made it fairly easy for me to pick up the basics. Syntactically, it feels very familiar. I started by rewriting a couple of small CLI tools I used to keep in Bash or Go. Even in these tiny programs, the borrow checker forced me to be clear about who owns what and when data goes away. It can be quite the mental workout at times, which is perfect for keeping that atrophy from setting in. After that, I started to graduate to slightly larger programs and small services. There are two main resources I keep coming back to: There are times when the book or course mentions a concept and I want to go deeper. Typically, I’d spend time googling, searching Stack Overflow, finding references, diving into code snippets, and trying to clear up small nuances. But that’s changed dramatically with AI. One of my early aha moments with AI was how easy it made ramping up on code. The same is true for learning a new language like Rust. For example, what’s the difference 2 between these two: Another thing I loved doing is asking AI: what are some idiomatic ways people use these concepts? Here’s a prompt I gave Gemini while learning: Here’s an abbreviated response (the full response was incredibly useful): It’s easy to be doom and gloom about AI in coding — the “we’ll all forget how to program” anxiety is real. But I hope this offers a more hopeful perspective. If you’re an experienced developer worried about skill atrophy, learn a language that forces you to think differently. AI can help you cross that gap faster. Use it as a tutor, not just a code generator. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎ fd (my tool of choice for finding files) ripgrep (my tool of choice for searching files) Fish shell (my shell of choice, recently rewrote in Rust) Zed (my text/code editor of choice) Firefox ( my browser of choice) Android?! That’s right: Rust now powers some of the internals of the OS, including the recent Quick Share feature. Fondly referred to as “ The Book ”. There’s also a convenient YouTube series following the book . Google’s Comprehensive Rust course, presumably created to ramp up their Android team. It even has a dedicated Android chapter . This worked beautifully for me. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎

0 views
Chris Coyier 5 days ago

The Jeopardy Phenomenon

There’s the thing where if you’re reading an article in the newspaper, and it’s about stuff you don’t know a ton about, it all seems well and good. Then you read another article in the same paper and it’s about something you know intimately (your job, your neighborhood, your hobby, etc) there is a good chance you’ll be like hey! that’s not quite right! I think of that as the Jeopardy Phenomenon. On the TV game show Jeopardy, if you don’t know the answer to a question, it can feel very much like jeez this quiz show is really hard ! But then if a category or question comes up around a topic you know a bit about, the question (or “answer” in reverse Jeopardy parlance) can feel very basic and simple. Like if the “answer” is about popular fantasy card games, the “question” is not going to be Android Netrunner, it’s going to be Magic: The Gathering. (and you’ll roll your eyes a little bit, because it’s like duh ) I think AI has the Jeopardy Phenomenon too. If you use it to generate code that is outside your expertise, you are likely to think it’s all well and good, especially if it seems to work at first pop. But if you’re intimately familiar with the technology or the code around the code it’s generating, there is a good chance you’ll be like hey! that’s not quite right!

0 views
underlap 5 days ago

Dead code

I’ve just been doing some prefactoring and I ran into an interesting situation with some dead code. Prefactoring is refactoring in preparation for a piece of development work. There are several situations in which prefactoring is useful. Sometimes code you are about to work on is unclear or difficult to understand. If you can see significant improvements, by way of refactoring, which make the code easier to understand, it’s a good idea to put these in place. Other times the code is just too difficult to make the necessary changes. In this case, it is sometimes possible to massage the code to get it into a better shape. While prefactoring, I came across a line of code which looked like it contained a bug. I didn’t want to get distracted, so I added a comment. After I’d finished my current round of prefactoring, I decided to explore the bug. I thought of a test which should fail and went to implement the test. I noticed a similar test – that was already passing – but assumed that test wasn’t provoking the bug. So I went ahead and wrote the new, somewhat more complex test. The new test passed as well! I’m used to the stages of understanding a bug, including “How could the code fail like that?” and “How could the code ever work in the first place?”, so I pressed on and ran the test with logging turned on. Trawling through the logs, I couldn’t see where an incorrect message was being sent. I even added a couple of logs to the testcase to pinpoint the problem: still no progress. I added logging to another function to give me more insight. But it seemed like messages were consistently being sent with the correct second parameter (i.e. not ). What was going on? Finally, I looked again at the context of the supposedly buggy line of code: Huh! I didn’t remember adding that attribute (to suppress a compiler warning about dead code) on the function. But, looking at the file history, I added it in a commit entitled “Tidy up” six months ago. I clearly intended this code to become live at some point in the future, but hadn’t remembered to delete it when it was no longer needed. If this wasn’t a solo project, a colleague would probably have noticed the problem. But, for some reason, I was blind to the attribute. I deleted the dead code (!), reverted the new test, and decided to write up my findings here, partly as penance and partly to entertain (and possibly inform) others. It would have been far better not to suppress the dead code compiler warning, so there would have been a permanent reminder of the need to make use of the code or delete it.

0 views
Abhinav Sarkar 5 days ago

Solving Advent of Code 2025 in Janet: Day 1–4

I’m solving the Advent of Code 2025 in Janet . After doing the last five years in Haskell, I wanted to learn a new language this year. I’ve been eyeing the “New Lisps” 1 for a while now, and I decided to learn Janet. Janet is a Clojure like Lisp that can be interpreted, embedded and compiled, and comes with a large standard library with concurrency, HTTP and PEG parser support. I want to replace Python with Janet as my scripting language. Here are my solutions for Dec 1–4. This post was originally published on abhinavsarkar.net . All my solutions follow the same structure because I wrote a template to create new empty solutions. Actually, I added a fair bit of automation this time to build, run, test and benchmark the solutions. Day 1 was a bit mathy but it didn’t take too long to figure out. I spent more time polishing the solution to be idiomatic Janet code. , the PEG grammar to parse the input was the most interesting part for me on the day. If you know Janet, you can notice this is not the cleanest code, but that’s okay, it was my day 1 too. The most interesting part of the day 2 solution was the macro that reads the input at compile-time and creates a custom function to check whether a number is in one of the given ranges. This turned out to be almost 4x faster than writing the same thing as a function. Notice , the PEG grammar to parse the input. So short and clean! I also leaned into the imperative and mutable nature of the Janet data-structures. The code is still not the cleanest as I was still learning. The first part of day 3 was pretty easy to solve, but using the same solution for the second part just ran forever. I realized that this is a Dynamic Programming problem, but I don’t like doing array-based solutions, so I simply rewrote the solution to add caching. And it worked! It is definitely on the slower side, but I’m okay with it. The code has become a little more idiomatic Janet. Day 4 is when I learned more about Janet control flow structures. The solution for the part 2 is a straightforward Breadth-first traversal . The interesting parts are the , and statements. So concise and elegant! That’s it for now. Next note will drop after 4 or 5 days. You can browse the code repo to see the full setup. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! The new Lisps that interest me are: Janet, Fennel and Jank . ↩︎ If you liked this post, please leave a comment . The new Lisps that interest me are: Janet, Fennel and Jank . ↩︎

0 views

Struggling Towards an Algebraic Theory of Music

For the last few months, I’ve been trying to come up with a nice, denotational basis for what music is. But I’m running out of steam on the project, so I thought I’d write what I’ve figured out, and what I’ve tried but doesn’t work. Hopefully this will inspire someone to come tell me what I’m being stupid about and help get the whole process unstuck. It’s tempting to gesticulate wildly, saying that music is merely a function from time to wave amplitudes, eg something of the form: While I think it’s fair to say that this is indeed the underlying denotation of sound, this is clearly not the denotation of music. For example, we can transpose a song up a semitone without changing the speed—something that’s very challenging without a great deal of in the waveform representation. And we can play a musical phrase backwards, which is probably impossible in a waveform for any timbral envelope. Since we have now two examples of “reasonable to want to do” with musical objects, which cannot be expressed in terms of a function , we must conceed that waveforms-over-time cannot be the denotation of music. Music is obviously temporal, so keeping the “function from time” part seems relevant. But a function from time to what? As a first attempt: which, for a given time, returns a set of notes starting at that time, and how long they ought to be played for. An immediate improvement would be to parameterize the above over notes: It’s tempting to try to eliminate more of the structure here with our parametricity, but I was unable to do so. In contrapuntal music, we will want to be able to express two notes starting at the same moment, but ending at different times. One alluring path here could to write monophonic voices, and combine them together for polyphony: Such an encoding has many unfavorable traits. First, it just feels yucky. Why are there two layers of ? Second, now I-as-a-composer need to make a choice of which voice I put each note in, despite the fact that this is merely an encoding quirk. So no, I don’t think this is a viable path forward. So let’s return to our best contender: This definition is trivially a monoid, pointwise over the time structure: If we think about abstract sets here, rather than , such an object is clearly a functor. There are many possible applicatives here, but the pointwise zipper seems most compelling to me. Pictorally: Such an applicative structure is quite nice! It would allow us to “stamp” a rhythm on top of a pure representation of a melody. However, the desirability of this instance is a point against , since by Conal Elliott’s typeclass morphism rule , the meaning of the applicative here ought to be the applicative of the meaning. Nevertheless, any other applicative structure would be effecitvely useless, since it would require the notes on one side to begin at the same time as the notes on the other. To sketch: Good luck finding a musically meaningful for such a thing! Ok, so let’s say we commit to the pointwise zippy instance as our applicative instance. Is there a corresponding monad? Such a thing would substitute notes with more music. My first idea of what to do with such a thing would be to replace chords with texture. For example, we could replace chords with broken chords, or with basslines that target the same notes. Anyway, the answer is yes, there is such a monad. But it’s musically kinda troublesome. Assume we have the following function: which will convert a into its notes and an optional temporal interval (optional because goes on forever.) Then, we can write our bind as: where changes when a piece of music occurs. We are left with a hole of type: whose semantics sure better be that it forces the given to fit in the alotted time. There are two reasonable candidates here: where changes the local interpretation of time such that the entire musical argument is played within the given duration, and just takes the first ’s worth of time. Truncate is too obviously unhelpful here, since the continuation doesn’t know how much time it’s been given, and thus most binds will drop almost all of their resulting music. Therefore we will go with . Which satisfies all of the algebraic (monad) laws, but results in some truly mystifying tunes. The problem here is that this is not an operation which respects musical meter. Each subsequent bind results in a correspondingly smaller share of the pie. Thus by using only bind and mconcat, it’s easy to get a bar full of quarter notes, followed by a bar of sixty-fourth notes, followed by two bars full of of 13-tuplets. If you want to get a steady rhythm out of the whole thing, you need a global view on how many binds deep you’re ever going to go, and you need to ensure locally that you only produce a small powers-of-two number of notes, or else you will accidentally introduce tuplets. It’s a mess. But algebraically it’s fine. The above foray into monads seems tentatively promising for amateur would-be algorithmic composers (read: people like me.) But I have been reading several books on musical composition lately, and my big takeaway from them is just how damn contextual notes are. So maybe this means we want more of a comonadic interface. One in which you can every note, by taking into account all of the notes in its local vicinity. This feels just as right as the monadic approach does, albeit in a completely different way. Being able to give a comonad instance for would require us to somehow reckon with having only a single at any given time. Which appeals to my functional programmer soul, but again, I don’t know how to do it. But imagine if we did have a comonadic instance. We could perform voice leading by inspecting what the next note was, and by futzing around with our pitch. We could do some sort of reharmonization by shifting notes around according to what else is happening. But maybe all of this is just folly. Music as it’s actually practiced doesn’t seem to have much of the functionaly-compositional properties we like—ie, that we can abstract and encapsulate. But music doesn’t appear to be like that! Instead, a happy melody takes a different character when played on major vs minor chords. Adding a dissonant interval can completely reconceptualize other notes. It feels like a bit of a bummer to end like this, but I don’t really know where to go from here. I’ve worked something like six completely-different approaches over the last few months, and what’s documented here is the most promising bits and pieces. My next thought is that maybe music actually forms a sheaf , which is to say that it is a global solution that respects many local constraints. All of this research into music has given me much more thoughts about music qua music which I will try to articulate the next time I have an evening to myself. Until then.

0 views

The Durable Function Tree - Part 1

In my last post I wrote about why and where determinism is needed in durable execution (DE). In this post I'm going to explore how workflows can be formed from trees of durable function calls based on durable promises and continuations.  Here's how I'll approach this: Building blocks : Start with promises and continuations and how they work in traditional programming. Making them durable : How promises and continuations are made durable. The durable function tree : How these pieces combine to create hierarchical workflows with nested fault boundaries. Function trees in practice : A look at Temporal, Restate, Resonate and DBOS. Responsibility boundaries : How function trees fit into my Coordinated Progress model and responsibility boundaries Value-add: What value does durable execution actually provide? Architecture discussion : Where function trees sit alongside event-driven choreography, and when to use each. At their core, most durable execution frameworks organize work as hierarchical trees of function calls. A root function invokes child functions, which may invoke their own children, forming a tree. In some frameworks such as Temporal, the workflow task is the parent function and each activity is a leaf function in a two-level tree. Other frameworks support arbitrary function trees where each function returns a durable promise to its caller. When a child function completes, its promise resolves, allowing the parent to resume. The durable execution engine manages this dance of invoking functions, caching results, handling retries, and suspending functions that are waiting on remote work. I'll refer to this pattern as the durable function tree , though it manifests differently across frameworks. In this series, I use the term side effect as any operation whose result depends on the external world rather than the function’s explicit inputs. That includes the obvious mutations such as writing to a database or sending an email, but also non-mutating operations whose results are not guaranteed to be the same across re-execution (such as retrieving a record from a database). Even though these operations look like pure reads, they are logically side effects because they break determinism (ah yes, the curse of determinism) as the value you obtain today may differ from the value you obtain when the function is retried tomorrow. So in these posts, side effect means: Anything external that must be recorded (and replayed) because it cannot be deterministically re-executed. Promises and futures are programming language constructs that act as handles or placeholders for a future result. They are coordination primitives.  The promise and the future are closely related concepts: A promise is a write-once container of a value, where the writer sets the value now or at some point in the future. Setting the value resolves the promise. A future is a read-only interface of the promise, so the bearer can only check if it is resolved or not. Fig 1. The promise/future as a container for a future value. As a container for a future value, the bearer can await the promise/future which will block until it has been resolved. While technically distinct (a promise is writable, a future is read-only), most languages and frameworks blur this distinction. For simplicity, I'll use the term "promise" throughout. Here's the basic pattern in pseudocode: The function creates a promise, launches asynchronous work that will eventually resolve it, and returns immediately. The caller can await the promise right away or continue with other work: Developers are generally comfortable with functions returning promises: invoke a function, get a handle, await its result. Usually we're waiting for some IO to complete or a call to an API. In fact, when a function executes, it might be the root of a tree of function calls, each passing back promises/futures to its caller, forming a promise chain. Promises and continuations are related but distinct concepts: A promise is a synchronization primitive for a value that doesn't yet exist A continuation is a control-flow primitive representing what the program should do once that value exists In JavaScript-style APIs, continuations appear explicitly in then , catch , and finally : Modern async/await syntax hides this continuation-passing behind synchronous-looking code: When execution hits await , the current function suspends and everything after the await becomes the continuation. The code that will resume once the promise resolves. A durable promise is a promise that survives process crashes, machine failures, and even migration to different servers. We can model this as a durable write-once register (WOR) with a unique, deterministic identity. The key properties: Deterministic identity : The promise ID is derived deterministically from the execution context (parent function ID, position in code, explicitly defined by the developer). Write-once semantics : Can only be resolved once. Durable storage : Both the creation and resolution are recorded persistently. Globally accessible : Any service that knows the promise ID can resolve it or await it. The durable execution engines (DEEs) generally implement this logical WOR by recording two entries in the function's execution history: one when the promise is created, another when it's resolved. This history is persisted and used to reconstruct state during replay. There are also additional concerns such as timeouts and cancellation of the promise, beyond its creation and resolution. When you write: Behind the scenes, the framework SDK: Checks if a durable promise for get_customer(231) already exists. If resolved: returns the stored value immediately. If unresolved or doesn't exist: executes (or re-executes) the work. Traditional promises suspend execution by capturing the call stack and heap state. Everything is still running in a single process. Durable execution engines typically don't do this as capturing and persisting arbitrary program state is complex and fragile. Instead, they implement continuations through replay and memoization : The function executes from the top Each await checks if its durable promise is already resolved If yes: use the stored result and continue (this is fast, it’s just a lookup) If no: execute the work, resolve the promise, record the result On failure: restart from step 1 Consider this example: First execution: Executes get_customer , resolves Promise 1, stores result Executes check_inventory , resolves Promise 2, stores result Starts charge_customer , crashes mid-execution Second execution (after crash): Re-runs from top get_customer : Promise 1 already resolved → returns stored result instantly check_inventory : Promise 2 already resolved → returns stored result instantly charge_customer : Promise 3 unresolved → executes the work Completes successfully This is why determinism matters (from the previous post). The function must take the same code path on replay to encounter the same promises in the same order. If control flow were non-deterministic, replayed execution might skip a promise or try to await a different promise entirely, breaking the memoization. Let’s now introduce the durable function tree. Durable functions can call other durable functions , creating trees of execution. Each function invocation returns a durable promise to the caller. Fig 2. A tree of function calls, returning durable promises. Execution flows down the tree; promise resolution flows back up.  This produces a tree of function calls, where each function is a control flow which executes various side effects. Side effects can be run from the local context or from a reliable remote context (such as another durable function), and the difference matters. Local-context side effects run within the function's execution context: Database queries S3 operations HTTP calls to external APIs Local computations with side effects Local-context side effects have these characteristics: Execute synchronously (even if using async syntax, the result is received by the same context) Cannot be retried independently (only by replaying the parent function) Require the function to keep running (e.g., maintaining a TCP connection for a database response) Remote-context side effects run in a separate reliable context: Another durable function. A durable timer (managed by the DEE). Work queued for external processing with an attached durable promise for the 3rd party to resolve. Remote-context side effects behave differently: Can be retried independently of the caller. Continues progressing even if the caller crashes or suspends. The caller awaits a promise, not a direct response. It is asynchronous, the caller context that receives the result may be a re-execution running on a different server, hours, days or months later. The distinction between local and remote matters because remote-context side effects create natural suspension points , which become important for durable function trees. Let’s use the tree from fig 2. It has a mix of local-context side effects (such as db commands and HTTP calls) and remote-context side effects, aka, calls to other durable functions (or timers). When a function is waiting only on promises from remote side effects, it can be suspended (meaning terminated, with all execution state discarded). The function doesn't need to sit in memory burning resources while waiting hours or days for remote work to complete. Fig 3. Our durable function tree seen as a tree with local-context and remote-context side effects Let’s imagine that the payment provider is down for two hours, so func3 cannot complete. The execution flow of the tree: Func1 runs: Executes getCustomer (local work, cannot suspend here) Invokes func2 , and receives a durable promise. There is no other local work to run right now. Only waiting on remote-context side effects. Func1 suspends —completely terminated, no resources held Func2 runs: Executes checkInventory (local work, cannot suspend here) Invokes func3 and func4 , receiving durable promises. There is no other local work to run right now. Only waiting on remote-context side effects. Func2 suspends —completely terminated, no resources held Func3 runs (concurrently with func4 ) Payment provider down, so fails payment. Func3 is retried repeatedly by the DEE. Two hours later, func3 completes, resolves the promise Func4 runs (concurrently with func3 ) Executes uploadInvoice (local work, cannot suspend here) Executes updateOrder (local work, cannot suspend here) Resolves its promise. Func2 resumes –re-executed from the top by the DEE.  checkInventory : already resolved → instant return Func3 : already resolved → instant return Func4 : already resolved → instant return Resolve promise to func1. Func1 resumes –re-executed from the top by the DEE. getCustomer : already resolved → instant return func2 : already resolved → instant return with result Continues to logAudit (local work) and completes. Without suspension, either: The whole tree would need to be re-executed from the top repeatedly until func3 completes after two hours. Or, each function in the tree, from func3 and up, would need to retry independently every few minutes for those two hours the payment provider is down just to check if their child promises have been resolved.  With function suspension, we avoid the need to repeatedly retry for long time periods and only resume a function once its child promise(s) has been resolved, all the while consuming zero resources while waiting.  Local side effects don't allow suspension because the function must remain running for the side effect to complete. You can't suspend while waiting for a database response: the TCP connection would be lost and the response would never arrive. The same goes for API calls that are not directly managed by the durable execution engine, these are treated like any other locally-run side effect.  What makes this durable function tree structure particularly powerful for fault tolerance is that each node can fail, retry, and recover independently without affecting its siblings or ancestors. If func3 crashes, only func3 needs to retry: func2 remains suspended. func4 's completed work is preserved. func1 , also suspended, and doesn't even know a failure occurred. The tree structure creates natural fault boundaries: failures are contained to a single branch and don't cascade upward until that branch exhausts its retries or reaches a defined timeout. This means a complex workflow with dozens of steps can have a single step fail and retry hundreds of times without forcing the entire workflow to restart from scratch. Portions of the tree can remain suspended indefinitely, until a dependent promise allows resumption of the parent function. Different durable execution engines make different choices about tree depth and suspension points. Temporal uses a two-layer model where workflows orchestrate activities . The workflow is the root function (run as a workflow task) and each activity is a leaf function (each run as a separate activity task). Each activity is considered a single side effect. Child workflows add depth to the tree as a parent workflow can trigger and await the result of the child. Fig 4. Temporal’s two layer workflow→activity model. Because each activity is a separately scheduled task that could run on any worker, for the workflow task, activities are remote-context side effects , which allows the workflow task to be suspended . In fact, if a workflow has three activities to execute, then the workflow will be executed across four workflow tasks in order to complete (as the first three workflow tasks end up suspending on an activity invocation).  Fig 5. Workers poll Temporal Server task queues for tasks, and then execute those tasks. Activity are invoked via commands which Temporal Server derives events and tasks from. Even when an activity fails, Temporal re-executes the parent workflow from the top, which re-encounters the failed activity.  In Temporal, the workflow task, run on workers, drives forward progress. If an activity needs to be scheduled again, that is driven from the workflow task. In turn, Temporal detects the need to reschedule a workflow task when an activity times out (rather than detecting the error directly). Temporal is a push/pull model where: Workers pull tasks (workflow/activity) from task queues in Temporal Server. Workers drive forward progress by pushing (sending) commands to Temporal Server (which in turn leads to the server creating more tasks to be pulled). Restate supports arbitrary tree depth, functions calling functions calling functions. Each function execution can progress through multiple side effects before suspending when awaiting remote Restate services (durable functions), timers, or delegation promises. Failed functions are retriggered independently by Restate Service rather than requiring parent re-execution.  Where Temporal drives progress of an activity via scheduling a workflow task, Restate drives progress by directly invoking the target function from the engine itself. This makes sense as there is no separate workflow and activity task specialization. If func1 is waiting on func2, then func1 can suspend while Restate executes (and possibly retries) func2 independently until it completes or reaches a retry or time limit, only then waking up func1 to resume. Therefore we can say Restate is purely a push model. Restate Server acts as a man-in-the-middle, invoking functions, and functions send commands and notifications to Restate Service which it reacts to. In its man-in-the-middle position, it can also subscribe to Kafka topics and invoke a function for each event. Fig 6. Invocations are driven by Restate Service. Functions will suspend when they await Restate governed remote side effects (and no local side effects). Restate detects when a suspended function should be resumed and invokes it. Note this diagram omits the notifications send from the Restate client back to Restate server related the start and end of each local side effect. Resonate is definitely worth a mention here too, it falls into the arbitrary function tree camp, and is going further by defining a protocol for this pattern. The Resonate model looks the simplest (everything is a function, either local or remote), though I haven’t played with it yet. I recommend reading Dominik Tornow’s writing and talks on this subject matter of distributed async/await as trees of functions returning promises. DBOS has some similarities with Temporal in that it is also a two-level model with workflows and steps, except most steps are local-context (run as part of the parent function). DBOS workflows mostly operate as a single function with local-context side effects, except for a few cases like durable timers, which act as remote-context side effects and provide suspension points. A DBOS workflow can also trigger another workflow and await the result, providing another suspension point (as the other workflow is a remote-context side effect). In this way, DBOS can form function trees via workflows invoking workflows (as workflows are basically functions). DBOS also uses a worker model, where workers poll Postgres for work (which is similar to Temporal workers polling task queues). Because steps are local-context side effects (such as db commands, API calls) a typical workflow does not suspend (unless it awaits a timer or another workflow). This differentiates itself from Temporal, which schedules all activities as remote-context side effects (activity tasks are run as an independent unit of work on any worker). Fig 7. DBOS workers poll Postgres for work. Functions will suspend when they await a timer or another DBOS workflow. The logic is mostly housed in the DBOS client, where the polling logic can detect when to resume a suspended workflow. Despite their differences, Temporal, Restate and DBOS suspend execution for the same fundamental reason: the distinction between locally-run and remotely-run side effects. Temporal makes activities explicitly remote but only ever one layer deep; Restate and DBOS generally make side effects local-context but support remote-context in the form of timers and other durable workflows/functions. Durable execution frameworks sit on a continuum from “more constrained” to “more flexible compositional” models: On the left, frameworks like Temporal and DBOS use two distinct abstractions: workflows (control flow logic) and activities/steps (side effects) . Activities/steps are terminal leaves; only workflows can have children. This constraint provides helpful structure. It's clear what should be a workflow (multi-step coordination) versus an activity (a single unit of work). The tradeoff is less flexibility: if your "single unit of work" needs its own sub-steps, you must either break it into multiple activities or promote it to a child workflow. On the right, frameworks like Resonate treat everything as functions calling functions . There's no distinction between "orchestration" and "work". Any function can call any other function to arbitrary depth. This provides maximum composability but requires discipline to avoid overly complex trees. Restate kind of straddles both as it offers multiple building blocks, it’s harder to pin down Restate on this continuum.  All positions on this continuum support function trees, the difference is how much structure the framework imposes versus how much freedom it provides. Constrained models offer guardrails against complexity; forcing you to think in terms of workflows and steps. Resonate and Restate provide more flexibility, functions calling functions, but inevitably this requires a bit more discipline. Using what we’ve covered in part 1, in part 2 we’ll take a step back and: Look at durable execution compares to event-driven architecture in terms of fault domains/ responsibility boundaries. Ask the question: what does durable execution actually provide us that we can’t achieve by other means? Finally, look at how does durable execution fits into the wider architecture, including event-driven architecture. Part 1 Building blocks : Start with promises and continuations and how they work in traditional programming. Making them durable : How promises and continuations are made durable. The durable function tree : How these pieces combine to create hierarchical workflows with nested fault boundaries. Function trees in practice : A look at Temporal, Restate, Resonate and DBOS. Part 2 Responsibility boundaries : How function trees fit into my Coordinated Progress model and responsibility boundaries Value-add: What value does durable execution actually provide? Architecture discussion : Where function trees sit alongside event-driven choreography, and when to use each. A promise is a write-once container of a value, where the writer sets the value now or at some point in the future. Setting the value resolves the promise. A future is a read-only interface of the promise, so the bearer can only check if it is resolved or not. A promise is a synchronization primitive for a value that doesn't yet exist A continuation is a control-flow primitive representing what the program should do once that value exists Deterministic identity : The promise ID is derived deterministically from the execution context (parent function ID, position in code, explicitly defined by the developer). Write-once semantics : Can only be resolved once. Durable storage : Both the creation and resolution are recorded persistently. Globally accessible : Any service that knows the promise ID can resolve it or await it. Checks if a durable promise for get_customer(231) already exists. If resolved: returns the stored value immediately. If unresolved or doesn't exist: executes (or re-executes) the work. The function executes from the top Each await checks if its durable promise is already resolved If yes: use the stored result and continue (this is fast, it’s just a lookup) If no: execute the work, resolve the promise, record the result On failure: restart from step 1 Executes get_customer , resolves Promise 1, stores result Executes check_inventory , resolves Promise 2, stores result Starts charge_customer , crashes mid-execution Re-runs from top get_customer : Promise 1 already resolved → returns stored result instantly check_inventory : Promise 2 already resolved → returns stored result instantly charge_customer : Promise 3 unresolved → executes the work Completes successfully Database queries S3 operations HTTP calls to external APIs Local computations with side effects Execute synchronously (even if using async syntax, the result is received by the same context) Cannot be retried independently (only by replaying the parent function) Require the function to keep running (e.g., maintaining a TCP connection for a database response) Another durable function. A durable timer (managed by the DEE). Work queued for external processing with an attached durable promise for the 3rd party to resolve. Can be retried independently of the caller. Continues progressing even if the caller crashes or suspends. The caller awaits a promise, not a direct response. It is asynchronous, the caller context that receives the result may be a re-execution running on a different server, hours, days or months later. Func1 runs: Executes getCustomer (local work, cannot suspend here) Invokes func2 , and receives a durable promise. There is no other local work to run right now. Only waiting on remote-context side effects. Func1 suspends —completely terminated, no resources held Func2 runs: Executes checkInventory (local work, cannot suspend here) Invokes func3 and func4 , receiving durable promises. There is no other local work to run right now. Only waiting on remote-context side effects. Func2 suspends —completely terminated, no resources held Func3 runs (concurrently with func4 ) Payment provider down, so fails payment. Func3 is retried repeatedly by the DEE. Two hours later, func3 completes, resolves the promise Func4 runs (concurrently with func3 ) Executes uploadInvoice (local work, cannot suspend here) Executes updateOrder (local work, cannot suspend here) Resolves its promise. Func2 resumes –re-executed from the top by the DEE.  checkInventory : already resolved → instant return Func3 : already resolved → instant return Func4 : already resolved → instant return Resolve promise to func1. Func1 resumes –re-executed from the top by the DEE. getCustomer : already resolved → instant return func2 : already resolved → instant return with result Continues to logAudit (local work) and completes. The whole tree would need to be re-executed from the top repeatedly until func3 completes after two hours. Or, each function in the tree, from func3 and up, would need to retry independently every few minutes for those two hours the payment provider is down just to check if their child promises have been resolved.  func2 remains suspended. func4 's completed work is preserved. func1 , also suspended, and doesn't even know a failure occurred. Workers pull tasks (workflow/activity) from task queues in Temporal Server. Workers drive forward progress by pushing (sending) commands to Temporal Server (which in turn leads to the server creating more tasks to be pulled). On the left, frameworks like Temporal and DBOS use two distinct abstractions: workflows (control flow logic) and activities/steps (side effects) . Activities/steps are terminal leaves; only workflows can have children. This constraint provides helpful structure. It's clear what should be a workflow (multi-step coordination) versus an activity (a single unit of work). The tradeoff is less flexibility: if your "single unit of work" needs its own sub-steps, you must either break it into multiple activities or promote it to a child workflow. On the right, frameworks like Resonate treat everything as functions calling functions . There's no distinction between "orchestration" and "work". Any function can call any other function to arbitrary depth. This provides maximum composability but requires discipline to avoid overly complex trees. Restate kind of straddles both as it offers multiple building blocks, it’s harder to pin down Restate on this continuum.  Look at durable execution compares to event-driven architecture in terms of fault domains/ responsibility boundaries. Ask the question: what does durable execution actually provide us that we can’t achieve by other means? Finally, look at how does durable execution fits into the wider architecture, including event-driven architecture.

0 views
Tara's Website 6 days ago

(very) late autumn 2025 update

(very) late autumn 2025 update Servus from Tara’s offline outpost! We reached -6°C in the past days here and a dash of snow appeared. Nothing compared to the -12°C and the amount of snow I witnessed a few days ago in the Garmisch-Partenkirchen area. Beautiful… but… better experienced from the car! Even the UI of the car wasn’t able to display that temperature properly: the minus sign overlapped with another widget on the display.

0 views
The Coder Cafe 1 weeks ago

Build Your Own Key-Value Storage Engine—Week 3

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Introducing a WAL is not free, though. Writes are slower because each write also goes to the WAL. It also increases write amplification, the ratio of data written to data requested by a client. Another important aspect of durability is when to synchronize a file’s state with the storage device. When you write to a file, it may appear as saved, but the bytes may sit in memory caches rather than on the physical disk. These caches are managed by the OS’s filesystem, an abstraction over the disk. If the machine crashes before the data is flushed, you can lose data. To force the data to stable storage, you need to call a sync primitive. The simple, portable choice is to call fsync , a system call that flushes a file’s buffered data and required metadata to disk. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord For the WAL data format, you won’t use JSON like the SSTables, but NDJSON (Newline-Delimited JSON). It is a true append-only format with one JSON object per line. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Keep the same flush trigger (2,000 entries) and the same logic (stop-the-world operation) as last week: Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. If the server is unavailable, do not fail. Retry indefinitely with a short delay (or exponential backoff). To assess durability: Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. Add a per-record checksum to each WAL record. On startup, verify records and stop at the first invalid/truncated one, discarding the tail. For reference, ScyllaDB checksums segments using CRC32; see its commitlog segment file format for inspiration. Regarding the flush process, if the database crashes after step 1 (write the new SSTable) and before step 2 (update the MANIFEST atomically), you may end up with a dangling SSTable file on disk. Add a startup routine to delete any file that exists on disk but is not listed in the MANIFEST. This keeps the data directory aligned with the MANIFEST after a crash. That’s it for this week! Your storage engine is now durable. On restart, data that was in the memtable is recovered from the WAL. This is made possible by and the atomic update of the MANIFEST. Deletion is not handled yet. In the worst case, a miss can read all SSTables, which quickly becomes highly inefficient. In two weeks, you will add a endpoint and learn how SSTables are compacted so the engine can reclaim space and keep reads efficient. In your implementation, you used as a simple “make it durable now“ button. In practice, offers finer control both over what you sync and when you sync. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. If you want, you can also explore group commit in your implementation and its impact on durability and latency/throughput, since this series will not cover it later. Also, you should know that since a WAL adds I/O to the write path, storage engines use a few practical tricks to keep it fast and predictable. A common one is to preallocate fixed-size WAL segments at startup to: Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache). Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache).

0 views
Bill Mill 1 weeks ago

Advent of Code 2025

This year I've decided to do Advent of Code in Gleam , a programming language I know nothing about and have never written a line of code in. Previously: Advent of Code 2024 , 2023 problem log , and incomplete answers back to 2015 in github Not the easiest day 1! The problem centered around a ring of the numbers 0-99. It seems that the creator of gleam doesn't like tuples, so today I tried not to use any. I started with a custom type to represent a range: For part 1, I used a generative approach where I iterated from the start of each range through the end, and checked if the number was of the form . The heart of it is this function: For part 2, I got in loops thinking about how to generate all possible combinations of numbers, so I instead took a filtering approach. This function checks if a number is symmetric in the way the problem states (assuming that there are no numbers with more than 10 digits): I find that pleasing! Then I iterated over every number in each range and checked it for symmetry, and summed the result: I didn't do any better at handling errors today! I think I will make that an emphasis tomorrow. Custom types were really simple, and I haven't yet hit any big gleam roadblocks. It takes me a minute to think of how to phrase problems recursively, but I like that it kind of forces you to break your code into small, clear functions. I wish there were better inline testing functionality in gleam, I don't want to put my tests out into a test directory. I'm kind of annoyed generally by how picky it is about the project structure - but I also see how it fits into the zen of the language, which minimizes your options to keep complexity under control. My answer for both parts is here I struggled with coming up with a simple way to express the crossings of zero for the second part, and I'm not super happy with how that came out You need a library to read files in gleam! to get the standard sync library it seems is super handy for debugging pipelines - just stick it between braces and it will show you what's going on I ought to figure out a better way to handle errors than just ing them and picking a random default value, it's definitely going to bite me eventually if I don't panic. I'm surprised there's not an or something like that! Maybe I'll write it as or something, but for today I just got by

0 views
Rob Zolkos 1 weeks ago

The Making of Fizzy, Told by Git

Today Fizzy was released and the entire source code of its development history is open for anyone to see . DHH announced on X that the full git history is available - a rare opportunity to peek behind the curtain of how a 37signals product comes together. I cloned down the repository and prompted Claude Code: “Can you go through the entire git history and write a documentary about the development of this application. What date the first commit was. Any major tweaks, changes and decisions and experiments. You can take multiple passes and use sub-agents to build up a picture. Make sure to cite commits for any interesting things. If there is anything dramatic then make sure to see if you can figure out decision making. Summarize at the end but the story should go into STORY.md” It responded with: “This is a fascinating task! Let me create a comprehensive investigation plan and use multiple agents to build up a complete picture of this project’s history.” Here is the story of Fizzy - as interpreted by Claude - from the trail of git commits. Enjoy! A chronicle of 18 months of development at Basecamp, told through 8,152 commits. At 1:19 PM on a summer Friday, Kevin McConnell typed the words that would begin an 18-month journey: Within hours, the foundation was laid. The team moved with practiced efficiency: By end of day, the skeleton of a Rails application stood ready. But what would it become? One month after inception, Jason Zimdars introduced the application’s first real identity: A “Splat” — the name evokes something chaotic, impactful, unexpected. Like a bug hitting your windshield on a summer drive. The original data model was simple: The next day brought the visual metaphor that would define the early application: The windshield was the canvas. Splats appeared on it like bugs on glass — colorful, slightly chaotic, each one a piece of information demanding attention. The commits reveal urgency. Something important was coming: The all-hands demo. Approximately one month after project inception, Fizzy (then still called “Splat”) was shown to the entire company. The pressure to polish was evident in the commit messages. Seven days after the windshield metaphor was established, Jason Zimdars typed four words that would reshape the application’s identity: The chaotic “splat” gave way to something gentler — bubbles floating on a windshield , like soap suds catching light. The animation changed from aggressive splattering to gentle floating: Perfect circles gave way to hand-drawn blob shapes. The team was discovering what their product was through the act of building it. A new interaction pattern emerged: When users “boosted” a bubble, it would puff up and float away — like champagne fizz rising. The animation: The metaphor was crystallizing. Bubbles. Fizzing. Effervescence. The name would come soon. In a single day, the application found its final name through two commits: 42 files changed. The model, controllers, views, tests — everything touched. Hours later: Fizzy. The name captured everything: the bubbles, the effervescence, the playful energy of the interface. Visual design had driven product naming — the team discovered what they were building through the act of building it. The flat list of bubbles needed structure: But “Projects” didn’t feel right. Eight days later: Then “Bucket” became “Collection.” Eventually, “Collection” would become “Board.” The terminology dance — Projects → Buckets → Collections → Boards — reveals a team searching for the right mental model. They ultimately landed on the familiar “Board” metaphor, aligning with tools like Trello and Linear. David Heinemeier Hansson, creator of Ruby on Rails and co-founder of Basecamp, made his first contribution with characteristic pragmatism: He deleted an unused image file. It was a statement of intent. Within two days, DHH’s fingerprints were everywhere: He upgraded the entire application to Rails 8 release candidate and systematically added HTTP caching throughout. DHH’s most distinctive contribution was his crusade against what he called “anemic” code — thin wrappers that explain nothing and add needless indirection. He used this term 15 times in commit messages: Philosophy: Code should either add explanatory value OR hide implementation complexity. Thin wrappers that do neither are “anemic” and should be eliminated. Then came April 2025. DHH made 323 commits in a single month — 55% of his total contributions compressed into 30 days. This was a surgical strike. He: His commit messages tell the story: In DHH’s philosophy: deletion is a feature, not a bug. After 10 months as “Bubbles,” another transformation: 333 files changed. “Pop” (completing a bubble) became “Closure” (closing a card). The playful metaphor gave way to task management vocabulary. The final architectural piece: Fizzy had become a kanban board . Cards lived in columns. Columns could be customized, colored, reordered. The application had evolved from “bugs on a windshield” to a sophisticated project management tool. Collections became Boards. The transformation was complete: Original (July 2024): Final (November 2025): A Claude-powered AI assistant that could answer questions about project content. Born, restricted to staff, then removed entirely. Perhaps replaced by the more ambitious MCP (Model Context Protocol) integration — making Fizzy AI-native at the protocol level rather than bolting on a chatbot. Emoji reactions for cards and comments. Added. Removed. Then added again. The git history shows healthy debate — not everything that ships stays shipped, and not everything removed stays gone. Saved custom views were replaced by ephemeral quick filters. Complexity gave way to simplicity. Predefined workflows with stages were removed in favor of ad-hoc column organization. Users would create their own structure. The MCP (Model Context Protocol) branch represents cutting-edge AI integration — allowing Claude and other AI assistants to interact with Fizzy programmatically. An manifest advertises Fizzy’s capabilities to AI clients. Status: Removed from main, but the infrastructure remains fascinating. This is one of the earliest explorations of making traditional web applications AI-native. Multiple parallel branches exploring different approaches to mobile column navigation. Scroll snapping. Contained scrolling. Swipeable columns. The problem remains unsolved — there’s no “one true way” for mobile kanban navigation. Making Fizzy work with SQLite in addition to MySQL. Simpler local development. Better portability. The search index was even sharded into 16 tables ( through ) for scale. The proprietary SAAS features were extracted into a separate gem. What remained was a clean, open-source Rails application. After 18 months of development, 8,152 commits, and countless pivots, Fizzy became open source. Jason Zimdars (2,217 commits) — The visual architect. From “Let’s try bubbles” to pixel-perfect polish. Jorge Manrubia (2,053 commits) — The engineering backbone. Consistent, prolific, essential. Andy Smith (1,007 commits) — Front-end craftsmanship and UI refinement. Mike Dalessio (875 commits) — Infrastructure, performance, the recent dashboard work. David Heinemeier Hansson (586 commits) — The architectural enforcer. Rails modernization and the war on anemic code. Kevin McConnell (351 commits) — Started it all with “New Rails app.” Jose Farias (341 commits) — Feature development and testing. Stanko K.R. (239 + 54 commits) — Security hardening and webhook restrictions. Jeffrey Hardy (100 commits) — Early infrastructure and modernization. Jason Fried (7 commits) — The occasional “Small copy adjustment” from the CEO. July 2024 (v0.1): September 2024 (v0.2): November 2025 (v1.0): The story of Fizzy is the story of discovery through building . The team didn’t know they were building a kanban board when they started with “splats on a windshield.” They found out through iteration. Key lessons: Names matter, but they can change. Splat → Bubble → Card. Project → Bucket → Collection → Board. The right name emerges through use. Deletion is a feature. Boosts, Fizzy Ask, custom views, workflows — removing the wrong features is as important as adding the right ones. Architecture evolves. The final column-based kanban system looks nothing like the original flat list of splats. DHH’s philosophy: Remove anemic code. Keep transactions short. Use the latest Rails. Delete more than you add. Design drives naming. “Fizzy” emerged from the visual metaphor of bubbles puffing up and floating away — the design informed the brand. Open source takes extraction. 18 months of SAAS development needed careful separation before the core could be shared. The git history of Fizzy is a masterclass in iterative product development. 8,152 commits. 25+ contributors. 18 months. One application that discovered its identity through the act of creation. “Let’s try bubbles.” — Jason Zimdars, July 31, 2024 Documentary compiled December 2, 2025 Based on analysis of the Fizzy git repository First Commit: June 21, 2024 Total Commits: 8,152 Contributors: 25+ Lines of Code Changed: Hundreds of thousands Name Changes: 4 (Splat → Bubble → Card; Project → Bucket → Collection → Board) Features Removed: At least 4 major ones DHH Commits in April 2025 Alone: 323 1:23 PM — Gemfile updated ( ) 3:47 PM — Rubocop configured ( ) 4:07 PM — Minimal authentication flow ( ) 4:29 PM — CSS reset and base styles ( ) 4:46 PM — Brakeman security scanning added ( ) Removed the entire Boosts feature ( ) — 299 lines across 27 files, gone Eliminated activity scoring ( , , ) Extracted RESTful controllers from overloaded ones ( , ) Enforced transaction discipline ( — “No long transactions!”) Splats on a Windshield Cards → Columns → Boards → Accounts Jason Zimdars (2,217 commits) — The visual architect. From “Let’s try bubbles” to pixel-perfect polish. Jorge Manrubia (2,053 commits) — The engineering backbone. Consistent, prolific, essential. Andy Smith (1,007 commits) — Front-end craftsmanship and UI refinement. Mike Dalessio (875 commits) — Infrastructure, performance, the recent dashboard work. David Heinemeier Hansson (586 commits) — The architectural enforcer. Rails modernization and the war on anemic code. Kevin McConnell (351 commits) — Started it all with “New Rails app.” Jose Farias (341 commits) — Feature development and testing. Stanko K.R. (239 + 54 commits) — Security hardening and webhook restrictions. Jeffrey Hardy (100 commits) — Early infrastructure and modernization. Jason Fried (7 commits) — The occasional “Small copy adjustment” from the CEO. July 24, 2024: “Handful of tweaks before all-hands” — Demo day pressure July 31, 2024: “Let’s try bubbles” — The visual pivot September 4, 2024: “Splat -> Fizzy” — Finding the name April 2025: DHH’s 323-commit refactoring blitz October 2025: “Remove Fizzy Ask” — The AI feature that didn’t survive November 28, 2025: “Initial README and LICENSE” — Going public Rails 8.x — Always on the latest, sometimes ahead of stable Hotwire (Turbo + Stimulus) — No heavy JavaScript framework Solid Queue & Solid Cache — Rails-native background jobs and caching SQLite + MySQL support — Database flexibility Kamal deployment — Modern container orchestration UUID primary keys — Using UUIDv7 for time-ordering Multi-tenancy — Account-based data isolation Names matter, but they can change. Splat → Bubble → Card. Project → Bucket → Collection → Board. The right name emerges through use. Deletion is a feature. Boosts, Fizzy Ask, custom views, workflows — removing the wrong features is as important as adding the right ones. Architecture evolves. The final column-based kanban system looks nothing like the original flat list of splats. DHH’s philosophy: Remove anemic code. Keep transactions short. Use the latest Rails. Delete more than you add. Design drives naming. “Fizzy” emerged from the visual metaphor of bubbles puffing up and floating away — the design informed the brand. Open source takes extraction. 18 months of SAAS development needed careful separation before the core could be shared.

0 views

The Decomposition of Switching Functions

The Decomposition of Switching Functions Robert Ashenhurst International Symposium on the Theory of Switching Functions'59 Say you are given a logic formula such as: and you want to rewrite it in a more structured form. One way to impose structure is to express as the composition of simpler functions. You can perform such a rewrite with the Ashenhurst-Curtis decomposition , and the result will look like this: This transformation is useful in logic synthesis. The task of synthesizing a circuit to implement the 4-input function can be broken down into synthesizing circuits to implement: The 2-input function The 3-input function The goal of the Ashenhurst-Curtis decomposition is to partition the inputs of a logic formula ( are the inputs in the running example) into two disjoint sets ( , ). The partitioning depends on the logic formula being decomposed. The idea is to decompose the formula into two simpler formulas: Formula 1 accepts as inputs Formula 2 accepts as inputs, along with the output of #1 A way to do this is to search through all possible partition matrices . Fig. 4 shows the partition matrix for the decomposition in the running example: Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Think of a partition matrix as a two-dimensional truth table. Here is the original logic formula, along with the above partition matrix containing explicit values for : In other words, f is non-zero only when given the following inputs (which correspond to the 4 entries in the partition matrix with the value ). For a given logic formula, there exist many partition matrices. Only some are suitable for decomposing the logic formula. The key property to look for is the column multiplicity of the partition matrix. This column multiplicity is the number of unique column vectors in the partition matrix. The partition matrix above has a column multiplicity of 2 (i.e., there are two unique column vectors in the matrix, and ). If you can find a partition matrix with a column multiplicity less than or equal to two, then you can decompose the logic formula. Note that a partition matrix is defined not only by the values in the matrix, but also the partitioning of the input variables. If you have located a partition matrix with a column multiplicity no greater than two, then the row multiplicity must be no greater than 4. In particular, there are 4 kinds of row vectors that you will see in the partition matrix: Vectors of all 0s Vectors of all 1s A particular vector The inverse of In the running example, the row vectors are: [1,0,0,1] (we call this ) [0,1,1,0] (note how this is the inverse of ) To decompose f, first generate a function ϕ. The inputs to the function are the variables written at the top of the partition matrix ( and in this example). The truth table for ϕ is simply the first non-trivial row vector in the partition matrix ([1,0,0,1] in the example). Next, generate a function . The inputs to F are the variables written at the left of the partition matrix ( and in this example), and the output of . In this example, has 3 inputs, so the truth table defining has 8 entries. A simple algorithm is used to generate the entries in that truth table. The algorithm iterates over the row vectors in the partition matrix, and each row vector defines 2 elements in the truth table of . If row vector is all zeros, then the truth table elements at indices and are 0 If row vector is all ones, then the truth table elements at indices and are 1 If row vector is , then the truth table element at index is 0, and the element at index is 1 If row vector is the inverse of , then the truth table element at index is 1, and the element at index is 0 These rules are described in Table 1: Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf The final decomposed function is: Until now, we’ve assumed that someone has handed us a partition matrix with column multiplicity no greater than two. But how does one find such a matrix. The paper proposes iterating through all partition matrices in the partition chart of the input formula. A partition chart contains many partition matrices, each one corresponding to a different partitioning of the input variables. A circled element in the partition chart corresponds to a in the truth table of the input logic formula. Fig. 5 shows partition charts for the running example. The one in the lower right corner is the partition matrix in the running example. Astute readers will notice that it is actually the transpose of the partition matrix (i.e., and are on the left, and are on the top), but that is no big deal, they can be transposed as long as the variable names are transposed along with the matrix entries. Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Once you have generated a partition chart, it is straightforward to search for partition matrices which have column (or row) multiplicity no greater than two. These are the ones that can be used to generate the decomposed function. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work. The 2-input function The 3-input function Formula 1 accepts as inputs Formula 2 accepts as inputs, along with the output of #1 Source: https://people.eecs.berkeley.edu/~alanmi/publications/other/ashenhurst1959.pdf Think of a partition matrix as a two-dimensional truth table. Here is the original logic formula, along with the above partition matrix containing explicit values for : In other words, f is non-zero only when given the following inputs (which correspond to the 4 entries in the partition matrix with the value ). Vectors of all 0s Vectors of all 1s A particular vector The inverse of [1,0,0,1] (we call this ) [0,1,1,0] (note how this is the inverse of ) If row vector is all zeros, then the truth table elements at indices and are 0 If row vector is all ones, then the truth table elements at indices and are 1 If row vector is , then the truth table element at index is 0, and the element at index is 1 If row vector is the inverse of , then the truth table element at index is 1, and the element at index is 0

0 views