Posts in Haskell (20 found)
マリウス 1 weeks ago

Hold on to Your Hardware

Tl;dr at the end. For the better part of two decades, consumers lived in a golden age of tech. Memory got cheaper, storage increased in capacity and hardware got faster and absurdly affordable. Upgrades were routine, almost casual. If you needed more RAM, a bigger SSD, or a faster CPU or GPU, you barely had to wait a week for a discount offer and you moved on with your life. This era is ending. What’s forming now isn’t just another pricing cycle or a short-term shortage, it is a structural shift in the hardware industry that paints a deeply grim outlook for consumers. Today, I am urging you to hold on to your hardware, as you may not be able to replace it affordably in the future. While I have always been a stark critic of today’s consumer industry , as well as the ideas behind it , and a strong proponent of buying it for life (meaning, investing into durable, repairable, quality products) the industry’s shift has nothing to do with the protection of valuable resources or the environment, but is instead a move towards a trajectory that has the potential to erode technological self-sufficiency and independence for people all over the world. In recent months the buzzword RAM-pocalypse has started popping up across tech journalism and enthusiast circles. It’s an intentionally dramatic term that describes the sharp increase in RAM prices, primarily driven by high demand from data centers and “AI” technology, which most people had considered a mere blip in the market. This presumed temporary blip , however, turned out to be a lot more than just that, with one manufacturer after the other openly stating that prices will continue to rise, with suppliers forecasting shortages of specific components that could last well beyond 2028, and with key players like Western Digital and Micron either completely disregarding or even exiting the consumer market altogether. Note: Micron wasn’t just another supplier , but one of the three major players directly serving consumers with reasonably priced, widely available RAM and SSDs. Its departure leaves the consumer memory market effectively in the hands of only two companies: Samsung and SK Hynix . This duopoly certainly doesn’t compete on your wallet’s behalf, and it definitely wouldn’t be the first time it would optimize for margins . The RAM-pocalypse isn’t just a temporary headline anymore, but has seemingly become long-term reality. However, RAM and memory in general is only the beginning. The main reason for the shortages and hence the increased prices is data center demand, specifically from “AI” companies. These data centers require mind-boggling amounts of hardware, specifically RAM, storage drives and GPUs, which in turn are RAM-heavy graphics units for “AI” workloads. The enterprise demand for specific components simply outpaces the current global production capacity, and outbids the comparatively poor consumer market. For example, OpenAI ’s Stargate project alone reportedly requires approximately 900,000 DRAM wafers per month , which could account for roughly 40% of current global DRAM output. Other big tech giants including Google , Amazon , Microsoft , and Meta have placed open-ended orders with memory suppliers, accepting as much supply as available. The existing and future data centers for/of these companies are expected to consume 70% of all memory chips produced in 2026. However, memory is just the first domino. RAM and SSDs are where the pain is most visible today, but rest assured that the same forces are quietly reshaping all aspects of consumer hardware. One of the most immediate and tangible consequences of this broader supply-chain realignment are sharp, cascading price hikes across consumer electronics, with LPDDR memory standing out as an early pressure point that most consumers didn’t recognize until it was already unavoidable. LPDDR is used in smartphones, laptops, tablets, handheld consoles, routers, and increasingly even low-power PCs. It sits at the intersection of consumer demand and enterprise prioritization, making it uniquely vulnerable when manufacturers reallocate capacity toward “AI” accelerators, servers, and data-center-grade memory, where margins are higher and contracts are long-term. As fabs shift production toward HBM and server DRAM , as well as GPU wafers, consumer hardware production quietly becomes non-essential , tightening supply just as devices become more power- and memory-hungry, all while continuing on their path to remain frustratingly unserviceable and un-upgradable. The result is a ripple effect, in which device makers pay more for chips and memory and pass those costs on through higher retail prices, cut base configurations to preserve margins, or lock features behind premium tiers. At the same time, consumers lose the ability to compensate by upgrading later, because most components these days, like LPDDR , are soldered down by design. This is further amplified by scarcity, as even modest supply disruptions can spike prices disproportionately in a market where just a few suppliers dominate, turning what should be incremental cost increases into sudden jumps that affect entire product categories at once. In practice, this means that phones, ultrabooks, and embedded devices are becoming more expensive overnight, not because of new features, but because the invisible silicon inside them has quietly become a contested resource in a world that no longer builds hardware primarily for consumers. In late January 2026, the Western Digital CEO confirmed during an earnings call that the company’s entire HDD production capacity for calendar year 2026 is already sold out. Let that sink in for a moment. Q1 hasn’t even ended and a major hard drive manufacturer has zero remaining capacity for the year. Firm purchase orders are in place with its top customers, and long-term agreements already extend into 2027 and 2028. Consumer revenue now accounts for just 5% of Western Digital ’s total sales, while cloud and enterprise clients make up 89%. The company has, for all practical purposes, stopped being a consumer storage company. And Western Digital is not alone. Kioxia , one of the world’s largest NAND flash manufacturers, admitted that its entire 2026 production volume is already in a “sold out” state , with the company expecting tight supply to persist through at least 2027 and long-term customers facing 30% or higher year-on-year price increases. Adding to this, the Silicon Motion CEO put it bluntly during a recent earnings call : We’re facing what has never happened before: HDD, DRAM, HBM, NAND… all in severe shortage in 2026. In addition, the Phison CEO has gone even further, warning that the NAND shortage could persist until 2030, and that it risks the “destruction” of entire segments of the consumer electronics industry. He also noted that factories are now demanding prepayment for capacity three years in advance , an unprecedented practice that effectively locks out smaller players. The collateral damage of this can already be felt, and it’s significant. For example Valve confirmed that the Steam Deck OLED is now out of stock intermittently in multiple regions “due to memory and storage shortages” . All models are currently unavailable in the US and Canada, the cheaper LCD model has been discontinued entirely, and there is no timeline for when supply will return to normal. Valve has also been forced to delay the pricing and launch details for its upcoming Steam Machine console and Steam Frame VR headset, directly citing memory and storage shortages. At the same time, Sony is considering delaying the PlayStation 6 to 2028 or even 2029, and Nintendo is reportedly contemplating a price increase for the Switch 2 , less than a year after its launch. Both decisions are seemingly driven by the same memory supply constraints. Meanwhile, Microsoft has already raised prices on the Xbox . Now you might think that everything so far is about GPUs and other gaming-related hardware, but that couldn’t be further from the truth. General computing, like the Raspberry Pi is not immune to any of this either. The Raspberry Pi Foundation has been forced to raise prices twice in three months, with the flagship Raspberry Pi 5 (16GB) jumping from $120 at launch to $205 as of February 2026, a 70% increase driven entirely by LPDDR4 memory costs. What was once a symbol of affordable computing is rapidly being priced out of reach for the educational and hobbyist communities it was designed to serve. HP, on the other hand, seems to have already prepared for the hardware shortage by launching a laptop subscription service where you pay a monthly fee to use a laptop but never own it , no matter how long you subscribe. While HP frames this as a convenience, the timing, right in the middle of a hardware affordability crisis, makes it feel a lot more like a preview of a rented compute future. But more on that in a second. “But we’ve seen price spikes before, due to crypto booms, pandemic shortages, factory floods and fires!” , you might say. And while we did live through those crises, things eventually eased when bubbles popped and markets or supply chains recovered. The current situation, however, doesn’t appear to be going away anytime soon, as it looks like the industry’s priorities have fundamentally changed . These days, the biggest customers are not gamers, creators, PC builders or even crypto miners anymore. Today, it’s hyperscalers . Companies that use hardware for “AI” training clusters, cloud providers, enterprise data centers, as well as governments and defense contractors. Compared to these hyperscalers consumers are small fish in a big pond. These buyers don’t care if RAM costs 20% more and neither do they wait for Black Friday deals. Instead, they sign contracts measured in exabytes and billions of dollars. With such clients lining up, the consumer market in contrast is suddenly an inconvenience for manufacturers. Why settle for smaller margins and deal with higher marketing and support costs, fragmented SKUs, price sensitivity and retail logistics headaches, when you can have behemoths throwing money at you? Why sell a $100 SSD to one consumer, when you can sell a whole rack of enterprise NVMe drives to a data center with circular virtually infinite money? Guaranteed volume, guaranteed profit, zero marketing. The industry has answered these questions loudly. All of this goes to show that the consumer market is not just deprioritized, but instead it is being starved . In fact, IDC has already warned that the PC market could shrink by up to 9% in 2026 due to skyrocketing memory prices, and has described the situation not as a cyclical shortage but as “a potentially permanent, strategic reallocation of the world’s silicon wafer capacity” . Leading PC OEMs including Lenovo , Dell , HP , Acer , and ASUS have all signaled 15-20% PC price increases for 2026, with some models seeing even steeper hikes. Framework , the repairable laptop company, has also been transparent about rising memory costs impacting its pricing. And analyst Jukan Choi recently revised his shortage timeline estimate , noting that DRAM production capacity is expected to grow at just 4.8% annually through 2030, with even that incremental capacity concentrated on HBM rather than consumer memory. TrendForce ’s latest forecast projects DRAM contract prices rising by 90-95% quarter over quarter in Q1 2026. And that is not a typo. The price of hardware is one thing, but value-for-money is another aspect that appears to be only getting worse from here on. Already today consumer parts feel like cut-down versions of enterprise silicon. As “AI” accelerators and server chips dominate R&D budgets, consumer improvements will slow even further, or arrive at higher prices justified as premium features . This is true for CPUs and GPUs, and it will be equally true for motherboards, chipsets, power supplies, networking, etc. We will likely see fewer low-end options, more segmentation, artificial feature gating and generally higher baseline prices that, once established, won’t be coming back down again. As enterprise standards become the priority, consumer gear is becoming an afterthought that is being rebadged, overpriced, and poorly supported. The uncomfortable truth is that the consumer hardware market is no longer the center of gravity, as we all were able to see at this year’s CES . It’s orbiting something much larger, and none of this is accidental. The industry isn’t failing, it’s succeeding, just not for you . And to be fair, from a corporate standpoint, this pivot makes perfect sense. “AI” and enterprise customers are rewriting revenue charts, all while consumers continue to be noisy, demanding, and comparatively poor. It is pretty clear that consumer hardware is becoming a second-class citizen, which means that the machines we already own are more valuable than we might be thinking right now. “But what does the industry think the future will look like if nobody can afford new hardware?” , you might be asking. There is a darker, conspiratorial interpretation of today’s hardware trends that reads less like market economics and more like a rehearsal for a managed future. Businesses, having discovered that ownership is inefficient and obedience is profitable, are quietly steering society toward a world where no one owns compute at all, where hardware exists only as an abstraction rented back to the public through virtual servers, SaaS subscriptions, and metered experiences , and where digital sovereignty, that anyone with a PC tower under their desk once had, becomes an outdated, eccentric, and even suspicious concept. … a morning in said future, where an ordinary citizen wakes up, taps their terminal, which is a sealed device without ports, storage, and sophisticated local execution capabilities, and logs into their Personal Compute Allocation . This bundle of cloud CPU minutes, RAM credits, and storage tokens leased from a conglomerate whose logo has quietly replaced the word “computer” in everyday speech, just like “to search” has made way for “to google” , has removed the concept of installing software, because software no longer exists as a thing , but only as a service tier in which every task routes through servers owned by entities. Entities that insist that this is all for the planet . Entities that outlawed consumer hardware years ago under the banner of environmental protectionism , citing e-waste statistics, carbon budgets , and unsafe unregulated silicon , while conveniently ignoring that the data centers humming beyond the city limits burn more power in an hour than the old neighborhood ever did in a decade. In this world, the ordinary citizen remembers their parents’ dusty Personal Computer , locked away in a storage unit like contraband. A machine that once ran freely, offline if it wanted, immune to arbitrary account suspensions and pricing changes. As they go about their day, paying a micro-fee to open a document, losing access to their own photos because a subscription lapsed, watching a warning banner appear when they type something that violates the ever evolving terms-of-service, and shouting “McDonald’s!” to skip the otherwise unskippable ads within every other app they open, they begin to understand that the true crime of consumer hardware wasn’t primarily pollution but independence. They realize that owning a machine meant owning the means of computation , and that by centralizing hardware under the guise of efficiency, safety, and sustainability, society traded resilience for convenience and autonomy for comfort. In this dyst… utopia , nothing ever breaks because nothing is yours , nothing is repairable because nothing is physical, and nothing is private because everything runs somewhere else , on someone else’s computer . The quiet moral, felt when the network briefly stutters and the world freezes, is that keeping old hardware alive was never nostalgia or paranoia, but a small, stubborn act of digital self-defense; A refusal to accept that the future must be rented, permissioned, and revocable at any moment. If you think that dystopian “rented compute over owned hardware” future could never happen, think again . In fact, you’re already likely renting rather than owning in many different areas. Your means of communication are run by Meta , your music is provided by Spotify , your movies are streamed from Netflix , your data is stored in Google ’s data centers and your office suite runs on Microsoft ’s cloud. Maybe even your car is leased instead of owned, and you pay a monthly premium for seat heating or sElF-dRiViNg , whatever that means. After all, the average Gen Z and Millennial US consumer today apparently has 8.2 subscriptions , not including their DaIlY aVoCaDo ToAsTs and StArBuCkS cHoCoLate ChIp LaTtEs that the same Boomers responsible for the current (and past) economic crises love to dunk on. Besides, look no further than what’s already happening in for example China, a country that manufactures massive amounts of the world’s sought-after hardware yet faces restrictions on buying that very hardware. In recent years, a complex web of export controls and chip bans has put a spotlight on how hardware can become a geopolitical bargaining chip rather than a consumer good. For example, export controls imposed by the United States in recent years barred Nvidia from selling many of its high-performance GPUs into China without special licenses, significantly reducing legal access to cutting-edge compute inside the country. Meanwhile, enforcement efforts have repeatedly busted smuggling operations moving prohibited Nvidia chips into Chinese territory through Southeast Asian hubs, with over $1 billion worth of banned GPUs reportedly moving through gray markets, even as official channels remain restricted. Coverage by outlets such as Bloomberg , as well as actual investigative journalism like Gamer’s Nexus has documented these black-market flows and the lengths to which both sides go to enforce or evade restrictions, including smuggling networks and increased regulatory scrutiny. On top of this, Chinese regulators have at times restricted domestic tech firms from buying specific Nvidia models, further underscoring how government policy can override basic market access for hardware, even in the country where much of that hardware is manufactured. While some of these export rules have seen partial reversals or regulatory shifts, the overall situation highlights a world in which hardware access is increasingly determined by politics, security regimes, and corporate strategy, and not by consumer demand . This should serve as a cautionary tale for anyone who thinks owning their own machines won’t matter in the years to come. In an ironic twist, however, one of the few potential sources of relief may, in fact, come from China. Two Chinese manufacturers, CXMT ( ChangXin Memory Technologies ) and YMTC ( Yangtze Memory Technologies ), are embarking on their most aggressive capacity expansions ever , viewing the global shortage as a golden opportunity to close the gap with the incumbent big three ( Samsung , SK Hynix , Micron ). CXMT is now the world’s fourth-largest DRAM maker by production volume, holding roughly 10-11% of global wafer capacity, and is building a massive new DRAM facility in Shanghai expected to be two to three times larger than its existing Hefei headquarters, with volume production targeted for 2027. The company is also preparing a $4.2 billion IPO on Shanghai’s STAR Market to fund further expansion and has reportedly delivered HBM3 samples to domestic customers including Huawei . YMTC , traditionally a NAND flash supplier, is constructing a third fab in Wuhan with roughly half of its capacity dedicated to DRAM, and has reached 270-layer 3D NAND capability, rapidly narrowing the gap with Samsung (286 layers) and SK Hynix (321 layers). Its NAND market share by shipments reached 13% in Q3 2025, close to Micron ’s 14%. What’s particularly notable is that major PC manufacturers are already turning to these suppliers . However, as mentioned before, with hardware having become a geopolitical topic, both companies face ongoing (US-imposed) restrictions. Hence, for example HP has indicated it would only use CXMT chips in devices for non-US markets. Nevertheless, for consumers worldwide the emergence of viable fourth and fifth players in the memory market represents the most tangible hope of eventually breaking the current supply stranglehold. Whether that relief arrives in time to prevent lasting damage to the consumer hardware ecosystem remains an open question, though. Polymarket bet prediction : A non-zero percentage of people will confuse Yangtze Memory Technologies with the Haskell programming language . The reason I’m writing all of this isn’t to create panic, but to help put things into perspective. You don’t need to scavenger-hunt for legacy parts in your local landfill (yet) or swear off upgrades forever, but you do need to recognize that the rules have changed . The market that once catered to enthusiasts and everyday users is turning its back. So take care of your hardware, stretch its lifespan, upgrade thoughtfully, and don’t assume replacement will always be easy or affordable. That PC, laptop, NAS, or home server isn’t disposable anymore. Clean it, maintain it, repaste it, replace fans and protect it, as it may need to last far longer than you originally planned. Also, realize that the best time to upgrade your hardware was yesterday and that the second best time is now . If you can afford sensible upgrades, especially RAM and SSD capacity, it may be worth doing sooner rather than later. Not for performance, but for insurance, because the next time something fails, it might be unaffordable to replace, as the era of casual upgrades seems to be over. Five-year systems may become eight- or ten-year systems. Software bloat will hurt more and will require re-thinking . Efficiency will matter again . And looking at it from a different angle, maybe that’s a good thing. Additionally, the assumption that prices will normalize again at some point is most likely a pipe dream. The old logic wait a year and it’ll be cheaper no longer applies when manufacturers are deliberately constraining supply. If you need a new device, buy it; If you don’t, however, there is absolutely no need to spend money on the minor yearly refresh cycle any longer, as the returns will be increasingly diminishing. And again, looking at it from a different angle, probably that is also a good thing. Consumer hardware is heading toward a bleak future where owning powerful, affordable machines becomes harder or maybe even impossible, as manufacturers abandon everyday users to chase vastly more profitable data centers, “AI” firms, and enterprise clients. RAM and SSD price spikes, Micron ’s exit from the consumer market, and the resulting Samsung / SK Hynix duopoly are early warning signs of a broader shift that will eventually affect CPUs, GPUs, and the entire PC ecosystem. With large manufacturers having sold out their entire production capacity to hyperscalers for the rest of the year while simultaneously cutting consumer production by double-digit percentages, consumers will have to take a back seat. Already today consumer hardware is overpriced, out of stock or even intentionally being delayed due to supply issues. In addition, manufacturers are pivoting towards consumer hardware subscriptions, where you never own the hardware and in the most dystopian trajectory, consumers might not buy any hardware at all, with the exception of low-end thin-clients that are merely interfaces , and will rent compute through cloud platforms, losing digital sovereignty in exchange for convenience. And despite all of this sounding like science fiction, there is already hard evidence proving that access to hardware can in fact be politically and economically revoked. Therefor I am urging you to maintain and upgrade wisely, and hold on to your existing hardware , because ownership may soon be a luxury rather than the norm.

0 views

Type safe interpreters

It is well known that most software has bugs [Citation needed] . It is also well known that interpreters are software 1 . One issue that one might run into in an interpreter is type safety : It may very well be possible for your typed language's interpreter to nevertheless end up in a state where it is asked to evaluate . What can we do about that? GADTs are a technique for increasing type safety in a program, by "indexing" a constructor by a type. One of the canonical examples for this is an -like. Imagine we wanted a type-safe way to ensure that in some cases, we could always extract the value inside. One way of doing this is to add an extra type parameter, and use that to mark whether something is inside. We will use OCaml for this demonstration. Note that we have used a different syntax than usual. Normally, we put a type parameter in the constructor, like , and then use that throughout ( ). However, as we are changing that type in the "return" of the constructor, we need a syntax that allows us to change that. marks a type parameter we "don't need to name" (as we always specify it), and in this case, we always specify both. Then, we add our elements before a , mimicking a function. The general form of this in OCaml is . Now, if we have something of type , we know it contains a value! The only other way a can be formed is via , but that gives it type , so it'd be type-incorrect. This means we can write a function like and have it be total; there is no missing case, because the missing case is a type error. Note that when writing a function like , we need to use this interesting piece of syntax: This is as otherwise OCaml eagerly thinks "Ah, must be !" (or equivalent for ) when seeing the case in the pattern match. The designator tells it to keep as general as possible, instead of refining it. How does this allow us to get a more type-safe interpreter, though? Well, let's start by introducing our expression type: We've added space for a type parameter. What should integers and booleans look like? Maybe they should be indexed by their respective "meta" types. Now, we want to add a case for and . Naturally, we shouldn't be able to add two booleans, so let's restrict it to only taking in s. Similar for negate. We've already added type safety! It's impossible to write , because expects a , but is a . Let's keep going: (or ) should take two integers and return a boolean, and should take a boolean as its first argument, but anything for its second and third. Hence note that we can still use generic parameters - we just need to think about when it makes sense to do so, to get the type safety we want. Now for functions. What should a function look like here? One way of implementing them is with a "Higher-order abstract syntax"(HOAS) 3 approach, where we use the functions of the host language to make functions in the interpreted language. In that case, we should take a function as the argument to our constructor, but when what is our result type? That function type! We'll need to know it later to safely write , so we "store" it in the type now. We also want the function to take and return an , so we can do constructions like . Note that the function is not an at first, as otherwise we would have no way to construct it in the first place. (Remember, this is how we make functions.) Then . It should take in a function, and a value, and then return the function applied to the value. This hence translates as: As a finale, we'll add pairs and projections. I won't elaborate on these, hoping that it's becoming obvious how they work. Now let's look at some examples and their types. We have that a construction like is not just an error, but a type error ! This means it's caught at compile time! From here, the evaluator itself is very simple. We just write the interpretation of our language, as usual: The and cases are interesting. In the former, we need to return a function of type . We can start by introducing a function , but we have , so we just apply it. In the latter, we need to to get a function from , which is of type . We can then pass it , which is of type to get a , which we must then evaluate again to get a . Now we can run the above examples. Remember, all of the above is completely type safe! We have succesfully constructed an interpreter that can never type error. In short, it can be hard. One way is to have a "base" language that's parsed, and then have your typechecker produce a representation like the above that you then know is type correct. However, this does prevent typechecker bugs from sneaking through, which was our entire goal! To some extent, this is kicking the bucket down the road - you are now relying that your "host language"'s typechecker is correct. However, this is a numbers game; if you're writing in OCaml or Haskell or similar, the chances of that are very very high unless you're using extremely weird features. The astute may have noted that this is essentially a version of denotional semantics. We're interpreting into the type-safe domain of the host language to ensure that the types always line up (as they must, for the interpretation to succeed, and the typechecker of the host language ensures that). No, I won't do the same joke twice. Most of the time. ↩ I am not going into what "ADT" stands for, as it's a whole nother debate. ↩ Higher-order abstract syntax. A bit too in-depth to explore in full right now, but worth exploring. ↩ No, I won't do the same joke twice. Most of the time. ↩ I am not going into what "ADT" stands for, as it's a whole nother debate. ↩ Higher-order abstract syntax. A bit too in-depth to explore in full right now, but worth exploring. ↩

0 views
Abhinav Sarkar 1 months ago

Implementing Co, a Small Language With Coroutines #5: Adding Sleep

In the previous post , we added channels to Co , the small language we are implementing in this series of posts. In this post, we add the primitive to it, enabling time-based coroutine scheduling. We then use sleep to build a simulation of digital logic circuits. This post was originally published on abhinavsarkar.net . This post is a part of the series: Implementing Co, a Small Language With Coroutines . Sleep is a commonly used operation in concurrent programs. It pauses the execution of the current Thread of Computation (ToC) for a specified duration, after which the ToC is resumed automatically. Sleep is used for various purposes: polling for events, delaying execution of an operation, simulating latency, implementing timeouts, and more. Sleep is generally implemented as a primitive operation in most languages, delegating the actual implementation to the underlying operating system. The operating system’s scheduler removes the ToC from the list of runnable ToCs , places it in a list of sleeping ToCs , and after the specified duration, moves it back to the list of runnable ToCs for scheduling. Since Co implements its own ToC (coroutine) scheduler, we implement sleep as a primitive operation within the interpreter itself 1 . We start by exposing and as built-in functions to Co : The built-in function takes one argument—the duration in milliseconds to sleep for. The function returns the current time in milliseconds since the Unix epoch . Both of them delegate to the functions explained next. The function evaluates its argument to a number, checks that it is non-negative, and then calls the function in the monad. calls and returns the milliseconds wrapped as a . The implementation of sleep is more involved than other built-in functions because it interacts with the coroutine scheduler. When a coroutine calls , we want to suspend the coroutine, and schedule it to be resumed after the specified duration. There may be multiple coroutines in the sleep state at a time, and they must be resumed according to their wakeup time (time at which sleep was called + sleep duration), and not in any other order. To be efficient, it is also important that the scheduler does not poll repeatedly for new coroutines to wake up and run, but instead waits till the right time. These are the two requirements for our coroutine scheduler. And the solution is: delayed coroutines. The coroutines we have implemented so far were scheduled to run immediately. To implement sleep, we extend the coroutine concept with Delayed Coroutines —coroutines that are scheduled to run at a specific future time. Now the data type holds an to signal when the coroutine is ready to be run. The old-style coroutines that run immediately are created ready to run by the function. But delayed coroutines are different: The key difference from a regular coroutine is that the used for signaling is created empty. We fork a thread 2 that sleeps 3 for the specified sleep duration, and then signals that the coroutine is ready to run by filling the . An is a synchronization primitive 4 —essentially a mutable box that can hold a value or be empty. When we call on an empty , it blocks until another thread fills it. This is what makes it powerful for our use case: instead of the interpreter repeatedly polling the queue asking “is this coroutine ready yet?”, we let the interpreter wait on the . The forked thread signals readiness at the right time by filling the . The interpreter wakes up immediately—no wasted CPU cycles, no busy-waiting. We already have a of coroutines in our . It is a min-priority queue sorted by timestamps, which we have been using as a FIFO queue till now. Now we use it for its real purpose: storing delayed coroutines sorted by their wakeup times. The queue also tracks the maximum wakeup time of all coroutines in the queue. This information is useful for calculating how long the interpreter should sleep before termination. The core operations on the queue are: We saw the function earlier : The function enqueues the given value at the given time in the queue. The function enqueues the value at the current time, thus scheduling it to run immediately. The function dequeues the value with the lowest priority from the queue, which in this case, is the value that is enqueued first. The function returns the monotonically increasing current system time. The function dequeues the coroutine with lowest priority, so if we use the wakeup time as priority, it will dequeue the coroutine that is to be run next. That works! The function calculates and tracks the maximum wakeup times of the coroutines as well. Next, we implement the scheduling of delayed coroutines: The function enqueues a coroutine in the interpreter coroutine queue with the specified wakeup time. We also improve the function to wait for the coroutine to be ready before running it. The function call blocks till the thread that was forked when creating the coroutine wakes up and fills the . So we don’t have to poll the queue. That’s all we have to do for having delayed coroutines. With the infrastructure in place, the function becomes straightforward: When a coroutine calls , we capture the current environment and use to capture the continuation—the code that should run after the sleep completes. We then create a new delayed coroutine with this continuation, schedule it for the future, and run the next coroutine in the queue. The scheduler machinery takes care of running the delayed coroutine at the right time. We also modify the function from the previous post to handle delayed coroutines. It now sleeps till the last wakeup time before checking if the queue is empty: Notice how we use the function we just defined in . The function calculates how long to sleep before the last coroutine becomes ready: That’s all for sleeping. This may be too much to take in, so let’s go through some examples. Sleep can be used for polling/waiting for events, delaying execution, simulating latency, implementing timeouts, and more. Let’s see some simple uses. An interesting example of sleep is the infamous sleep sort , which sorts a list of numbers by spawning a coroutine for each number that sleeps for the duration of that number, then prints it: Running this program prints what we expect: Don’t use for sorting your numbers though. Moving on. With sleep, we can implement JavaScript-like and functions: The function spawns a coroutine that sleeps for the specified duration and then calls the callback function. The function repeatedly calls a callback at a fixed interval using to reschedule itself. Running the above code prints alternating and every 1 second, forever: Notice that the scheduling is not accurate up to milliseconds, but only approximate. As a more complex example of using sleep, we implement a simulator for digital logic circuits, from basic Logic gates to a Ripple carry adder . The idea is to model circuits as a network of wires and gates, where the wires carry digital signal values ( or ), and the logic gates transform input signals to output signals with a propagation delay. The digital circuit simulation example is from the Wizard Book . Quoting an example: An inverter is a primitive function box [logic gate] that inverts its input. If the input signal to an inverter changes to 0, then one inverter-delay later the inverter will change its output signal to 1. If the input signal to an inverter changes to 1, then one inverter-delay later the inverter will change its output signal to 0. But first, we’ll need to make some lists. We implement a simple cons list (a singly linked list) using a trick from the book itself : creates an empty list, and we grow the list by prepending an element to it by calling the function. returns the first element of a list, and returns the rest of them. Notice that a cell is just a closure that holds references to its first and rest parameters, and returns a selector function to retrieve them. Next, we define a helper function to call a list of actions, yielding after each one: A wire holds a mutable signal value and a list of actions to call when the signal changes: A wire provides three operations: The function connects two wires, causing the signal from one to propagate to another. First, we define the basic logic operations: And a utility function to schedule a function to run after a delay: With these building blocks, we define the logic gates. Each gate computes its output based on its inputs and schedules the output update after a propagation delay specific to the gate: We add the action to each input wire, which runs when the input signals change, and sets the signal on the output wire after a delay. Let’s test an And gate: For probing, we define a helper that logs signal changes with milliseconds elapsed since start of the run: The output: It works as expected. You can notice the sleep and the And gate delay in action. Using the basic logic gates, next we build adders. A Half adder is a digital circuit that adds two bits: It has two input signals/bits and , and two output bits and . We simply connect the And, Or and Not gates with input, output and intermediate wires in our code as shown in the diagram: Nice and simple. Let’s test it: And the output: In binary, . Correct! Notice again how the signal propagation through the gates is delayed. Next up is the full adder. A Full adder adds three bits, two inputs and a carry-in: Notice that a full adder uses two half adders. Again, we follow the diagram and connect the wires: Let’s skip the demo for full adder and jump to something more exciting. A Ripple-carry adder chains together multiple full adders to add multi-bit numbers. The diagram below shows a four-bit adder: We create a ripple-carry adder that can add any number of bits. First we need some helper functions: creates a list of wires to represent an N-bit input/output. sets the bits of a N-bit wire list to a given N-bit value. Now we write a ripple-carry adder: The ripple-carry adder uses one full adder per bit, cascading the carry-out bit of each input bit-pair’s sum to the next pair of bits. To demonstrate, let’s add two 4-bit numbers: This one runs for a while because of the collective delays. Let me pick out the final output: We add and in binary, resulting in , which is correct again. Everything works perfectly. With sleep, we’ve now implemented all major features of Co —a complete concurrent language with first-class coroutines, channels, and time-based scheduling. With the addition of sleep, we’ve completed our implementation of Co —a small language with coroutines and channels. Over these five posts, we went from parsing source code to building a full interpreter that handles cooperative multitasking using coroutines. The key insight was realizing that coroutines are just environments plus continuations. By designing our interpreter to use continuation-passing style, we gained the ability to suspend execution at any point and resume it later. Channels built naturally on top of that, providing a way for coroutines to synchronize and pass messages. And sleep extended the scheduler to handle time-based execution, unlocking patterns like timeouts and periodic tasks. The examples we built along the way—pubsub system, actor system, and digital circuit simulation—show what becomes possible once these primitives are in place. Starting with basic arithmetic and functions, we ended up with a language capable of expressing real concurrent programs. What comes next? Maybe a compiler for Co ? Stay tuned by subscribing to the feed or the email newsletter . The full code for the Co interpreter is available here . If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! The sleep implementation in Co is not interruptible. That is, if a coroutine is sleeping, it cannot be resumed before the specified duration. This is different from sleep implementations in most programming languages, where the sleep operation can be interrupted by sending a signal to the sleeping ToC. ↩︎ Threads in GHC are Green Threads and are very cheap to create and run. It is perfectly okay to fork a new one for each delayed coroutine. ↩︎ So in a way, we cheat here by using the sleep primitive provided by the GHC runtime to implement our sleep primitive. If we write a compiler for Co , we’ll have to write our own runtime where we’ll have to implement our sleep function using the functionalities provided by the operating systems. ↩︎ To learn more about how s can be used to communicate between threads, read the chapter 24 of Real World Haskell . ↩︎ This post is a part of the series: Implementing Co, a Small Language With Coroutines . If you liked this post, please leave a comment . The Interpreter Adding Coroutines Adding Channels Adding Sleep 👈 Introduction Adding Sleep Delayed Coroutines Queuing Coroutines Implementing Sleep Sleep in Action Sleep Sort JavaScript-like Timeouts and Intervals Bonus Round: Digital Circuit Simulation Conjuring Lists Logic Gates Ripple-carry Adder : returns the current signal value. : sets a new signal value and calls all actions if the value changed. : adds an action to be called when the signal changes, and calls it immediately. The sleep implementation in Co is not interruptible. That is, if a coroutine is sleeping, it cannot be resumed before the specified duration. This is different from sleep implementations in most programming languages, where the sleep operation can be interrupted by sending a signal to the sleeping ToC. ↩︎ Threads in GHC are Green Threads and are very cheap to create and run. It is perfectly okay to fork a new one for each delayed coroutine. ↩︎ So in a way, we cheat here by using the sleep primitive provided by the GHC runtime to implement our sleep primitive. If we write a compiler for Co , we’ll have to write our own runtime where we’ll have to implement our sleep function using the functionalities provided by the operating systems. ↩︎ To learn more about how s can be used to communicate between threads, read the chapter 24 of Real World Haskell . ↩︎ The Interpreter Adding Coroutines Adding Channels Adding Sleep 👈

0 views
seated.ro 2 months ago

glimpses of the future

Glimpse can now build call graphs, showing you exactly how functions relate to each other in your codebase. This works by parsing your code with tree-sitter, extracting function definitions and calls, then resolving those calls to their actual definitions. Sometimes tree-sitter based resolution isn’t enough. Maybe you’re dealing with dynamic dispatch, generics, or just a language with particularly complex module resolution. For this, Glimpse can use LSPs to resolve definitions semantically. This spins up actual LSP servers and uses goto-definition / goto-implementation to resolve calls. It’s slower, but accurate. Glimpse will attempt to auto-install the LSP servers for you. Glimpse eagerly caches whatever it finds into an incremental index. But you can choose to pre-build the index ahead of time for instant queries. The index stores all the definitions, calls, and resolutions so subsequent queries are fast. Glimpse now supports: Go, Rust, C, C++, Python, TypeScript, JavaScript, Zig, Java, Scala, Nix, Lua, Ruby, C#, Kotlin, Swift, and Haskell. Each language has custom tree-sitter queries for extracting definitions, calls, and imports. The grammars are downloaded and compiled automatically on first use.

0 views
Abhinav Sarkar 2 months ago

Polls I Ran on Mastodon in 2025

In 2025, I ran ten polls on Mastodon exploring various topics, mostly to outsource my research to the hivemind. Here are the poll results organized by topic, with commentary. How do you pronounce JSON? January 15, 2025 I’m in the “Jay-Son, O as in Otter” camp, which is the majority response. It seems like most Americans prefer the “Jay-Son, O as in Utter” option. Thankfully, only one person in the whole world says “Jay-Ess-On”. If someone were to write a new compiler book today, what would you prefer the backend to emit? October 31, 2025 LLVM wins this poll hands down. It is interesting to see WASM beating other targets. Which is your favourite Haskell parsing library? November 3, 2025 I didn’t expect Attoparsec to go toe-to-toe with Megaparsec . I did some digging, and it seems like Megaparsec is the clear winner when it comes to parsing programming languages in Haskell. However, for parsing file formats and network protocols, Attoparsec is the most popular one. I think that’s wise, and I’m inclined to make the same choice. If you were to write a compiler in Haskell, would you use a lens library to transform the data structures? July 11, 2025 This one has mixed results. Personally, I’d like to use a minimal lens library if I’m writing a compiler in Haskell. What do you think is the right length of programming related blog posts (containing code) in terms of reading time? May 18, 2025 As a writer of programming related blog posts, this poll was very informative for me. 10 minute long posts seem to be the most popular option, but my own posts are a bit longer, usually between 15–20 minutes. Do you print blog posts or save them as PDFs for offline reading? March 8, 2025 Most people do not seem to care about saving or printing blog posts. But I went ahead and added (decent) printing support for my blog posts anyway. If you have a personal website and you do not work in academia, do you have your résumé or CV on your website? August 30, 2025 I don’t have a public résumé on my website either. I’d like to, but I don’t think anyone visiting my website would read it. Would people be interested in a series of blog posts where I implement the C compiler from “Writing a C Compiler” book by Nora Sandler in Haskell? November 11, 2025 Well, 84% people voted “Yes”, so this is (most certainly) happening in 2026! If I were to release a service to run on servers, how would you prefer I package it? December 30, 2025 Well, people surely love their Docker images. Surprisingly, many are okay with just source code and build instructions. Statically linked executable are more popular now, probably because of the ease of deployment. Many also commented that they’d prefer OS specify package like deb or rpm. However, my personal preference is Nix package and NixOS module. If you run services on Hetzner, do you keep a backup of your data entirely off Hetzner? August 9, 2025 It is definitely wise to have an offsite backup. I’m still figuring out the backup strategy for my VPS. That’s all for this year. Let’s see what polls I come up with in 2026. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! This post was originally published on abhinavsarkar.net . If you liked this post, please leave a comment . General Programming JSON Pronunciation Compilers Compiler Backend Targets Haskell Parsing Libraries Compiler in Haskell with Lenses Blogging & Web Blog Post Length Preferences Blog Post Print Support Résumés on Personal Website “Writing a C Compiler” Blog Series Self-hosting Service Packaging Preferences Hetzner Backup Strategy

0 views
matklad 2 months ago

Newtype Index Pattern In Zig

In efficiency-minded code, it is idiomatic to use indexes rather than pointers. Indexes have several advantages: First , they save memory. Typically a 32-bit index is enough, a saving of four bytes per pointer on 64-bit architectures. I haven’t seen this measured, but my gut feeling is that this is much more impactful than it might initially seem. On modern architectures, saving memory saves time (and energy) as well, because the computing bottleneck is often the bit pipe between the memory and the CPU, not the computation per se. Dense data structures use CPU cache more efficiently, removing prohibitive latency of memory accesses. Bandwidth savings are even better: smaller item size obviously improves bandwidth utilization, but having more items in cache obviates the need to use the bandwidth in the first place. Best case, the working set fits into the CPU cache! Note well that memory savings are evenly spread out. Using indexes makes every data structure slightly more compact, which improves performance across the board, regardless of hotspot distribution. It’s hard to notice a potential for such saving in a profiler, and even harder to test out. For these two reasons, I would default to indexes for code where speed matters, even when I don’t have the code written yet to profile it! There’s also a more subtle way in which indexes save memory. Using indexes means storing multiple items in an array, but such dense storage contains extra information in relative positions of the items. If you need to store a list of items, you can often avoid materializing the list of indexes by storing a range “pointing” into the shared storage. Occasionally, you can even do UTF-8 trick and use just a single bit to mark the end of a list. The second benefit of indexes is more natural modeling of cyclic and recursive data structures. Creating a cycle fundamentally requires mutability somewhere (“tying the knot” in Haskell relies on mutability of lazy thunks). This means that you need to make some pointers nullable, and that usually gets awkward even without borrow checker behind your back. Even without cycles and just recursion, pointers are problematic, due to a combination of two effects: The combination works fine at small scale, but then it fails with stack overflow in production every single time, requiring awkward work-arounds. For example, serializes error traces from nested macro expansions as a deeply nested tree of JSON objects, which requires using stacker hack when parsing the output (which you’ll learn about only after crashes in the hands of macro connoisseur users). Finally , indexes greatly help serialization, they make it trivial to communicate data structures both through space (sending a network message) and time (saving to disk and reading later). Indexes are naturally relocatable, it doesn’t matter where in memory they are. But this is just a half of serialization benefit. The other is that, because everything is in few arrays, you can do bulk serialization. You don’t need to write the items one by one, you can directly arrays around (but be careful to not leak data via padding, and be sure to checksum the result). The big problem with “naive” indexes is of course using the right index with the wrong array, or vice verse. The standard solution here is to introduce a newtype wrapper around the raw index. @andrewrk recently popularized a nice “happy accident of language design” pattern for this in Zig. The core idea is to define an index via non-exhaustive : In Zig, designates a strongly-typed collection of integer constants, not a Rust-style ADT (there’s for that). By default an backing integer type is chosen by the compiler, but you can manually override it with syntax: Finally, Zig allows making enums non-exhaustive with . In a non-exhaustive enum, any numeric value is valid, and some have symbolic labels: and builtins switch abstraction level between a raw integer and an enum value. So, is a way to spell “ , but a distinct type”. Note that there’s no strong encapsulation boundary here, anyone can . Zig just doesn’t provide language-enforced encapsulation mechanisms. Putting everything together, this is how I would model n-ary tree with parent pointers in Zig: Some points of note: P.S. Apparently I also wrote a Rust version of this post a while back? https://matklad.github.io/2018/06/04/newtype-index-pattern.html pointers encourage recursive functions, and recursive data structures lead to arbitrary long (but finite) chains of pointers. As usual with indexes, you start with defining the collective noun first, a rather than a . In my experience, you usually don’t want suffix in your index types, so is just , not the underlying data. Nested types are good! feels just right. For readability, the order is fields, then nested types, then functions. In , we have a couple of symbolic constants. is for the root node that is stored first, for whenever we want to apply offensive programing and make bad indexes blow up. Here, we use for “null” parent. An alternative would be to use , but that would waste of space, or making the root its own parent. If you care about performance, its a good idea to sizes of structures, not to prevent changes, but as a comment that explains to the reader just how the large the struct is. I don’t know if I like or more for representing ranges, but I use the former just because the names align in length. Both and are reasonable shapes for the API. I don’t know which one I prefer more. I default to the former because it works even if there are several node arguments.

0 views
Abhinav Sarkar 2 months ago

Solving Advent of Code 2025 in Janet: Days 5–8

I’m solving the Advent of Code 2025 in Janet . After doing the last five years in Haskell, I wanted to learn a new language this year. I’ve been eyeing the “New Lisps” 1 for a while now, and I decided to learn Janet. Janet is a Clojure like Lisp that can be interpreted, embedded and compiled, and comes with a large standard library with concurrency, HTTP and PEG parser support. I want to replace Python with Janet as my scripting language. Here are my solutions for December 5–8. This post was originally published on abhinavsarkar.net . This post is a part of the series: Solving Advent of Code 2025 in Janet . All my solutions follow the same structure because I wrote a template to create new empty solutions. Actually, I added a fair bit of automation this time to build, run, test and benchmark the solutions. Parsing the day 5 input was a bit involved because of the two different formats. Other than that, the function is the most interesting part. Since I sorted the ranges in , I needed to do only one linear scan of ranges, merging the current one with the previous one if possible. The trick here was to be correct about finding overlapping ranges and calculating the merged range. I made multiple mistakes but eventually figured it out. Day 6 was entirely a parsing-based problem, and Janet was well suited to it. Parts 1 and 2 required the input to be parsed differently, so the is parameterized. In part 1, I ignored whitespaces in numbers, while in part 2, they were significant. So I passed two different patterns to parse numbers in and . I had to write the function because it is not built into Janet. Rest of it was straightforward. Notice how I used threading macros to write the computations linearly. I solved part 1 of day 7 by simply folding over the input rows, propagating the beam, and splitting it when required. I used a set of indices to keep track of the current indices at which beam was present. Only tricky thing here was using a dict to simulate a set because Janet does not have sets built-in. That’s what the code is doing. Part 2 was harder. I first wrote a brute-force solution to count the number of paths, but it never finished running. The number of paths is \(O(2^n)\) , and impossible to solve with brute-force. I know that there may be better solutions possible, but I simply added a dict-based cache, and that made it work. Day 8 required me to do several new things. It was immediately clear to me that I needed a Disjoint Set to keep track of the connected points. So I wrote one in object-oriented Janet! Object-orientation in Janet is prototype-based , pretty much like JavaScript. You can see the and methods in the above. I first computed all unique pairs and distances between them, and sorted the pairs by distances. In part 1, I union-ed closest \(k\) pairs, while in part 2, I kept going till all points were connected in one circuit. This worked but it took really long to run: over 600ms. I was not satisfied. After a night’s sleep, I realized that I do not need to sort all pairs but only top \(k\) , where \(k\) is much smaller than total number of pairs (~500000). So I rewrote the function to use a max binary heap that keeps only the closest- \(k\) pairs. The function changed to pass \(k\) as a parameter to , which after a bit of experimentation, I set to 5500. The rest of the functions stayed unchanged. This change provided over 10x speedup, reducing the run time to under 60ms 2 ! You can see the mutable nature of Janet in all its glory in this solution. I had several gotcha moments when I tried to mix higher-order functions—such as , , and —with mutable date structures in Janet. Not only they are confusing, but they also result in slower code because Janet does not have Persistent data-structures like Clojure. Every etc. result in a new array being created. My advice is to not mix functional programming code with procedural programming code in Janet. That’s it for now. Next note will drop after 4 or 5 days. You can browse the code repo to see the full setup. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! The new Lisps that interest me are: Janet, Fennel and Jank . ↩︎ You may ask why I didn’t write the max-heap as OO-Janet code. Well, I did and I found that it was 50% slower than the procedural version shown here. I guess the dispatch overhead for methods is too much. ↩︎ This post is a part of the series: Solving Advent of Code 2025 in Janet . If you liked this post, please leave a comment . Days 5–8 👈 The new Lisps that interest me are: Janet, Fennel and Jank . ↩︎ You may ask why I didn’t write the max-heap as OO-Janet code. Well, I did and I found that it was 50% slower than the procedural version shown here. I guess the dispatch overhead for methods is too much. ↩︎ Days 5–8 👈

0 views
Abhinav Sarkar 2 months ago

Solving Advent of Code 2025 in Janet: Day 1–4

I’m solving the Advent of Code 2025 in Janet . After doing the last five years in Haskell, I wanted to learn a new language this year. I’ve been eyeing the “New Lisps” 1 for a while now, and I decided to learn Janet. Janet is a Clojure like Lisp that can be interpreted, embedded and compiled, and comes with a large standard library with concurrency, HTTP and PEG parser support. I want to replace Python with Janet as my scripting language. Here are my solutions for Dec 1–4. This post was originally published on abhinavsarkar.net . All my solutions follow the same structure because I wrote a template to create new empty solutions. Actually, I added a fair bit of automation this time to build, run, test and benchmark the solutions. Day 1 was a bit mathy but it didn’t take too long to figure out. I spent more time polishing the solution to be idiomatic Janet code. , the PEG grammar to parse the input was the most interesting part for me on the day. If you know Janet, you can notice this is not the cleanest code, but that’s okay, it was my day 1 too. The most interesting part of the day 2 solution was the macro that reads the input at compile-time and creates a custom function to check whether a number is in one of the given ranges. This turned out to be almost 4x faster than writing the same thing as a function. Notice , the PEG grammar to parse the input. So short and clean! I also leaned into the imperative and mutable nature of the Janet data-structures. The code is still not the cleanest as I was still learning. The first part of day 3 was pretty easy to solve, but using the same solution for the second part just ran forever. I realized that this is a Dynamic Programming problem, but I don’t like doing array-based solutions, so I simply rewrote the solution to add caching. And it worked! It is definitely on the slower side, but I’m okay with it. The code has become a little more idiomatic Janet. Day 4 is when I learned more about Janet control flow structures. The solution for the part 2 is a straightforward Breadth-first traversal . The interesting parts are the , and statements. So concise and elegant! That’s it for now. Next note will drop after 4 or 5 days. You can browse the code repo to see the full setup. If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! The new Lisps that interest me are: Janet, Fennel and Jank . ↩︎ If you liked this post, please leave a comment . The new Lisps that interest me are: Janet, Fennel and Jank . ↩︎

0 views
Fernando Borretti 3 months ago

Ad-Hoc Emacs Packages with Nix

You can use Nix as a package manager for Emacs, like so: Today I learned you can also use it to create ad-hoc packages for things not in MELPA or nixpkgs . The other day I wanted to get back into Inform 7 , naturally the first stack frame of the yak shave was to look for an Emacs mode. exists, but isn’t packaged anywhere. So I had to vendor it in. You can use git submodules for this, but I have an irrational aversion to submodules. Instead I did something far worse: I wrote a Makefile to download the from GitHub, and used home-manager to copy it into my . Which is nasty. And of course this only works for small, single-file packages. And, on top of that: whatever dependencies your vendored packages need have to be listed in , which confuses the packages you want, with the transitive dependencies of your vendored packages. I felt like the orange juice bit from The Simpsons . There must be a better way! And there is. With some help from Claude, I wrote this: Nix takes care of everything: commit pinning, security (with the SHA-256 hash), dependencies for custom packages. And it works wonderfully. Armed with a new hammer, I set out to drive some nails. Today I created a tiny Haskell project, and when I opened the file, noticed it had no syntax highlighting. I was surprised to find there’s no in MELPA. But coincidentally, someone started working on this literally three weeks ago ! So I wrote a small expression to package this new : A few weeks back I switched from macOS to Linux, and since I’m stuck on X11 because of stumpwm , I’m using XCompose to define keybindings for entering dashes, smart quotes etc. It bothered me slightly that my file didn’t have syntax highlighting. I found in kragen’s repo , but it’s slightly broken (it’s missing a call at the end). I started thinking how hard it would be to write a Nix expression to modify the source after fetching, when I found that Thomas Voss hosts a patched version here . Which made this very simple: Somehow the version of in nixpkgs unstable was missing the configuration option to use a custom shell. Since I want to use nu instead of bash, I had to package this myself from the latest commit: I started reading Functional Programming in Lean recently, and while there is a , it’s not packaged anywhere. This only required a slight deviation from the pattern: when I opened a file I got an error about a missing JSON file, consulting the README for , it says: If you use a source-based package-manager (e.g. , Straight or Elpaca), then make sure to list the directory in your Lean4-Mode package recipe. To do this I had to use rather than :

0 views
mcyoung 4 months ago

Why SSA?

If you’ve read anything about compilers in the last two decades or so, you have almost certainly heard of SSA compilers , a popular architecture featured in many optimizing compilers, including ahead-of-time compilers such as LLVM, GCC, Go, CUDA (and various shader compilers), Swift 1 , and MSVC 2 , and just-in-time compilers such as HotSpot C2 3 , V8 4 , SpiderMonkey 5 , LuaJIT, and the Android Runtime 6 . SSA is hugely popular, to the point that most compiler projects no longer bother with other IRs for optimization 7 . This is because SSA is incredibly nimble at the types of program analysis and transformation that compiler optimizations want to do on your code. But why ? Many of my friends who don’t do compilers often say that compilers seem like opaque magical black boxes, and SSA, as it often appears in the literature, is impenetrably complex. But it’s not! SSA is actually very simple once you forget everything you think your programs are actually doing. We will develop the concept of SSA form, a simple SSA IR, prove facts about it, and design some optimizations on it. I have previously written about the granddaddy of all modern SSA compilers, LLVM. This article is about SSA in general, and won’t really have anything to do with LLVM. However, it may be helpful to read that article to make some of the things in this article feel more concrete. SSA is a property of intermediate representations (IRs), primarily used by compilers for optimizing imperative code that target a register machine . Register machines are computers that feature a fixed set of registers that can be used as the operands for instructions: this includes virtually all physical processors, including CPUs, GPUs, and weird tings like DSPs. SSA is most frequently found in compiler middle-ends , the optimizing component between the frontend (which deals with the surface language programmers write, and lowers it into the middle-end’s IR), and the backend (which takes the optimized IR and lowers it into the target platform’s assembly). SSA IRs, however, often have little resemblance to the surface language they lower out of, or the assembly language they target. This is because neither of these representations make it easy for a compiler to intuit optimization opportunities. Imperative code consists of a sequence of operations that mutate the executing machine’s state to produce a desired result. For example, consider the following C program: This program returns no matter what its input is, so we can optimize it down to this: But, how would you write a general algorithm to detect that all of the operations cancel out? You’re forced to keep in mind program order to perform the necessary dataflow analysis, following mutations of and through the program. But this isn’t very general, and traversing all of those paths makes the search space for large functions very big. Instead, you would like to rewrite the program such that and gradually get replaced with the expression that calculates the most recent value, like this: Then we can replace each occurrence of a variable with its right-hand side recursively… Then fold the constants together… And finally, we see that we’re returning , and can replace it with . All the other variables are now unused, so we can delete them. The reason this works so well is because we took a function with mutation, and converted it into a combinatorial circuit , a type of digital logic circuit that has no state, and which is very easy to analyze. The dependencies between nodes in the circuit (corresponding to primitive operations such as addition or multiplication) are obvious from its structure. For example, consider the following circuit diagram for a one-bit multiplier: This graph representation of an operation program has two huge benefits: The powerful tools of graph theory can be used to algorithmically analyze the program and discover useful properties, such as operations that are independent of each other or whose results are never used. The operations are not ordered with respect to each other except when there is a dependency; this is useful for reordering operations, something compilers really like to do. The reason combinatorial circuits are the best circuits is because they are directed acyclic graphs (DAGs) which admit really nice algorithms. For example, longest path in a graph is NP-hard (and because P ≠ N P P \neq NP P  = NP 8 , has complexity O ( 2 n ) O(2^n) O ( 2 n ) ). However, if the graph is a DAG, it admits an O ( n ) O(n) O ( n ) solution! To understand this benefit, consider another program: Suppose we wanted to replace each variable with its definition like we did before. We can’t just replace each constant variable with the expression that defines it though, because we would wind up with a different program! Now, we pick up an extra term because the squaring operation is no longer unused! We can put this into circuit form, but it requires inserting new variables for every mutation. But we can’t do this when complex control flow is involved! So all of our algorithms need to carefully account for mutations and program order, meaning that we don’t get to use the nice graph algorithms without careful modification. SSA stands for “static single assignment”, and was developed in the 80s as a way to enhance the existing three-argument code (where every statement is in the form ) so that every program was circuit-like, using a very similar procedure to the one described above. The SSA invariant states that every variable in the program is assigned to by precisely one operation. If every operation in the program is visited once, they form a combinatorial circuit. Transformations are required to respect this invariant. In circuit form, a program is a graph where operations are nodes, and “registers” (which is what variables are usually called in SSA) are edges (specifically, each output of an operation corresponds to a register). But, again, control flow. We can’t hope to circuitize a loop, right? The key observation of SSA is that most parts of a program are circuit-like. A basic block is a maximal circuital component of a program. Simply put, it is a sequence of non-control flow operations, and a final terminator operation that transfers control to another basic block. The basic blocks themselves form a graph, the control flow graph , or CFG. This formulation of SSA is sometimes called SSA-CFG 9 . This graph is not a DAG in general; however, separating the program into basic blocks conveniently factors out the “non-DAG” parts of the program, allowing for simpler analysis within basic blocks. There are two equivalent formalisms for SSA-CFG. The traditional one uses special “phi” operations (often called phi nodes , which is what I will call them here) to link registers across basic blocks. This is the formalism LLVM uses. A more modern approach, used by MLIR, is block arguments : each basic block specifies parameters, like a function, and blocks transferring control flow to it must pass arguments of those types to it. Let’s look at some code. First, consider the following C function which calculates Fibonacci numbers using a loop. How might we express this in an SSA-CFG IR? Let’s start inventing our SSA IR! It will look a little bit like LLVM IR, since that’s what I’m used to looking at. Every block ends in a , which transfers control to one of several possible blocks. In the process, it calls that block with the given arguments. One can think of a basic block as a tiny function which tails 10 into other basic blocks in the same function. aside Phi Nodes LLVM IR is… older, so it uses the older formalism of phi nodes. “Phi” comes from “phony”, because it is an operation that doesn’t do anything; it just links registers from predecessors. A operation is essentially a switch-case on the predecessors, each case selecting a register from that predecessor (or an immediate). For example, has two predecessors, the implicit entry block , and . In a phi node IR, instead of taking a block argument for , it would specify The value of the operation is the value from whichever block jumped to this one. This can be awkward to type out by hand and read, but is a more convenient representation for describing algorithms (just “add a phi node” instead of “add a parameter and a corresponding argument”) and for the in-memory representation, but is otherwise completely equivalent. It’s a bit easier to understand the transformation from C to our IR if we first rewrite the C to use goto instead of a for loop: However, we still have mutation in the picture, so this isn’t SSA. To get into SSA, we need to replace every assignment with a new register, and somehow insert block arguments… The above IR code is already partially optimized; the named variables in the C program have been lifted out of memory and into registers. If we represent each named variable in our C program with a pointer, we can avoid needing to put the program into SSA form immediately. This technique is used by frontends that lower into LLVM, like Clang. We’ll enhance our IR by adding a declaration for functions, which defines scratch space on the stack for the function to use. Each stack slot produces a pointer that we can from and to. Our Fibonacci function would now look like so: Any time we reference a named variable, we load from its stack slot, and any time we assign it, we store to that slot. This is very easy to get into from C, but the code sucks because it’s doing lots of unnecessary pointer operations. How do we get from this to the register-only function I showed earlier? aside Program Order We want program order to not matter for the purposes of reordering, but as we’ve written code here, program order does matter: loads depend on prior stores but stores don’t produce a value that can be used to link the two operations. We can restore not having program order by introducing operands representing an “address space”; loads and stores take an address space as an argument, and stores return a new address space. An address space, or , represents the state of some region of memory. Loads and stores are independent when they are not connected by a argument. This type of enhancement is used by Go’s SSA IR, for example. However, it adds a layer of complexity to the examples, so instead I will hand-wave this away. Now we need to prove some properties about CFGs that are important for the definition and correctness of our optimization passes. First, some definitions. The predecessors (or “preds”) of a basic block is the set of blocks with an outgoing edge to that block. A block may be its own predecessors. Some literature calls the above “direct” or immediate predecessors. For example, the preds of in our example are are (the special name for the function entry-point) . The successors (no, not “succs”) of a basic block is the set of blocks with an outgoing edge from that block. A block may be its own successors. The sucessors of are and . The successors are listed in the loop’s . If a block is a transitive pred of a block , we say that weakly dominates , or that it is a weak dominator of . For example, , and both weakly dominate . However, this is not usually an especially useful relationship. Instead, we want to speak of dominators: A block is a dominator (or dominates ) if every pred of is dominated by , or if is itself. Equivalently, the dominator set of is the intersection of the dominator sets of its preds, plus . The dominance relation has some nice order properties that are necessary for defining the core graph algorithms of SSA. We only consider CFGs which are flowgraphs, that is, all blocks are reachable from the root block , which has no preds. This is necessary to eliminate some pathological graphs from our proofs. Importantly, we can always ask for an acyclic path 11 from to any block . An equivalent way to state the dominance relationship is that from every path from to contains all of ’s dominators. proposition dominates iff every path from to contains . First, assume every to path contains . If is , we’re done. Otherwise we need to prove each predecessor of is dominated by ; we do this by induction on the length of acyclic paths from to . Consider preds of that are not , and consider all acyclic paths p p p from to ; by appending to them, we have an acyclic path p ′ p' p ′ from to , which must contain . Because both the last and second-to-last elements of this are not , it must be within the shorter path p p p which is shorter than p ′ p' p ′ . Thus, by induction, dominates and therefore Going the other way, if dominates , and consider a path p p p from to . The second-to-last element of p p p is a pred of ; if it is we are done. Otherwise, we can consider the path p p p made by deleting at the end. is dominated by , and p ′ p' p ′ is shorter than p p p , so we can proceed by induction as above. Onto those nice properties. Dominance allows us to take an arbitrarily complicated CFG and extract from it a DAG, composed of blocks ordered by dominance. The dominance relation is a partial order. Dominance is reflexive and transitive by definition, so we only need to show blocks can’t dominate each other. Suppose distinct and dominate each other.Pick an acyclic path p p p from to . Because dominates , there is a prefix p ′ p' p ′ of this path ending in . But because dominates , some prefix p ′ ′ p'' p ′′ of p ′ p' p ′ ends in . But now p p p must contain twice, contradicting that it is acyclic. This allows us to write when dominates . There is an even more refined graph structure that we can build out of dominators, which follows immediately from the partial order theorem. The dominators of a basic block are totally ordered by the dominance relation. Suppose and , but neither dominates the other. Then, there must exist acyclic paths from to which contain both, but in different orders. Take the subpaths of those paths which follow , and , neither of which contains . Concatenating these paths yields a path from to that does not contain , a contradiction. This tells us that the DAG we get from the dominance relation is actually a tree, rooted at . The parent of a node in this tree is called its immediate dominator . Computing dominators can be done iteratively: the dominator set of a block is the intersection the dominator sets of its preds, plus . This algorithm runs in quadratic time. A better algorithm is the Lengauer-Tarjan algorithm[^lta]. It is relatively simple, but explaining how to implement it is a bit out of scope for this article. I found a nice treatment of it here . What’s important is we can compute the dominator tree without breaking the bank, and given any node, we can ask for its immediate dominator. Using immediate dominators, we can introduce the final, important property of dominators. The dominance frontier of a block is the set of all blocks not dominated by with at least one pred which dominates. These are points where control flow merges from distinct paths: one containing and one not. The dominance frontier of is , whose preds are and . There are many ways to calculate dominance frontiers, but with a dominance tree in hand, we can do it like this: algorithm Dominance Frontiers. For each block with more than one pred, for each of its preds, let be that pred. Add to the dominance frontier of and all of its dominators, stopping when encountering ’ immediate dominator. We need to prove that every block examined by the algorithm winds up in the correct frontiers. First, we check that every examined block is added to the correct frontier. If , where is a pred of , and a is ’s immediate dominator, then if , is not in its frontier, because must dominate . Otherwise, must be in ’s frontier, because dominates a pred but it cannot dominate , because then it would be dominated by , a contradiction. Second, we check that every frontier is complete. Consider a block . If an examined block is in its frontier, then must be among the dominators of some pred , and it must be dominated by ’s immediate dominator; otherwise, would dominate (and thus would not be in its frontier). Thus, gets added to ’s dominator. You might notice that all of these algorithms are quadratic. This is actually a very good time complexity for a compilers-related graph algorithm. Cubic and quartic algorithms are not especially uncommon, and yes, your optimizing compiler’s time complexity is probably cubic or quartic in the size of the program! Ok. Let’s construct an optimization. We want to figure out if we can replace a load from a pointer with the most recent store to that pointer. This will allow us to fully lift values out of memory by cancelling out store/load pairs. This will make use of yet another implicit graph data structure. The dataflow graph is the directed graph made up of the internal circuit graphs of each each basic block, connected along block arguments. To follow a use-def chain is to walk this graph forward from an operation to discover operations that potentially depend on it, or backwards to find operations it potentially depends on. It’s important to remember that the dataflow graph, like the CFG, does not have a well defined “up” direction. Navigating it and the CFG requires the dominator tree. One other important thing to remember here is that every instruction in a basic block always executes if the block executes. In much of this analysis, we need to appeal to “program order” to select the last load in a block, but we are always able to do so. This is an important property of basic blocks that makes them essential for constructing optimizations. For a given , we want to identify all loads that depend on it. We can follow the use-def chain of to find which blocks contain loads that potentially depend on the store (call it ). First, we can eliminate loads within the same basic block (call it ). Replace all instructions after (but before any other s, in program order) with ’s def. If is not the last store in this block, we’re done. Otherwise, follow the use-def chain of to successors which use , i.e., successors whose case has as at least one argument. Recurse into those successors, and now replacing the pointer of interest with the parameters of the successor which were set to (more than one argument may be ). If successor loads from one of the registers holding , replace all such loads before a store to . We also now need to send into somehow. This is where we run into something of a wrinkle. If has exactly one predecessor, we need to add a new block argument to pass whichever register is holding (which exists by induction). If is already passed into by another argument, we can use that one. However, if has multiple predecessors, we need to make sure that every path from to sends , and canonicalizing those will be tricky. Worse still, if is in ’s domination frontier, a different store could be contributing to that load! For this reason, dataflow from stores to loads is not a great strategy. Instead, we’ll look at dataflow from loads backwards to stores (in general, dataflow from uses to defs tends to be more useful), which we can use to augment the above forward dataflow analysis to remove the complex issues around domination frontiers. Let’s analyze loads instead. For each in , we want to determine all stores that could potentially contribute to its value. We can find those stores as follows: We want to be able to determine which register in a given block corresponds to the value of , and then find its last store in that block. To do this, we’ll flood-fill the CFG backwards in BFS order. This means that we’ll follow preds (through the use-def chain) recursively, visiting each pred before visiting their preds, and never revisiting a basic block (except we may need to come back to at the end). Determining the “equivalent” 12 of in (we’ll call it ) can be done recursively: while examining , follow the def of . If is a block parameter, for each pred , set to the corresponding argument in the case in ’s . Using this information, we can collect all stores that the load potentially depends on. If a predecessor stores to , we add the last such store in (in program order) to our set of stores, and do not recurse to ’s preds (because this store overwrites all past stores). Note that we may revisit in this process, and collect a store to from it occurs in the block. This is necessary in the case of loops. The result is a set of pairs. In the process, we also collected a set of all blocks visited, , which are dominators of which we need to plumb a through. This process is called memory dependency analysis , and is a key component of many optimizations. Not all contributing operations are stores. Some may be references to globals (which we’re disregarding), or function arguments or the results of a function call (which means we probably can’t lift this load). For example gets traced all the way back to a function argument, there is a code path which loads from a pointer whose stores we can’t see. It may also trace back to a stack slot that is potentially not stored to. This means there is a code path that can potentially load uninitialized memory. Like LLVM, we can assume this is not observable behavior, so we can discount such dependencies. If all of the dependencies are uninitialized loads, we can potentially delete not just the load, but operations which depend on it (reverse dataflow analysis is the origin of so-called “time-traveling” UB). Now that we have the full set of dependency information, we can start lifting loads. Loads can be safely lifted when all of their dependencies are stores in the current function, or dependencies we can disregard thanks to UB in the surface language (such as loads or uninitialized loads). There is a lot of fuss in this algorithm about plumbing values through block arguments. A lot of IRs make a simplifying change, where every block implicitly receives the registers from its dominators as block arguments. I am keeping the fuss because it makes it clearer what’s going on, but in practice, most of this plumbing, except at dominance frontiers, would be happening in the background. Suppose we can safely lift some load. Now we need to plumb the stored values down to the load. For each block in (all other blocks will now be in unless stated otherwise). We will be building two mappings: one , which is the register equivalent to in that block. We will also be building a map , which is the value that must have in that block. Prepare a work queue, with each in it initially. Pop a block form the queue. For each successor (in ): If isn’t already defined, add it as a block argument. Have pass to that argument. If hasn’t been visited yet, and isn’t the block containing the load we’re deleting, add it to the queue. Once we’re done, if is the block that contains the load, we can now replace all loads to before any stores to with . There are cases where this whole process can be skipped, by applying a “peephole” optimization. For example, stores followed by loads within the same basic block can be optimized away locally, leaving the heavy-weight analysis for cross-block store/load pairs. Here’s the result of doing dependency analysis on our Fibonacci function. Each load is annotated with the blocks and stores in . Let’s look at . Is contributing loads are in and . So we add a new parameter : in , we call that parameter with (since that’s stored to it in ), while in , we pass . What about L4? The contributing loads are also in and , but one of those isn’t a pred of . is also in the subgraph for this load, though. So, starting from , we add a new parameter to and feed (the stored value, an immediate this time) through it. Now looking at , we see there is already a parameter for this load ( ), so we just pass as that argument. Now we process , which pushed onto the queue. gets a new parameter , which is fed ’s own . We do not re-process , even though it also appears in ’s gotos, because we already visited it. After doing this for the other two loads, we get this: After lifting, if we know that a stack slot’s pointer does not escape (i.e., none of its uses wind up going into a function call 13 ) or a write to a global (or a pointer that escapes), we can delete every store to that pointer. If we delete every store to a stack slot, we can delete the stack slot altogether (there should be no loads left for that stack slot at this point). This analysis is simple, because it assumes pointers do not alias in general. Alias analysis is necessary for more accurate dependency analysis. This is necessary, for example, for lifting loads of fields of structs through subobject pointers, and dealing with pointer arithmetic in general. However, our dependency analysis is robust to passing different pointers as arguments to the same block from different predecessors. This is the case that is specifically handled by all of the fussing about with dominance frontiers. This robustness ultimately comes from SSA’s circuital nature. Similarly, this analysis needs to be tweaked to deal with something like (a ternary, essentially). s of pointers need to be replaced with s of the loaded values, which means we need to do the lifting transformation “all at once”: lifting some liftable loads will leave the IR in an inconsistent state, until all of them have been lifted. Many optimizations will make a mess of the CFG, so it’s useful to have simple passes that “clean up” the mess left by transformations. Here’s some easy examples. If an operation’s result has zero uses, and the operation has no side-effects, it can be deleted. This allows us to then delete operations that it depended on that now have no side effects. Doing this is very simple, due to the circuital nature of SSA: collect all instructions whose outputs have zero uses, and delete them. Then, examine the defs of their operands; if those operations now have no uses, delete them, and recurse. This bubbles up all the way to block arguments. Deleting block arguments is a bit trickier, but we can use a work queue to do it. Put all of the blocks into a work queue. Pop a block from the queue. Run unused result elimination on its operations. If it now has parameters with no uses, remove those parameters. For each pred, delete the corresponding arguments to this block. Then, Place those preds into the work queue (since some of their operations may have lost their last use). If there is still work left, go to 1. There are many CFG configurations that are redundant and can be simplified to reduce the number of basic blocks. For example, unreachable code can help delete blocks. Other optimizations may cause the at the end of a function to be empty (because all of its successors were optimized away). We treat an empty as being unreachable (since it has no cases!), so we can delete every operation in the block up to the last non-pure operation. If we delete every instruction in the block, we can delete the block entirely, and delete it from its preds’ s. This is a form of dead code elimination , or DCE, which combines with the previous optimization to aggressively delete redundant code. Some jumps are redundant. For example, if a block has exactly one pred and one successor, the pred’s case for that block can be wired directly to the successor. Similarly, if two blocks are each other’s unique predecessor/successor, they can be fused , creating a single block by connecting the input blocks’ circuits directly, instead of through a . If we have a ternary operation, we can do more sophisticated fusion. If a block has two successors, both of which the same unique successor, and those successors consist only of gotos, we can fuse all four blocks, replacing the CFG diamond with a . In terms of C, this is this transformation: LLVM’s CFG simplification pass is very sophisticated and can eliminate complex forms of control flow. I am hoping to write more about SSA optimization passes. This is a very rich subject, and viewing optimizations in isolation is a great way to understand how a sophisticated optimization pipeline is built out of simple, dumb components. It’s also a practical application of graph theory that shows just how powerful it can be, and (at least in my opinion), is an intuitive setting for understanding graph theory, which can feel very abstract otherwise. In the future, I’d like to cover CSE/GVN, loop optimizations, and, if I’m feeling brave, getting out of SSA into a finite-register machine (backends are not my strong suit!). Specifically the Swift frontend before lowering into LLVM IR.  ↩ Microsoft Visual C++, a non-conforming C++ compiler sold by Microsoft  ↩ HotSpot is the JVM implementation provided by OpenJDK; C2 is the “second compiler”, which has the best performance among HotSpot’s Java execution engines.  ↩ V8 is Chromium’s JavaScript runtime.  ↩ SpiderMonkey is Firefox’s JavaScript runtime.  ↩ The Android Runtime (ART) is the “JVM” (scare quotes) on the Android platform.  ↩ The Glasgow Haskell Compiler (GHC), does not use SSA; it (like some other pure-functional languages) uses a continuation-oriented IR (compare to Scheme’s ).  ↩ Every compiler person firmly believes that P ≠ N P P \neq NP P  = NP , because program optimization is full of NP-hard problems and we would have definitely found polynomial ideal register allocation by now if it existed.  ↩ Some more recent IRs use a different version of SSA called “structured control flow”, or SCF. Wasm is a notable example of an SCF IR. SSA-SCF is equivalent to SSA-CFG, and polynomial time algorithms exist for losslessly converting between them (LLVM compiling Wasm, for example, converts its CFG into SCF using a “relooping algorithm”). In SCF, operations like switch statements and loops are represented as macro operations that contain basic blocks. For example, a operation might take a value as input, select a basic block to execute based on that, and return the value that basic block evaluates to as its output. RVSDG is a notable innovation in this space, because it allows circuit analysis of entire imperative programs. I am convering SSA-CFG instead of SSA-SCF simply because it’s more common, and because it’s what LLVM IR is. See also this MLIR presentation for converting between the two.  ↩ Tail calling is when a function call is the last operation in a function; this allows the caller to jump directly to the callee, recycling its own stack frame for it instead of requiring it to allocate its own.  ↩ Given any path from to , we can make it acyclic by replacing each subpath from to with a single node.  ↩ When moving from a basic block to a pred, a register in that block which is defined as a block parameter corresponds to some register (or immediate) in each predecessor. That is the “equivalent” of . One possible option for the “equivalent” is an immediate: for example, or the address of a global. In the case of a global , assuming no data races, we would instead need alias information to tell if stores to this global within the current function (a) exist and (b) are liftable at all. If the equivalent is , we can proceed in one of two ways depending on optimization level. If we want loads of to trap (as in Go), we need to mark this load as not being liftable, because it may trap. If we want loads of to be UB, we simply ignore that pred, because we can assume (for our analysis) that if the pointer is , it is never loaded from.  ↩ Returned stack pointers do not escape: stack slots’ lifetimes end at function exit, so we return a dangling pointer, which we assume are never loaded. So stores to that pointer before returning it can be discarded.  ↩ The powerful tools of graph theory can be used to algorithmically analyze the program and discover useful properties, such as operations that are independent of each other or whose results are never used. The operations are not ordered with respect to each other except when there is a dependency; this is useful for reordering operations, something compilers really like to do. Prepare a work queue, with each in it initially. Pop a block form the queue. For each successor (in ): If isn’t already defined, add it as a block argument. Have pass to that argument. If hasn’t been visited yet, and isn’t the block containing the load we’re deleting, add it to the queue. Pop a block from the queue. Run unused result elimination on its operations. If it now has parameters with no uses, remove those parameters. For each pred, delete the corresponding arguments to this block. Then, Place those preds into the work queue (since some of their operations may have lost their last use). If there is still work left, go to 1. Specifically the Swift frontend before lowering into LLVM IR.  ↩ Microsoft Visual C++, a non-conforming C++ compiler sold by Microsoft  ↩ HotSpot is the JVM implementation provided by OpenJDK; C2 is the “second compiler”, which has the best performance among HotSpot’s Java execution engines.  ↩ V8 is Chromium’s JavaScript runtime.  ↩ SpiderMonkey is Firefox’s JavaScript runtime.  ↩ The Android Runtime (ART) is the “JVM” (scare quotes) on the Android platform.  ↩ The Glasgow Haskell Compiler (GHC), does not use SSA; it (like some other pure-functional languages) uses a continuation-oriented IR (compare to Scheme’s ).  ↩ Every compiler person firmly believes that P ≠ N P P \neq NP P  = NP , because program optimization is full of NP-hard problems and we would have definitely found polynomial ideal register allocation by now if it existed.  ↩ Some more recent IRs use a different version of SSA called “structured control flow”, or SCF. Wasm is a notable example of an SCF IR. SSA-SCF is equivalent to SSA-CFG, and polynomial time algorithms exist for losslessly converting between them (LLVM compiling Wasm, for example, converts its CFG into SCF using a “relooping algorithm”). In SCF, operations like switch statements and loops are represented as macro operations that contain basic blocks. For example, a operation might take a value as input, select a basic block to execute based on that, and return the value that basic block evaluates to as its output. RVSDG is a notable innovation in this space, because it allows circuit analysis of entire imperative programs. I am convering SSA-CFG instead of SSA-SCF simply because it’s more common, and because it’s what LLVM IR is. See also this MLIR presentation for converting between the two.  ↩ Tail calling is when a function call is the last operation in a function; this allows the caller to jump directly to the callee, recycling its own stack frame for it instead of requiring it to allocate its own.  ↩ Given any path from to , we can make it acyclic by replacing each subpath from to with a single node.  ↩ When moving from a basic block to a pred, a register in that block which is defined as a block parameter corresponds to some register (or immediate) in each predecessor. That is the “equivalent” of . One possible option for the “equivalent” is an immediate: for example, or the address of a global. In the case of a global , assuming no data races, we would instead need alias information to tell if stores to this global within the current function (a) exist and (b) are liftable at all. If the equivalent is , we can proceed in one of two ways depending on optimization level. If we want loads of to trap (as in Go), we need to mark this load as not being liftable, because it may trap. If we want loads of to be UB, we simply ignore that pred, because we can assume (for our analysis) that if the pointer is , it is never loaded from.  ↩ Returned stack pointers do not escape: stack slots’ lifetimes end at function exit, so we return a dangling pointer, which we assume are never loaded. So stores to that pointer before returning it can be discarded.  ↩

0 views
Abhinav Sarkar 4 months ago

A Fast Bytecode VM for Arithmetic: The Virtual Machine

In this series of posts, we write a fast bytecode compiler and a virtual machine for arithmetic in Haskell. We explore the following topics: In this final post, we write the virtual machine that executes our bytecode, and benchmark it. This post was originally published on abhinavsarkar.net . This post is part of the series: A Fast Bytecode VM for Arithmetic . Bytecode Virtual Machines (VMs) are known to be faster than AST -walking interpreters. That’s why many real-world programming languages these days are implemented with bytecode VM s, for example, Java , Python , PHP , and Raku . The reason is partially, the flat and compact nature of bytecode itself. But VM s also have a few other tricks up their sleeves that make them highly performant. In this post, we write a VM for our arithmetic expression language, and explore some of these performance tricks. But first, we need to finish a pending task. We wrote some unit tests for our compiler in the last post, but unit tests cover only the cases we can think of. A compiler has to deal with any input, and with just unit tests we cannot be sure of its correctness. To test our compiler and other components for correctness, we use the QuickCheck library. QuickCheck is a Property-based Testing framework. The key idea of property-based testing is to write properties of our code that hold true for any input, and then to automatically generate a large number of arbitrary inputs and make sure that the properties are indeed true for them 1 2 . Since we are writing an arithmetic expression parser/compiler/ VM , we generate arbitrary expression AST s, and use them to assert certain invariants of our program. With QuickCheck, we write generators for the inputs for our tests. These generators are composable just like parser combinators are. We use the library provided generators to write small generators that we combine to create larger ones. Let’s start: First come the basic generators: Moving on to composite generators: generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. Finally, we have some instances of QuickCheck’s type class to tie everything together: We can apply them in GHCi: Notice that the generated samples increase in complexity. With the generators in place, we define our properties next. Let’s test our parser first: This property is a simple round-trip test for the parser and printer: we parse the string representation of a generated expression, and assert that it gives back the same expression. The second property is a more involved round-trip test for the compiler and decompiler: This asserts that compiling an expression, then disassembling and decompiling it, and finally compiling it again should result in the original bytecode 3 . This requires a helper function to get the size of an expression: We run these tests in a later section. This ends our short detour. Now for the main event: the virtual machine. Our VM is a stack-based machine that operates on a stack of values and executes the compiled bytecode. Our goal is to be as fast as possible. For a quick reminder, these are our s: And now, the heart of the VM : The function is where the action happens. It is way more complex than , but the complexity has a reason, namely performance. runs inside the monad wrapped with the monad transformer. monad lets us use mutable data structures locally while ensuring the function remains externally pure. monad transformer adds support for throwing and propagating errors in a pure manner. We use for our stack, which is a mutable array of unboxed primitive types, in our case an array of values. Using a mutable unboxed array is much faster than using an immutable and/or boxed one like or due to reduced allocation and/or pointer chasing. The core of the VM is the function, a tight, tail-recursive loop that GHC compiles into an efficient machine loop, as we see later. It takes the stack pointer ( ), instruction pointer ( ) 4 , and the stack as arguments. At the top of each loop, a block of guard clauses checks for stack overflow, underflow, and other error conditions before branching on the current opcode. Placing these checks at the top instead of inside the opcode cases is a deliberate choice. This may make the code slightly harder to understand, but it significantly improves the performance of the loop by moving all branching at the beginning of the loop, resulting in code that is more friendly to the CPU’s Branch Predictor . Also notice how we reduce the number of checks by working with a range of opcodes at once in the guard. The checks are also sorted so as to be most performant, guided by profiling and benchmarking 5 . The handling of each opcode is actually pretty straightforward. We use different specific operations to read and write to the stack, while taking care of doing the required bound and arithmetic checks. We also use the functions that we wrote earlier . After carrying out each operation, we reenter the loop by calling it tail-recursively with the right stack and instruction pointers. Finally, we make sure that the execution terminated correctly by checking the state of the stack, and return its first element. We see later that the VM is quite fast, but how does GHC achieve this performance? To see the magic, we can look at GHC ’s intermediate language: Core . Core is a simpler functional language than Haskell to which GHC compiles Haskell. The simpler nature of Core makes it easier for GHC to optimize it, and compile it further. We can get the Core code for a program by compiling with the GHC option . The actual Core code for our VM is too verbose to show here, but here is a simplified C-like pseudo-code version of our loop: A few key optimizations are worth pointing out: The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. In short, GHC has turned our high-level, declarative Haskell code into a low-level loop that looks remarkably like one we would write in C. We get the safety and expressiveness of Haskell, while GHC does the heavy lifting to produce highly optimized code. It’s the best of both worlds! We must test the VM to make sure it works correctly 6 . We reuse the success and failure tests for the AST interpreter, as the bytecode interpreter should yield the same result: We also add a property-based test this time: for any given expression, interpreting the AST should produce the same result as compiling it to bytecode and executing it in the VM 7 . Our test suite is complete now: And finally, we run all tests together: Happily, all tests pass. Now for the fun part: benchmarking. We use the criterion library to benchmark the code. We have a benchmark suite to measure the performance of each pass, the two interpreters ( AST and bytecode), and the full end-to-end runs 8 . We compile with the following GHC options: Here are the results in a more digestible format: Here are the times in a chart (smaller is better): Let’s break down these numbers: I can already see readers thinking, “Sure that’s fast, but is it faster than C/Rust/Zig/my favourite language?” Let’s find out. To get a better sense of our VM ’s performance, I rewrote it in C. The C implementation is a classic manual approach: a hand-written tokenizer and recursive-descent parser, s with pointers for the AST , and manual memory management and error propagation. The VM is a simple loop with a statement for dispatching opcodes 9 . To compare our Haskell code against the C code, we need to write the last Haskell module, the CLI app that we demonstrated in the first post : We compile with the following GHC options 10 : And for the C version, we compile using GCC: Now, let’s see how they stack up against each other. We use hyperfine to run the two executables. Here’s a summary of the results: I have subtracted the times of previous passes to get the times for individual passes. Here’s the same in a chart (smaller is better): As expected, the C implementation is faster across the board, between 1.5x to 2.6x. The biggest difference is in parsing, where the hand-written C parser is more than twice as fast as our combinator-based one. On the other hand, the Haskell VM is only 50% slower than the C VM . In my opinion, the Haskell code’s performance is quite respectable, especially given the safety, expressiveness and conciseness benefits, as illustrated by the code sizes 11 : The Haskell implementation is almost half the size of the C code. I don’t know about you but I’m perfectly happy with the half as small, half as fast tread-off. The benchmark results for the VM s become less surprising when I compare the C function with the GHC Core code for 12 . This structure is almost a 1-to-1 match with the GHC Core code we saw earlier. The C loop corresponds to the optimized function that GHC generates, the statement is almost identical to the analysis on the raw opcode byte, and the C stack array is equivalent to the GHC uses. GHC effectively compiles our high-level Haskell into a low-level code that is structurally identical to what we wrote by hand in C 13 . This explains why the performance is in the same ballpark. The remaining performance gap is probably due to the thin layer of abstraction that the Haskell runtime still maintains, but it’s remarkable how close we can get to C-like performance. While our Haskell program is fast, we can improve certain things: Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. And that’s a wrap! We successfully built a bytecode compiler and virtual machine in Haskell. We covered parsing, AST interpretation, compilation, and bytecode execution, as well as, debugging and testing functionalities. Let’s update our checklist: The journey from a simple AST interpreter to a bytecode VM has been a rewarding one. We saw a significant performance improvement, learned about how compilers and VM s work, and how to write performant code in Haskell. While our Haskell implementation isn’t as fast as the hand-written C version, it’s far more concise and, I would argue, easier to reason about. It’s a great demonstration of Haskell’s power for writing high-performance—yet safe and elegant—code. See the full code at: If you have any questions or comments, please leave a comment below. If you liked this post, please share it. Thanks for reading! Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎ If you liked this post, please leave a comment . Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine (VM). Unit testing and property-based testing for our VM . Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. The Compiler The Virtual Machine (you are here) Introduction Testing the Compiler The Virtual Machine Testing the VM Benchmarking the VM Benchmarking Against C Future Directions generates number expressions by using QuickCheck’s built-in function. generates variable expressions by choosing from the set of passed valid variable names. generates valid identifiers from combinations of letters a—z and A—Z, and discarding ones that are reserved keywords. generates binary expressions with arbitrary binary operations. It recursively calls to generate the operands. The parameter controls the complexity of the generated expressions, and we half the size of operands (and so on recursively) so that we don’t end up with infinitely large expressions. generates expressions by generating an identifier, and then generating the assignment and body expressions recursively. We do the same trick of halving sizes here as well. Notice that the assignment is generated with the passed variable names in scope, whereas the body is generated with the new identifier added to the scope. uses the above generators to generate all kinds of expressions. At smaller sizes, it prefers to generate base expressions, while at larger sizes, it prefers composite ones. Due to the careful recursive halving of size in composite generators, we end up with expressions of finite sizes. The loop: The tail-recursive function is compiled into a proper loop. The instruction is effectively a , which means there’s no function call overhead for each iteration of the VM loop. Unboxing: The Core code is full of primitive, unboxed types like , , and , and operations on them. These are raw machine integers and memory addresses, not boxed Haskell objects. This means operations on them are as fast as they would be in C. The stack operations are not function calls on a instance, but primitive memory reads and writes on a raw memory address . Inlining: The helper function is completely inlined into the main loop. For , the code for reading two values, adding them, and writing the result is laid out inline, and works on unboxed values and array address. Parsing and decompiling are slow : At ~573ms and ~506ms, these are by far the slowest passes. This isn’t surprising. Parsing with parser combinators has a known trade-off of expressiveness for performance. Decompiling is a shift-reduce parser that reconstructs an AST from a linear stream of opcodes, and we didn’t spend any time optimizing it. Compilation is fast : At ~51ms, compilation is an order of magnitude faster than parsing. This is thanks to pre-calculating the bytecode size during the parsing phase, which allows us to pre-allocate a single and fill it in with low-level pointer operations. Bytecode interpretation is blazingly fast : At just ~16ms, our VM ’s interpreter is over 3 times faster than the AST interpreter (~50ms), which proves our belief that bytecode interpreters are faster. End-to-end runs : Interestingly, the total time to run via bytecode (~638ms) is slightly slower than the run via AST (~617ms). This is because the cost of parsing, compiling, and then interpreting is higher than just parsing and interpreting. The real win for a bytecode VM comes when you compile once and run many times , amortizing the initial compilation cost. Parser optimizations : As the benchmarks showed, parsing is our slowest pass. For better performance, we could replace our Attoparsec-based combinator parser with a parser generator like Alex and Happy , or even write a recursive-descent parser by hand. Superinstructions : We could analyze the bytecode for common instruction sequences (like followed by ) and combine them into single superinstructions. This would reduce the instruction dispatch overhead, but may make compilation slower. Register-based VM : A register-based VM , which uses a small array of virtual registers instead of a memory-based stack, could significantly reduce memory traffic and improve performance. This would require a more complex compiler capable of register allocation. Just-in-Time (JIT) compilation : The ultimate performance boost could come from a JIT compiler . Instead of interpreting bytecode, we could compile it to native machine code at runtime, eliminating the interpreter entirely. Maybe we could use LLVM to build a JIT compiler in Haskell. Parsing arithmetic expressions to Abstract Syntax Trees (ASTs). Unit testing for our parser. Interpreting ASTs. Compiling ASTs to bytecode. Disassembling and decompiling bytecode. Unit testing for our compiler. Property-based testing for our compiler. Efficiently executing bytecode in a virtual machine. Unit testing and property-based testing for our VM. Benchmarking our code to see how the different passes perform. All the while keeping an eye on performance. ArithVMLib.hs ArithVMSpec.hs ArithVMBench.hs ArithVMApp.hs Actually, QuickCheck does not generate entirely arbitrary inputs. It generates arbitrary inputs with increasing complexity—where the complexity is defined by the user—and asserts the properties on these inputs. When a test fails for a particular input, QuickCheck also tries to simplify the culprit and tries to find the simplest input for which the test fails. This process is called Shrinking in QuickCheck parlance. QuickCheck then shows this simplest input to the user for them to use it to debug their code. ↩︎ Read this good introduction to QuickCheck if you are unfamiliar. ↩︎ Notice that we discard the expressions that do not compile successfully. ↩︎ and are not actual pointers, but indices into the stack and bytecode arrays respectively. ↩︎ Guided by the GHC profiler, I tweaked the code in many different ways and ran benchmarks for every change. Then I chose the code that was most performant. ↩︎ It is extremely important to write good tests before getting your hands dirty with performance optimizations. In my case, the tests saved me many times from breaking the VM while moving code around for performance. ↩︎ We are using our AST interpreter as a definitional interpreter, assuming it to be correctly implemented because of its simpler nature. ↩︎ I ran all benchmarks on an Apple M4 Pro 24GB machine against a 142MB file generated using the expression generator we wrote earlier. ↩︎ I don’t claim to be a great or even a good C programmer. In fact, this C VM is the first substantial C code I have written in decades. I’m sure the code is not most optimized. It may even be ridden with memory management bugs. If you find something wrong, please let me know in the comments. ↩︎ I tried various RTS options to tweak GHC garbage collection, but the defaults proved to be fastest. ↩︎ The lines of code are for only the overlapping functionalities between C and Haskell versions. ↩︎ I did try using Direct Threading and Subroutine Threading in the C code, but they resulted in slower code than the switch-case variant. GCC may be smart enough in case of this simple VM to optimize the switch-case to be faster than threaded code. ↩︎ You may have noticed that the C function is not laid out in the exact same manner as the Haskell function. In case of C, moving the checks to the front did not yield in performance improvement. I suspect this may be because GCC is smart enough to do that optimization by itself. The nested were also no detriment to the performance of the C code. ↩︎

0 views

Arrows to Arrows, Categories to Queries

I’ve had a little time off of work as of late, and been spending it in characteristically unwise ways. In particular, I’ve written a little programming language that compiles to SQL . I call it catlang . That’s not to say that I’ve written a new query language. It’s a programming language, whose compiler spits out one giant statement. When you run that query in postgres, you get the output of your program. Why have I done this? Because I needed a funny compilation target to test out the actual features of the language, which is that its intermediary language is a bunch of abstract category theory nonsense. Which I’ll get to. But I’m sure you first want to see this bad boy in action. Behold, the function that returns 100 regardless of what input you give it. But it does it with the equivalent of a while loop: If you’re familiar with arrow notation , you’ll notice the above looks kinda like one big block. This is not a coincidence (because nothing is a coincidence). I figured if I were to go through all of this work, we might as well get a working arrow desugarer out of the mix. But I digress; that’s a story for another time. Anyway, what’s going on here is we have an arrow , which takes a single argument . We then loop, starting from the value of . Inside the loop, we now have a new variable , which we do some voodoo on to compute —the current value of the loop variable. Then we subtract 100 from , and take the absolute value. The function here is a bit odd; it returns if the input was negative, and otherwise. Then we branch on the output of , where and have been renamed and respectively. If was less than zero, we find ourselves in the case, where we add 1 to and wrap the whole thing in —which the loop interprets as “loop again with this new value.” Otherwise, was non-negative, and so we can return directly. Is it roundabout? You bet! The obtuseness here is not directly a feature, I was just looking for conceptually simple things I could do which would be easy to desugar into category-theoretical stuff. Which brings us to the intermediary language. After desugaring the source syntax for above, we’re left with this IL representation: We’ll discuss all of this momentarily, but for now, just let your eyes glaze over the pretty unicode. The underlying idea here is that each of these remaining symbols has very simple and specific algebraic semantics. For example, means “do and pipe the result into .” By giving a transformation from this categorical IL into other domains, it becomes trivial to compile catlang to all sorts of weird compilation targets. Like SQL. You’re probably wondering what the generated SQL looks like. Take a peek if you dare. It’s not pretty, rather amazingly, running the above query in postgres 17 will in fact return a single row with a single column whose value is 100. And you’d better believe it does it by actually looping its way up to 100. If you don’t believe me, make the following change: which will instead return a row for each step of the iteration. There are some obvious optimizations I could make to the generated SQL, but it didn’t seem worth my time, since that’s not the interesting part of the project. Let’s take some time to discuss the underlying category theory here. I am by no means an expert, but what I have learned after a decade of bashing my head against this stuff is that a little goes a long way. For our intents and purposes, we have types, and arrows (functions) between types. We always have the identity “do nothing arrow” : and we can compose arrows by lining up one end to another: 1 Unlike Haskell (or really any programming language, for that matter), we DO NOT have the notion of function application. That is, there is no arrow: You can only compose arrows, you can’t apply them. That’s why we call these things “arrows” rather than “functions.” There are a bundle of arrows for working with product types. The two projection functions correspond to and , taking individual components out of pairs: How do we get things into pairs in the first place? We can use the “fork” operation, which takes two arrows computing and , and generates a new arrow which generates a pair of : If you’re coming from a Haskell background, it’s tempting to think of this operation merely as the pair constructor. But you’ll notice from the type of the computation that there can be no data dependency between and , thus we are free to parallelize each side of the pair. In category theory, the distinction between left and right sides of an arrow is rather arbitrary. This gives rise to a notion called duality where we can flip the arrows around, and get cool new behavior. If we dualize all of our product machinery, we get the coproduct machinery, where a coproduct of and is “either or , but definitely not both nor neither.” Swapping the arrow direction of and , and replacing with gives us the following injections: and the following “join” operation for eliminating coproducts: Again, coming from Haskell this is just the standard function. It corresponds to a branch between one of two cases. As you can see, with just these eight operations, we already have a tremendous amount of expressivity. We can express data dependencies via and branching via . With we automatically encode opportunities for parallelism, and gain the ability to build complicated data structures, with and allowing us to get the information back out of the data structures. You’ll notice in the IL that there are no variable names anywhere to be found. The desugaring of the source language builds a stack (via the pattern), and replaces subsequent variable lookups with a series of projections on the stack to find the value again. On one hand, this makes the categorical IL rather hard to read, but it makes it very easy to re-target! Many domains do have a notion of grouping, but don’t have a native notion of naming. For example, in an electronic circuit, I can have a ribbon of 32 wires which represents an . If I have another ribbon of 32 wires, I can trivially route both wires into a 64-wire ribbon corresponding to a pair of . By eliminating names before we get to the IL, it means no compiler backend ever needs to deal with names. They can just work on a stack representation, and are free to special-case optimize series of projections if they are able to. Of particular interest to this discussion is how we desugar loops in catlang. The underlying primitive is : which magically turns an arrow on s into an arrow without the eithers. We obviously must run that arrow on eithers. If that function returns , then we’re happy and we can just output that. But if the function returns , we have no choice but to pass it back in to the eithered arrow. In Haskell, cochoice is implemented as: which as you can see, will loop until finally returns a . What’s neat about this formulation of a loop is that we can statically differentiate between our first and subsequent passes through the loop body. The first time through is , while for all other times it is . We don’t take advantage of it in the original program, but how many times have you written loop code that needs to initialize something its first time through? So that’s the underlying theory behind the IL. How can we compile this to SQL now? As alluded to before, we simply need to give SQL implementations for each of the operations in the intermediary language. As a simple example, compiles to , where is the input of the arrow. The hardest part here was working out a data representation. It seems obvious to encode each element of a product as a new column, but what do we do about coproducts? After much work thought, I decided to flatten out the coproducts. So, for example, the type: would be represented as three columns: with the constraint that exactly one of or would be at any given point in time. With this hammered out, almost everything else is pretty trivial. Composition corresponds to a nested query. Forks are s which concatenate the columns of each sub-query. Joins are s, where we add a clause to enforce we’re looking at the correct coproduct constructor. Cochoice is the only really tricky thing, but it corresponds to a recursive CTE . Generating a recursive CTE table for the computation isn’t too hard, but getting the final value out of it was surprisingly tricky. The semantics of SQL tables is that they are multisets and come with an arbitrary greatest element. Which is to say, you need an column structured in a relevant way in order to query the final result. Due to some quirks in what postgres accepts, and in how I structured my queries, it was prohibitively hard to insert a “how many times have I looped” column and order by that. So instead I cheated and added a column which looks at the processor clock and ordered by that. This is clearly a hack, and presumably will cause problems if I ever add some primitives which generate more than one row, but again, this is just for fun and who cares. Send me a pull request if you’re offended by my chicanery! I’ve run out of vacation time to work on this project, so I’m probably not going to get around to the meta-circular stupidity I was planning. The compiler still needs a few string-crunching primitives (which are easy to add), but then it would be simple to write a little brainfuck interpreter in catlang. Which I could then compile to SQL. Now we’ve got a brainfuck interpreter running in postgres. Of course, this has been done by hand before, but to my knowledge, never via compilation. There exist C to brainfuck compilers. And postgres is written in C. So in a move that would make Xzibit proud, we could run postgres in postgres. And of course, it would be fun to run brainfuck in brainfuck. That’d be a cool catlang backend if someone wanted to contribute such a thing. I am not the first person to do anything like this. The source language of catlang is heavily inspired by Haskell’s arrow syntax , which in turn is essentially a desugaring algorithm for Arrows . Arrows are slightly the wrong abstraction because they require an operation —which requires you to be able to embed Haskell functions in your category, something which is almost never possible. Unfortunately, arrow syntax in Haskell desugars down to for almost everything it does, which in turn makes arrow notation effectively useless. In an ideal world, everything I described in this blog post would be a tiny little Haskell library, with arrow notation doing the heavy lifting. But that is just not the world we live in. Nor am I the first person to notice that there are categorical semantics behind programming languages. I don’t actually know whom to cite on this one, but it is well-established folklore that the lambda calculus corresponds to cartesian-closed categories . The “closed” part of “cartesian-closed” means we have an operation , but everyone and their dog has implemented the lambda calculus, so I thought it would be fun to see how far we can get without it. This is not a limitation on catlang’s turing completeness (since gives us everything we need.) I’ve been thinking about writing a category-first programming language for the better part of a decade, ever since I read Compiling to Categories . That paper takes Haskell and desugars it back down to categories. I stole many of the tricks here from that paper. Anyway. All of the code is available on github if you’re interested in taking a look. The repo isn’t up to my usual coding standards, for which you have my apologies. Of note is the template-haskell backend which can spit out Haskell code; meaning it wouldn’t be very hard to make a quasiquoter to compile catlang into what Haskell’s arrow desugaring ought to be. If there’s enough clamor for such a thing, I’ll see about turning this part into a library. When looking at the types of arrows in this essay, we make the distinction that are arrows that we can write in catlang, while exist in the metatheory. ↩︎ When looking at the types of arrows in this essay, we make the distinction that are arrows that we can write in catlang, while exist in the metatheory. ↩︎

1 views
Ahmad Alfy 4 months ago

How Functional Programming Shaped (and Twisted) Frontend Development

A friend called me last week. Someone who’d built web applications back for a long time before moving exclusively to backend and infra work. He’d just opened a modern React codebase for the first time in over a decade. “What the hell is this?” he asked. “What are all these generated class names? Did we just… cancel the cascade? Who made the web work this way?” I laughed, but his confusion cut deeper than he realized. He remembered a web where CSS cascaded naturally, where the DOM was something you worked with , where the browser handled routing, forms, and events without twenty abstractions in between. To him, our modern frontend stack looked like we’d declared war on the platform itself. He asked me to explain how we got here. That conversation became this essay. A disclaimer before we begin : This is one perspective, shaped by having lived through the first browser war. I applied to make 24-bit PNGs work in IE6. I debugged hasLayout bugs at 2 AM. I wrote JavaScript when you couldn’t trust to work the same way across browsers. I watched jQuery become necessary, then indispensable, then legacy. I might be wrong about some of this. My perspective is biased for sure, but it also comes with the memory that the web didn’t need constant reinvention to be useful. There’s a strange irony at the heart of modern web development. The web was born from documents, hyperlinks, and a cascading stylesheet language. It was always messy, mutable, and gloriously side-effectful. Yet over the past decade, our most influential frontend tools have been shaped by engineers chasing functional programming purity: immutability, determinism, and the elimination of side effects. This pursuit gave us powerful abstractions. React taught us to think in components. Redux made state changes traceable. TypeScript brought compile-time safety to a dynamic language. But it also led us down a strange path. A one where we fought against the platform instead of embracing it. We rebuilt the browser’s native capabilities in JavaScript, added layers of indirection to “protect” ourselves from the DOM, and convinced ourselves that the web’s inherent messiness was a problem to solve rather than a feature to understand. The question isn’t whether functional programming principles have value. They do. The question is whether applying them dogmatically to the web (a platform designed around mutability, global scope, and user-driven chaos) made our work better, or just more complex. To understand why functional programming ideals clash with web development, we need to acknowledge what the web actually is. The web is fundamentally side-effectful. CSS cascades globally by design. Styles defined in one place affect elements everywhere, creating emergent patterns through specificity and inheritance. The DOM is a giant mutable tree that browsers optimize obsessively; changing it directly is fast and predictable. User interactions arrive asynchronously and unpredictably: clicks, scrolls, form submissions, network requests, resize events. There’s no pure function that captures “user intent.” This messiness is not accidental. It’s how the web scales across billions of devices, remains backwards-compatible across decades, and allows disparate systems to interoperate. The browser is an open platform with escape hatches everywhere. You can style anything, hook into any event, manipulate any node. That flexibility and that refusal to enforce rigid abstractions is the web’s superpower. When we approach the web with functional programming instincts, we see this flexibility as chaos. We see globals as dangerous. We see mutation as unpredictable. We see side effects as bugs waiting to happen. And so we build walls. Functional programming revolves around a few core principles: functions should be pure (same inputs → same outputs, no side effects), data should be immutable, and state changes should be explicit and traceable. These ideas produce code that’s easier to reason about, test, and parallelize, in the right context of course. These principles had been creeping into JavaScript long before React. Underscore.js (2009) brought map, reduce, and filter to the masses. Lodash and Ramda followed with deeper FP toolkits including currying, composition and immutability helpers. The ideas were in the air: avoid mutation, compose small functions, treat data transformations as pipelines. React itself started with class components and , hardly pure FP. But the conceptual foundation was there: treat UI as a function of state, make rendering deterministic, isolate side effects. Then came Elm, a purely functional language created by Evan Czaplicki that codified the “Model-View-Update” architecture. When Dan Abramov created Redux, he explicitly cited Elm as inspiration. Redux’s reducers are directly modeled on Elm’s update functions: . Redux formalized what had been emerging patterns. Combined with React Hooks (which replaced stateful classes with functional composition), the ecosystem shifted decisively toward FP. Immutability became non-negotiable. Pure components became the ideal. Side effects were corralled into . Through this convergence (library patterns, Elm’s rigor, and React’s evolution) Haskell-derived ideas about purity became mainstream JavaScript practice. In the early 2010s, as JavaScript applications grew more complex, developers looked to FP for salvation. jQuery spaghetti had become unmaintainable. Backbone’s two-way binding caused cascading updates (ironically, Backbone’s documentation explicitly advised against two-way binding saying “it doesn’t tend to be terribly useful in your real-world app” yet many developers implemented it through plugins). The community wanted discipline, and FP offered it: treat your UI as a pure function of state. Make data flow in one direction. Eliminate shared mutable state. React’s arrival in 2013 crystallized these ideals. It promised a world where : give it data, get back a component tree, re-render when data changes. No manual DOM manipulation. No implicit side effects. Just pure, predictable transformations. This was seductive. And in many ways, it worked. But it also set us on a path toward rebuilding the web in JavaScript’s image, rather than JavaScript in the web’s image. CSS was designed to be global. Styles cascade, inherit, and compose across boundaries. This enables tiny stylesheets to control huge documents, and lets teams share design systems across applications. But to functional programmers, global scope is dangerous. It creates implicit dependencies and unpredictable outcomes. Enter CSS-in-JS: styled-components, Emotion, JSS. The promise was component isolation. Styles scoped to components, no cascading surprises, no naming collisions. Styles become data , passed through JavaScript, predictably bound to elements. But this came at a cost. CSS-in-JS libraries generate styles at runtime, injecting them into tags as components mount. This adds JavaScript execution to the critical rendering path. Server-side rendering becomes complicated. You need to extract styles during the render, serialize them, and rehydrate them on the client. Debugging involves runtime-generated class names like . And you lose the cascade; the very feature that made CSS powerful in the first place. Worse, you’ve moved a browser-optimized declarative language into JavaScript, a single-threaded runtime. The browser can parse and apply CSS in parallel, off the main thread. Your styled-components bundle? That’s main-thread work, blocking interactivity. The web had a solution. It’s called a stylesheet. But it wasn’t pure enough. The industry eventually recognized these problems and pivoted to Tailwind CSS. Instead of runtime CSS generation, use utility classes. Instead of styled-components, compose classes in JSX. This was better, at least it’s compile-time, not runtime. No more blocking the main thread to inject styles. No more hydration complexity. But Tailwind still fights the cascade. Instead of writing once and letting it cascade to all buttons, you write on every single button element. You’ve traded runtime overhead for a different set of problems: class soup in your markup, massive HTML payloads, and losing the cascade’s ability to make sweeping design changes in one place. And here’s where it gets truly revealing: when Tailwind added support for nested selectors using (a feature that would let developers write more cascade-like styles), parts of the community revolted. David Khourshid (creator of XState) shared examples of using nested selectors in Tailwind, and the backlash was immediate. Developers argued this defeated the purpose of Tailwind, that it brought back the “problems” of traditional CSS, that it violated the utility-first philosophy. Think about what this means. The platform has cascade. CSS-in-JS tried to eliminate it and failed. Tailwind tried to work around it with utilities. And when Tailwind cautiously reintroduced a cascade-like feature, developers who were trained by years of anti-cascade ideology rejected it. We’ve spent so long teaching people that the cascade is dangerous that even when their own tools try to reintroduce platform capabilities, they don’t want them. We’re not just ignorant of the platform anymore. We’re ideologically opposed to it. React introduced synthetic events to normalize browser inconsistencies and integrate events into its rendering lifecycle. Instead of attaching listeners directly to DOM nodes, React uses event delegation. It listens at the root, then routes events to handlers through its own system. This feels elegant from a functional perspective. Events become data flowing through your component tree. You don’t touch the DOM directly. Everything stays inside React’s controlled universe. But native browser events already work. They bubble, they capture, they’re well-specified. The browser has spent decades optimizing event dispatch. By wrapping them in a synthetic layer, React adds indirection: memory overhead for event objects, translation logic for every interaction, and debugging friction when something behaves differently than the native API. Worse, it trains developers to avoid the platform. Developers learn React’s event system, not the web’s. When they need to work with third-party libraries or custom elements, they hit impedance mismatches. becomes a foreign API in their own codebase. Again: the web had this. The browser’s event system is fast, flexible, and well-understood. But it wasn’t controlled enough for the FP ideal of a closed system. The logical extreme of “UI as a pure function of state” is client-side rendering: the server sends an empty HTML shell, JavaScript boots up, and the app renders entirely in the browser. From a functional perspective, this is clean. Your app is a deterministic function that takes initial state and produces a DOM tree. From a web perspective, it’s a disaster. The browser sits idle while JavaScript parses, executes, and manually constructs the DOM. Users see blank screens. Screen readers get empty documents. Search engines see nothing. Progressive rendering which is one of the browser’s most powerful features, goes unused. The industry noticed. Server-side rendering came back. But because the mental model was still “JavaScript owns the DOM,” we got hydration : the server renders HTML, the client renders the same tree in JavaScript, then React walks both and attaches event handlers. During hydration, the page is visible but inert. Clicks do nothing, forms don’t submit. This is architecturally absurd. The browser already rendered the page. It already knows how to handle clicks. But because the framework wants to own all interactions through its synthetic event system, it must re-create the entire component tree in JavaScript before anything works. The absurdity extends beyond the client. Infrastructure teams watch in confusion as every user makes double the number of requests : the server renders the page and fetches data, then the client boots up and fetches the exact same data again to reconstruct the component tree for hydration. Why? Because the framework can’t trust the HTML it just generated. It needs to rebuild its internal representation of the UI in JavaScript to attach event handlers and manage state. This isn’t just wasteful, it’s expensive. Database queries run twice. API calls run twice. Cache layers get hit twice. CDN costs double. And for what? So the framework can maintain its pure functional model where all state lives in JavaScript. The browser had the data. The HTML had the data. But that data wasn’t in the right shape . It wasn’t a JavaScript object tree, so we throw it away and fetch it again. Hydration is what happens when you treat the web like a blank canvas instead of a platform with capabilities. The web gave us streaming HTML, progressive enhancement, and instant interactivity. We replaced it with JSON, JavaScript bundles, duplicate network requests, and “please wait while we reconstruct reality.” Consider the humble modal dialog. The web has , a native element with built-in functionality: it manages focus trapping, handles Escape key dismissal, provides a backdrop, controls scroll-locking on the body, and integrates with the accessibility tree. It exists in the DOM but remains hidden until opened. No JavaScript mounting required. It’s fast, accessible, and battle-tested by browser vendors. Now observe what gets taught in tutorials, bootcamps, and popular React courses: build a modal with elements. Conditionally render it when is true. Manually attach a click-outside handler. Write an effect to listen for the Escape key. Add another effect for focus trapping. Implement your own scroll-lock logic. Remember to add ARIA attributes. Oh, and make sure to clean up those event listeners, or you’ll have memory leaks. You’ve just written 100+ lines of JavaScript to poorly recreate what the browser gives you for free. Worse, you’ve trained developers to not even look for native solutions. The platform becomes invisible. When someone asks “how do I build a modal?”, the answer is “install a library” or “here’s my custom hook,” never “use .” The teaching is the problem. When influential tutorial authors and bootcamp curricula skip native APIs in favor of React patterns, they’re not just showing an alternative approach. They’re actively teaching malpractice. A generation of developers learns to build inaccessible soup because that’s what fits the framework’s reactivity model, never knowing the platform already solved these problems. And it’s not just bootcamps. Even the most popular component libraries make the same choice: shadcn/ui builds its Dialog component on Radix UI primitives, which use instead of the native element. There are open GitHub issues requesting native support, but the implicit message is clear: it’s easier to reimplement the browser than to work with it. The problem runs deeper than ignorance or inertia. The frameworks themselves increasingly struggle to work with the platform’s evolution. Not because the platform features are bad, but because the framework’s architectural assumptions can’t accommodate them. Consider why component libraries like Radix UI choose over . The native element manages its own state: it knows when it’s open, it handles its own visibility, it controls focus internally. But React’s reactivity model expects all state to live in JavaScript, flowing unidirectionally into the DOM. When a native element manages its own state, React’s mental model breaks down. Keeping in your React state synchronized with the element’s actual open/closed state becomes a nightmare of refs, effects, and imperative calls. Precisely what React was supposed to eliminate. Rather than adapt their patterns to work with stateful native elements, library authors reimplement the entire behavior in a way that fits the framework. It’s architecturally easier to build a fake dialog in JavaScript than to integrate with the platform’s real one. But the conflict extends beyond architectural preferences. Even when the platform adds features that developers desperately want, frameworks can’t always use them. Accordions? The web has and . Tooltips? There’s attribute and the emerging API. Date pickers? . Custom dropdowns? The web now supports styling elements with and pseudo-elements. You can even put elements with images inside elements now. It eliminates the need for the countless JavaScript select libraries that exist solely because designers wanted custom styling. Frameworks encourage conditional rendering and component state, so these elements don’t get rendered until JavaScript decides they should exist. The mental model is “UI appears when state changes,” not “UI exists, state controls visibility.” Even when the platform adds the exact features developers have been rebuilding in JavaScript for years, the ecosystem momentum means most developers never learn these features exist. And here’s the truly absurd part: even when developers do know about these new platform features, the frameworks themselves can’t handle them. MDN’s documentation for customizable elements includes this warning: “ Some JavaScript frameworks block these features; in others, they cause hydration failures when Server-Side Rendering (SSR) is enabled. ” The platform evolved. The HTML parser now allows richer content inside elements. But React’s JSX parser and hydration system weren’t designed for this. They expect to only contain text. Updating the framework to accommodate the platform’s evolution takes time, coordination, and breaking changes that teams are reluctant to make. The web platform added features that eliminate entire categories of JavaScript libraries, but the dominant frameworks can’t use those features without causing hydration errors. The stack that was supposed to make development easier now lags behind the platform it’s built on. The browser has native routing: tags, the History API, forward/back buttons. It has native forms: elements, validation attributes, submit events. These work without JavaScript. They’re accessible by default. They’re fast. Modern frameworks threw them out. React Router, Next.js’s router, Vue Router; they intercept link clicks, prevent browser navigation, and handle routing in JavaScript. Why? Because client-side routing feels like a pure state transition: URL changes, state updates, component re-renders. No page reload. No “lost” JavaScript state. But you’ve now made navigation depend on JavaScript. Ctrl+click to open in a new tab? Broken, unless you carefully re-implement it. Right-click to copy link? The URL might not match what’s rendered. Accessibility tools that rely on standard navigation patterns? Confused. Forms got the same treatment. Instead of letting the browser handle submission, validation, and accessibility, frameworks encourage JavaScript-controlled forms. Formik, React Hook Form, uncontrolled vs. controlled inputs; entire libraries exist to manage what already does. The browser can validate instantly, with no JavaScript. But that’s not reactive enough, so we rebuild validation in JavaScript, ship it to the client, and hope we got the logic right. The web had these primitives. We rejected them because they didn’t fit our FP-inspired mental model of “state flows through JavaScript.” Progressive enhancement used to be a best practice: start with working HTML, layer on CSS for style, add JavaScript for interactivity. The page works at every level. Now, we start with JavaScript and work backwards, trying to squeeze HTML out of our component trees and hoping hydration doesn’t break. We lost built-in accessibility. Native HTML elements have roles, labels, and keyboard support by default. Custom JavaScript widgets require attributes, focus management, and keyboard handlers. All easy to forget or misconfigure. We lost performance. The browser’s streaming parser can render HTML as it arrives. Modern frameworks send JavaScript, parse JavaScript, execute JavaScript, then finally render. That’s slower. The browser can cache CSS and HTML aggressively. JavaScript bundles invalidate on every deploy. We lost simplicity. is eight characters. A client-side router is a dependency, a config file, and a mental model. is self-documenting. A controlled form with validation is dozens of lines of state management. And we lost alignment with the platform. The browser vendors spend millions optimizing HTML parsing, CSS rendering, and event dispatch. We spend thousands of developer-hours rebuilding those features in JavaScript, slower. This isn’t a story of incompetence. Smart people built these tools for real reasons. By the early 2010s, JavaScript applications had become unmaintainable. jQuery spaghetti sprawled across codebases. Two-way data binding caused cascading updates that were impossible to debug. Teams needed discipline, and functional programming offered it: pure components, immutable state, unidirectional data flow. For complex, stateful applications (like dashboards with hundreds of interactive components, real-time collaboration tools, data visualization platforms) React’s model was genuinely better than manually wiring up event handlers and tracking mutations. The FP purists weren’t wrong that unpredictable mutation causes bugs. They were wrong that the solution was avoiding the platform’s mutation-friendly APIs instead of learning to use them well. But in the chaos of 2013, that distinction didn’t matter. React worked. It scaled. And Facebook was using it in production. Then came the hype cycle. React dominated the conversation. Every conference had React talks. Every tutorial assumed React as the starting point. CSS-in-JS became “modern.” Client-side rendering became the default. When big companies like Facebook, Airbnb, Netflix and others adopted these patterns, they became industry standards. Bootcamps taught React exclusively. Job postings required React experience. The narrative solidified: this is how you build for the web now. The ecosystem became self-reinforcing through its own momentum. Once React dominated hiring pipelines and Stack Overflow answers, alternatives faced an uphill battle. Teams that had already invested in React by training developers, building component libraries, establishing patterns are now facing enormous switching costs. New developers learned React because that’s what jobs required. Jobs required React because that’s what developers knew. The cycle fed itself, independent of whether React was the best tool for any particular job. This is where we lost the plot. Somewhere in the transition from “React solves complex application problems” to “React is how you build websites,” we stopped asking whether the problems we were solving actually needed these solutions. I’ve watched developers build personal blogs with Next.js. Sites that are 95% static content with maybe a contact form, because that’s what they learned in bootcamp. I’ve seen companies choose React for marketing sites with zero interactivity, not because it’s appropriate, but because they can’t hire developers who know anything else. The tool designed for complex, stateful applications became the default for everything, including problems the web solved in 1995 with HTML and CSS. A generation of developers never learned that most websites don’t need a framework at all. The question stopped being “does this problem need React?” and became “which React pattern should I use?” The platform’s native capabilities like progressive rendering, semantic HTML, the cascade, instant navigation are now considered “old-fashioned.” Reinventing them in JavaScript became “best practices.” We chased functional purity on a platform that was never designed for it. And we built complexity to paper over the mismatch. The good news: we’re learning. The industry is rediscovering the platform. HTMX embraces HTML as the medium of exchange. Server sends HTML, browser renders it, no hydration needed. Qwik resumable architecture avoids hydration entirely, serializing only what’s needed. Astro defaults to server-rendered HTML with minimal JavaScript. Remix and SvelteKit lean into web standards: forms that work without JS, progressive enhancement, leveraging the browser’s cache. These tools acknowledge what the web is: a document-based platform with powerful native capabilities. Instead of fighting it, they work with it. This doesn’t mean abandoning components or reactivity. It means recognizing that is a useful model inside your framework, not a justification to rebuild the entire browser stack. It means using CSS for styling, native events for interactions, and HTML for structure and then reaching for JavaScript when you need interactivity beyond what the platform provides. The best frameworks of the next decade will be the ones that feel like the web, not in spite of it. In chasing functional purity, we built a frontend stack that is more complex, more fragile, and less aligned with the platform it runs on. We recreated CSS in JavaScript, events in synthetic wrappers, rendering in hydration layers, and routing in client-side state machines. We did this because we wanted predictability, control, and clean abstractions. But the web was never meant to be pure. It’s a sprawling, messy, miraculous platform built on decades of emergent behavior, pragmatic compromises, and radical openness. Its mutability isn’t a bug. It’s the reason a document written in 1995 still renders in 2025. Its global scope isn’t dangerous. It’s what lets billions of pages share a design language. Maybe the web didn’t need to be purified. Maybe it just needed to be understood. I want to thank my friend Ihab Khattab for reviewing this piece and providing invaluable feedback.

0 views
Neil Madden 5 months ago

Rating 26 years of Java changes

I first started programming Java at IBM back in 1999 as a Pre-University Employee. If I remember correctly, we had Java 1.1.8 installed at that time, but were moving to Java 1.2 (“Java 2”), which was a massive release—I remember engineers at the time grumbling that the ever-present “ Java in a Nutshell ” book had grown to over 600 pages. I thought I’d take a look back at 26 years of Java releases and rate some of the language and core library changes (Java SE only) that have occurred over this time. It’s a very different language to what I started out with! I can’t possibly cover every feature of those releases , as there are just way too many. So I’m just going to cherry-pick some that seemed significant at the time, or have been in retrospect. I’m not going to cover UI- or graphics-related stuff (Swing, Java2D etc), or VM/GC improvements. Just language changes and core libraries. And obviously this is highly subjective. Feel free to put your own opinions in the comments! The descriptions are brief and not intended as an introduction to the features in question: see the links from the Wikipedia page for more background. NB: later features are listed from when they were first introduced as a preview. The Collections Framework : before the collections framework, there was just raw arrays, Vector, and Hashtable. It gets the job done, but I don’t think anyone thinks the Java collections framework is particularly well designed. One of the biggest issues was a failure to distinguish between mutable and immutable collections, strange inconsistencies like why Iterator as a remove() method (but not, say, update or insert), and so on. Various improvements have been made over the years, and I do still use it in preference to pulling in a better alternative library, so it has shown the test of time in that respect. 4/10 The keyword: I remember being somewhat outraged at the time that they could introduce a new keyword! I’m personally quite fond of asserts as an easy way to check invariants without having to do complex refactoring to make things unit-testable, but that is not a popular approach. I can’t remember the last time I saw an assert in any production Java code. 3/10 Regular expressions: Did I really have to wait 3 years to use regex in Java? I don’t remember ever having any issues with the implementation they finally went for. The Matcher class is perhaps a little clunky, but gets the job done. Good, solid, essential functionality. 9/10 “New” I/O (NIO): Provided non-blocking I/O for the first time, but really just a horrible API (still inexplicably using 32-bit signed integers for file sizes, limiting files to 2GB, confusing interface). I still basically never use these interfaces except when I really need to. I learnt Tcl/Tk at the same time that I learnt Java, and Java’s I/O always just seemed extraordinarily baroque for no good reason. Has barely improved in 2 and a half decades. 0/10 Also notable in this release was the new crypto APIs : the Java Cryptography Extensions (JCE) added encryption and MAC support to the existing signatures and hashes, and we got JSSE for SSL. Useful functionality, dr eadful error-prone APIs . 1/10 Absolutely loads of changes in this release. This feels like the start of modern Java to me. Generics : as Go discovered on its attempt to speed-run Java’s mistakes all over again, if you don’t add generics from the start then you’ll have to retrofit them later, badly. I wouldn’t want to live without them, and the rapid and universal adoption of them shows what a success they’ve been. They certainly have complicated the language, and there are plenty of rough edges (type erasure, reflection, etc), but God I wouldn’t want to live without them. 8/10 . Annotations: sometimes useful, sometimes overused. I know I’ve been guilty of abusing them in the past. At the time it felt like they were ushering a new age of custom static analysis, but that doesn’t really seem to be used much. Mostly just used to mark things as deprecated or when overriding a method. Meh. 5/10 Autoboxing: there was a time when, if you wanted to store an integer in a collection, you had to manually convert to and from the primitive int type and the Integer “boxed” class. Such conversion code was everywhere. Java 5 got rid of that, by getting the compiler to insert those conversions for you. Brevity, but no less inefficient. 7/10 Enums : I’d learned Haskell by this point, so I couldn’t see the point of introducing enums without going the whole hog and doing algebraic datatypes and pattern-matching. (Especially as Scala launched about this time). Decent feature, and a good implementation, but underwhelming. 6/10 Vararg methods: these have done quite a lot to reduce verbosity across the standard library. A nice small improvement that’s had a good quality of life enhancement. I still never really know when to put @SafeVarargs annotations on things though. 8/10 The for-each loop: cracking, use it all the time. Still not a patch on Tcl’s foreach (which can loop over multiple collections at once), but still very good. Could be improved and has been somewhat replaced by Streams. 8/10 Static imports: Again, a good simple change. I probably would have avoided adding * imports for statics, but it’s quite nice for DSLs. 8/10 Doug Lea’s java.util.concurrent etc : these felt really well designed. So well designed that everyone started using them in preference to the core collection classes, and they ended up back-porting a lot of the methods. 10/10 After the big bang of Java 5, Java 6 was mostly performance and VM improvements, I believe, so we had to wait until 2011 for more new language features. Strings in switch: seems like a code smell to me. Never use this, and never see it used. 1/10 Try-with-resources : made a huge difference in exception safety. Combined with the improvements in exception chaining (so root cause exceptions are not lost), this was a massive win. Still use it everywhere. 10/10 Diamond operator for type parameter inference: a good minor syntactic improvement to cut down the visual noise. 6/10 Binary literals and underscores in literals: again, minor syntactic sugar. Nice to have, rarely something I care about much. 4/10 Path and Filesystem APIs: I tend to use these over the older File APIs, but just because it feels like I should. I couldn’t really tell you if they are better or not. Still overly verbose. Still insanely hard to set file permissions in a cross-platform way. 3/10 Lambdas: somewhat controversial at the time. I was very in favour of them, but only use them sparingly these days, due to ugly stack traces and other drawbacks. Named method references provide most of the benefit without being anonymous. Deciding to exclude checked exceptions from the various standard functional interfaces was understandable, but also regularly a royal PITA. 4/10 Streams: Ah, streams. So much potential, but so frustrating in practice. I was hoping that Java would just do the obvious thing and put filter/map/reduce methods onto Collection and Map, but they went with this instead. The benefits of functional programming weren’t enough to carry the feature, I think, so they had to justify it by promising easy parallel computing. This scope creep enormously over-complicated the feature, makes it hard to debug issues, and yet I almost never see parallel streams being used. What I do still see quite regularly is resource leaks from people not realising that the stream returned from Files.lines() has to be close()d when you’re done—but doing so makes the code a lot uglier. Combine that with ugly hacks around callbacks that throw checked exceptions, the non-discoverable API (where are the static helper functions I need for this method again?), and the large impact on lots of very common code, and I have to say I think this was one of the largest blunders in modern Java. I blogged what I thought was a better approach 2 years earlier, and I still think it would have been better. There was plenty of good research that different approaches were better , since at least Oleg Kiselyov’s work in the early noughties . 1/10 Java Time: Much better than what came before, but I have barely had to use much of this API at all, so I’m not in a position to really judge how good this is. Despite knowing how complex time and dates are, I do have a nagging suspicion that surely it doesn’t all need to be this complex? 8/10 Modules: I still don’t really know what the point of all this was. Enormous upheaval for minimal concrete benefit that I can discern. The general advice seems to be that modules are (should be) an internal detail of the JRE and best ignored in application code (apart from when they spuriously break things). Awful. -10/10 (that’s minus 10!) jshell: cute! A REPL! Use it sometimes. Took them long enough. 6/10 The start of time-based releases, and a distinct ramp-up of features from here on, trying to keep up with the kids. Local type inference (“var”) : Some love this, some hate it. I’m definitely in the former camp. 9/10 New HTTP Client : replaced the old URL.openStream() approach by creating something more like Apache HttpClient. It works for most purposes, but I do find the interface overly verbose. 6/10 This release also added TLS 1.3 support, along with djb-suite crypto algorithms. Yay. 9/10 Switch expressions : another nice mild quality-of-life improvement. Not world changing, but occasionally nice to have. 6/10 Text blocks: on the face of it, what’s not to like about multi-line strings? Well, apparently there’s a good reason that injection attacks remain high on the OWASP Top 10, as the JEP introducing this feature seemed intent on getting everyone writing SQL, HTML and JavaScript using string concatenation again. Nearly gave me a heart attack at the time, and still seems like a pointless feature. Text templates (later) are trying to fix this, but seem to be currently in limbo . 3/10 Pattern matching in : a little bit of syntactic sugar to avoid an explicit cast. But didn’t we all agree that using was a bad idea decades ago? I’m really not sure who was doing the cost/benefit analysis on these kinds of features. 4/10 Records: about bloody time! Love ‘em. 10/10 Better error messages for NullPointerExceptions: lovely. 8/10 Sealed classes: in principal I like these a lot. We’re slowly getting towards a weird implementation of algebraic datatypes. I haven’t used them very much yet so far. 8/10 EdDSA signatures: again, a nice little improvement in the built-in cryptography. Came with a rather serious bug though… 8/10 Vector (SIMD) API: this will be great when it is finally done, but still baking several years later. ?/10 Pattern matching switch: another piece of the algebraic datatype puzzle. Seems somehow more acceptable than instanceof, despite being largely the same idea in a better form. 7/10 UTF-8 by default: Fixed a thousand encoding errors in one fell swoop. 10/10 Record patterns: an obvious extension, and I think we’re now pretty much there with ADTs? 9/10 Virtual threads: being someone who never really got on with async/callback/promise/reactive stream-based programming in Java, I was really happy to see this feature. I haven’t really had much reason to use them in anger yet, so I don’t know how well they’ve been done. But I’m hopeful! ?/10 String templates: these are exactly what I asked for in A few programming language features I’d like to see , based on E’s quasi-literal syntax, and they fix the issues I had with text blocks. Unfortunately, the first design had some issues, and so they’ve gone back to the drawing board. Hopefully not for too long. I really wish they’d not released text blocks without this feature. 10/10 (if they ever arrive). Sequenced collections: a simple addition that adds a common super-type to all collections that have a defined “encounter order”: lists, deques, sorted sets, etc. It defines convenient getFirst() and getLast() methods and a way to iterate items in the defined order or in reverse order. This is a nice unification, and plugs what seems like an obvious gap in the collections types, if perhaps not the most pressing issue? 6/10 Wildcards in patterns: adds the familiar syntax from Haskell and Prolog etc of using as a non-capturing wildcard variable in patterns when you don’t care about the value of that part. 6/10 Simplified console applications: Java finally makes simple programs simple for beginners, about a decade after universities stopped teaching Java to beginners… Snark aside, this is a welcome simplification. 8/10 This release also adds support for KEMs , although in the simplest possible form only. Meh. 4/10 The only significant change in this release is the ability to have statements before a call to super() in a constructor. Fine. 5/10 Primitive types in patterns: plugs a gap in pattern matching. 7/10 Markdown javadoc comments: Does anyone really care about this? 1/10 The main feature here from my point of view as a crypto geek is the addition of post-quantum cryptography in the form of the newly standardised ML-KEM and ML-DSA algorithms, and support in TLS. Stable values: this is essentially support for lazily-initialised final variables. Lazy initialisation is often trickier than it should be in Java, so this is a welcome addition. Remembering Alice ML , I wonder if there is some overlap between the proposed StableValue and a Future? 7/10 ? PEM encoding of cryptographic objects is welcome from my point of view, but someone will need to tell me why this is not just ? Decoding support is useful though, as that’s a frequent reason I have to grab Bouncy Castle still. 7/10 Well, that brings us pretty much up to date. What do you think? Agree, disagree? Are you a passionate defender of streams or Java modules? Have at it in the comments.

0 views
Blargh 5 months ago

Ideal programming language

My last post about Go got some attention . In fact, two of my posts got attention that day , which broke my nginx since I was running livecount behind nginx, making me run out of file descriptors when thousands of people had the page opened. It’s a shame that I had to turn off livecount, since it’d be cool to see the stats. But I was out of the country, with unreliable access to both Internet and even electricity in hotels, so I couldn’t implement the real fix until I got back, when it had already mostly died down. I knew this was a problem with livecount, of course, and I even allude to it in its blog post . Anyway, back to programming languages. The reactions to my post can be summarized as: I respect the first two. The last one has to be from people who are too emotionally invested with their tools, and take articles like this like a personal attack of some sort. They go out of their way to be offended, and then start screaming “but I don’t fucking want guitar lessons!” . They want to counter attack against another programming language, thinking I would take it personally too. Maybe this heretic is a Java programmer, and that’s why he’s stupid? ( bad guess ). It also reminded me of PHP programmers back in the PHP 3.x days who would die on the hill of defending PHP as an awesome language, while admitting that they knew literally no other language. What should they know of England who only England know? I’m not offended. Those replies are not offensive; They’re boring. There’s nothing to learn from the comments, and probably not from the people making such comments in general, either. Also it seems that somebody managed to get my whole Blog comment database deleted from disqus. Either disqus itself was hacked, or just my account with them. Or someone tricked disqus into deleting it. They managed to restore it, though. Keeping this third type of commenter in mind, I got an email a few days later asking what programming languages are “closer to ideal”, and if I maybe have a blog post in me about that. I don’t know who’s asking, so I replied the long version, while still taking the question at face value. Ideal… Well, this is getting into the space of “what is the best programming language”, which doesn’t have a perfect answer. To do what? To make an Android app (something I’m not an expert in. I’ve just made one simple one), I think Kotlin seems nice. But I don’t know it very well. For web development, something I also don’t do much, it’s probably Typescript. For maximum portability for systems programming, C or C++ (depending on if all your target platforms (e.g. embedded stuff) support C++) is probably best. But these are practical answers. Some people like Lisp. Others like Haskell. Rust strikes a good position between practical, low level control, safe, and a high level type system. If the Rust compiler supports your platform, then it’s pretty much as portable as C/C++. I’ve written about the deficiencies of Java and Go because the ways they are deficient are interesting. I don’t find the ways C++ is deficient to be interesting. C++ is what it is. I happen to enjoy coding C++. I also have thoughts about the trap of accidentally writing too much in bash . I have a draft of things wrong with Rust. But so far I think they are all fixable. (e.g. no placement syntax has been defined yet ) But I have no interest in writing a blog post about the lack of memory safety of C++. It is what it is. Java’s deficiencies are interesting because they are the best guesses of the future that 1990 had to offer. And those guesses were almost all wrong. Go’s deficiencies are interesting/frustrating because it (almost entirely) is the best that the 1980s-early 1990s had to offer. And yet it launched in 2009. I don’t enjoy coding Javascript. So I’m experimenting writing frontend stuff in Rust and compile to WASM. So far it works for me, but is not something I’d recommend for anyone who wants to get anything done. But no, I won’t be writing up which programming language is “ideal”, because it’s one of those “it depends”. Oh yes, these things are definite flaws in the language. What you’re saying is true, but it’s not a problem. Your post is pointless. You’re dumb. You don’t understand Go. Here let me explain your own blog post to you […]

0 views
qouteall notes 6 months ago

Higher-Level Design Patterns

Higher-level software design patterns: These patterns and ideas are often deeply connected and used together. It involves two different aspects: The benefit of turning computation (logic and action) into data: The benefit of turning execution state into explicit data: The distinction between computation and execution state is blurry. A closure can capture data. An execution state can be seen as a continuation , which is also a computation. Algebraic effect : An effect handler executes some code in a scope. Some code is executed under an effect handler. When it performs an effect, the control flow jumps to the effect handler, and the execution state ( delimited continuation ) up to the effect handler's scope is also saved. The effect handler can then resume using the execution state. A simple introduction to Algebraic effects Delimited continuation is the execution state turned into data. It's delimited because the execution state only include the stackframes within effect handling scope. The continuation (without "delimited") contains the whole execution state of the whole program (assume program is single-threaded). Delimited continuation is "local". Continuation is "global". The "local" one is more fine-grained and useful. Continuation passing style (CPS) is a way of representing programs. In CPS, each function accepts a continuation. Returning becomes calling the continuation. Calling continuation is to continue execution. The output of continuation is the "final output of whole program" (if IO or mutable state involved, the "final output of whole program" can be empty). See also: Configuration complexity clock When you try to handle many different business requests, one solution is to create a flexible rules engine. Configuring the rules engine can handle all of the requests. However then a tradeoff become salient: DSL are useful when it's high in abstraction level, and new requirements mostly follow the abstration. System calls are expensive . Replacing system calls with data can improve performance: Mutation can be represented as data. Data can be interpreted as mutation. The benefits: About rollback: Sometimes, to improve user experience, we need to replay conflicted changes, instead of rolling back them. It's more complex. In some places, we specify a new state and need to compute the mutation (diff). Examples: Mutate-by-recreate: Keep data immutable. Change it by recreating the whole data. In multi-threading, for read-heavy data, it's often beneficial to make the data structure immutable, but keep one mutable atomic reference to it. Updating recreates the whole data structure and atomically change the reference. This called read-copy-update (RCU) or copy-on-write (COW). In pure functional languages (e.g. Haskell), there is no direct way of mutating things. Mutation can only be simutated by recreating. Bitemporal modelling : Store two pieces of records. One records the data and time updated to database. Another records the data and time that reflect the reality. (Sometimes the reality changes but database doesn't edit immediately. Sometimes database contains wrong informaiton that's corrected later.) Conflict-free replicated data type (CRDT) : The mutations can be combined, and the result doesn't depend on order of combining. It allows distributed system get eventual consistency without immediate communication. In CRDT, the operator of combining mutation ∗ * ∗ : Examples of CRDT: For example, in multiplayer game, there is a door. The door's state can be open or close (a boolean). Each operation is a tuple . Combination is max-by-timestamp (for two operations, pick the higher-timestamp ones). Consdering that multiple players can do operation in exactly the same timestamp, so we add player ID as tie-breaker . The operation now become . Combination max-by the tuple of . If equals, larger wins. Note that typical multiplayer game implementation doesn't use CRDT. The server holds source-of-truth game state. Clients send actions to servers. The server validates actions, change game state and broadcast to all clients. Drawing solid triangles to framebuffer can also be seen as CRDT. The whole framebuffer can be seen as an operation. Each pixel in framebuffer has a depth value. Combining two framebuffer takes lowest-depth one for two pixels in the same position. (Two framebuffers may have same depth on same pixel with different color. We can use unique triangle ID as tie-breaker.) Note that actual rasterization in GPU works by having one centralized framebuffer, not using CRDT. In a collaborative text editing system, each character has an ID. It supports two kinds of operations: There is a "root character" in the beginning of document. It's invisible and cannot delete. For two insertions after the same character, the tie-breaker is . Higher timestamp ones appear first. For the same timestamp, higher user id ones appear first. It forms a tree. Each character is a node, containing visibility boolean flag. Each operation is an edge pointing to new character. The document is formed by traversing the tree in depth-first order (edges ordered by tie-breaker) while hiding invisible characters. 3 4 Only compute some parts of the data, and keep the information of remaining computation for future use. Deferred (async) compuation vs immediate compuation: A computation, an optimization, or a safety check can be done in: Most computations that are done at compile time can be done at runtime (with extra performance cost). But if you want to avoid the performance cost by doing it in compile time, it becomes harder. Rust and C++ has Statics-Dynamics Biformity ( see also ): most runtime computation methods cannot be easily used in compile-time. Using compile-time mechanisms often require data to be encoded in types, which then require type gymnastics. The ways that solve (or partially solve) the biformity between compile-time and runtime computation: Related: Symbolic execution , Program search using superposition Views in SQL databases are "fake" tables that represents the result of a query. The view's data is derived from other tables (or views). The generalized concept of view: View takes one information model and present it as another information model, and allow operating as another information model. The generalized view can be understood as: Examples of the generalized view concept: More generally: Dynamically-typed languages also have "types". The "type" here is the mapping between in-memory data and information . Even in dynamic languages, the data still has "shape" at runtime . The program only works with specific "shapes" of data . For example, in Python, if a function accepts an array of string, but you pass it one string, then it treats each character as a string, which is wrong. Mainstream languages often have relatively simpler and less expressive type systems. Some "shape" of data are complex and cannot be easily expressed in mainstram languages' type system (without type erasure). Dynamic languages' benefits: But the statically-typed languages and IDEs are improving. The more expressive type systems reduce friction of typing. Types can help catching mistakes, help understanding code and help IDE functionalities. Type inference and IDE completion saves time of typing types. That's why mainstream dynamic languages (JS, Python) are embracing type annotations. A view can be backed by either storage or computation (or a combination of storage and computation). Modern highly-parallel computation are often bottlenecked by IO and synchronization. Adding new computation hardware units is easy. Making the information to flow efficiently between these hardware units is hard. When memory IO becomes bottleneck, re-computing rather than storing can be beneficial. Most algorithms use the idea of producing invariant, growing invariant and maintaining invariant: About transitive rule: if X and Y both follow invariant, then result of "merging" X and Y also follows invariant. "Transitive" is why the invariant can grow without re-checking the whole data. When the invariant is forced in language level, it can be " contagious ". For example, invariants in business logic: Invariants in data: The timing of maintaining invariant: The responsibility of maintaining invariant: In the second case (application code maintains invariant), to make it less error prone, we can encapsulate the data and the invariant-maintaining code , and ensuring that any usage of encapsulated API won't violate the invariant . If some usages of API can break the invariant and developer can only know it by considering implementation, then it's a leaky abstraction. For example, one common invariant to maintain is consistency between derived data and base data (source-of-truth) . There are many solutions: In real-world legacy code, invariants are often not documented . They are implicit in code. A developer not knowing an invariant can easily break it. Type systems also help maintaining invariant. But a simple type system can only maintain simple invariants. Complex invariants require complex types to maintain. If it becomes too complex, type may be longer than execution code, and type errors become harder to resolve. It's a tradeoff. Concentration and fat-tail distribution (80/20) are common in software world: Many optimizations are based on assuming the high-probability case happens : GoF design patterns Other GoF design patterns briefly explained: It's possible to use separate threads for suspendable compuatation. However, OS threads are expensive and context switch is expensive. Manually-implemented state machine is faster. ↩ It's possible to use polling to fully avoid system call after initial setup, but with costs. ↩ There are optimizations. To avoid storing unique ID for each character, it can store many immutable text blocks, and use as character ID. Consecutive insertions and deletions can be merged. The tree keeps growing, and need to be merged. The exact implementation is complex. ↩ Another way of collaborative text editing is Operational Transformation . It use as cursor. The server can transform a cursor to the latest version of document: if there are insertion before , the increments accordingly. If there are deletion, reduces accordingly. This is also called index rebasing. ↩ Also: In hard disk, magnetic field is viewed as bits. In CD, the pits and lands are viewed as bits. In SSD, the electron's position in floating gate is viewed as bits. In fiber optics, light pulses are viewed as bits. In quantum computer, the quantum state (like spin of electron) can be viewed as bits. ...... ↩ Specifically, bytes are viewed as code units, code units are viewed as code points, code points are viewed as strings. Code points can also be viewed as grapheme clusters. ↩ For beginners, a common misconception is that "if the software shows things on screen, then it's 90% done". In reality, a proof-of-concept is often just 20% done. There are so many corner cases in real usage. Not handing one corner case is bug. Most code are used for handling corner cases, not common cases. Although each specific corner case triggering probability is small, triggering any of the many corner cases is high-probability. ↩ Computation-data duality. Mutation-data duality. Partial computation and multi-stage computation. Generalized View. Invariant production, grow, and maintenance. Turn computation (logic and action) into data. Turn execution state into explicit data. Closure (lambda expression, function value). A function along with captured data. It allows reusing a piece of code along with captured data . It can help abstraction: separate the generation of computation (create function values) and execution of computation (executing function). (It's related to partial computation and multi-stage computation) Composition. The computation that's turned to data can be more easily composed. Functional programming encourages having simple building blocks and compose them into complex logic. Flexibility. The computation that's turned to data can be changed and rebuilt dynamically. Inspection: Explicit execution state is easier to inspect and display (the machine code can be optimized, and it's platform-depenent, so machine code execution position and runtime stack are harder to inspect and manipulate than explicit data) Serialization: Explicit execution state can be serialized and deserialized, thus be stored to database and sent across network. (Example: Restate) Suspension: Explicit execution state allows temporarily suspending execution and resume it later. Suspending thread is harder and less efficient 1 . Modification: Explicit execution state can be modified. It makes cancellation and rollback easier. (Modifying execution stack and execution state is harder, and it's not supported by many mainstream languages.) Forking: Allows forking control flow, which can be useful in some kinds of simulations. If the rules engine is high in abstraction level, doing a lot of predefined things under-the-hood, then: it will be unadaptive when a new requirement clashes with predefined behavior. Simple interface = hardcoded defaults = less customizability . If the rules engine is low in abstraction level, then doing things will require more configuration. It's not more convenient than just coding. It becomes a new DSL. The new DSL is often worse than mainstream languages because: The new DSL often has poor debug support. The new DSL often has no IDE support. No existing libraries ecosystem. Need to reinvent wheel. The new DSL is often less battle-tested and more buggy. io_uring: allows submitting many IO tasks by writing into memory, then use one system call to submit them. 2 Graphics API: Old OpenGL use system calls to change state and dispatch draw call. New Graphics APIs like Vulkan, Metal and WebGPU all use command buffer. Operations are turned to data in command buffer, then one system call to submit many commands. Instead of just doing in-place mutation, we can enqueue a command (or event) to do mutation later. The command is then processed to do actual mutation. (It's also moving computatin between stages) Layered filesystem (in Docker). Mutating or adding file is creating a new layer. The unchanged previous layers can be cached and reused. Event sourcing . Derive latest state from a events (log, mutations). Express the latest state as a view of old state + mutations. The idea is adopted by database WAL, data replication, Lambda architecture, etc. Command Query Responsibility Segregation . The system has two facades: the query facade doesn't allow mutation, and the command facade only accepts commands and don't give data. Easier to inspect, audit and debug mutations, because mutations are explicit data, not implicit execution history. Easier to audit and replay. Can replay mutations and rollback easily. Can replicate (sync) data change without sending full data. Transactional databases allow rolling back a uncommited transaction. (MySQL InnoDB does in-place mutation on disk but writes undo log and redo log. PostgreSQL MVCC write is append-only on disk.) Editing software often need to support undo. It's often implemted by storing previous step's data, while sharing unchanged substructure to optimize. Multiplayer game client that does server-state-prediction (to reduce visible latency) need to rollback when prediction is invalidted by server's message. CPU does branch prediction and speculative execution. If branch prediction fails or there is other failure, it internally rollback. ( Spectre vulnerability and Meltdown vulnerability are caused by rollback not cancelling side effects in cache that can be measured in access speed). Git. Compute diff based on file snapshots. The diff can then be manipulated (e.g. merging, rebasing, cherry-pick). React. Compute diff from virtual data structure and apply to actual DOM. Sync change from virtual data structure to actual DOM. Kubernetes. You configure what pods/volumes/... should exist. Kubernetes observes the diff between reality and configuration, then do actions (e.g. launch new pod, destroy pod) to cover the diff. It must be commutative. a ∗ b = b ∗ a a * b = b * a a ∗ b = b ∗ a . The order of combinding doesn't matter. It must be associative. a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c a * (b * c) = (a * b) * c a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c . The order of combining doesn't matter. It must be idempotent: a ∗ a = a a * a = a a ∗ a = a . Duplicating a mutation won't affect result. (Idempotence is not needed if you ensure exactly-once delivery.) Insertion. inserts a new character after the character with id . The is unique globally. Deletion only marks invisible flag of character (keep the tombstone) In lazy evaluation, the unobserved data is not computed. Deferred mutation. Relates to mutation-data duality. Replace immediately executed code with data (expression tree, DSL, etc.) that will be executed (interpreted) later. Relates to computation-data duality. In multi-stage programming, some data are fixed while some data are unknown. The fixed data can be used for optimization. It can be seen as runtime constant value folding. JIT can be seen as treating bytecode as runtime constant and fold them in interpreter code. Replacing a value with a function or expression tree helps handling the currently-unknown data. Using a future (promise) object to represent a pending computation. In Idris, having a hole and inspecting the type of hole can help proving. Immediately free memory is immediate computation. GC is deferred computation. Stream processing is immediate computation. Batch processing is deferred computation. Pytorch's most matrix operations are async. GPU computes in background. The tensor object's content may be yet unknown (and CPU will wait for GPU when you try to read its content). PostgreSQL and SQLite require deferred "vacuum" that rearranges storage space. Mobile GPUs often do tiled rendering . After vertex shader running, the triangles are not immediately rasterized, but dispatched to tiles (one triangle can go to multiple tiles). Each tile then rasterize and run pixel shader separately. It can reduce memory bandwidth requirement and power consumption. Pre-compile stage. (Code generation, IDE linting, etc.) Compile stage. (Compile-time computation, macros, dependent type theorem proving, etc.) Runtime stage. (Runtime check, JIT compilation, etc.) After first run. (Offline profile-guided optimization, etc.) Zig compile-time computation and reflection. See also Dependently-typed languages. (e.g. Idris, Lean) Scala multi-stage programming. See also . It's at runtime, not compile-time. But its purpose is similar to macro and code generation. Dynamically compose code at runtime and then get JITed. Encapsulating information. Hiding you the true underlying information and only expose derived information. "Faking" information. Bits are views of voltage in circuits. 5 Integers are views of bits Characters are views of integers. Strings are views of characters. 6 Other complex data structures are views to binary data. (pointer can be seen as a view to pointed data) A map (dictionary) can be viewed as a function. Lookup acceleration structure (e.g. database index) are also views to underlying data. Cache is view to underlying data/computation. Lazy evaluation provides a view to the computation result. Virtual memory is a view to physical memory. File system is a view to data on disk. The not-on-disk data can also be viewed as files (Unix everything-is-file philosophy). Symbolic link in file systems is a view to another point in file system. Database provides generalized views of in-disk/in-memory data. Linux namespaces, hypervisors, sandboxes, etc. provides view of aspects of the system. Proxy, NAT, firewall, virtualized networking etc. provides manipulated view of network. Transaction isolation in databases provide views of data (e.g. snapshot isolation). Replicated data and redundant data are views to the original data. Multi-tier storage system. From small-fast ones to large-slow ones: register, cache, memory, disk, cloud storage. Previously mentioned computation-data duality and mutation-data duality can be also seen as viewing. Transposing in Pytorch (by default) doesn't change the underlying matrix data. It only changes how the data is viewed. The mapping between binary data and information is view. Information is bits+context . The context is how the bits are mapped between information. Type contains viewing from binary data to information . Abstraction involves viewing different things as the same things . Avoid the shackle of an unexpressive type system. Avoid syntax inconvenience related to type erasure (type erasure in typed languages require inconvenient things like type conversion). Can quickly iterate by changing one part of program, before the changes work with other parts of program (in static typed languages, you need to resolve all compile errors in the code that you don't use now). This is double-edged sword. The broken code that was not tested tend to get missed. Save some time typing types and type definitions. Reference in GC languages. It may be implemented with a pointer, a colored pointer, or a handle (object ID). The pointer may be changed by moving GC. But in-language semantic doesn't change after moving. ID. All kinds of ID, like string id, integer id, UUID, etc. can be seen as a reference to an object. The ID may still exist after referenced object is removed, then ID become "dangling ID". A special kind of ID is path. For example, file path points to a file, URL points to a web resource, permission path points to a permission, etc. They are the "pointers" into a node in a hierarchical (tree-like) structure. Content-addressable ID. Using the hash of object as the ID of object. This is used in Git, Blockchain, IPFS and languages like Unison . Iterator. An Iterator can be seen as a pointer pointing to an element in container. Zipper. A zipper contains two things: 1. a container with a hole 2. element at the position of the hole. Unlike iterator, a zipper contains the information of whole container. It's often used in pure functional languages. Produce invariant. Create invariant at the smallest scale, in the simplest case. Grow invariant. Combine or expand small invariants to make them larger. This often utilizes transitive rule . Do it until the invariant become big enough to finish the task. Maintain invariant. Every mutaiton to a data structure need to maintain its invariant. Merge sort. Create sorted sub-sequence in smallest scale (e.g. two elements). Then merge two sorted sub-sequences into a bigger one, and continue. The invariant of sorted-ness grows up to the whole sequence. Quick sort. Select a pivot. Then partition the sequence into a part that's smaller than pivot and a part that's larger than pivot (and a part that equals pivot). By partitioning, it creates invariant LeftPartElements < Pivot < RightPartElements \text{LeftPartElements} < \text{Pivot} < \text{RightPartElements} LeftPartElements < Pivot < RightPartElements . By recursively creating such invariants until to the smallest scale (individual elements), the whole sequence is sorted. Binary search tree. It creates invariant LeftSubtreeElements ≤ ParentNode ≤ RightSubtreeElements \text{LeftSubtreeElements} \leq \text{ParentNode} \leq \text{RightSubtreeElements} LeftSubtreeElements ≤ ParentNode ≤ RightSubtreeElements . When there is only one node, the invariant is produced at the smallest scale. Every insertion then follows that invariant and then grows and maintains that invariant. Dijkstra algorithm. The visited nodes are the nodes whose shortest path from source node are known. By using the nodes that we know shortest path, it "expands" on graph, knowing new node's shortest path from source. The algorithm iteratively add new nodes into the invariant, until it expands to destination node. Dynamic programming. The problem is separated into sub-problems. There is no cycle dependency between sub-problems. One problem's result can be quickly calculated from sub-problem's results (e.g. max, min). Querying hash map can skip data because hash ( a ) ≠ hash ( b ) \text{hash}(a) \neq \text{hash}(b) hash ( a )  = hash ( b ) implies a ≠ b a \neq b a  = b . Querying ordered search tree can skip data because ( a < b ) ∧ ( b < c ) (a < b) \land (b < c) ( a < b ) ∧ ( b < c ) implies a < c a < c a < c . Parallelization often utilize associativity: a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c a * (b * c) = (a * b) * c a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c . For example, a ∗ ( b ∗ ( c ∗ d ) ) = ( a ∗ b ) ∗ ( c ∗ d ) a*(b*(c*d))=(a * b) * (c * d) a ∗ ( b ∗ ( c ∗ d )) = ( a ∗ b ) ∗ ( c ∗ d ) , where a ∗ b a*b a ∗ b and c ∗ d c*d c ∗ d don't depend on each other and can be computed in parallel. Examples: sum, product, max, min, max-by, min-by, list concat, set union, function combination, logical and, logical or. (Associativity with identity is monoid.) User name cannot duplicate. Bank account balance should never be negative. No over-spend. Product inventory count should never be negative. No over-sell. One room in hotel cannot be booked two times with time overlap. The product should be shipped after the order is paid. No lost notification or duplicated notification. The user can't view or change information that's out of their permission. User cannot use a functionality if subscription ends. The reduntant data, derived data and acceleration data structure (index, cache) should stay consistent with base data (source-of-truth). The client side data should be consistent with server side data. Memory safety invariants. Pointer should point to valid data. Should not use-after-free. Only free once. etc. Should free unused memory. Thread safety invariants. Many operations involve 3 stages: read-compute-write. It will malfunction when other thread mutates between reading data and writing data. The previous read result is no longer valid. Some operations involve more stages (many reads and writes). If the partially-modified state is exposed, invariant is also violatated. Some non-thread-safe data structure should not be shared between threads. Some non-thread-safe data structure must be accessed under lock. The modification of some action should be cancelled after a subsequent action fails. (Ad-hoc transaction, application-managed rollback) Immediate invariant maintenance Delayed invariant maintenance (tolerant stale data. cache, batched processing) The database/framework/OS/language etc. is responsible of maintaining invariant. For example, database maintains the validity of index and materialized view. If they don't have bugs, the invariant won't be violated. The application code is responsible for maintaining the invariant. This is more error-prone . Make the derived data always compute-on-demand . No longer need to manually maintain invariant. But it may cost performance. Caching of immutable compute result can make it faster, while still maintaining the semantic of compute-on-demand. Store the derived data as mutable state and manually keep consistency with source-of-truth. This is the most error-prone solution. All modifications to base data should "notify" the derived data to update accordingly. Sometimes notify is to call a function. Sometimes notify involves networking. A more complex case: the derived data need to modified in a way that reflect to base data. This violates single source-of-truth . It's even more error-prone. Even more complex: the client side need to reduce visible latency by predicting server side data, and wrong prediction need to be corrected by server side data. It not only violates single source-of-truth but also often require rollback and replay mechanism. Relying on other tools (database/framework/OS/language etc.) to maintain the invariant, as previously mentioned. Most users use few common features of a software. Most complexity (and bugs) come from very few features requirements. Most time of CPU is spent executing few hot code. Most data access targets few hot data (cache require this to be effective). Most branches are biased to one in execution (branch prediction require this to be effective). Most developers use few languages, libraries and frameworks. (Matthew effect of ecosystem) Most code and development efforts are for fixing edge cases. Few code and development efforts are spent on main case handling. 7 Most bugs that users see are caused by few easy-to-trigger bugs. Only a small portion of transisters in hardware are used in most times. Many transistors are rarely used. (Many transistors are for rarely-used instructions. Hardware defects related to them have higher probability to evade the test. See also: Silent Data Corruptions at Scale , Cores that don’t count ) Branch prediction assumes that it will execute the high-probability branch. If it predicts wrongly, speculative execution rolls back. Cache assumes that it will access hot data. If it accesses outside of hot data, cache is not hit. Optimistic concurrency control assumes there will be no concurrency conflict. If there do is a conflict, it rolls back and retries. It requires fewer waiting and communication than pessimistic concurrency control (locking), unless there are many contentions. Computation-data duality: Factory pattern. Turn object creation code into factory object. Prototype pattern. Turn object creation code into copying prototype. Chain of Responsibility pattern. Multiple processor objects process command objects in a pipeline. Command pattern. Turn command (action) into object. Interpreter pattern. Interpret data as computation. Iterator pattern / generator. Turn iteration code into state machine. Strategy pattern. Turn strategy into object. Observer pattern. Turn event handling code into an observer. Mutation-data duality: Command pattern. It also involves computation-data duality. The command can both represent mutation, action and computation. Partial computation and multi-stage computation: Builder pattern. Turn the process of creating object into multiple stages. Each stage builds a part. Chain of Responsibility pattern. The processing is separated into a pipeline. It's also in computation-data duality. Generalized view: Adapter pattern. View one interface as another interface. Bridge pattern. View the underlying different implentations as one interface. Composite pattern. View multiple objects as one object. Decorator pattern. Wrap an object, changing its behavior and view it as the same object. Facade pattern. View multiple complex interfaces as one simple interface. Proxy pattern. Proxy object provides a view into other things. Template method pattern. View different implementations as the same interface methods. Flyweight pattern. Save memory by sharing common data. View shared data as owned data. Invariant production, grow and maintenance: There is no GoF pattern that tightly corresponds to it. Visitor pattern. Can be replaced by pattern matching. State pattern. Make state a polymorphic object. Memento pattern. Backup the state to allow rollback. It exists mainly because in OOP data is tied to behavior. The separate memento is just data, decoupled with behavior. Memento pattern doesn't involve turning mutation into data, so it's not mutation-data duality. Singleton pattern. It's similar to global variable, but can be late-initialized, can be ploymorphic, etc. Mediator pattern. One abstraction to centrally manage other abstractions. It's possible to use separate threads for suspendable compuatation. However, OS threads are expensive and context switch is expensive. Manually-implemented state machine is faster. ↩ It's possible to use polling to fully avoid system call after initial setup, but with costs. ↩ There are optimizations. To avoid storing unique ID for each character, it can store many immutable text blocks, and use as character ID. Consecutive insertions and deletions can be merged. The tree keeps growing, and need to be merged. The exact implementation is complex. ↩ Another way of collaborative text editing is Operational Transformation . It use as cursor. The server can transform a cursor to the latest version of document: if there are insertion before , the increments accordingly. If there are deletion, reduces accordingly. This is also called index rebasing. ↩ Also: In hard disk, magnetic field is viewed as bits. In CD, the pits and lands are viewed as bits. In SSD, the electron's position in floating gate is viewed as bits. In fiber optics, light pulses are viewed as bits. In quantum computer, the quantum state (like spin of electron) can be viewed as bits. ...... ↩ Specifically, bytes are viewed as code units, code units are viewed as code points, code points are viewed as strings. Code points can also be viewed as grapheme clusters. ↩ For beginners, a common misconception is that "if the software shows things on screen, then it's 90% done". In reality, a proof-of-concept is often just 20% done. There are so many corner cases in real usage. Not handing one corner case is bug. Most code are used for handling corner cases, not common cases. Although each specific corner case triggering probability is small, triggering any of the many corner cases is high-probability. ↩

0 views