AWS re:Invent, Agents for AWS, Nova Forge
AWS re:Invent sought to present AI solutions in the spirit of AWS' original impact on startups; the real targets may be the startups from that era, not the current one.
AWS re:Invent sought to present AI solutions in the spirit of AWS' original impact on startups; the real targets may be the startups from that era, not the current one.
OpenAI is declaring code red and doubling down on ChatGPT, highlighting the company's bear case. Then, AWS makes it easier to run AI workloads on other clouds.
In the interests of clarity, I am a former NASA engineer/scientist with a PhD in space electronics. I also worked at Google for 10 years, in various parts of the company including YouTube and the bit of Cloud responsible for deploying AI capacity, so I'm quite well placed to have an opinion here. The short version: this is an absolutely terrible idea, and really makes zero sense whatsoever. There are multiple reasons for this, but they all amount to saying that the kind of electronics needed to make a datacenter work, particularly a datacenter deploying AI capacity in the form of GPUs and TPUs, is exactly the opposite of what works in space. If you've not worked specifically in this area before, I'll caution against making gut assumptions, because the reality of making space hardware actually function in space is not necessarily intuitively obvious. The first reason for doing this that seems to come up is abundant access to power in space. This really isn't the case. You basically have two options: solar and nuclear. Solar means deploying a solar array with photovoltaic cells – something essentially equivalent to what I have on the roof of my house here in Ireland, just in space. It works, but it isn't somehow magically better than installing solar panels on the ground – you don't lose that much power through the atmosphere, so intuition about the area needed transfers pretty well. The biggest solar array ever deployed in space is that of the International Space Station (ISS), which at peak can deliver a bit over 200kW of power. It is important to mention that it took several Shuttle flights and a lot of work to deploy this system – it measures about 2500 square metres, over half the size of an American football field. Taking the NVIDIA H200 as a reference, the per-GPU-device power requirements are on the order of 0.7kW per chip. These won't work on their own, and power conversion isn't 100% efficient, so in practice 1kW per GPU might be a better baseline. A huge, ISS-sized, array could therefore power roughly 200 GPUs. This sounds like a lot, but lets keep some perspective: OpenAI's upcoming Norway datacenter is intending to house 100,000 GPUs, probably each more power hungry than the H200. To equal this capacity, you'd need to launch 500 ISS-sized satellites. In contrast, a single server rack (as sold by NVIDIA preconfigured) will house 72 GPUs, so each monster satellite is only equivalent to roughly three racks. Nuclear won't help. We are not talking nuclear reactors here – we are talking about radioisotope thermal generators (RTGs) , which typically have a power output of about 50W - 150W. So not enough to even run a single GPU, even if you can persuade someone to give you a subcritical lump of plutonium and not mind you having hundreds of chances to scatter it across a wide area when your launch vehicle explosively self-disassembles. Thermal Regulation I've seen quite a few comments about this concept where people are saying things like, "Well, space is cold, so that will make cooling really easy, right?" Really, really no. Cooling on Earth is relatively straightforward. Air convection works pretty well – blow air across a surface, particularly one designed to have a large surface area to volume ratio like a heatsink, will transfer heat from the heatsink to the air quite effectively. If you need more power density than can be directly cooled in this way (and higher power GPUs are definitely in that category), you can use liquid cooling to transfer heat from the chip to a larger radiator/heatsink elsewhere. In datacenters on Earth, it is common to set up cooling loops where machines are cooled via chilled coolant (usually water) that is pumped around racks, with the heat extracted and cold coolant returned to the loop. Typically the coolant is cooled via convective cooling to the air, so one way or another this is how things work on Earth. In space, there is no air. The environment is close enough to a hard, total vacuum as makes no practical difference, so convection just doesn't happen. On the space engineering side, we typically think about thermal management , not just cooling. Thing is, space doesn't really have a temperature as-such. Only materials have a temperature. It may come as a surprise, but in the Earth-Moon system the average temperature of pretty much anything is basically the same as the average temperature of Earth, because this is why Earth has that particular temperature. If a satellite is rotating, a bit like a chicken on a rotisserie, it will tend toward having a consistent temperature that's roughly similar to that of the Earth surface. If it isn't rotating, the side pointing away from the sun will tend to get progressively colder, with a limit due to the cosmic microwave background, around 4 Kelvin, just a little bit above absolute zero. On the sunward side, things can get a bit cooked, hitting hundreds of centigrade. Thermal management therefore requires very careful design, making sure that heat is carefully directed where it needs to go. Because there is no convection in a vacuum, this can only be achieved by conduction, or via some kind of heat pump. I've designed space hardware that has flown in space. In one particular case, I designed a camera system that needed to be very small and lightweight, whilst still providing science-grade imaging capabilities. Thermal management was front and centre in the design process – it had to be, because power is scarce in small spacecraft, and thermal management has to be achieved whilst keeping mass to a minimum. So no heat pumps or fancy stuff for me – I went in the other direction, designing the system to draw a maximum of about 1 watt at peak, dropping to around 10% of that when the camera was idle. All this electrical power turns into heat, so if I can draw 1 watt only while capturing an image, then turn the image sensor off as soon as the data is in RAM, I can halve the consumption, then when the image has been downloaded to the flight computer I can turn the RAM off and drop the power down to a comparative trickle. The only thermal management needed was bolting the edge of the board to the chassis so the internal copper planes in the board could transfer any heat generated. Cooling even a single H200 will be an absolute nightmare. Clearly a heatsink and fan won't do anything at all, but there is a liquid cooled H200 variant. Let's say this was used. This heat would need to be transferred to a radiator panel – this isn't like the radiator in your car, no convection, remember? – which needs to radiate heat into space. Let's assume that we can point this away from the sun. The Active Thermal Control System (ATCS) on the ISS is an example of such a thermal control system. This is a very complex system, using an ammonia cooling loop and a large thermal radiator panel system. It has a dissipation limit of 16kW, so roughly 16 H200 GPUs, a bit over the equivalent to a quarter of a ground-based rack. The thermal radiator panel system measures 13.6m x 3.12 m, i.e., roughly 42.5 square metres. If we use 200kW as a baseline and assume all of that power will be fed to GPUs, we'd need a system 12.5 times bigger, i.e., roughly 531 square metres, or about 2.6 times the size of the relevant solar array. This is now going to be a very large satellite, dwarfing the ISS in area, all for the equivalent of three standard server racks on Earth. Radiation Tolerance This is getting into my PhD work now. Assuming you can both power and cool your electronics in space, you have the further problem of radiation tolerance. The first question is where in space? If you are in low Earth orbit (LEO), you are inside the inner radiation belt, where radiation dose is similar to that experienced by high altitude aircraft – more than an airliner, but not terrible. Further out, in mid Earth orbit (MEO), where the GPS satellites live, they are not protected by the Van Allen belts – worse, this orbit is literally inside them. Outside the belts, you are essentially in deep space (details vary with how close to the Sun you happen to be, but the principles are similar). There are two main sources of radiation in space – from our own star, the Sun, and from deep space. This basically involves charged particles moving at a substantial percentage of the speed of light, from electrons to the nuclei of atoms with masses up to roughly that of oxygen. These can cause direct damage, by smashing into the material from which chips are made, or indirectly, by travelling through the silicon die without hitting anything but still leaving a trail of charge behind them. The most common conseqence of this happening is a single-event upset (SEU), where a direct impact or (more commonly) a particle passing through a transistor briefly (approx 600 picoseconds) causes a pulse to happen where it shouldn't have. If this causes a bit to be flipped, we call this a SEU. Other than damage to data, they don't cause permanent damage. Worse is single-event latch-up. This happens when a pulse from a charged particle causes a voltage to go outside the power rails powering the chip, causing a transistor essentially to turn on and stay on indefinitely. I'll skip the semiconductor physics involved, but the short version is that if this happens in a bad way, you can get a pathway connected between the power rails that shouldn't be there, burning out a gate permanently. This may or may not destroy the chip, but without mitigation it can make it unusable. For longer duration missions, which would be the case with space based datacenters because they would be so expensive that they would have to fly for a long time in order to be economically viable, it's also necessary to consider total dose effects . Over time, the performance of chips in space degrades, because repeated particle impacts make the tiny field-effect transistors switch more slowly and turn on and off less completely. In practice, this causes maximum viable clock rates to decay over time, and for power consumption to increase. Though not the hardest issue to deal with, this must still be mitigated or you tend to run into a situation where a chip that was working fine at launch stops working because either the power supply or cooling has become inadequate, or the clock is running faster than the chip can cope with. It's therefore necessary to have a clock generator that can throttle down to a lower speed as needed – this can also be used to control power consumption, so rather than a chip ceasing to function it will just get slower. The next FAQ is, can't you just use shielding? No, not really, or maybe up to a point. Some kinds of shielding can make the problem worse – an impact to the shield can cause a shower of particles that then cause multiple impact at once, which is far harder to mitigate. The very strongest cosmic rays can go through an astonishing amount of solid lead – since mass is always at a premium, it's rarely possible to deploy significant amounts of shielding, so radiation tolerance must be built into the system (this is often described as Radiation Hardness By Design, RHBD). GPUs and TPUs and the high bandwidth RAM they depend on are absolutely worst case for radiation tolerance purposes. Small geometry transistors are inherently much more prone both to SEUs and latch-up. The very large silicon die area also makes the frequency of impacts higher, since that scales with area. Chips genuinely designed to work in space are taped out with different gate structures and much larger geometries. The processors that are typically used have the performance of roughly a 20-year-old PowerPC from 2005. Bigger geometries are inherently more tolerant, both to SEUs and total dose, and the different gate topologies are immune to latch up, whilst providing some degree of SEU mitigation via fine-grained redundancy at the circuit level. Taping out a GPU or TPU with this kind of approach is certainly possible, but the performance would be a tiny fraction of that of a current generation Earth-based GPU/TPU. There is a you-only-live-once (my terminology!) approach, where you launch the thing and hope for the best. This is commonplace in small cubesats, and also why small cubesats often fail after a few weeks on orbit. Caveat emptor! Communications Most satellites communicate with the ground via radio. It is difficult to get much more than about 1Gbps reliably. There is some interesting work using lasers to communicate with satellites, but this depends on good atmospheric conditions to be feasible. Contrasting this with a typical server rack on Earth, where 100Gbps rack-to-rack interconnect would be considered at the low end, and it's easy to see that this is also a significant gap. Conclusions I suppose this is just about possible if you really want to do it, but I think I've demonstrated above that it would firstly be extremely difficult to achieve, disproportionately costly in comparison with Earth-based datacenters, and offer mediocre performance at best. If you still think this is worth doing, good luck, space is hard. Myself, I think it's a catastrophically bad idea, but you do you.
Oasis: Pooling PCIe Devices Over CXL to Boost Utilization Yuhong Zhong, Daniel S. Berger, Pantea Zardoshti, Enrique Saurez, Jacob Nelson, Dan R. K. Ports, Antonis Psistakis, Joshua Fried, and Asaf Cidon SOSP'25 If you are like me, you’ve dabbled with software prefetching but never had much luck with it. Even if you care nothing about sharing of PCIe devices across servers in a rack, this paper is still interesting because it shows a use case where software prefetching really matters. I suppose it is common knowledge that CXL enables all of the servers in a rack to share a pool of DRAM . The unique insight from this paper is that once you’ve taken this step, sharing of PCIe devices (e.g., NICs, SSDs) can be implemented “at near-zero extra cost.” Fig. 2 shows how much SSD capacity and NIC bandwidth are stranded in Azure. A stranded resource is one that is underutilized because some other resource (e.g., CPU or memory) is the bottleneck. Customers may not be able to precisely predict the ratios of CPU:memory:SSD:NIC resources they will need. Even if they could, a standard VM size may not be available that exactly matches the desired ratio. Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The steps to receive a packet are similar: The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory. One trick used here is flow tagging . This is a DPDK feature that enables the NIC to determine which host the message is destined for, without the backend driver having to inspect network packet headers. Fig. 8 shows measurements of the overhead added by Oasis. The solid lines are the baseline; the dotted lines are Oasis. Each color represents a different latency bucket. The baseline uses a NIC which is local to the host running the benchmark. The overhead is measurable, but not excessive. Source: https://dl.acm.org/doi/pdf/10.1145/3731569.3764812 Dangling Pointers The paper doesn’t touch on the complexities related to network virtualization in a pooled device scheme. It seems to me that solving these problems wouldn’t affect performance but would require significant engineering. Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Datapath Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Cache Coherency Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Send and Receive Flows Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory.
Using IPv6 with Cloudflare to run multiple services on a single server without a reverse proxy
I haven't done a full-system backup since back in the olden days before Dropbox and Git. Every machine I now own is treated as a stateless, disposable unit that can be stolen, lost, or corrupted without consequences. The combination of full-disk encryption and distributed copies of all important data means there's just no stress if anything bad happens to the computer. But don't mistake this for just a "everything is in the cloud" argument. Yes, I use Dropbox and GitHub to hold all the data that I care about, but the beauty of these systems is that they work with local copies of that data, so with a couple of computers here and there, I always have a recent version of everything, in case either syncing service should go offline (or away!). The trick to making this regime work is to stick with it. This is especially true for Dropbox. It's where everything of importance needs to go: documents, images, whatever. And it's instantly distributed on all the machines I run. Everything outside of Dropbox is essentially treated as a temporary directory that's fully disposable. It's from this principle that I built Omarchy too. Given that I already had a way to restore all data and code onto a new machine in no time at all, it seemed so unreasonable that the configuration needed for a fully functional system still took hours on end. Now it's all encoded in an ISO setup that installs in two minutes on a fast computer. Now it's true that this method relies on both multiple computers and a fast internet connection. If you're stuck on a rock in the middle of nowhere, and you somehow haven't discovered the glory of Starlink, maybe just stick to your old full-disk backup ways. But if you live in the modern world, there ought to be no reason why a busted computer is a calamity of data loss or a long restore process.
Somewhere along the way, we stopped talking about servers. The word felt clunky, industrial, too tied to physical reality. Instead, we started saying "the cloud". It sounds weightless, infinite, almost magical. Your photos live in the cloud. Your documents sync through the cloud. Your company's entire infrastructure runs in the cloud. I hated the term cloud. I wasn't alone, someone actually created a "cloud to butt" browser extension that was pretty fun and popular. But the world has adopted the term, and I had no choice but to oblige. So what is the actual cloud? Why is it hiding behind this abstraction? Well, the cloud is rows upon rows of industrial machines, stacked in massive data centers, consuming electricity at a scale most of us can't even imagine. The cloud isn't floating above us. It's bolted to concrete floors, surrounded by cooling systems, and plugged into power grids that strain under its appetite. I'm old enough to remember the crypto boom and the backlash that followed. Critics loved to point out that Bitcoin mining consumed as much electricity as entire countries. Argentina, the Netherlands, and so many nations were picked for comparison. But I was not outraged by it at all. My reaction at the time was simpler. Why does it matter if they pay their electric bill? If you use electricity and compensate for it, isn't that just... how markets work? Turns out, I was missing the bigger picture. And the AI boom has made it impossible to ignore. When new data centers arrive in a region, everyone's electric bill goes up. Even if your personal consumption stays exactly the same. It has nothing to do with fairness and free markets. Infrastructure is not free. The power grids weren't designed for the sudden addition of facilities that consume megawatts continuously. When demand surges beyond existing capacity, utilities pass those infrastructure costs onto everyone. New power plants get built, transmission lines get upgraded, and residential customers help foot the bill through rate increases. The person who never touches AI, never mines crypto, never even knows what a data center does, this person is now subsidizing the infrastructure boom through their monthly utility payment. The cloud, it turns out, has a very terrestrial impact on your wallet. We've abstracted computing into its purest conceptual form: "compute." I have to admit, it's my favorite term in tech. "Let's buy more compute." "We need to scale our compute." It sounds frictionless, almost mathematical. Like adjusting a variable in an equation. Compute feels like a slider you can move up and down in your favorite cloud provider's interface. Need more? Click a button. Need less? Drag it down. The interface is clean, the metaphor is seamless, and completely disconnected from the physical reality. But in the real world, "buying more compute" means someone is installing physical hardware in a physical building. It means racks of servers being assembled, hard drives being mounted, cables being routed. The demand has become so intense that some data center employees have one job and one job only: installing racks of new hard drives, day in and day out. It's like an industrial assembly line. Every gigabyte of "cloud storage" occupies literal space. Every AI query runs on actual processors that generate actual heat. The abstraction is beautiful, but the reality is concrete and steel. The cloud metaphor served its purpose. It helped us think about computing as a utility. It's always available, scalable, detached from the messy details of hardware management. But metaphors shape how we think, and this one has obscured too much for too long. Servers are coming out of their shells. The foggy cloud is lifting, and we're starting to see the machinery underneath: vast data centers claiming real estate, consuming real water for cooling, and drawing real power from grids shared with homes, schools, and hospitals. This isn't an argument against cloud computing or AI. There nothing to go back to. But we need to acknowledge their physical footprint. The cloud isn't a magical thing in the sky. It's industry. And like all industry, it needs land, resources, and infrastructure that we all share.
Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.
It happened again. This time it’s Cloudflare, The last time I wrote about a single board computer, it was AWS that went down on the same day. Today, I wrote about the LattePanda IOTA. I’ll let y’all know once I plan on writing about another single board computer, seems to be bad for the internet.
At the time of writing 12:43 UTC on Tue 18 Nov, Cloudflare has taken many sites down. I'm trying to browse the web, but about half of the sites show an error:  Most of these sites are not even that big. I expect maybe a few thousand visitors per month. This demonstrates again a simple fact: if you put your site behind a centralized service, then this service is a single point of failure. Even large established companies make mistakes and can go down. Most people use Cloudflare because they have been scared into the idea that you need DDoS protecti...
As with my Anthropic exclusive from a few weeks ago , though this feels like a natural premium piece, I decided it was better to publish on my free one so that you could all enjoy it. If you liked or found this piece valuable, please subscribe to my premium newsletter — here’s $10 off the first year of an annual subscription . I have put out over a hundred thousand words of coverage in the last three months, most of which is on my premium, and I’d really appreciate your support. I also did an episode of my podcast Better Offline about this. Before publishing, I discussed the data with a Financial Times reporter. Microsoft and OpenAI both declined to comment to the FT. If you ever want to share something with me in confidence, my signal is ezitron.76, and I’d love to hear from you. What I’ll describe today will be a little more direct than usual, because I believe the significance of the information requires me to be as specific as possible. Based on documents viewed by this publication, I am able to report OpenAI’s inference spend on Microsoft Azure, in addition to its payments to Microsoft as part of its 20% revenue share agreement, which was reported in October 2024 by The Information . In simpler terms, Microsoft receives 20% of OpenAI’s revenue. I do not have OpenAI’s training spend, nor do I have information on the entire extent of OpenAI’s revenues, as it appears that Microsoft shares some percentage of its revenue from Bing, as well as 20% of the revenue it receives from selling OpenAI’s models. According to The Verge : Nevertheless, I am going to report what I’ve been told. One small note — for the sake of clarity, every time I mention a year going forward, I’ll be referring to the calendar year, and not Microsoft’s financial year (which ends in June). These numbers in this post differ to those that have been reported publicly. For example, previous reports had said that OpenAI had spent $2.5 billion on “cost of revenue” - which I believe are OpenAI’s inference costs - in the first half of CY2025 . According to the documents viewed by this newsletter, OpenAI spent $5.02 billion on inference alone with Microsoft Azure in the first half of Calendar Year CY2025. This is a pattern that has continued through the end of September. By that point in CY2025 — three months later — OpenAI had spent $8.67 billion on inference. OpenAI’s inference costs have risen consistently over the last 18 months, too. For example, OpenAI spent $3.76 billion on inference in CY2024, meaning that OpenAI has already doubled its inference costs in CY2025 through September. Based on its reported revenues of $3.7 billion in CY2024 and $4.3 billion in revenue for the first half of CY2025 , it seems that OpenAI’s inference costs easily eclipsed its revenues. Yet, as mentioned previously, I am also able to shed light on OpenAI’s revenues, as these documents also reveal the amounts that Microsoft takes as part of its 20% revenue share with OpenAI. Concerningly, extrapolating OpenAI’s revenues from this revenue share does not produce numbers that match those previously reported. According to the documents, Microsoft received $493.8 million in revenue share payments in CY2024 from OpenAI — implying revenues for CY2024 of at least $2.469 billion, or around $1.23 billion less than the $3.7 billion that has been previously reported . Similarly, for the first half of CY2025, Microsoft received $454.7 million as part of its revenue share agreement, implying OpenAI’s revenues for that six-month period were at least $2.273 billion, or around $2 billion less than the $4.3 billion previously reported . Through September, Microsoft’s revenue share payments totalled $865.9 million, implying OpenAI’s revenues are at least $4.329 billion. According to Sam Altman, OpenAI’s revenue is “well more” than $13 billion . I am not sure how to reconcile that statement with the documents I have viewed. The following numbers are calendar years. I will add that, where I have them, I will include OpenAI’s leaked or reported revenues. In some cases, the numbers match up. In others they do not. Though I do not know for certain, the only way to reconcile this would be some sort of creative means of measuring “annualized” or “recurring” revenue. I am confident in saying that I have read every single story about OpenAI’s revenue ever written, and at no point does OpenAI (or the documents reporting anything) explain how the company defines “annualized” or “annual recurring revenue.” I must be clear that the following is me speaking in generalities, and not about OpenAI specifically, but you can get really creative with annualized revenue or annual recurring revenue. You can say 30 days, 28 days, and you can even choose a period of time that isn’t a calendar month too — so, say, the best 30 days of your company’s existence across two different months. I have no idea how OpenAI defines this metric, and default to saying that “annualized” or “ARR” means $Xnumber divided by 12. The Financial Times reported on February 9 2024 that OpenAI’s revenues had “surpassed $2 billion on an annualised basis” in December 2023, working out to $166.6 million in a month: The Information reported on June 12 2024 that OpenAI had “more than doubled its annualized revenue to $3.4 billion in the last six months or so,” working out to around $283 million in a month, likely referring to this period. On September 27 2024, the New York Times reported that “OpenAI’s monthly revenue hit $300 million in August…and the company expects about $3.7 billion in annual sales [in 2024],” according to a financial professional’s review of documents. On June 9, 2025, an OpenAI spokesperson told CNBC that it had hit “$10 billion annual recurring revenue,” excluding licensing revenue from OpenAI’s 20% revenue share and “large, one-time deals.” $10bn annualized revenue works out to around $833 million in a month. These numbers are inclusive of OpenAI’s revenue share payments to Microsoft and OpenAI’s inference spend. There could be potentially royalty payments made to OpenAI as part of its deal to receive 20% of Microsoft’s sales of OpenAI’s models, or other revenue related to its revenue share with Bing. Due to the sensitivity and significance of this information, I am taking a far more blunt approach with this piece. Based on the information in this piece, OpenAI’s costs and revenues are potentially dramatically different to what we believed. The Information reported in October 2024 that OpenAI’s revenue could be $4 billion, and inference costs $2 billion based on documents “which include financial statements and forecasts,” and specifically added the following: I do not know how to reconcile this with what I am reporting today. In the first half of CY2024, based on the information in the documents, OpenAI’s inference costs were $1.295 billion, and its revenues at least $934 million. Indeed, it is tough to reconcile what I am reporting with much of what has been reported about OpenAI’s costs and revenues. OpenAI’s inference spend with Microsoft Azure between CY2024 and Q3 CY2025 was $12.43 billion. That is an astonishing figure, one that dramatically dwarfs any and all reporting, which, based on my analysis, suggested that OpenAI spent $2 billion on inference in 2024 and $2.5 billion through H1 CY2025. In other words, inference costs are nearly triple that reported elsewhere. Similarly, OpenAI’s extrapolated revenues are dramatically different to those reported. While we do not have a final tally for 2024, the indicators presented in the documents viewed contrast starkly with the reported predictions from that year. Both reports of OpenAI’s 2024 revenues ( CNBC , The Information ) are from the same year and are projections of potential final totals, though The Information’s story about OpenAI’s H1 CY2025 revenues said that “OpenAI generated $4.3 billion in revenue in the first half of 2025, about $16% more than it generated all of last year,” which would bring us to $3.612 billion in revenue, or $1.145 billion more than are implied by OpenAI’s revenue share numbers paid to Microsoft. I do not have an answer for inference, other than I believe that OpenAI is spending far more money on inference than we were led to believe, and that the current numbers reported do not resemble those in the documents. Based on these numbers, it appears that OpenAI may be the single-most cash intensive startup of all time, and that the cost of running large language models may not be something that can be supported by revenues. Even if revenues were to match those that had been reported, OpenAI’s inference spend on Azure consumes them, and appears to scale linearly above revenue. I also cannot reconcile these numbers with the reporting that OpenAI will have a cash burn of $9 billion in CY2025 . On inference alone, OpenAI has already spent $8.67 billion through Q3 CY2025. Similarly, I cannot see a path for OpenAI to hit its projected $13 billion in revenue by the end of 2025, nor can I see on what basis Mr. Altman could state that OpenAI will make “well more” than $13 billion this year . I cannot and will not speak to the financial health of OpenAI in this piece, but I will say this: these numbers are materially different to what has been reported, and the significance of OpenAI’s inference spend alone makes me wonder about the larger cost picture for generative AI. If it costs this much to run inference for OpenAI, I believe it costs this much for any generative AI firm to run on OpenAI’s models. If it does not, OpenAI’s costs are dramatically higher than the prices it is charging its customers, which makes me wonder whether price increases could be necessary to begin making more money, or at the very least losing less. Similarly, if OpenAI’s costs are this high, it makes me wonder about the margins of any frontier model developer. Inference: $546.8 million Microsoft Revenue Share: $77.3 million Implied OpenAI revenue: at least $386.5 million Inference: $748.3 million Microsoft Revenue Share: $109.5 million Implied OpenAI Revenue: at least $547.5 million Inference: $1.005 billion Microsoft Revenue Share: $139.2 million Implied OpenAI Revenue: at least $696 million Inference: $1.467 billion Microsoft Revenue Share: $167.8 million Implied OpenAI Revenue: at least $839 million Total inference spend for CY2024: $3.767 billion Total implied revenue for CY2024: at least $2.469 billion Reported (projected) revenue for CY2024: $3.7 billion, per CNBC in September 2024. The Information also reported that expected revenue could be as high as $4 billion in a piece from October 2024. Reported inference costs for CY2024: $2 billion, per The Information . Inference: $2.075 billion Microsoft Revenue Share: $206.4 million Implied OpenAI Revenue: $1.032 billion Inference: $2.947 billion Microsoft Revenue Share: $248.3 million Implied OpenAI Revenue: $1.241.5 billion H1 CY2025 Inference: $5.022 billion H1 CY2025 Revenue: at least $2.273 billion Reported H1 CY2025 Revenue: $4.3 billion ( per The Information ) Reported H1 CY2025 “Cost of Revenue”: $2.5 billion ( per The Information ) Inference: $3.648 billion Microsoft Revenue Share: $411.1 million Implied OpenAI Revenue: at least $2.056 billion
I’ve been curious about how far you can push object storage as a foundation for database-like systems. In previous posts, I explored moving JSON data from PostgreSQL to Parquet on S3 and building MVCC-style tables with constant-time deletes using S3’s conditional writes. These experiments showed that decoupling storage from compute unlocks interesting trade-offs while lowering costs and simpler operations in exchange for higher cold query latency. Search engines traditionally don’t fit this model.
For the past five months I've been leading the efforts to rebuild Zed 's cloud infrastructure. Our current backend—known as Collab—has been chugging along since basically the beginning of the company. We use Collab every day to work together on Zed in Zed. However, as Zed continues to grow and attracts more users, we knew that we needed a full reboot of our backend infrastructure to set us up for success for our future endeavors. Enter Zed Cloud. Like Zed itself, Zed Cloud is built in Rust 1 . This time around there is a slight twist: all of this is running on Cloudflare Workers , with our Rust code being compiled down to WebAssembly (Wasm). One of our goals with this rebuild was to reduce the amount of operational effort it takes to maintain our hosted services, so that we can focus more of our time and energy on building Zed itself. Cloudflare Workers allow us to easily scale up to meet demand without having to fuss over it too much. Additionally, Cloudflare offers an ever-growing amount of managed services that cover anything you might need for a production web service. Here are some of the Cloudflare services we're using today: Another one of our goals with this rebuild was to build a platform that was easy to test. To achieve this, we built our own platform framework on top of the Cloudflare Workers runtime APIs. At the heart of this framework is the trait: This trait allows us to write our code in a platform-agnostic way while still leveraging all of the functionality that Cloudflare Workers has to offer. Each one of these associated types corresponds to some aspect of the platform that we'll want to have control over in a test environment. For instance, if we have a service that needs to interact with the system clock and a Workers KV store, we would define it like this: There are two implementors of the trait: and . —as the name might suggest—is an implementation of the platform on top of the Cloudflare Workers runtime. This implementation targets Wasm and is what we run when developing locally (using Wrangler ) and in production. We have a crate 2 that contains bindings to the Cloudflare Workers JS runtime. You can think of as the glue between those bindings and the idiomatic Rust APIs exposed by the trait. The is used when running tests, and allows for simulating almost every part of the system in order to effectively test our code. Here's an example of a test for ingesting a webhook from Orb : In this test we're able to test the full end-to-end flow of: The call to advances the test simulator, in this case running the pending queue consumers. At the center of the is the , a crate that powers our in-house async runtime. The scheduler is shared between GPUI —Zed's UI framework—and the used in tests. This shared scheduler enables us to write tests that span the client and the server. So we can have a test that starts in a piece of Zed code, flows through Zed Cloud, and then asserts on the state of something in Zed after it receives the response from the backend. The work being done on Zed Cloud now is laying the foundation to support our future work around collaborative coding with DeltaDB . If you want to work with me on building out Zed Cloud, we are currently hiring for this role. We're looking for engineers with experience building and maintaining web APIs and platforms, solid web fundamentals, and who are excited about Rust. If you end up applying, you can mention this blog post in your application. I look forward to hearing from you! The codebase is currently 70k lines of Rust code and 5.7k lines of TypeScript. This is essentially our own version of . I'd like to switch to using directly, at some point. Hyperdrive for talking to Postgres Workers KV for ephemeral storage Cloudflare Queues for asynchronous job processing Receiving and validating an incoming webhook event to our webhook ingestion endpoint Putting the webhook event into a queue Consuming the webhook event in a background worker and processing it
I just came across the book _Solar Secrets_ (2014) by Peter Lindemann. It observes that most solar panels are optimized to perform on bright sunny days whereas they barely perform on cloudy days. This while there are panels that capture light of lower wavelengths and are therefore much less affected by clouds. The author shows this in the following figure from page 17:  Here it can be seen that Amorphous Silicon (a-Si) solar panels produce energy from the 500 to 700 nm wave length range while Crystalline...
The latest Thoughtworks TechRadar is out. Here are some of the more data-related ‘blips’ (as they’re called on the radar) that I noticed. Each item links to the blip’s entry where you can read more information about Thoughtwork’s usage and opinions on it. Databricks Assistant Apache Paimon Delta Sharing Naive API-to-MCP conversion Standalone data engineering teams Text to SQL
Nextcloud. I really want to like it, but it’s making it really difficult. I like what Nextcloud offers with its feature set and how easily it replaces a bunch of services under one roof (files, calendar, contacts, notes, to-do lists, photos etc.), but no matter how hard I try and how much I optimize its resources on my home server, it feels slow to use, even on hardware that is ranging from decent to good. Then I opened developer tools and found the culprit. It’s the Javascript. On a clean page load, you will be downloading about 15-20 MB of Javascript, which does compress down to about 4-5 MB in transit, but that is still a huge amount of Javascript. For context, I consider 1 MB of Javascript to be on the heavy side for a web page/app. Yes, that Javascript will be cached in the browser for a while, but you will still be executing all of that on each visit to your Nextcloud instance, and that will take a long time due to the sheer amount of code your browser now has to execute on the page. A significant contributor to this heft seems to be the bundle, which based on its name seems to provide some common functionality that’s shared across different Nextcloud apps that one can install. It’s coming in at 4.71 MB at the time of writing. Then you want notifications, right? is here to cover you, at 1.06 MB . Then there are the app-specific views. The Calendar app is taking up 5.94 MB to show a basic calendar view. Files app includes a bunch of individual scripts, such as ( 1.77 MB ), ( 1.17 MB ), ( 1.09 MB ), ( 0.9 MB which I’ve never used!) and many smaller ones. Notes app with its basic bare-bones editor? 4.36 MB for the ! This means that even on an iPhone 13 mini, opening the Tasks app (to-do list), will take a ridiculously long time. Imagine opening your shopping list at the store and having to wait 5-10 seconds before you see anything, even with a solid 5G connection. Sounds extremely annoying, right? I suspect that a lot of this is due to how Nextcloud is architected. There’s bound to be some hefty common libraries and tools that allow app developers to provide a unified experience, but even then there is something seriously wrong with the end result, the functionality to bundle size ratio is way off. As a result, I’ve started branching out some things from Nextcloud, such as replacing the Tasks app with using a private Vikunja instance, and Photos to a private Immich instance. Vikunja is not perfect, but its 1.5 MB of Javascript is an order of magnitude smaller compared to Nextcloud, making it feel incredibly fast in comparison. However, with other functionality I have to admit that the convenience of Nextcloud is enough to dissuade me from replacing it elsewhere, due to the available feature set comparing well to alternatives. I’m sure that there are some legitimate reasons behind the current state, and overworked development teams and volunteers are unfortunately the norm in the industry, but it doesn’t take away the fact that the user experience and accessibility suffers as a result. I’d like to thank Alex Russell for writing about web performance and why it matters, with supporting evidence and actionable advice, it has changed how I view websites and web apps and has pushed me to be better in my own work. I highly suggest reading his content, starting with the performance inequality gap series. It’s educational, insightful and incredibly irritating once you learn how crap most things are and how careless a lot of development teams are towards performance and accessibility.
This blog post explains how to effectively file abuse reports against cloud providers to stop malicious traffic. Key points: Two IP Types : Residential (ineffective to report) vs. Commercial (targeted reports) Why Cloud Providers : Cloud customers violate provider terms, making abuse reports actionable Effective Abuse Reports Should Include : Note on "Free VPNs" : Often sell your bandwidth as part of botnets, not true public infrastructure The goal is to make scraping the cloud provider's problem, forcing them to address violations against their terms of service. Two IP Types : Residential (ineffective to report) vs. Commercial (targeted reports) Why Cloud Providers : Cloud customers violate provider terms, making abuse reports actionable Effective Abuse Reports Should Include : Time of abusive requests IP/User-Agent identifiers robots.txt status System impact description Service context Process : Use whois to find abuse contacts (look for "abuse-c" or "abuse-mailbox") Send detailed reports with all listed emails Expect response within 2 business days Note on "Free VPNs" : Often sell your bandwidth as part of botnets, not true public infrastructure
Looking at actual token demand growth, infrastructure utilization, and capacity constraints - the economics don't match the 2000s playbook like people assume
“ The Kafka community is currently seeing an unprecedented situation with three KIPs ( KIP-1150 , KIP-1176 , KIP-1183) simultaneously addressing the same challenge of high replication costs when running Kafka across multiple cloud availability zones. ” — Luke Chen, The Path Forward for Saving Cross-AZ Replication Costs KIPs At the time of writing the Kafka project finds itself at a fork in the road where choosing the right path forward for implementing S3 topics has implications for the long-term success of the project. Not just the next couple of years, but the next decade. Open-source projects live and die by these big decisions and as a community, we need to make sure we take the right one. This post explains the competing KIPs, but goes further and asks bigger questions about the future direction of Kafka. Before comparing proposals, we should step back and ask what kind of system we want Kafka to become. Kafka now faces two almost opposing forces. One force is stabilizing: the on-prem deployments and the low latency workloads that depend on local disks and replication. Kafka must continue to serve those use cases. The other force is disrupting: the elastic, cloud-native workloads that favor stateless compute and shared object storage. Relaxed-latency workloads such as analytics have seen a shift in system design with durability increasingly delegated to shared object storage, freeing the compute layer to be stateless, elastic, and disposable. Many systems now scale by adding stateless workers rather than rebalancing stateful nodes. In a stateless compute design, the bottleneck shifts from data replication to metadata coordination. Once durability moves to shared storage, sequencing and metadata consistency become the new limits of scalability. That brings us to the current moment, with three competing KIPs defining how to integrate object storage directly into Kafka. While we evaluate these KIPs, it’s important to consider the motivations for building direct-to-S3 topics. Cross-AZ charges are typically what are on people’s minds, but it’s a mistake to think of S3 simply as a cheaper disk or a networking cheat. The shift is also architectural, providing us an opportunity to achieve those operational benefits such as elastic stateless compute. The devil is in the details: how each KIP enables Kafka to leverage object storage while also retaining Kafka’s soul and what made it successful in the first place. With that in mind, while three KIPs have been submitted, it comes down to two different paths: Revolutionary : Choose a direct-to-S3 topic design that maximizes the benefits of an object-storage architecture, with greater elasticity and lower operational complexity. However, in doing so, we may increase the implementation cost and possibly the long-term code maintenance too by maintaining two very different topic-models in the same project (leader-based replication and direct-to-S3). Evolutionary : Shoot for an evolutionary design that makes use of existing components to reduce the need for large refactoring or duplication of logic. However, by coupling to the existing architecture, we forfeit the extra benefits of object storage, focusing primarily on networking cost savings (in AWS and GCP). Through this coupling, we also run the risk of achieving the opposite: harder to maintain code by bending and contorting a second workload into an architecture optimized for something else. In this post I will explain the two paths in this forked road, how the various KIPs map onto those paths, and invite the whole community to think through what they want for Apache Kafka for the next decade. Note that I do not include KIP-1183 as it looks dead in the water, and not a serious contender. The KIP proposes AutoMQ’s storage abstractions without the accompanying implementation. Which perhaps cynically, seems to benefit AutoMQ were it ever adopted, leaving the community to rewrite the entire storage subsystem again. If you want a quick summary of the three KIPs (including KIP-1183), you can read Luke Chen’s The Path Forward for Saving Cross-AZ Replication Costs KIPs or Anton Borisov’s summary of the three KIPs . This post is structured as follows: The term “Diskless” vs “Direct-to-S3” The Common Parts. Some approaches are shared across multiple implementations and proposals. Revolutionary: KIPs and real-world implementations Evolutionary: Slack’s KIP-1176 The Hybrid: balancing revolution with evolution Deciding Kafka’s future I used the term “diskless” in the title as that is the current hype word. But it is clear that not all designs are actually diskless in the same spirit as “serverless”. Serverless implies that users no longer need to consider or manage servers at all, not that there are no servers. In the world of open-source, where you run stuff yourself, diskless would have to mean literally “no disks”, else you will be configuring disks as part of your deployment. But all the KIPs (in their current state) depend on disks to some extent, even KIP-1150 which was proposed as diskless. In most cases, disk behavior continues to influence performance and therefore correct disk provisioning will be important. So I’m not a fan of “diskless”, I prefer “direct-to-S3”, which encompasses all designs that treat S3 (and other object stores) as the only source of durability. The main commonality between all Direct-to-S3 Kafka implementations and design proposals is the uploading of objects that combine the data of multiple topics. The reasons are two-fold: Avoiding the small file problem . Most designs are leaderless for producer traffic, allowing for any server to receive writes to any topic. To avoid uploading a multitude of tiny files, servers accumulate batches in a buffer until ready for upload. Before upload, the buffer is sorted by topic id and partition, to make compaction and some reads more efficient by ensuring that data of the same topic and same partition are in contiguous byte ranges. Pricing . The pricing of many (but not all) cloud object storage services penalize excessive requests, so it can be cheaper to roll-up whatever data has been received in the last X milliseconds and upload it with a single request. In the leader-based model, the leader determines the order of batches in a topic partition. But in the leaderless model, multiple brokers could be simultaneously receiving produce batches of the same topic partition, so how do we order those batches? We need a way of establishing a single order for each partition and we typically use the word “sequencing” to describe that process. Usually there is a central component that does the sequencing and metadata storage, but some designs manage the sequencing in other ways. WarpStream was the first to demonstrate that you could hack the metadata step of initiating a producer to provide it with broker information that would align the producer with a zone-local broker for the topics it is interested in. The Kafka client is leader-oriented, so we just pass it a zone-local broker and tell the client “this is the leader”. This is how all the leaderless designs ensure producers write to zone-local brokers. It’s not pretty, and we should make a future KIP to avoid the need for this kind of hack. Consumer zone-alignment heavily depends on the particular design, but two broad approaches exist: Leaderless: The same way that producer alignment works via metadata manipulation or using KIP-392 (fetch from follower) which can be used in a leaderless context. Leader-based: Zone-aware consumer group assignment as detailed in KIP-881: Rack-aware Partition Assignment for Kafka Consumers . The idea is to use consumer-to-partition assignment to ensure consumers are only assigned zone-local partitions (where the partition leader is located). KIP-392 (fetch-from-follower) , which is effective for designs that have followers (which isn’t always the case). Given almost all designs upload combined objects, we need a way to make those mixed objects more read optimized. This is typically done through compaction, where combined objects are ultimately separated into per-topic or even per-partition objects. Compaction could be one-shot or go through multiple rounds. The “revolutionary” path draws a new boundary inside Kafka by separating what can be stateless from what must remain stateful. Direct-to-S3 traffic is handled by a lightweight, elastic layer of brokers that simply serve producers and consumers. The direct-to-S3 coordination (sequencing/metadata) is incorporated into the stateful side of regular brokers where coordinators, classic topics and KRaft live. I cover three designs in the “revolutionary” section: WarpStream (as a reference, a kind of yardstick to compare against) KIP-1150 revision 1 Aiven Inkless (a Kafka-fork) Before we look at the KIPs that describe possible futures for Apache Kafka, let’s look at a system that was designed from scratch with both cross-AZ cost savings and elasticity (from object storage) as its core design principles. WarpStream was unconstrained by an existing stateful architecture, and with this freedom, it divided itself into: Leaderless, stateless and diskless agents that handle Kafka clients, as well as compaction/cleaning work. Coordination layer : A central metadata store for sequencing, metadata storage and housekeeping coordination. Fig WS-A. The WarpStream stateless/stateful split architecture. As per the Common Parts section, the zone alignment and sorted combined object upload strategies are employed. Fig WS-B. The WarpStream write path. On the consumer side, which again is leaderless and zone-local, a per-zone shared cache is implemented (which WarpStream dubbed distributed mmap). Within a zone, this shared cache assigns each agent a portion of the partition-space. When a consumer fetch arrives, an agent will download the object byte range itself if it is responsible for that partition, else it will ask the responsible agent to do that on its behalf. That way, we ensure that multiple agents per zone are not independently downloading the same data, thus reducing S3 costs. Fig WS-C. Shared per-zone read cache to reduce S3 throughput and request costs. WarpStream implements agent roles (proxy, job) and agent groups to separate the work of handling producer, consumer traffic from background jobs such as compaction, allowing for independent scaling for each workload. The proxy role (producer/consumer Kafka client traffic) can be further divided into proxy-producer and proxy-consumer. Agents can be deployed as dedicated agent groups, which allows for further separation of workloads. This is useful for avoiding noisy neighbor issues, running different groups in different VPCs and scaling different workloads that hit the same topics. For example, you could use one proxy group for microservices, and a separate proxy-consumer group for an analytics workload. Fig WS-D. Agent roles and groups can be used for shaping traffic, independent scaling and deploying different workloads into separate VPCs. Being a pure Direct-to-S3 system allowed WarpStream to choose a design that clearly separates traffic serving work into stateless agents and the coordination logic into one central metadata service. The traffic serving layer is highly elastic, with relatively simple agents that require no disks at all. Different workloads benefit from independent and flexible scaling via the agent roles and groups. The point of contention is the metadata service, which needs to be carefully managed and scaled to handle the read/write metadata volume of the stateless agents. Confluent Freight Clusters follow a largely similar design principle of splitting stateless brokers from central coordination. I will write about the Freight design sometime soon in the future. Apache Kafka is a stateful distributed system, but next we’ll see how KIP-1150 could fulfill much of the same capabilities as WarpStream. KIP-1150 has continued to evolve since it was first proposed, undergoing two subsequent major revisions between the KIP itself and mailing list discussion . This section describes the first version of the KIP, created in April 2025. KIP-1150 revision 1 uses a leaderless architecture where any broker can serve Kafka producers and consumers of any topic partition. Batch Coordinators (replicated state machines like Group and Transaction Coordinators), handle coordination (object sequencing and metadata storage). Brokers accumulate Kafka batches and upload shared objects to S3 (known as Shared Log Segment Objects, or SLSO). Metadata that maps these blocks of Kafka batches to SLSOs (known as Batch Coordinates) is then sent to a Batch Coordinator (BC) that sequences the batches (assigning offsets) to provide a global order of those batches per partition. The BC acts as sequencer and batch coordinate database for later lookups, allowing for the read path to retrieve batches from SLSOs. The BCs will also apply idempotent and transactional producer logic (not yet finalized). Fig KIP-1150-rev1-A. Leaderless brokers upload shared objects, then commit and sequence them via the Batch Coordinator. The Batch Coordinator (BC) is the stateful component akin to WarpStream’s metadata service. Kafka has other coordinators such as the Group Coordinator (for the consumer group protocol), the Transaction Coordinator (for reliable 2PC) and Share Coordinator (for queues aka share groups). The coordinator concept, for centralizing some kind of coordination, has a long history in Kafka. The BC is the source of truth about uploaded objects, Direct-to-S3 topic partitions, and committed batches. This is the component that contains the most complexity, with challenges around scaling, failovers, reliability as well as logic for idempotency, transactions, object compaction coordination and so on. A Batch Coordinator has the following roles: Sequencing . Chooses the total ordering for writes, assigning offsets without gaps or duplicates. Metadata storage . Stores all metadata that maps partition offset ranges to S3 object byte ranges. Serving lookup requests . Serving requests for log offsets. Serving requests for batch coordinates (S3 object metadata). Partition CRUD operations . Serving requests for atomic operations (creating partitions, deleting topics, records, etc.) Data expiration Managing data expiry and soft deletion. Coordinating physical object deletion (performed by brokers). The Group and Transaction Coordinators use internal Kafka topics to durably store their state and rely on the KRaft Controller for leader election (coordinators are highly available with failovers). This KIP does not specify whether the Batch Coordinator will be backed by a topic, or use some other option such as a Raft state machine (based on KRaft state machine code). It also proposes that it could be pluggable. For my part, I would prefer all future coordinators to be KRaft state machines, as the implementation is rock solid and can be used like a library to build arbitrary state machines inside Kafka brokers. When a producer starts, it sends a Metadata request to learn which brokers are the leaders of each topic partition it cares about. The Metadata response contains a zone-local broker. The producer sends all Produce requests to this broker. The receiving broker accumulates batches of all partitions and uploads a shared log segment object (SLSO) to S3. The broker commits the SLSO by sending the metadata of the SLSO (the metadata is known as the batch coordinates) to the Batch Coordinator. The coordinates include the S3 object metadata and the byte ranges of each partition in the object. The Batch Coordinator assigns offsets to the written batches, persists the batch coordinates, and responds to the broker. The broker sends all associated acknowledgements to the producers. Brokers can parallelize the uploading SLSOs to S3, but commit them serially to the Batch Coordinator. When batches are first written, a broker does not assign offsets as would happen with a regular topic, as this is done after the data is written, when it is sequenced and indexed by the Batch Coordinator. In other words, the batches stored in S3 have no offsets stored within the payloads as is normally the case. On consumption, the broker must inject the offsets into the Kafka batches, using metadata from the Batch Coordinator. The consumer sends a Fetch request to the broker. The broker checks its cache and on a miss, the broker queries the Batch Coordinator for the relevant batch coordinates. The broker downloads the data from object storage. The broker injects the computed offsets and timestamps into the batches (the offsets being part of the batch coordinates). The broker constructs and sends the Fetch response to the Consumer. Fig KIP-1150-rev1-B. The consume path of KIP-1150 (ignoring any caching logic here). The KIP notes that broker roles could be supported in the future. I believe that KIP-1150 revision 1 starts becoming really powerful with roles akin to WarpStream. That way we can separate out direct-to-S3 topic serving traffic and object compaction work on proxy brokers , which become the elastic serving and background job layer. Batch Coordinators would remain hosted on standard brokers, which are already stateful. With broker roles, we can see how Kafka could implement its own WarpStream-like architecture, which makes full use of disaggregated storage to enable better elasticity. Fig KIP-1150-rev1-C. Broker roles would bring the stateless/stateful separation that would unlock elasticity in the high throughput workload of many Direct-to-S3 deployments. Things like supporting idempotency, transactions, caching and object compaction are left to be decided later. While taxing to design, these things look doable within the basic framework of the design. But as I already mentioned in this post, this will be costly in effort to develop but also may come with a long term code maintenance overhead if complex parts such as transactions are maintained twice. It may also be possible to refactor rather than do wholesale rewrites. Inkless is Aiven’s direct-to-S3 fork of Apache Kafka. The Inkless design shares the combined object upload part and metadata manipulation that all designs across the board are using. It is also leaderless for direct-to-S3 topics. Inkless is firmly in the revolutionary camp. If it were to implement broker roles, it would make this Kafka-fork much closer to a WarpStream-like implementation (albeit with some issues concerning the coordination component as we’ll see further down). While Inkless is described as a KIP-1150 implementation, we’ll see that it actually diverges significantly from KIP-1150, especially the later revisions (covered later). Inkless eschewed the Batch Coordinator of KIP-1150 in favor of a Postgres instance, with coordination being executed through Table Valued Functions (TVF) and row locking where needed. Fig Inkless-A. The write path. On the read side, each broker must discover what batches exist by querying Postgres (using a TVF again), which returns the next set of batch coordinates as well as the high watermark. Now the broker knows where the next batches are located, it requests those batches via a read-through cache. On a cache miss, it fetches the byte ranges from the relevant objects in S3. Fig Inkless-B. The read path. Inkless bears some resemblance to KIP-1150 revision 1, except the difficult coordination bits are delegated to Postgres. Postgres does all the sequencing, metadata storage, as well as coordination for compaction and file cleanup. For example, compaction is coordinated via Postgres, with a TVF that is periodically called by each broker which finds a set of files which together exceed a size threshold and places them into a merge work order (tables file_merge_work_items and file_merge_work_item_files ) that the broker claims. Once carried out, the original files are marked for deletion (which is another job that can be claimed by a broker). Fig Inkless-C. Direct-to-S3 traffic uses a leaderless broker architecture with coordination owned by Postgres. Inkless doesn’t implement transactions, and I don’t think Postgres could take on the role of transaction coordinator, as the coordinator does more than sequencing and storage. Inkless will likely have to implement some kind of coordinator for that. The Postgres data model is based on the following tables: logs . Stores the Log Start Offset and High Watermark of each topic partition, with the primary key of topic_id and partition. files . Lists all the objects that host the topic partition data. batches . Maps Kafka batches to byte ranges in Files. producer_state . All the producer state needed for idempotency. Some other tables for housekeeping, and the merge work items. Fig Inkless-D. Postgres data model. The Commit File TVF , which sequences and stores the batch coordinates, works as follows: A broker opens a transaction and submits a table as an argument, containing the batch coordinates of the multiple topics uploaded in the combined file. The TVF logic creates a temporary table (logs_tmp) and fills it via a SELECT on the logs table, with the FOR UPDATE clause which obtains a row lock on each topic partition row in the logs table that matches the list of partitions being submitted. This ensures that other brokers that are competing to add batches to the same partition(s) queue up behind this transaction. This is a critical barrier that avoids inconsistency. These locks are held until the transaction commits or aborts. Next it, inside a loop, partition-by-partition, the TVF: Updates the producer state. Updates the high watermark of the partition (a row in the logs table). Inserts the batch coordinates into the batches table (sequencing and storing them). Commits the transaction. Apache Kafka would not accept a Postgres dependency of course, and KIP-1150 has not proposed centralizing coordination in Postgres either. But the KIP has suggested that the Batch Coordinator be pluggable, which might leave it open for using Postgres as a backing implementation. As a former database performance specialist, the Postgres locking does concern me a bit. It blocks on the logs table rows scoped to the topic id and partition. An ORDER BY prevents deadlocks, but given the row locks are maintained until the transaction commits, I imagine that given enough contention, it could cause a convoy effect of blocking. This blocking is fair, that is to say, First Come First Serve (FCFS) for each individual row. For example, with 3 transactions: T1 locks rows 11–15, T2 wants to lock 6-11, but only manages 6-10 as it blocks on row 11. Meanwhile T2 wants to lock 1-6, but only manages 1-5 as it blocks on 6. We now have a dependency tree where T1 blocks T2 and T2 blocks T3. Once T1 commits, the others get unblocked, but under sustained load, this kind of locking and blocking can quickly cascade, such that once contention starts, it rapidly expands. This contention is sensitive to the number of concurrent transactions and the number of partitions per commit. A common pattern with this kind of locking is that up until a certain transaction throughput everything is fine, but at the first hint of contention, the whole thing slows to a crawl. Contention breeds more contention. I would therefore caution against the use of Postgres as a Batch Coordinator implementation. The following is a very high-level look at Slack’s KIP-1176 , in the interests of keeping this post from getting too detailed. There are three key points to this KIP’s design: Maintain leader-based topic partitions (producers continue to write to leaders), but replace Kafka replication protocol with a per-broker S3-based write-ahead-log (WAL). Try to preserve existing partition replica code for idempotency and transactions. Reuse existing tiered storage for long-term S3 data management. Fig Slack-KIP-1176-A. Leader-based architecture retained, replication replaced by an S3 WAL. Tiered storage manages long-term data. The basic idea is to preserve the leader-based architecture of Kafka, with each leader replica continuing to write to an active local log segment file, which it rotates periodically. A per-broker write-ahead-log (WAL) replaces replication. A WAL Combiner component in the Kafka broker progressively (and aggressively) tiers portions of the local active log segment files (without closing them), combining them into multi-topic objects uploaded to S3. Once a Kafka batch has been written to the WAL, the broker can send an acknowledgment to its producer. This active log segment tiering does not change how log segments are rotated. Once an active log segment is rotated out (by closing it and creating a new active log segment file), it can be tiered by the existing tiered storage component, for the long-term. Fig Slack-KIP-1176-B. Produce batches are written to the page cache as usual, but active log segment files are aggressively tiered to S3 (possibly Express One Zone) in combined log segment files. The WAL acts as write-optimized S3 storage and the existing tiered storage uploads closed log segment files for long-term storage. Once all data of a given WAL object has been tiered, it can be deleted. The WAL only becomes necessary during topic partition leader-failovers, where the new leader replica bootstraps itself from the WAL. Alternatively, each topic partition can have one or more followers which actively reconstruct local log segments from the WAL, providing a faster failover. The general principle is to keep as much of Kafka unchanged as possible, only changing from the Kafka replication protocol to an S3 per-broker WAL. The priority is to avoid the need for heavy rework or reimplementation of logic such as idempotency, transactions and share groups integration. But it gives up elasticity and the additional architectural benefits that come from building on disaggregated storage. Having said all of the above. There are a lot of missing or hacky details that currently detract from the evolutionary goal. There is a lot of hand-waving when it comes to correctness too. It is not clear that this KIP will be able to deliver a low-disruption evolutionary design that is also correct, highly available and durable. Discussion in the mailing list is ongoing. Luke Chen remarked: “ the current availability story is weak… It’s not clear if the effort is still small once details on correctness, cost, cleanness are figured out. ”, and I have to agree. The second revision of KIP-1150 replaces the future object compaction logic by delegating long-term storage management to the existing tiered storage abstraction (like Slack’s KIP-1176). The idea is to: Remove Batch Coordinators from the read path. Avoid separate object compaction logic by delegating long-term storage management to tiered storage (which already exists). Rebuild per-partition log segments from combined objects in order to: Submit them for long-term tiering (works as a form of object compaction too). Serve consumer fetch requests. The entire write path becomes a three stage process: Stage 1 – Produce path, synchronous . Uploads multi-topic WAL Segments to S3 and sequences the batches, acknowledging to producers once committed. This is unchanged except SLSOs are now called WAL Segments. Stage 2 – Per-partition log segment file construction, asynchronous . Each broker is assigned a subset of topic partitions. The brokers download WAL segment byte ranges that host these assigned partitions and append to on-disk per-partition log segment files. Stage 3 – Tiered storage, asynchronous . The tiered storage architecture tiers the locally cached topic partition log segments files as normal. Stage 1 – The produce path The produce path is the same, but SLSOs are now called WAL Segments. Fig KIP-1150-rev2-A. Write path stage 1 (the synchronous part). Leaderless brokers upload multi-topic WAL Segments, then commit and sequence them via the Batch Coordinator. Stage 2 – Local per-partition segment caching The second stage is preparation for both: Stage 3, segment tiering (tiered storage). The read pathway for tailing consumers Fig KIP-1150-rev2-B. Stage 2 of the write-path where assigned brokers download WAL segments and append to local log segments. Each broker is assigned a subset of topic partitions. Each broker polls the BC to learn of new WAL segments. Each WAL segment that hosts any of the broker’s assigned topic partitions will be downloaded (at least the byte range of its assigned partitions). Once the download completes, the broker will inject record offsets as determined by the batch coordinator, and append the finalized batch to a local (per topic partition) log segment on-disk. At this point, a log segment file looks like a classic topic partition segment file. The difference is that they are not a source of durability, only a source for tiering and consumption. WAL segments remain in object storage until all batches of a segment have been tiered via tiered storage. Then WAL segments can be deleted. Stage 3 – Tiered storage Tiered storage continues to work as it does today (KIP-405), based on local log segments. It hopefully knows nothing of the Direct-to-S3 components and logic. Tiered segment metadata is stored in KRaft which allows for WAL segment deletion to be handled outside of the scope of tiered storage also. Fig KIP-1150-rev2-C. Tiered storage works as-is, based on local log segment files. Data is consumed from S3 topics from either: The local segments on-disk, populated from stage 2. Tiered log segments (traditional tiered storage read pathway) End-to-end latency of any given batch is therefore based on: Produce batch added to buffer. WAL Segment containing that batch written to S3. Batch coordinates submitted to the Batch Coordinator for sequencing. Producer request acknowledged Tail, untiered (fast path for tailing consumers) -> Replica downloads WAL Segment slice. Replica appends the batch to a local (per topic partition) log segment. Replica serves a consumer fetch request from the local log segment. Tiered (slow path for lagging consumers) -> Remote Log Manager downloads tiered log segment Replica serves a consumer fetch request from the downloaded log segment. Avoiding excessive reads to S3 will be important in stage 2 (when a broker is downloading WAL segment files for its assigned topic partitions). This KIP should standardize how topic partitions are laid out inside every WAL segment and perform partition-broker assignments based on that same order: Pick a single global topic partition order (based on a permutation of a topic partition id). Partition that order into contiguous slices, giving one slice per broker as its assignment ( (a broker may get multiple topic partitions, but they must be adjacent in the global order). Lay out every WAL segment in that same global order. That way, each broker’s partition assignment will occupy one contiguous block per WAL Segment, so each read broker needs only one byte-range read per WAL segment object (possibly empty if none of its partitions appear in that object). This reduces the number of range reads per broker when reconstructing local log segments. By integrating with tiered storage and reconstructing log segments, revision 2 moves to a more diskful design, where disks form a step in the write path to long term storage. It is also more stateful and sticky than revision 1, given that each topic partition is assigned a specific broker for log segment reconstruction, tiering and consumer serving. Revision 2 remains leaderless for producers, but leaderful for consumers. Therefore to avoid cross-AZ traffic for consumer traffic, it will rely on KIP-881: Rack-aware Partition Assignment for Kafka Consumers to ensure zone-local consumer assignment. This makes revision 2 a hybrid. By delegating responsibility to tiered storage, more of the direct-to-S3 workload must be handled by stateful brokers. It is less able to benefit from the elasticity of disaggregated storage. But it reuses more of existing Kafka. The third revision, which at the time of writing is a loose proposal in the mailing list, ditches the Batch Coordinator (BC) altogether. Most of the complexity of KIP-1150 centers around Batch Coordinator efficiency, failovers, scaling as well as idempotency, transactions and share groups logic. Revision 3 proposes to replace the BCs with “classic” topic partitions. The classic topic partition leader replicas will do the work of sequencing and storing the batch coordinates of their own data. The data itself would live in SLSOs (rev1)/WAL Segments (rev2) and ultimately, as tiered log segments (tiered storage). To make this clear, if as a user you create the topic Orders with one partition, then an actual topic Orders will be created with one partition. However, this will only be used for the sequencing of the Orders data and the storage of its metadata. The benefit of this approach is that all the idempotency and transaction logic can be reused in these “classic-ish” topic partitions. There will be code changes but less than Batch Coordinators. All the existing tooling of moving partitions around, failovers etc works the same way as classic topics. So replication continues to exist, but only for the metadata. Fig KIP-1150-rev3-A. Replicated partitions continue to exist, acting as per-partition sequencers and metadata stores (replacing batch coordinators). One wrinkle this adds is that there is no central place to manage the clean-up of WAL Segments. Therefore a WAL File Manager component would have responsibility for background cleanup of those WAL segment files. It would periodically check the status of tiering to discover when a WAL Segment can get deleted. Fig KIP-1150-rev3-B. The WAL File Manager is responsible for WAL Segment clean up The motivation behind this change to remove Batch Coordinators is to simplify implementation by reusing existing Kafka code paths (for idempotence, transactions, etc.). However, it also opens up a whole new set of challenges which must be discussed and debated, and it is not clear this third revision solves the complexity. Revision 3 now depends on the existing classic topics, with leader-follower replication. It moves a little further again towards the evolutionary path. It is curious to see Aiven productionizing its Kafka fork “Inkless”, which falls under the “revolutionary” umbrella, while pushing towards a more evolutionary stateful/sticky design in these later revisions of KIP-1150. Apache Kafka is approaching a decision point with long-term implications for its architecture and identity. The ongoing discussions around KIP-1150 revisions 1-3 and KIP-1176 are nominally framed around replication cost reduction, but the underlying issue is broader: how should Kafka evolve in a world increasingly shaped by disaggregated storage and elastic compute? At its core, the choice comes down to two paths. The evolutionary path seeks to fit S3 topics into Kafka’s existing leader-follower framework, reusing current abstractions such as tiered storage to minimize disruption to the codebase. The revolutionary path instead prioritizes the benefits of building directly on object storage. By delegating to shared object storage, Kafka can support an S3 topic serving layer which is stateless, elastic, and disposable. Scaling coming by adding and removing stateless workers rather than rebalancing stateful nodes. While maintaining Kafka’s existing workloads with classic leader-follower topics. While the intentions and goals of the KIPs clearly fall on a continuum of revolutionary to evolutionary, the reality in the mailing list discussions makes everything much less clear. The devil is in the details, and as the discussion advances, the arguments of “simplicity through reuse” start to strain. The reuse strategy is a retrofitting strategy which ironically could actually make the codebase harder to maintain in the long term. Kafka’s existing model is deeply rooted in leader-follower replication, with much of its core logic built around that assumption. Retrofitting direct-to-S3 into this model forces some “unnatural” design choices. Choices that would not be made otherwise (if designing a cloud-native solution). My own view aligns with the more revolutionary path in the form of KIP-1150 revision 1 . It doesn’t simply reduce cross-AZ costs, but fully embraces the architectural benefits of building on object storage. With additional broker roles and groups, Kafka could ultimately achieve a similar elasticity to WarpStream (and Confluent Freight Clusters). The approach demands more upfront engineering effort, may increase long-term maintenance complexity, but avoids tight coupling to the existing leader-follower architecture. Much depends on what kind of refactoring is possible to avoid the duplication of idempotency, transactions and share group logic. I believe the benefits justify the upfront cost and will help keep Kafka relevant in the decade ahead. In theory, both directions are defensible, ultimately it comes down to the specifics of each KIP. The details really matter. Goals define direction, but it’s the engineering details that determine the system’s actual properties. We know the revolutionary path involves big changes, but the evolutionary path comes with equally large challenges, where retrofitting may ultimately be more costly while simultaneously delivering less. The committers who maintain Kafka are cautious about large refactorings and code duplication, but are equally wary of hacks and complex code serving two needs. We need to let the discussions play out. My aim with this post has been to take a step back from "how to implement direct-to-S3 topics in Kafka", and think more about what we want Kafka to be. The KIPs represent the how, the engineering choices. Framed that way, I believe it is easier for the wider community to understand the KIPs, the stakes and the eventual decision of the committers, whichever way they ultimately decide to go. Revolutionary : Choose a direct-to-S3 topic design that maximizes the benefits of an object-storage architecture, with greater elasticity and lower operational complexity. However, in doing so, we may increase the implementation cost and possibly the long-term code maintenance too by maintaining two very different topic-models in the same project (leader-based replication and direct-to-S3). Evolutionary : Shoot for an evolutionary design that makes use of existing components to reduce the need for large refactoring or duplication of logic. However, by coupling to the existing architecture, we forfeit the extra benefits of object storage, focusing primarily on networking cost savings (in AWS and GCP). Through this coupling, we also run the risk of achieving the opposite: harder to maintain code by bending and contorting a second workload into an architecture optimized for something else. The term “Diskless” vs “Direct-to-S3” The Common Parts. Some approaches are shared across multiple implementations and proposals. Revolutionary: KIPs and real-world implementations Evolutionary: Slack’s KIP-1176 The Hybrid: balancing revolution with evolution Deciding Kafka’s future Avoiding the small file problem . Most designs are leaderless for producer traffic, allowing for any server to receive writes to any topic. To avoid uploading a multitude of tiny files, servers accumulate batches in a buffer until ready for upload. Before upload, the buffer is sorted by topic id and partition, to make compaction and some reads more efficient by ensuring that data of the same topic and same partition are in contiguous byte ranges. Pricing . The pricing of many (but not all) cloud object storage services penalize excessive requests, so it can be cheaper to roll-up whatever data has been received in the last X milliseconds and upload it with a single request. Leaderless: The same way that producer alignment works via metadata manipulation or using KIP-392 (fetch from follower) which can be used in a leaderless context. Leader-based: Zone-aware consumer group assignment as detailed in KIP-881: Rack-aware Partition Assignment for Kafka Consumers . The idea is to use consumer-to-partition assignment to ensure consumers are only assigned zone-local partitions (where the partition leader is located). KIP-392 (fetch-from-follower) , which is effective for designs that have followers (which isn’t always the case). WarpStream (as a reference, a kind of yardstick to compare against) KIP-1150 revision 1 Aiven Inkless (a Kafka-fork) Leaderless, stateless and diskless agents that handle Kafka clients, as well as compaction/cleaning work. Coordination layer : A central metadata store for sequencing, metadata storage and housekeeping coordination. Sequencing . Chooses the total ordering for writes, assigning offsets without gaps or duplicates. Metadata storage . Stores all metadata that maps partition offset ranges to S3 object byte ranges. Serving lookup requests . Serving requests for log offsets. Serving requests for batch coordinates (S3 object metadata). Partition CRUD operations . Serving requests for atomic operations (creating partitions, deleting topics, records, etc.) Data expiration Managing data expiry and soft deletion. Coordinating physical object deletion (performed by brokers). When a producer starts, it sends a Metadata request to learn which brokers are the leaders of each topic partition it cares about. The Metadata response contains a zone-local broker. The producer sends all Produce requests to this broker. The receiving broker accumulates batches of all partitions and uploads a shared log segment object (SLSO) to S3. The broker commits the SLSO by sending the metadata of the SLSO (the metadata is known as the batch coordinates) to the Batch Coordinator. The coordinates include the S3 object metadata and the byte ranges of each partition in the object. The Batch Coordinator assigns offsets to the written batches, persists the batch coordinates, and responds to the broker. The broker sends all associated acknowledgements to the producers. The consumer sends a Fetch request to the broker. The broker checks its cache and on a miss, the broker queries the Batch Coordinator for the relevant batch coordinates. The broker downloads the data from object storage. The broker injects the computed offsets and timestamps into the batches (the offsets being part of the batch coordinates). The broker constructs and sends the Fetch response to the Consumer. logs . Stores the Log Start Offset and High Watermark of each topic partition, with the primary key of topic_id and partition. files . Lists all the objects that host the topic partition data. batches . Maps Kafka batches to byte ranges in Files. producer_state . All the producer state needed for idempotency. Some other tables for housekeeping, and the merge work items. A broker opens a transaction and submits a table as an argument, containing the batch coordinates of the multiple topics uploaded in the combined file. The TVF logic creates a temporary table (logs_tmp) and fills it via a SELECT on the logs table, with the FOR UPDATE clause which obtains a row lock on each topic partition row in the logs table that matches the list of partitions being submitted. This ensures that other brokers that are competing to add batches to the same partition(s) queue up behind this transaction. This is a critical barrier that avoids inconsistency. These locks are held until the transaction commits or aborts. Next it, inside a loop, partition-by-partition, the TVF: Updates the producer state. Updates the high watermark of the partition (a row in the logs table). Inserts the batch coordinates into the batches table (sequencing and storing them). Commits the transaction. Maintain leader-based topic partitions (producers continue to write to leaders), but replace Kafka replication protocol with a per-broker S3-based write-ahead-log (WAL). Try to preserve existing partition replica code for idempotency and transactions. Reuse existing tiered storage for long-term S3 data management. Remove Batch Coordinators from the read path. Avoid separate object compaction logic by delegating long-term storage management to tiered storage (which already exists). Rebuild per-partition log segments from combined objects in order to: Submit them for long-term tiering (works as a form of object compaction too). Serve consumer fetch requests. Stage 1 – Produce path, synchronous . Uploads multi-topic WAL Segments to S3 and sequences the batches, acknowledging to producers once committed. This is unchanged except SLSOs are now called WAL Segments. Stage 2 – Per-partition log segment file construction, asynchronous . Each broker is assigned a subset of topic partitions. The brokers download WAL segment byte ranges that host these assigned partitions and append to on-disk per-partition log segment files. Stage 3 – Tiered storage, asynchronous . The tiered storage architecture tiers the locally cached topic partition log segments files as normal. Stage 3, segment tiering (tiered storage). The read pathway for tailing consumers The local segments on-disk, populated from stage 2. Tiered log segments (traditional tiered storage read pathway) Produce batch added to buffer. WAL Segment containing that batch written to S3. Batch coordinates submitted to the Batch Coordinator for sequencing. Producer request acknowledged Tail, untiered (fast path for tailing consumers) -> Replica downloads WAL Segment slice. Replica appends the batch to a local (per topic partition) log segment. Replica serves a consumer fetch request from the local log segment. Tiered (slow path for lagging consumers) -> Remote Log Manager downloads tiered log segment Replica serves a consumer fetch request from the downloaded log segment. Pick a single global topic partition order (based on a permutation of a topic partition id). Partition that order into contiguous slices, giving one slice per broker as its assignment ( (a broker may get multiple topic partitions, but they must be adjacent in the global order). Lay out every WAL segment in that same global order.