Posts in Cloud (20 found)

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization

Oasis: Pooling PCIe Devices Over CXL to Boost Utilization Yuhong Zhong, Daniel S. Berger, Pantea Zardoshti, Enrique Saurez, Jacob Nelson, Dan R. K. Ports, Antonis Psistakis, Joshua Fried, and Asaf Cidon SOSP'25 If you are like me, you’ve dabbled with software prefetching but never had much luck with it. Even if you care nothing about sharing of PCIe devices across servers in a rack, this paper is still interesting because it shows a use case where software prefetching really matters. I suppose it is common knowledge that CXL enables all of the servers in a rack to share a pool of DRAM . The unique insight from this paper is that once you’ve taken this step, sharing of PCIe devices (e.g., NICs, SSDs) can be implemented “at near-zero extra cost.” Fig. 2 shows how much SSD capacity and NIC bandwidth are stranded in Azure. A stranded resource is one that is underutilized because some other resource (e.g., CPU or memory) is the bottleneck. Customers may not be able to precisely predict the ratios of CPU:memory:SSD:NIC resources they will need. Even if they could, a standard VM size may not be available that exactly matches the desired ratio. Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The steps to receive a packet are similar: The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory. One trick used here is flow tagging . This is a DPDK feature that enables the NIC to determine which host the message is destined for, without the backend driver having to inspect network packet headers. Fig. 8 shows measurements of the overhead added by Oasis. The solid lines are the baseline; the dotted lines are Oasis. Each color represents a different latency bucket. The baseline uses a NIC which is local to the host running the benchmark. The overhead is measurable, but not excessive. Source: https://dl.acm.org/doi/pdf/10.1145/3731569.3764812 Dangling Pointers The paper doesn’t touch on the complexities related to network virtualization in a pooled device scheme. It seems to me that solving these problems wouldn’t affect performance but would require significant engineering. Subscribe now Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Additionally, servers may have redundant components which are used in case of a hardware failure. The paper cites servers containing redundant NICs to avoid the server disappearing off the network if a NIC fails. Pooling of PCIe devices could help both of these problems. The VM placement problem is easier if resources within a rack can be dynamically allocated, rather than at the server level. Similarly, a rack could contain redundant devices which are available to any server in the rack which experiences a hardware failure. Datapath Fig. 4 shows the Oasis architecture: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 In this example, a VM or container running on host A uses the NIC located on host B. An Oasis frontend driver on host A and a backend driver on host B make the magic happen. The communication medium is a shared pool of memory that both hosts have access to over CXL. The shared memory pool stores both the raw network packets, and message queues which contain pointers to the network packets. A tricky bit in this design is the assumption that the CPU caches in the hosts do not have a coherent view of the shared memory pool (i.e., there is no hardware cache coherence support). This quote sums up the reasoning behind this assumption: Although the CXL 3.0 specification introduces an optional cross-host hardware coherence flow [11], the implementation requires expensive hardware changes on both the processor and the device [74, 105, 143, 145]. To make Oasis compatible with hardware available today, we do not assume cache-coherent CXL devices. Cache Coherency Here is the secret sauce that Oasis uses to efficiently send a message from the frontend driver to the backend driver. Note that this scheme is used for the message channels (i.e., descriptors, packet metadata). The shared memory pool is mapped by both drivers as cacheable. The frontend driver writes the message into shared memory, increments a tail pointer (also stored in shared CXL memory) and then forces the containing cache lines to be written to the shared memory pool by executing the instruction. The backend driver polls the tail pointer. If polling reveals that there are no new messages, the driver invalidates the cache line containing the tail pointer (with followed by ). This handles the case where there actually are new messages available, but the backend driver is reading a cached (stale) copy of the tail pointer. The backend driver then speculatively prefetches 16 cache lines of message data (with ). When the backend driver detects that the tail pointer has been incremented, it processes all new messages. Hopefully there is more than one message, and the software prefetch instructions will overlap computation with transfer from the shared memory pool. After processing the message(s), the backend driver invalidates the memory where those messages are stored. This is critical, because it allows subsequent prefetch instructions to work. A prefetch instruction does nothing if the target cache line is already cached (even though it may be stale) . The speculative 16-cache line prefetch also suffers from the same issue. Say 4 of the 16 prefetched lines had valid messages, and 12 did not. Those 12 cache lines are now in the backend CPU cache, and future prefetch instructions targeting them will do nothing. To solve this problem, the backend driver also invalidates speculatively prefetched cache lines that did not contain any messages. Send and Receive Flows Fig. 7 illustrates the end-to-end packet transmit flow: Source: https://dl.acm.org/doi/10.1145/3731569.3764812 Here are the steps: The network stack running in the VM/container on host A writes packet data into the packet buffer in CXL memory. Note that the network stack doesn’t “know” that it is writing network packets to shared memory. The frontend driver writes a message in the queue stored in shared CXL memory. The frontend driver uses to flush cache lines associated with both the network packet data, the message, and the message queue tail pointer. The backend driver polls the tail pointer for new messages in the queue (using the prefetching tricks described previously). The backend driver uses DPDK to cause the NIC on host B to transmit the packet. Note that the CPU cores on host B do not need to actually read network packet data, the NIC uses DMA to read this data directly from the shared memory pool. The NIC on host B writes the packet data (via DMA) into the shared memory pool. The backend driver uses DPDK to learn that a new packet has arrived. The backend driver writes a message into the message queue in shared memory. The frontend driver polls the message queue (using the prefetch tricks). The network stack running in the VM/container on host A reads the packet data from shared memory.

0 views
DHH 6 days ago

No backup, no cry

I haven't done a full-system backup since back in the olden days before Dropbox and Git. Every machine I now own is treated as a stateless, disposable unit that can be stolen, lost, or corrupted without consequences. The combination of full-disk encryption and distributed copies of all important data means there's just no stress if anything bad happens to the computer. But don't mistake this for just a "everything is in the cloud" argument. Yes, I use Dropbox and GitHub to hold all the data that I care about, but the beauty of these systems is that they work with local copies of that data, so with a couple of computers here and there, I always have a recent version of everything, in case either syncing service should go offline (or away!). The trick to making this regime work is to stick with it. This is especially true for Dropbox. It's where everything of importance needs to go: documents, images, whatever. And it's instantly distributed on all the machines I run. Everything outside of Dropbox is essentially treated as a temporary directory that's fully disposable. It's from this principle that I built Omarchy too. Given that I already had a way to restore all data and code onto a new machine in no time at all, it seemed so unreasonable that the configuration needed for a fully functional system still took hours on end. Now it's all encoded in an ISO setup that installs in two minutes on a fast computer. Now it's true that this method relies on both multiple computers and a fast internet connection. If you're stuck on a rock in the middle of nowhere, and you somehow haven't discovered the glory of Starlink, maybe just stick to your old full-disk backup ways. But if you live in the modern world, there ought to be no reason why a busted computer is a calamity of data loss or a long restore process.

1 views
iDiallo 1 weeks ago

The real cost of Compute

Somewhere along the way, we stopped talking about servers. The word felt clunky, industrial, too tied to physical reality. Instead, we started saying "the cloud". It sounds weightless, infinite, almost magical. Your photos live in the cloud. Your documents sync through the cloud. Your company's entire infrastructure runs in the cloud. I hated the term cloud. I wasn't alone, someone actually created a "cloud to butt" browser extension that was pretty fun and popular. But the world has adopted the term, and I had no choice but to oblige. So what is the actual cloud? Why is it hiding behind this abstraction? Well, the cloud is rows upon rows of industrial machines, stacked in massive data centers, consuming electricity at a scale most of us can't even imagine. The cloud isn't floating above us. It's bolted to concrete floors, surrounded by cooling systems, and plugged into power grids that strain under its appetite. I'm old enough to remember the crypto boom and the backlash that followed. Critics loved to point out that Bitcoin mining consumed as much electricity as entire countries. Argentina, the Netherlands, and so many nations were picked for comparison. But I was not outraged by it at all. My reaction at the time was simpler. Why does it matter if they pay their electric bill? If you use electricity and compensate for it, isn't that just... how markets work? Turns out, I was missing the bigger picture. And the AI boom has made it impossible to ignore. When new data centers arrive in a region, everyone's electric bill goes up. Even if your personal consumption stays exactly the same. It has nothing to do with fairness and free markets. Infrastructure is not free. The power grids weren't designed for the sudden addition of facilities that consume megawatts continuously. When demand surges beyond existing capacity, utilities pass those infrastructure costs onto everyone. New power plants get built, transmission lines get upgraded, and residential customers help foot the bill through rate increases. The person who never touches AI, never mines crypto, never even knows what a data center does, this person is now subsidizing the infrastructure boom through their monthly utility payment. The cloud, it turns out, has a very terrestrial impact on your wallet. We've abstracted computing into its purest conceptual form: "compute." I have to admit, it's my favorite term in tech. "Let's buy more compute." "We need to scale our compute." It sounds frictionless, almost mathematical. Like adjusting a variable in an equation. Compute feels like a slider you can move up and down in your favorite cloud provider's interface. Need more? Click a button. Need less? Drag it down. The interface is clean, the metaphor is seamless, and completely disconnected from the physical reality. But in the real world, "buying more compute" means someone is installing physical hardware in a physical building. It means racks of servers being assembled, hard drives being mounted, cables being routed. The demand has become so intense that some data center employees have one job and one job only: installing racks of new hard drives, day in and day out. It's like an industrial assembly line. Every gigabyte of "cloud storage" occupies literal space. Every AI query runs on actual processors that generate actual heat. The abstraction is beautiful, but the reality is concrete and steel. The cloud metaphor served its purpose. It helped us think about computing as a utility. It's always available, scalable, detached from the messy details of hardware management. But metaphors shape how we think, and this one has obscured too much for too long. Servers are coming out of their shells. The foggy cloud is lifting, and we're starting to see the machinery underneath: vast data centers claiming real estate, consuming real water for cooling, and drawing real power from grids shared with homes, schools, and hospitals. This isn't an argument against cloud computing or AI. There nothing to go back to. But we need to acknowledge their physical footprint. The cloud isn't a magical thing in the sky. It's industry. And like all industry, it needs land, resources, and infrastructure that we all share.

0 views

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
./techtipsy 1 weeks ago

Every time I write about a single board computer, half the internet goes down

It happened again. This time it’s Cloudflare, The last time I wrote about a single board computer, it was AWS that went down on the same day. Today, I wrote about the LattePanda IOTA. I’ll let y’all know once I plan on writing about another single board computer, seems to be bad for the internet.

0 views
Rik Huijzer 1 weeks ago

Do Not Put Your Site Behind Cloudflare if You Don't Need To

At the time of writing 12:43 UTC on Tue 18 Nov, Cloudflare has taken many sites down. I'm trying to browse the web, but about half of the sites show an error: ![cloudflare.webp](/files/45b312b038ccdc65) Most of these sites are not even that big. I expect maybe a few thousand visitors per month. This demonstrates again a simple fact: if you put your site behind a centralized service, then this service is a single point of failure. Even large established companies make mistakes and can go down. Most people use Cloudflare because they have been scared into the idea that you need DDoS protecti...

0 views

Exclusive: Here's How Much OpenAI Spends On Inference and Its Revenue Share With Microsoft

As with my Anthropic exclusive from a few weeks ago , though this feels like a natural premium piece, I decided it was better to publish on my free one so that you could all enjoy it. If you liked or found this piece valuable, please subscribe to my premium newsletter — here’s $10 off the first year of an annual subscription . I have put out over a hundred thousand words of coverage in the last three months, most of which is on my premium, and I’d really appreciate your support. I also did an episode of my podcast Better Offline about this. Before publishing, I discussed the data with a Financial Times reporter. Microsoft and OpenAI both declined to comment to the FT. If you ever want to share something with me in confidence, my signal is ezitron.76, and I’d love to hear from you. What I’ll describe today will be a little more direct than usual, because I believe the significance of the information requires me to be as specific as possible.  Based on documents viewed by this publication, I am able to report OpenAI’s inference spend on Microsoft Azure, in addition to its payments to Microsoft as part of its 20% revenue share agreement, which was reported in October 2024 by The Information . In simpler terms, Microsoft receives 20% of OpenAI’s revenue. I do not have OpenAI’s training spend, nor do I have information on the entire extent of OpenAI’s revenues, as it appears that Microsoft shares some percentage of its revenue from Bing, as well as 20% of the revenue it receives from selling OpenAI’s models.  According to The Verge : Nevertheless, I am going to report what I’ve been told. One small note — for the sake of clarity, every time I mention a year going forward, I’ll be referring to the calendar year, and not Microsoft’s financial year (which ends in June).  These numbers in this post differ to those that have been reported publicly. For example, previous reports had said that OpenAI had spent $2.5 billion on “cost of revenue” - which I believe are OpenAI’s inference costs - in the first half of CY2025 .  According to the documents viewed by this newsletter, OpenAI spent $5.02 billion on inference alone with Microsoft Azure in the first half of Calendar Year CY2025.  This is a pattern that has continued through the end of September. By that point in CY2025 — three months later — OpenAI had spent $8.67 billion on inference.  OpenAI’s inference costs have risen consistently over the last 18 months, too. For example, OpenAI spent $3.76 billion on inference in CY2024, meaning that OpenAI has already doubled its inference costs in CY2025 through September. Based on its reported revenues of $3.7 billion in CY2024 and $4.3 billion in revenue for the first half of CY2025 , it seems that OpenAI’s inference costs easily eclipsed its revenues.  Yet, as mentioned previously, I am also able to shed light on OpenAI’s revenues, as these documents also reveal the amounts that Microsoft takes as part of its 20% revenue share with OpenAI.  Concerningly, extrapolating OpenAI’s revenues from this revenue share does not produce numbers that match those previously reported.  According to the documents, Microsoft received $493.8 million in revenue share payments in CY2024 from OpenAI — implying revenues for CY2024 of at least $2.469 billion, or around $1.23 billion less than the $3.7 billion that has been previously reported .  Similarly, for the first half of CY2025, Microsoft received $454.7 million as part of its revenue share agreement, implying OpenAI’s revenues for that six-month period were at least $2.273 billion, or around $2 billion less than the $4.3 billion previously reported . Through September, Microsoft’s revenue share payments totalled $865.9 million, implying OpenAI’s revenues are at least $4.329 billion. According to Sam Altman, OpenAI’s revenue is “well more” than $13 billion . I am not sure how to reconcile that statement with the documents I have viewed. The following numbers are calendar years. I will add that, where I have them, I will include OpenAI’s leaked or reported revenues. In some cases, the numbers match up. In others they do not. Though I do not know for certain, the only way to reconcile this would be some sort of creative means of measuring “annualized” or “recurring” revenue. I am confident in saying that I have read every single story about OpenAI’s revenue ever written, and at no point does OpenAI (or the documents reporting anything) explain how the company defines “annualized” or “annual recurring revenue.”  I must be clear that the following is me speaking in generalities, and not about OpenAI specifically, but you can get really creative with annualized revenue or annual recurring revenue. You can say 30 days, 28 days, and you can even choose a period of time that isn’t a calendar month too — so, say, the best 30 days of your company’s existence across two different months. I have no idea how OpenAI defines this metric, and default to saying that “annualized” or “ARR” means $Xnumber divided by 12. The Financial Times reported on February 9 2024 that OpenAI’s revenues had “surpassed $2 billion on an annualised basis” in December 2023, working out to $166.6 million in a month: The Information reported on June 12 2024 that OpenAI had “more than doubled its annualized revenue to $3.4 billion in the last six months or so,” working out to around $283 million in a month, likely referring to this period. On September 27 2024, the New York Times reported that “OpenAI’s monthly revenue hit $300 million in August…and the company expects about $3.7 billion in annual sales [in 2024],” according to a financial professional’s review of documents. On June 9, 2025, an OpenAI spokesperson told CNBC that it had hit “$10 billion annual recurring revenue,” excluding licensing revenue from OpenAI’s 20% revenue share and “large, one-time deals.” $10bn annualized revenue works out to around $833 million in a month. These numbers are inclusive of OpenAI’s revenue share payments to Microsoft and OpenAI’s inference spend. There could be potentially royalty payments made to OpenAI as part of its deal to receive 20% of Microsoft’s sales of OpenAI’s models, or other revenue related to its revenue share with Bing.  Due to the sensitivity and significance of this information, I am taking a far more blunt approach with this piece. Based on the information in this piece, OpenAI’s costs and revenues are potentially dramatically different to what we believed. The Information reported in October 2024 that OpenAI’s revenue could be $4 billion, and inference costs $2 billion based on documents “which include financial statements and forecasts,” and specifically added the following: I do not know how to reconcile this with what I am reporting today. In the first half of CY2024, based on the information in the documents, OpenAI’s inference costs were $1.295 billion, and its revenues at least $934 million.  Indeed, it is tough to reconcile what I am reporting with much of what has been reported about OpenAI’s costs and revenues.  OpenAI’s inference spend with Microsoft Azure between CY2024 and Q3 CY2025 was $12.43 billion. That is an astonishing figure, one that dramatically dwarfs any and all reporting, which, based on my analysis, suggested that OpenAI spent $2 billion on inference in 2024 and $2.5 billion through H1 CY2025. In other words, inference costs are nearly triple that reported elsewhere.  Similarly, OpenAI’s extrapolated revenues are dramatically different to those reported.  While we do not have a final tally for 2024, the indicators presented in the documents viewed contrast starkly with the reported predictions from that year.  Both reports of OpenAI’s 2024 revenues ( CNBC , The Information ) are from the same year and are projections of potential final totals, though The Information’s story about OpenAI’s H1 CY2025 revenues said that “OpenAI generated $4.3 billion in revenue in the first half of 2025, about $16% more than it generated all of last year,” which would bring us to $3.612 billion in revenue, or $1.145 billion more than are implied by OpenAI’s revenue share numbers paid to Microsoft. I do not have an answer for inference, other than I believe that OpenAI is spending far more money on inference than we were led to believe, and that the current numbers reported do not resemble those in the documents.  Based on these numbers, it appears that OpenAI may be the single-most cash intensive startup of all time, and that the cost of running large language models may not be something that can be supported by revenues. Even if revenues were to match those that had been reported, OpenAI’s inference spend on Azure consumes them, and appears to scale linearly above revenue.  I also cannot reconcile these numbers with the reporting that OpenAI will have a cash burn of $9 billion in CY2025 . On inference alone, OpenAI has already spent $8.67 billion through Q3 CY2025.  Similarly, I cannot see a path for OpenAI to hit its projected $13 billion in revenue by the end of 2025, nor can I see on what basis Mr. Altman could state that OpenAI will make “well more” than $13 billion this year .  I cannot and will not speak to the financial health of OpenAI in this piece, but I will say this: these numbers are materially different to what has been reported, and the significance of OpenAI’s inference spend alone makes me wonder about the larger cost picture for generative AI. If it costs this much to run inference for OpenAI, I believe it costs this much for any generative AI firm to run on OpenAI’s models. If it does not, OpenAI’s costs are dramatically higher than the prices it is charging its customers, which makes me wonder whether price increases could be necessary to begin making more money, or at the very least losing less. Similarly, if OpenAI’s costs are this high, it makes me wonder about the margins of any frontier model developer.  Inference: $546.8 million Microsoft Revenue Share: $77.3 million Implied OpenAI revenue: at least $386.5 million Inference: $748.3 million Microsoft Revenue Share: $109.5 million Implied OpenAI Revenue: at least $547.5 million Inference: $1.005 billion Microsoft Revenue Share: $139.2 million Implied OpenAI Revenue: at least $696 million Inference: $1.467 billion Microsoft Revenue Share: $167.8 million Implied OpenAI Revenue: at least $839 million Total inference spend for CY2024: $3.767 billion Total implied revenue for CY2024: at least $2.469 billion Reported (projected) revenue for CY2024: $3.7 billion, per CNBC in September 2024. The Information also reported that expected revenue could be as high as $4 billion in a piece from October 2024. Reported inference costs for CY2024: $2 billion, per The Information .  Inference: $2.075 billion Microsoft Revenue Share: $206.4 million Implied OpenAI Revenue: $1.032 billion Inference: $2.947 billion Microsoft Revenue Share: $248.3 million Implied OpenAI Revenue: $1.241.5 billion H1 CY2025 Inference: $5.022 billion H1 CY2025 Revenue: at least $2.273 billion Reported H1 CY2025 Revenue: $4.3 billion ( per The Information ) Reported H1 CY2025 “Cost of Revenue”: $2.5 billion ( per The Information ) Inference: $3.648 billion Microsoft Revenue Share: $411.1 million Implied OpenAI Revenue: at least $2.056 billion

0 views
Shayon Mukherjee 2 weeks ago

A hypothetical search engine on S3 with Tantivy and warm cache on NVMe

I’ve been curious about how far you can push object storage as a foundation for database-like systems. In previous posts, I explored moving JSON data from PostgreSQL to Parquet on S3 and building MVCC-style tables with constant-time deletes using S3’s conditional writes. These experiments showed that decoupling storage from compute unlocks interesting trade-offs while lowering costs and simpler operations in exchange for higher cold query latency. Search engines traditionally don’t fit this model.

0 views
maxdeviant.com 2 weeks ago

Head in the Zed Cloud

For the past five months I've been leading the efforts to rebuild Zed 's cloud infrastructure. Our current backend—known as Collab—has been chugging along since basically the beginning of the company. We use Collab every day to work together on Zed in Zed. However, as Zed continues to grow and attracts more users, we knew that we needed a full reboot of our backend infrastructure to set us up for success for our future endeavors. Enter Zed Cloud. Like Zed itself, Zed Cloud is built in Rust 1 . This time around there is a slight twist: all of this is running on Cloudflare Workers , with our Rust code being compiled down to WebAssembly (Wasm). One of our goals with this rebuild was to reduce the amount of operational effort it takes to maintain our hosted services, so that we can focus more of our time and energy on building Zed itself. Cloudflare Workers allow us to easily scale up to meet demand without having to fuss over it too much. Additionally, Cloudflare offers an ever-growing amount of managed services that cover anything you might need for a production web service. Here are some of the Cloudflare services we're using today: Another one of our goals with this rebuild was to build a platform that was easy to test. To achieve this, we built our own platform framework on top of the Cloudflare Workers runtime APIs. At the heart of this framework is the trait: This trait allows us to write our code in a platform-agnostic way while still leveraging all of the functionality that Cloudflare Workers has to offer. Each one of these associated types corresponds to some aspect of the platform that we'll want to have control over in a test environment. For instance, if we have a service that needs to interact with the system clock and a Workers KV store, we would define it like this: There are two implementors of the trait: and . —as the name might suggest—is an implementation of the platform on top of the Cloudflare Workers runtime. This implementation targets Wasm and is what we run when developing locally (using Wrangler ) and in production. We have a crate 2 that contains bindings to the Cloudflare Workers JS runtime. You can think of as the glue between those bindings and the idiomatic Rust APIs exposed by the trait. The is used when running tests, and allows for simulating almost every part of the system in order to effectively test our code. Here's an example of a test for ingesting a webhook from Orb : In this test we're able to test the full end-to-end flow of: The call to advances the test simulator, in this case running the pending queue consumers. At the center of the is the , a crate that powers our in-house async runtime. The scheduler is shared between GPUI —Zed's UI framework—and the used in tests. This shared scheduler enables us to write tests that span the client and the server. So we can have a test that starts in a piece of Zed code, flows through Zed Cloud, and then asserts on the state of something in Zed after it receives the response from the backend. The work being done on Zed Cloud now is laying the foundation to support our future work around collaborative coding with DeltaDB . If you want to work with me on building out Zed Cloud, we are currently hiring for this role. We're looking for engineers with experience building and maintaining web APIs and platforms, solid web fundamentals, and who are excited about Rust. If you end up applying, you can mention this blog post in your application. I look forward to hearing from you! The codebase is currently 70k lines of Rust code and 5.7k lines of TypeScript. This is essentially our own version of . I'd like to switch to using directly, at some point. Hyperdrive for talking to Postgres Workers KV for ephemeral storage Cloudflare Queues for asynchronous job processing Receiving and validating an incoming webhook event to our webhook ingestion endpoint Putting the webhook event into a queue Consuming the webhook event in a background worker and processing it

1 views
Rik Huijzer 3 weeks ago

Solar Cell Spectral Sensitivity

I just came across the book _Solar Secrets_ (2014) by Peter Lindemann. It observes that most solar panels are optimized to perform on bright sunny days whereas they barely perform on cloudy days. This while there are panels that capture light of lower wavelengths and are therefore much less affected by clouds. The author shows this in the following figure from page 17: ![Solar Cell Spectral Sensitivity a-Si solar versus c-Si solar](/files/e452ae9cbcc36dba) Here it can be seen that Amorphous Silicon (a-Si) solar panels produce energy from the 500 to 700 nm wave length range while Crystalline...

0 views
Robin Moffatt 3 weeks ago

Tech Radar (Nov 2025) - data blips

The latest  Thoughtworks TechRadar  is out. Here are some of the more data-related ‘blips’ (as they’re called on the radar) that I noticed. Each item links to the blip’s entry where you can read more information about Thoughtwork’s usage and opinions on it. Databricks Assistant Apache Paimon Delta Sharing Naive API-to-MCP conversion Standalone data engineering teams Text to SQL

0 views
./techtipsy 3 weeks ago

Why Nextcloud feels slow to use

Nextcloud. I really want to like it, but it’s making it really difficult. I like what Nextcloud offers with its feature set and how easily it replaces a bunch of services under one roof (files, calendar, contacts, notes, to-do lists, photos etc.), but no matter how hard I try and how much I optimize its resources on my home server, it feels slow to use, even on hardware that is ranging from decent to good. Then I opened developer tools and found the culprit. It’s the Javascript. On a clean page load, you will be downloading about 15-20 MB of Javascript, which does compress down to about 4-5 MB in transit, but that is still a huge amount of Javascript. For context, I consider 1 MB of Javascript to be on the heavy side for a web page/app. Yes, that Javascript will be cached in the browser for a while, but you will still be executing all of that on each visit to your Nextcloud instance, and that will take a long time due to the sheer amount of code your browser now has to execute on the page. A significant contributor to this heft seems to be the bundle, which based on its name seems to provide some common functionality that’s shared across different Nextcloud apps that one can install. It’s coming in at 4.71 MB at the time of writing. Then you want notifications, right? is here to cover you, at 1.06 MB . Then there are the app-specific views. The Calendar app is taking up 5.94 MB to show a basic calendar view. Files app includes a bunch of individual scripts, such as ( 1.77 MB ), ( 1.17 MB ), ( 1.09 MB ), ( 0.9 MB which I’ve never used!) and many smaller ones. Notes app with its basic bare-bones editor? 4.36 MB for the ! This means that even on an iPhone 13 mini, opening the Tasks app (to-do list), will take a ridiculously long time. Imagine opening your shopping list at the store and having to wait 5-10 seconds before you see anything, even with a solid 5G connection. Sounds extremely annoying, right? I suspect that a lot of this is due to how Nextcloud is architected. There’s bound to be some hefty common libraries and tools that allow app developers to provide a unified experience, but even then there is something seriously wrong with the end result, the functionality to bundle size ratio is way off. As a result, I’ve started branching out some things from Nextcloud, such as replacing the Tasks app with using a private Vikunja instance, and Photos to a private Immich instance. Vikunja is not perfect, but its 1.5 MB of Javascript is an order of magnitude smaller compared to Nextcloud, making it feel incredibly fast in comparison. However, with other functionality I have to admit that the convenience of Nextcloud is enough to dissuade me from replacing it elsewhere, due to the available feature set comparing well to alternatives. I’m sure that there are some legitimate reasons behind the current state, and overworked development teams and volunteers are unfortunately the norm in the industry, but it doesn’t take away the fact that the user experience and accessibility suffers as a result. I’d like to thank Alex Russell for writing about web performance and why it matters, with supporting evidence and actionable advice, it has changed how I view websites and web apps and has pushed me to be better in my own work. I highly suggest reading his content, starting with the performance inequality gap series. It’s educational, insightful and incredibly irritating once you learn how crap most things are and how careless a lot of development teams are towards performance and accessibility.

0 views
Xe Iaso 1 months ago

Taking steps to end traffic from abusive cloud providers

This blog post explains how to effectively file abuse reports against cloud providers to stop malicious traffic. Key points: Two IP Types : Residential (ineffective to report) vs. Commercial (targeted reports) Why Cloud Providers : Cloud customers violate provider terms, making abuse reports actionable Effective Abuse Reports Should Include : Note on "Free VPNs" : Often sell your bandwidth as part of botnets, not true public infrastructure The goal is to make scraping the cloud provider's problem, forcing them to address violations against their terms of service. Two IP Types : Residential (ineffective to report) vs. Commercial (targeted reports) Why Cloud Providers : Cloud customers violate provider terms, making abuse reports actionable Effective Abuse Reports Should Include : Time of abusive requests IP/User-Agent identifiers robots.txt status System impact description Service context Process : Use whois to find abuse contacts (look for "abuse-c" or "abuse-mailbox") Send detailed reports with all listed emails Expect response within 2 business days Note on "Free VPNs" : Often sell your bandwidth as part of botnets, not true public infrastructure

0 views
Jack Vanlightly 1 months ago

A Fork in the Road: Deciding Kafka’s Diskless Future

“ The Kafka community is currently seeing an unprecedented situation with three KIPs ( KIP-1150 , KIP-1176 , KIP-1183) simultaneously addressing the same challenge of high replication costs when running Kafka across multiple cloud availability zones. ” — Luke Chen, The Path Forward for Saving Cross-AZ Replication Costs KIPs At the time of writing the Kafka project finds itself at a fork in the road where choosing the right path forward for implementing S3 topics has implications for the long-term success of the project. Not just the next couple of years, but the next decade. Open-source projects live and die by these big decisions and as a community, we need to make sure we take the right one. This post explains the competing KIPs, but goes further and asks bigger questions about the future direction of Kafka. Before comparing proposals, we should step back and ask what kind of system we want Kafka to become. Kafka now faces two almost opposing forces. One force is stabilizing: the on-prem deployments and the low latency workloads that depend on local disks and replication. Kafka must continue to serve those use cases. The other force is disrupting: the elastic, cloud-native workloads that favor stateless compute and shared object storage. Relaxed-latency workloads such as analytics have seen a shift in system design with durability increasingly delegated to shared object storage, freeing the compute layer to be stateless, elastic, and disposable. Many systems now scale by adding stateless workers rather than rebalancing stateful nodes. In a stateless compute design, the bottleneck shifts from data replication to metadata coordination. Once durability moves to shared storage, sequencing and metadata consistency become the new limits of scalability. That brings us to the current moment, with three competing KIPs defining how to integrate object storage directly into Kafka. While we evaluate these KIPs, it’s important to consider the motivations for building direct-to-S3 topics. Cross-AZ charges are typically what are on people’s minds, but it’s a mistake to think of S3 simply as a cheaper disk or a networking cheat. The shift is also architectural, providing us an opportunity to achieve those operational benefits such as elastic stateless compute.  The devil is in the details: how each KIP enables Kafka to leverage object storage while also retaining Kafka’s soul and what made it successful in the first place.  With that in mind, while three KIPs have been submitted, it comes down to two different paths: Revolutionary : Choose a direct-to-S3 topic design that maximizes the benefits of an object-storage architecture, with greater elasticity and lower operational complexity. However, in doing so, we may increase the implementation cost and possibly the long-term code maintenance too by maintaining two very different topic-models in the same project (leader-based replication and direct-to-S3). Evolutionary : Shoot for an evolutionary design that makes use of existing components to reduce the need for large refactoring or duplication of logic. However, by coupling to the existing architecture, we forfeit the extra benefits of object storage, focusing primarily on networking cost savings (in AWS and GCP). Through this coupling, we also run the risk of achieving the opposite: harder to maintain code by bending and contorting a second workload into an architecture optimized for something else. In this post I will explain the two paths in this forked road, how the various KIPs map onto those paths, and invite the whole community to think through what they want for Apache Kafka for the next decade. Note that I do not include KIP-1183 as it looks dead in the water, and not a serious contender. The KIP proposes AutoMQ’s storage abstractions without the accompanying implementation. Which perhaps cynically, seems to benefit AutoMQ were it ever adopted, leaving the community to rewrite the entire storage subsystem again. If you want a quick summary of the three KIPs (including KIP-1183), you can read Luke Chen’s The Path Forward for Saving Cross-AZ Replication Costs KIPs or Anton Borisov’s summary of the three KIPs .  This post is structured as follows: The term “Diskless” vs “Direct-to-S3” The Common Parts. Some approaches are shared across multiple implementations and proposals. Revolutionary: KIPs and real-world implementations Evolutionary: Slack’s KIP-1176 The Hybrid: balancing revolution with evolution Deciding Kafka’s future I used the term “diskless” in the title as that is the current hype word. But it is clear that not all designs are actually diskless in the same spirit as “serverless”. Serverless implies that users no longer need to consider or manage servers at all, not that there are no servers.  In the world of open-source, where you run stuff yourself, diskless would have to mean literally “no disks”, else you will be configuring disks as part of your deployment. But all the KIPs (in their current state) depend on disks to some extent, even KIP-1150 which was proposed as diskless. In most cases, disk behavior continues to influence performance and therefore correct disk provisioning will be important. So I’m not a fan of “diskless”, I prefer “direct-to-S3”, which encompasses all designs that treat S3 (and other object stores) as the only source of durability. The main commonality between all Direct-to-S3 Kafka implementations and design proposals is the uploading of objects that combine the data of multiple topics. The reasons are two-fold: Avoiding the small file problem . Most designs are leaderless for producer traffic, allowing for any server to receive writes to any topic. To avoid uploading a multitude of tiny files, servers accumulate batches in a buffer until ready for upload. Before upload, the buffer is sorted by topic id and partition, to make compaction and some reads more efficient by ensuring that data of the same topic and same partition are in contiguous byte ranges. Pricing . The pricing of many (but not all) cloud object storage services penalize excessive requests, so it can be cheaper to roll-up whatever data has been received in the last X milliseconds and upload it with a single request. In the leader-based model, the leader determines the order of batches in a topic partition. But in the leaderless model, multiple brokers could be simultaneously receiving produce batches of the same topic partition, so how do we order those batches? We need a way of establishing a single order for each partition and we typically use the word “sequencing” to describe that process. Usually there is a central component that does the sequencing and metadata storage, but some designs manage the sequencing in other ways. WarpStream was the first to demonstrate that you could hack the metadata step of initiating a producer to provide it with broker information that would align the producer with a zone-local broker for the topics it is interested in. The Kafka client is leader-oriented, so we just pass it a zone-local broker and tell the client “this is the leader”. This is how all the leaderless designs ensure producers write to zone-local brokers. It’s not pretty, and we should make a future KIP to avoid the need for this kind of hack. Consumer zone-alignment heavily depends on the particular design, but two broad approaches exist: Leaderless: The same way that producer alignment works via metadata manipulation or using KIP-392 (fetch from follower) which can be used in a leaderless context. Leader-based:  Zone-aware consumer group assignment as detailed in KIP-881: Rack-aware Partition Assignment for Kafka Consumers . The idea is to use consumer-to-partition assignment to ensure consumers are only assigned zone-local partitions (where the partition leader is located). KIP-392 (fetch-from-follower) , which is effective for designs that have followers (which isn’t always the case). Given almost all designs upload combined objects, we need a way to make those mixed objects more read optimized. This is typically done through compaction, where combined objects are ultimately separated into per-topic or even per-partition objects. Compaction could be one-shot or go through multiple rounds. The “revolutionary” path draws a new boundary inside Kafka by separating what can be stateless from what must remain stateful. Direct-to-S3 traffic is handled by a lightweight, elastic layer of brokers that simply serve producers and consumers. The direct-to-S3 coordination (sequencing/metadata) is incorporated into the stateful side of regular brokers where coordinators, classic topics and KRaft live. I cover three designs in the “revolutionary” section: WarpStream (as a reference, a kind of yardstick to compare against) KIP-1150 revision 1 Aiven Inkless (a Kafka-fork) Before we look at the KIPs that describe possible futures for Apache Kafka, let’s look at a system that was designed from scratch with both cross-AZ cost savings and elasticity (from object storage) as its core design principles. WarpStream was unconstrained by an existing stateful architecture, and with this freedom, it divided itself into: Leaderless, stateless and diskless agents that handle Kafka clients, as well as compaction/cleaning work.  Coordination layer : A central metadata store for sequencing, metadata storage and housekeeping coordination. Fig WS-A. The WarpStream stateless/stateful split architecture. As per the Common Parts section, the zone alignment and sorted combined object upload strategies are employed. Fig WS-B. The WarpStream write path. On the consumer side, which again is leaderless and zone-local, a per-zone shared cache is implemented (which WarpStream dubbed distributed mmap). Within a zone, this shared cache assigns each agent a portion of the partition-space. When a consumer fetch arrives, an agent will download the object byte range itself if it is responsible for that partition, else it will ask the responsible agent to do that on its behalf. That way, we ensure that multiple agents per zone are not independently downloading the same data, thus reducing S3 costs. Fig WS-C. Shared per-zone read cache to reduce S3 throughput and request costs. WarpStream implements agent roles (proxy, job) and agent groups to separate the work of handling producer, consumer traffic from background jobs such as compaction, allowing for independent scaling for each workload. The proxy role (producer/consumer Kafka client traffic) can be further divided into proxy-producer and proxy-consumer.  Agents can be deployed as dedicated agent groups, which allows for further separation of workloads. This is useful for avoiding noisy neighbor issues, running different groups in different VPCs and scaling different workloads that hit the same topics. For example, you could use one proxy group for microservices, and a separate proxy-consumer group for an analytics workload. Fig WS-D. Agent roles and groups can be used for shaping traffic, independent scaling and deploying different workloads into separate VPCs. Being a pure Direct-to-S3 system allowed WarpStream to choose a design that clearly separates traffic serving work into stateless agents and the coordination logic into one central metadata service. The traffic serving layer is highly elastic, with relatively simple agents that require no disks at all. Different workloads benefit from independent and flexible scaling via the agent roles and groups. The point of contention is the metadata service, which needs to be carefully managed and scaled to handle the read/write metadata volume of the stateless agents. Confluent Freight Clusters follow a largely similar design principle of splitting stateless brokers from central coordination. I will write about the Freight design sometime soon in the future. Apache Kafka is a stateful distributed system, but next we’ll see how KIP-1150 could fulfill much of the same capabilities as WarpStream. KIP-1150 has continued to evolve since it was first proposed, undergoing two subsequent major revisions between the KIP itself and mailing list discussion . This section describes the first version of the KIP, created in April 2025. KIP-1150 revision 1 uses a leaderless architecture where any broker can serve Kafka producers and consumers of any topic partition. Batch Coordinators (replicated state machines like Group and Transaction Coordinators), handle coordination (object sequencing and metadata storage). Brokers accumulate Kafka batches and upload shared objects to S3 (known as Shared Log Segment Objects, or SLSO). Metadata that maps these blocks of Kafka batches to SLSOs (known as Batch Coordinates) is then sent to a Batch Coordinator (BC) that sequences the batches (assigning offsets) to provide a global order of those batches per partition. The BC acts as sequencer and batch coordinate database for later lookups, allowing for the read path to retrieve batches from SLSOs. The BCs will also apply idempotent and transactional producer logic (not yet finalized). Fig KIP-1150-rev1-A. Leaderless brokers upload shared objects, then commit and sequence them via the Batch Coordinator. The Batch Coordinator (BC) is the stateful component akin to WarpStream’s metadata service. Kafka has other coordinators such as the Group Coordinator (for the consumer group protocol), the Transaction Coordinator (for reliable 2PC) and Share Coordinator (for queues aka share groups). The coordinator concept, for centralizing some kind of coordination, has a long history in Kafka. The BC is the source of truth about uploaded objects, Direct-to-S3 topic partitions, and committed batches. This is the component that contains the most complexity, with challenges around scaling, failovers, reliability as well as logic for idempotency, transactions, object compaction coordination and so on. A Batch Coordinator has the following roles: Sequencing . Chooses the total ordering for writes, assigning offsets without gaps or duplicates. Metadata storage . Stores all metadata that maps partition offset ranges to S3 object byte ranges. Serving lookup requests . Serving requests for log offsets. Serving requests for batch coordinates (S3 object metadata). Partition CRUD operations . Serving requests for atomic operations (creating partitions, deleting topics, records, etc.) Data expiration Managing data expiry and soft deletion. Coordinating physical object deletion (performed by brokers). The Group and Transaction Coordinators use internal Kafka topics to durably store their state and rely on the KRaft Controller for leader election (coordinators are highly available with failovers). This KIP does not specify whether the Batch Coordinator will be backed by a topic, or use some other option such as a Raft state machine (based on KRaft state machine code). It also proposes that it could be pluggable. For my part, I would prefer all future coordinators to be KRaft state machines, as the implementation is rock solid and can be used like a library to build arbitrary state machines inside Kafka brokers. When a producer starts, it sends a Metadata request to learn which brokers are the leaders of each topic partition it cares about. The Metadata response contains a zone-local broker. The producer sends all Produce requests to this broker. The receiving broker accumulates batches of all partitions and uploads a shared log segment object (SLSO) to S3. The broker commits the SLSO by sending the metadata of the SLSO (the metadata is known as the batch coordinates) to the Batch Coordinator. The coordinates include the S3 object metadata and the byte ranges of each partition in the object. The Batch Coordinator assigns offsets to the written batches, persists the batch coordinates, and responds to the broker. The broker sends all associated acknowledgements to the producers. Brokers can parallelize the uploading SLSOs to S3, but commit them serially to the Batch Coordinator. When batches are first written, a broker does not assign offsets as would happen with a regular topic, as this is done after the data is written, when it is sequenced and indexed by the Batch Coordinator. In other words, the batches stored in S3 have no offsets stored within the payloads as is normally the case. On consumption, the broker must inject the offsets into the Kafka batches, using metadata from the Batch Coordinator. The consumer sends a Fetch request to the broker. The broker checks its cache and on a miss, the broker queries the Batch Coordinator for the relevant batch coordinates. The broker downloads the data from object storage. The broker injects the computed offsets and timestamps into the batches (the offsets being part of the batch coordinates). The broker constructs and sends the Fetch response to the Consumer. Fig KIP-1150-rev1-B. The consume path of KIP-1150 (ignoring any caching logic here). The KIP notes that broker roles could be supported in the future. I believe that KIP-1150 revision 1 starts becoming really powerful with roles akin to WarpStream. That way we can separate out direct-to-S3 topic serving traffic and object compaction work on proxy brokers , which become the elastic serving and background job layer. Batch Coordinators would remain hosted on standard brokers, which are already stateful. With broker roles, we can see how Kafka could implement its own WarpStream-like architecture, which makes full use of disaggregated storage to enable better elasticity. Fig KIP-1150-rev1-C. Broker roles would bring the stateless/stateful separation that would unlock elasticity in the high throughput workload of many Direct-to-S3 deployments. Things like supporting idempotency, transactions, caching and object compaction are left to be decided later. While taxing to design, these things look doable within the basic framework of the design. But as I already mentioned in this post, this will be costly in effort to develop but also may come with a long term code maintenance overhead if complex parts such as transactions are maintained twice. It may also be possible to refactor rather than do wholesale rewrites. Inkless is Aiven’s direct-to-S3 fork of Apache Kafka. The Inkless design shares the combined object upload part and metadata manipulation that all designs across the board are using. It is also leaderless for direct-to-S3 topics. Inkless is firmly in the revolutionary camp. If it were to implement broker roles, it would make this Kafka-fork much closer to a WarpStream-like implementation (albeit with some issues concerning the coordination component as we’ll see further down). While Inkless is described as a KIP-1150 implementation, we’ll see that it actually diverges significantly from KIP-1150, especially the later revisions (covered later). Inkless eschewed the Batch Coordinator of KIP-1150 in favor of a Postgres instance, with coordination being executed through Table Valued Functions (TVF) and row locking where needed. Fig Inkless-A. The write path. On the read side, each broker must discover what batches exist by querying Postgres (using a TVF again), which returns the next set of batch coordinates as well as the high watermark. Now the broker knows where the next batches are located, it requests those batches via a read-through cache. On a cache miss, it fetches the byte ranges from the relevant objects in S3. Fig Inkless-B. The read path. Inkless bears some resemblance to KIP-1150 revision 1, except the difficult coordination bits are delegated to Postgres. Postgres does all the sequencing, metadata storage, as well as coordination for compaction and file cleanup. For example, compaction is coordinated via Postgres, with a TVF that is periodically called by each broker which finds a set of files which together exceed a size threshold and places them into a merge work order (tables file_merge_work_items and file_merge_work_item_files ) that the broker claims. Once carried out, the original files are marked for deletion (which is another job that can be claimed by a broker). Fig Inkless-C. Direct-to-S3 traffic uses a leaderless broker architecture with coordination owned by Postgres. Inkless doesn’t implement transactions, and I don’t think Postgres could take on the role of transaction coordinator, as the coordinator does more than sequencing and storage. Inkless will likely have to implement some kind of coordinator for that. The Postgres data model is based on the following tables: logs . Stores the Log Start Offset and High Watermark of each topic partition, with the primary key of topic_id and partition. files . Lists all the objects that host the topic partition data. batches . Maps Kafka batches to byte ranges in Files.  producer_state . All the producer state needed for idempotency. Some other tables for housekeeping, and the merge work items. Fig Inkless-D. Postgres data model. The Commit File TVF , which sequences and stores the batch coordinates, works as follows: A broker opens a transaction and submits a table as an argument, containing the batch coordinates of the multiple topics uploaded in the combined file. The TVF logic creates a temporary table (logs_tmp) and fills it via a SELECT on the logs table, with the FOR UPDATE clause which obtains a row lock on each topic partition row in the logs table that matches the list of partitions being submitted. This ensures that other brokers that are competing to add batches to the same partition(s) queue up behind this transaction. This is a critical barrier that avoids inconsistency. These locks are held until the transaction commits or aborts. Next it, inside a loop, partition-by-partition, the TVF: Updates the producer state. Updates the high watermark of the partition (a row in the logs table). Inserts the batch coordinates into the batches table (sequencing and storing them). Commits the transaction. Apache Kafka would not accept a Postgres dependency of course, and KIP-1150 has not proposed centralizing coordination in Postgres either. But the KIP has suggested that the Batch Coordinator be pluggable, which might leave it open for using Postgres as a backing implementation. As a former database performance specialist, the Postgres locking does concern me a bit. It blocks on the logs table rows scoped to the topic id and partition. An ORDER BY prevents deadlocks, but given the row locks are maintained until the transaction commits, I imagine that given enough contention, it could cause a convoy effect of blocking. This blocking is fair, that is to say, First Come First Serve (FCFS) for each individual row.  For example, with 3 transactions: T1 locks rows 11–15, T2 wants to lock 6-11, but only manages 6-10 as it blocks on row 11. Meanwhile T2 wants to lock 1-6, but only manages 1-5 as it blocks on 6. We now have a dependency tree where T1 blocks T2 and T2 blocks T3. Once T1 commits, the others get unblocked, but under sustained load, this kind of locking and blocking can quickly cascade, such that once contention starts, it rapidly expands. This contention is sensitive to the number of concurrent transactions and the number of partitions per commit. A common pattern with this kind of locking is that up until a certain transaction throughput everything is fine, but at the first hint of contention, the whole thing slows to a crawl. Contention breeds more contention. I would therefore caution against the use of Postgres as a Batch Coordinator implementation. The following is a very high-level look at Slack’s KIP-1176 , in the interests of keeping this post from getting too detailed. There are three key points to this KIP’s design: Maintain leader-based topic partitions (producers continue to write to leaders), but replace Kafka replication protocol with a per-broker S3-based write-ahead-log (WAL). Try to preserve existing partition replica code for idempotency and transactions. Reuse existing tiered storage for long-term S3 data management. Fig Slack-KIP-1176-A. Leader-based architecture retained, replication replaced by an S3 WAL. Tiered storage manages long-term data. The basic idea is to preserve the leader-based architecture of Kafka, with each leader replica continuing to write to an active local log segment file, which it rotates periodically. A per-broker write-ahead-log (WAL) replaces replication. A WAL Combiner component in the Kafka broker progressively (and aggressively) tiers portions of the local active log segment files (without closing them), combining them into multi-topic objects uploaded to S3. Once a Kafka batch has been written to the WAL, the broker can send an acknowledgment to its producer. This active log segment tiering does not change how log segments are rotated. Once an active log segment is rotated out (by closing it and creating a new active log segment file), it can be tiered by the existing tiered storage component, for the long-term. Fig Slack-KIP-1176-B. Produce batches are written to the page cache as usual, but active log segment files are aggressively tiered to S3 (possibly Express One Zone) in combined log segment files. The WAL acts as write-optimized S3 storage and the existing tiered storage uploads closed log segment files for long-term storage. Once all data of a given WAL object has been tiered, it can be deleted. The WAL only becomes necessary during topic partition leader-failovers, where the new leader replica bootstraps itself from the WAL. Alternatively, each topic partition can have one or more followers which actively reconstruct local log segments from the WAL, providing a faster failover. The general principle is to keep as much of Kafka unchanged as possible, only changing from the Kafka replication protocol to an S3 per-broker WAL. The priority is to avoid the need for heavy rework or reimplementation of logic such as idempotency, transactions and share groups integration. But it gives up elasticity and the additional architectural benefits that come from building on disaggregated storage. Having said all of the above. There are a lot of missing or hacky details that currently detract from the evolutionary goal. There is a lot of hand-waving when it comes to correctness too. It is not clear that this KIP will be able to deliver a low-disruption evolutionary design that is also correct, highly available and durable. Discussion in the mailing list is ongoing. Luke Chen remarked: “ the current availability story is weak… It’s not clear if the effort is still small once details on correctness, cost, cleanness are figured out. ”, and I have to agree. The second revision of KIP-1150 replaces the future object compaction logic by delegating long-term storage management to the existing tiered storage abstraction (like Slack’s KIP-1176). The idea is to: Remove Batch Coordinators from the read path. Avoid separate object compaction logic by delegating long-term storage management to tiered storage (which already exists).  Rebuild per-partition log segments from combined objects in order to: Submit them for long-term tiering (works as a form of object compaction too). Serve consumer fetch requests. The entire write path becomes a three stage process: Stage 1 – Produce path, synchronous . Uploads multi-topic WAL Segments to S3 and sequences the batches, acknowledging to producers once committed. This is unchanged except SLSOs are now called WAL Segments. Stage 2 – Per-partition log segment file construction, asynchronous . Each broker is assigned a subset of topic partitions. The brokers download WAL segment byte ranges that host these assigned partitions and append to on-disk per-partition log segment files. Stage 3 – Tiered storage, asynchronous . The tiered storage architecture tiers the locally cached topic partition log segments files as normal. Stage 1 – The produce path The produce path is the same, but SLSOs are now called WAL Segments. Fig KIP-1150-rev2-A. Write path stage 1 (the synchronous part). Leaderless brokers upload multi-topic WAL Segments, then commit and sequence them via the Batch Coordinator. Stage 2 – Local per-partition segment caching The second stage is preparation for both: Stage 3, segment tiering (tiered storage). The read pathway for tailing consumers Fig KIP-1150-rev2-B. Stage 2 of the write-path where assigned brokers download WAL segments and append to local log segments. Each broker is assigned a subset of topic partitions. Each broker polls the BC to learn of new WAL segments. Each WAL segment that hosts any of the broker’s assigned topic partitions will be downloaded (at least the byte range of its assigned partitions). Once the download completes, the broker will inject record offsets as determined by the batch coordinator, and append the finalized batch to a local (per topic partition) log segment on-disk.  At this point, a log segment file looks like a classic topic partition segment file. The difference is that they are not a source of durability, only a source for tiering and consumption. WAL segments remain in object storage until all batches of a segment have been tiered via tiered storage. Then WAL segments can be deleted. Stage 3 – Tiered storage Tiered storage continues to work as it does today (KIP-405), based on local log segments. It hopefully knows nothing of the Direct-to-S3 components and logic. Tiered segment metadata is stored in KRaft which allows for WAL segment deletion to be handled outside of the scope of tiered storage also. Fig KIP-1150-rev2-C. Tiered storage works as-is, based on local log segment files. Data is consumed from S3 topics from either: The local segments on-disk, populated from stage 2.  Tiered log segments (traditional tiered storage read pathway) End-to-end latency of any given batch is therefore based on: Produce batch added to buffer. WAL Segment containing that batch written to S3. Batch coordinates submitted to the Batch Coordinator for sequencing. Producer request acknowledged Tail, untiered (fast path for tailing consumers) -> Replica downloads WAL Segment slice. Replica appends the batch to a local (per topic partition) log segment. Replica serves a consumer fetch request from the local log segment. Tiered (slow path for lagging consumers) -> Remote Log Manager downloads tiered log segment Replica serves a consumer fetch request from the downloaded log segment. Avoiding excessive reads to S3 will be important in stage 2  (when a broker is downloading WAL segment files for its assigned topic partitions). This KIP should standardize how topic partitions are laid out inside every WAL segment and perform partition-broker assignments based on that same order: Pick a single global topic partition order (based on a permutation of a topic partition id). Partition that order into contiguous slices, giving one slice per broker as its assignment ( (a broker may get multiple topic partitions, but they must be adjacent in the global order). Lay out every WAL segment in that same global order. That way, each broker’s partition assignment will occupy one contiguous block per WAL Segment, so each read broker needs only one byte-range read per WAL segment object (possibly empty if none of its partitions appear in that object). This reduces the number of range reads per broker when reconstructing local log segments. By integrating with tiered storage and reconstructing log segments, revision 2 moves to a more diskful design, where disks form a step in the write path to long term storage. It is also more stateful and sticky than revision 1, given that each topic partition is assigned a specific broker for log segment reconstruction, tiering and consumer serving. Revision 2 remains leaderless for producers, but leaderful for consumers. Therefore to avoid cross-AZ traffic for consumer traffic, it will rely on KIP-881: Rack-aware Partition Assignment for Kafka Consumers to ensure zone-local consumer assignment. This makes revision 2 a hybrid. By delegating responsibility to tiered storage, more of the direct-to-S3 workload must be handled by stateful brokers. It is less able to benefit from the elasticity of disaggregated storage. But it reuses more of existing Kafka. The third revision, which at the time of writing is a loose proposal in the mailing list, ditches the Batch Coordinator (BC) altogether. Most of the complexity of KIP-1150 centers around Batch Coordinator efficiency, failovers, scaling as well as idempotency, transactions and share groups logic. Revision 3 proposes to replace the BCs with “classic” topic partitions. The classic topic partition leader replicas will do the work of sequencing and storing the batch coordinates of their own data. The data itself would live in SLSOs (rev1)/WAL Segments (rev2) and ultimately, as tiered log segments (tiered storage). To make this clear, if as a user you create the topic Orders with one partition, then an actual topic Orders will be created with one partition. However, this will only be used for the sequencing of the Orders data and the storage of its metadata. The benefit of this approach is that all the idempotency and transaction logic can be reused in these “classic-ish” topic partitions. There will be code changes but less than Batch Coordinators. All the existing tooling of moving partitions around, failovers etc works the same way as classic topics. So replication continues to exist, but only for the metadata. Fig KIP-1150-rev3-A. Replicated partitions continue to exist, acting as per-partition sequencers and metadata stores (replacing batch coordinators). One wrinkle this adds is that there is no central place to manage the clean-up of WAL Segments. Therefore a WAL File Manager component would have responsibility for background cleanup of those WAL segment files. It would periodically check the status of tiering to discover when a WAL Segment can get deleted. Fig KIP-1150-rev3-B. The WAL File Manager is responsible for WAL Segment clean up The motivation behind this change to remove Batch Coordinators is to simplify implementation by reusing existing Kafka code paths (for idempotence, transactions, etc.).  However, it also opens up a whole new set of challenges which must be discussed and debated, and it is not clear this third revision solves the complexity. Revision 3 now depends on the existing classic topics, with leader-follower replication. It moves a little further again towards the evolutionary path. It is curious to see Aiven productionizing its Kafka fork “Inkless”, which falls under the “revolutionary” umbrella, while pushing towards a more evolutionary stateful/sticky design in these later revisions of KIP-1150.  Apache Kafka is approaching a decision point with long-term implications for its architecture and identity. The ongoing discussions around KIP-1150 revisions 1-3 and KIP-1176 are nominally framed around replication cost reduction, but the underlying issue is broader: how should Kafka evolve in a world increasingly shaped by disaggregated storage and elastic compute? At its core, the choice comes down to two paths. The evolutionary path seeks to fit S3 topics into Kafka’s existing leader-follower framework, reusing current abstractions such as tiered storage to minimize disruption to the codebase. The revolutionary path instead prioritizes the benefits of building directly on object storage. By delegating to shared object storage, Kafka can support an S3 topic serving layer which is stateless, elastic, and disposable. Scaling coming by adding and removing stateless workers rather than rebalancing stateful nodes. While maintaining Kafka’s existing workloads with classic leader-follower topics. While the intentions and goals of the KIPs clearly fall on a continuum of revolutionary to evolutionary, the reality in the mailing list discussions makes everything much less clear. The devil is in the details, and as the discussion advances, the arguments of “simplicity through reuse” start to strain. The reuse strategy is a retrofitting strategy which ironically could actually make the codebase harder to maintain in the long term. Kafka’s existing model is deeply rooted in leader-follower replication, with much of its core logic built around that assumption. Retrofitting direct-to-S3 into this model forces some “unnatural” design choices. Choices that would not be made otherwise (if designing a cloud-native solution). My own view aligns with the more revolutionary path in the form of KIP-1150 revision 1 . It doesn’t simply reduce cross-AZ costs, but fully embraces the architectural benefits of building on object storage. With additional broker roles and groups, Kafka could ultimately achieve a similar elasticity to WarpStream (and Confluent Freight Clusters).  The approach demands more upfront engineering effort, may increase long-term maintenance complexity, but avoids tight coupling to the existing leader-follower architecture. Much depends on what kind of refactoring is possible to avoid the duplication of idempotency, transactions and share group logic. I believe the benefits justify the upfront cost and will help keep Kafka relevant in the decade ahead.  In theory, both directions are defensible, ultimately it comes down to the specifics of each KIP. The details really matter. Goals define direction, but it’s the engineering details that determine the system’s actual properties. We know the revolutionary path involves big changes, but the evolutionary path comes with equally large challenges, where retrofitting may ultimately be more costly while simultaneously delivering less. The committers who maintain Kafka are cautious about large refactorings and code duplication, but are equally wary of hacks and complex code serving two needs. We need to let the discussions play out. My aim with this post has been to take a step back from "how to implement direct-to-S3 topics in Kafka", and think more about what we want Kafka to be. The KIPs represent the how, the engineering choices. Framed that way, I believe it is easier for the wider community to understand the KIPs, the stakes and the eventual decision of the committers, whichever way they ultimately decide to go. Revolutionary : Choose a direct-to-S3 topic design that maximizes the benefits of an object-storage architecture, with greater elasticity and lower operational complexity. However, in doing so, we may increase the implementation cost and possibly the long-term code maintenance too by maintaining two very different topic-models in the same project (leader-based replication and direct-to-S3). Evolutionary : Shoot for an evolutionary design that makes use of existing components to reduce the need for large refactoring or duplication of logic. However, by coupling to the existing architecture, we forfeit the extra benefits of object storage, focusing primarily on networking cost savings (in AWS and GCP). Through this coupling, we also run the risk of achieving the opposite: harder to maintain code by bending and contorting a second workload into an architecture optimized for something else. The term “Diskless” vs “Direct-to-S3” The Common Parts. Some approaches are shared across multiple implementations and proposals. Revolutionary: KIPs and real-world implementations Evolutionary: Slack’s KIP-1176 The Hybrid: balancing revolution with evolution Deciding Kafka’s future Avoiding the small file problem . Most designs are leaderless for producer traffic, allowing for any server to receive writes to any topic. To avoid uploading a multitude of tiny files, servers accumulate batches in a buffer until ready for upload. Before upload, the buffer is sorted by topic id and partition, to make compaction and some reads more efficient by ensuring that data of the same topic and same partition are in contiguous byte ranges. Pricing . The pricing of many (but not all) cloud object storage services penalize excessive requests, so it can be cheaper to roll-up whatever data has been received in the last X milliseconds and upload it with a single request. Leaderless: The same way that producer alignment works via metadata manipulation or using KIP-392 (fetch from follower) which can be used in a leaderless context. Leader-based:  Zone-aware consumer group assignment as detailed in KIP-881: Rack-aware Partition Assignment for Kafka Consumers . The idea is to use consumer-to-partition assignment to ensure consumers are only assigned zone-local partitions (where the partition leader is located). KIP-392 (fetch-from-follower) , which is effective for designs that have followers (which isn’t always the case). WarpStream (as a reference, a kind of yardstick to compare against) KIP-1150 revision 1 Aiven Inkless (a Kafka-fork) Leaderless, stateless and diskless agents that handle Kafka clients, as well as compaction/cleaning work.  Coordination layer : A central metadata store for sequencing, metadata storage and housekeeping coordination. Sequencing . Chooses the total ordering for writes, assigning offsets without gaps or duplicates. Metadata storage . Stores all metadata that maps partition offset ranges to S3 object byte ranges. Serving lookup requests . Serving requests for log offsets. Serving requests for batch coordinates (S3 object metadata). Partition CRUD operations . Serving requests for atomic operations (creating partitions, deleting topics, records, etc.) Data expiration Managing data expiry and soft deletion. Coordinating physical object deletion (performed by brokers). When a producer starts, it sends a Metadata request to learn which brokers are the leaders of each topic partition it cares about. The Metadata response contains a zone-local broker. The producer sends all Produce requests to this broker. The receiving broker accumulates batches of all partitions and uploads a shared log segment object (SLSO) to S3. The broker commits the SLSO by sending the metadata of the SLSO (the metadata is known as the batch coordinates) to the Batch Coordinator. The coordinates include the S3 object metadata and the byte ranges of each partition in the object. The Batch Coordinator assigns offsets to the written batches, persists the batch coordinates, and responds to the broker. The broker sends all associated acknowledgements to the producers. The consumer sends a Fetch request to the broker. The broker checks its cache and on a miss, the broker queries the Batch Coordinator for the relevant batch coordinates. The broker downloads the data from object storage. The broker injects the computed offsets and timestamps into the batches (the offsets being part of the batch coordinates). The broker constructs and sends the Fetch response to the Consumer. logs . Stores the Log Start Offset and High Watermark of each topic partition, with the primary key of topic_id and partition. files . Lists all the objects that host the topic partition data. batches . Maps Kafka batches to byte ranges in Files.  producer_state . All the producer state needed for idempotency. Some other tables for housekeeping, and the merge work items. A broker opens a transaction and submits a table as an argument, containing the batch coordinates of the multiple topics uploaded in the combined file. The TVF logic creates a temporary table (logs_tmp) and fills it via a SELECT on the logs table, with the FOR UPDATE clause which obtains a row lock on each topic partition row in the logs table that matches the list of partitions being submitted. This ensures that other brokers that are competing to add batches to the same partition(s) queue up behind this transaction. This is a critical barrier that avoids inconsistency. These locks are held until the transaction commits or aborts. Next it, inside a loop, partition-by-partition, the TVF: Updates the producer state. Updates the high watermark of the partition (a row in the logs table). Inserts the batch coordinates into the batches table (sequencing and storing them). Commits the transaction. Maintain leader-based topic partitions (producers continue to write to leaders), but replace Kafka replication protocol with a per-broker S3-based write-ahead-log (WAL). Try to preserve existing partition replica code for idempotency and transactions. Reuse existing tiered storage for long-term S3 data management. Remove Batch Coordinators from the read path. Avoid separate object compaction logic by delegating long-term storage management to tiered storage (which already exists).  Rebuild per-partition log segments from combined objects in order to: Submit them for long-term tiering (works as a form of object compaction too). Serve consumer fetch requests. Stage 1 – Produce path, synchronous . Uploads multi-topic WAL Segments to S3 and sequences the batches, acknowledging to producers once committed. This is unchanged except SLSOs are now called WAL Segments. Stage 2 – Per-partition log segment file construction, asynchronous . Each broker is assigned a subset of topic partitions. The brokers download WAL segment byte ranges that host these assigned partitions and append to on-disk per-partition log segment files. Stage 3 – Tiered storage, asynchronous . The tiered storage architecture tiers the locally cached topic partition log segments files as normal. Stage 3, segment tiering (tiered storage). The read pathway for tailing consumers The local segments on-disk, populated from stage 2.  Tiered log segments (traditional tiered storage read pathway) Produce batch added to buffer. WAL Segment containing that batch written to S3. Batch coordinates submitted to the Batch Coordinator for sequencing. Producer request acknowledged Tail, untiered (fast path for tailing consumers) -> Replica downloads WAL Segment slice. Replica appends the batch to a local (per topic partition) log segment. Replica serves a consumer fetch request from the local log segment. Tiered (slow path for lagging consumers) -> Remote Log Manager downloads tiered log segment Replica serves a consumer fetch request from the downloaded log segment. Pick a single global topic partition order (based on a permutation of a topic partition id). Partition that order into contiguous slices, giving one slice per broker as its assignment ( (a broker may get multiple topic partitions, but they must be adjacent in the global order). Lay out every WAL segment in that same global order.

0 views

A Fire is Not an Emergency for the Fire Department

Yesterday (Oct 20, 2025), AWS had a major incident . A lot of companies had a really bad day. I’ve been chatting with friends at impacted companies and a few had similar stories. Their teams were calm while their execs were not. There was a lot of “DO SOMETHING!” or “WHY AREN’T YOU FREAKING OUT?!” Incidents happen, but panic doesn’t do anything to solve them. Fires aren’t emergencies for the fire department and incidents shouldn’t be for engineering responders either. In this case, very little was likely in their control. For all but the most trivial of systems, you’re not going to migrate off of AWS in the middle of an incident and expect a good outcome. But the principle stands: the worst thing you can do is freak out, especially when the incident is self-inflicted . Calm leadership is going to be better than panic every time. Imagine you lit your stove on fire and called the fire department to put it out. Which of these two responses would you prefer after the fire chief hops out of the truck: I’m sure the first response would have you questioning who hired this person to be the fire chief and would not give you a lot of confidence that your whole house wasn’t going to burn down. Similarly, the second response might let you be able to exhale and leave things to the professionals. Like fires, not all incidents are created equally. A pan on fire on the stove is much more routine than a large wildfire threatening a major city. I would argue the need for calm leadership gets MORE important the bigger the impact of the incident. A small incident might not attract much attention, while a big incident certainly will. In yesterday’s AWS incident it would most definitely not be reassuring to see whoever was in charge of the response making an announcement about how bad it was and how they were on the verge of a panic attack. Same goes for wildfires, and same goes for your incidents. When you’re in a leadership position during an incident, it serves no useful purpose to yell and scream or jump up and down or tear your hair out. Your responders want normal operations to be restored as much as you do, especially if it’s off-hours and they’ve had their free time or sleep interrupted. Shouting isn’t going to make them solve the problems any faster. Now and then someone else will react negatively to someone in the Incident Commander or the engineering management seat behaving in a calm and collected way during an incident. Usually it’s someone who is in the position to get yelled at by a customer, like an account rep or a non-technical executive. They’re under a lot of pressure because incidents can cost your customers money, can result in churn or lost sales opportunities, and all of those things can affect the bottom line or career prospects of people who have no control over the actual technical issue. So they’re stressed about a lot of factors that are not the incident per se, but might be downstream consequences of it, much in the same way you would be after setting your kitchen on fire. Sometimes this comes out as anger or frustration that you, the technology person responsible for this, are not similarly stressed. This is understandable. Keep in mind too that to them, they are utterly dependent on you and the tech team in this moment. They don’t know how to fix whatever’s broken, but they’re still going to hear about it from customers and prospects who might not be nice about it. This can cause a feeling of powerlessness that results in expressions of frustration. This is not a reason to change your demeanour. The best thing you can do is keep the updates flowing, even if there’s no new news. People find reassurance in knowing the team is still working the problem. Keep the stakeholder updates jargon-free and high-level. Show empathy that you understand the customer impact and the potential consequences of that, but reassure them that the focus right now is restoring normal operations and, if appropriate, you’ll be available to talk through what happened after the dust settles. A negative reaction to the calm approach is usually born out of a fear that ‘nothing is happening’. The truth is often a big part of incident response is waiting. Without the technical context to understand what’s really going on, some people just want to see something happening to reassure themselves. You can mitigate this with updates: Use your tools to broadcast this as often as you can and you’ll tamp down on a lot of that anxiety. Big cloud provider incidents are fairly rare. Self-inflicted incidents are not. The responders trying to fix a self-inflicted problem are very likely to be the ones who took an action that caused or contributed to the outage. Even in a healthy, blameless culture this is going to cause some stress and shame, and in a blameful, scapegoating culture it’s going to be incredibly difficult. Having their manager, manager’s manager, or department head freaking out is only going to compound the problem. Highly stressed people can’t think straight and make mistakes. Set a tone of calm, methodical response and you’ll get out from under the problem a lot faster. This might be culturally hard to achieve if you have very hands-on execs who want to be very involved, even when their involvement is counterproductive (hint: they don’t see it that way even if you do). But to the extent you can, have one communication channel for the responders and a separate one for updates for stakeholders. Mixing the two just distracts the responders and lengthens the time-to-resolution. The key to this being successful is not neglecting the updates. I recommend using a timer and giving an update every 10 minutes whether there’s news or not. People will appreciate it and your responders can focus on responding. I don’t want this to be a “how to run a good incident” post. There are plenty of comprehensive materials out there that do a better job than I would here. I do want to focus on how to be a good leader in the context of incidents. Like many things, it comes down to culture and practice. Firefighters train to stay calm when responding to fires. They’re methodical and follow well-known good practices. To make this as easy as possible on yourself: If you can, work “fires aren’t emergencies for the fire department” into your company lexicon and remind people that in the context of incidents, the engineering team are the fire department. " Wildfire " by USFWS/Southeast is marked with Public Domain Mark 1.0 . Like this? Please feel free to share it on your favourite social media or link site! Share it with friends! Hit subscribe to get new posts delivered to your inbox automatically. Feedback? Questions? Get in touch ! [Screaming] “Oh dear! A fire! Somebody get the hose! Agggh!” “We see the problem, we know how to deal with it, we’ll handle it and give you an update as soon as we can.” Here’s what we’ve done. Here’s what we’re doing (or waiting on the results of). Here’s what we’ll likely do next. Set a good example when you’re in the incident commander seat (whether in an official or ad-hoc capacity). Run practice incidents where the stakes are low. Build up runbooks and checklists to make response activities automatic. Define your response process and roles. Define your communication channels so everyone knows where to find updates.

0 views

This Is How Much Anthropic and Cursor Spend On Amazon Web Services

So, I originally planned for this to be on my premium newsletter, but decided it was better to publish on my free one so that you could all enjoy it. If you liked it, please consider subscribing to support my work. Here’s $10 off the first year of annual . I’ve also recorded an episode about this on my podcast Better Offline ( RSS feed , Apple , Spotify , iHeartRadio ), it’s a little different but both handle the same information, just subscribe and it'll pop up.  Over the last two years I have written again and again about the ruinous costs of running generative AI services, and today I’m coming to you with real proof. Based on discussions with sources with direct knowledge of their AWS billing, I am able to disclose the amounts that AI firms are spending, specifically Anthropic and AI coding company Cursor, its largest customer . I can exclusively reveal today Anthropic’s spending on Amazon Web Services for the entirety of 2024, and for every month in 2025 up until September, and that that Anthropic’s spend on compute far exceeds that previously reported.  Furthermore, I can confirm that through September, Anthropic has spent more than 100% of its estimated revenue (based on reporting in the last year) on Amazon Web Services, spending $2.66 billion on compute on an estimated $2.55 billion in revenue. Additionally, Cursor’s Amazon Web Services bills more than doubled from $6.2 million in May 2025 to $12.6 million in June 2025, exacerbating a cash crunch that began when Anthropic introduced Priority Service Tiers, an aggressive rent-seeking measure that begun what I call the Subprime AI Crisis , where model providers begin jacking up the prices on their previously subsidized rates. Although Cursor obtains the majority of its compute from Anthropic — with AWS contributing a relatively small amount, and likely also taking care of other parts of its business — the data seen reveals an overall direction of travel, where the costs of compute only keep on going up .  Let’s get to it. In February of this year, The information reported that Anthropic burned $5.6 billion in 2024, and made somewhere between $400 million and $600 million in revenue: While I don’t know about prepayment for services, I can confirm from a source with direct knowledge of billing that Anthropic spent $1.35 billion on Amazon Web Services in 2024, and has already spent $2.66 billion on Amazon Web Services through the end of September. Assuming that Anthropic made $600 million in revenue, this means that Anthropic spent $6.2 billion in 2024, leaving $4.85 billion in costs unaccounted for.  The Information’s piece also brings up another point: Before I go any further, I want to be clear that The Information’s reporting is sound, and I trust that their source (I have no idea who they are or what information was provided) was operating in good faith with good data. However, Anthropic is telling people it spent $1.5 billion on just training when it has an Amazon Web Services bill of $1.35 billion, which heavily suggests that its actual compute costs are significantly higher than we thought, because, to quote SemiAnalysis, “ a large share of Anthropic’s spending is going to Google Cloud .”  I am guessing, because I do not know, but with $4.85 billion of other expenses to account for, it’s reasonable to believe Anthropic spent an amount similar to its AWS spend on Google Cloud. I do not have any information to confirm this, but given the discrepancies mentioned above, this is an explanation that makes sense. I also will add that there is some sort of undisclosed cut that Amazon gets of Anthropic’s revenue, though it’s unclear how much. According to The Information , “Anthropic previously told some investors it paid a substantially higher percentage to Amazon [than OpenAI’s 20% revenue share with Microsoft] when companies purchase Anthropic models through Amazon.” I cannot confirm whether a similar revenue share agreement exists between Anthropic and Google. This also makes me wonder exactly where Anthropic’s money is going. Anthropic has, based on what I can find, raised $32 billion in the last two years, starting out 2023 with a $4 billion investment from Amazon from September 2023 (bringing the total to $37.5 billion), where Amazon was named its “primary cloud provider” nearly eight months after Anthropic announced Google was Anthropic’s “cloud provider.,” which Google responded to a month later by investing another $2 billion on October 27 2023 , “involving a $500 million upfront investment and an additional $1.5 billion to be invested over time,” bringing its total funding from 2023 to $6 billion. In 2024, it would raise several more rounds — one in January for $750 million, another in March for $884.1 million, another in May for $452.3 million, and another $4 billion from Amazon in November 2024 , which also saw it name AWS as Anthropic’s “primary cloud and training partner,” bringing its 2024 funding total to $6 billion. In 2025 so far, it’s raised a $1 billion round from Google , a $3.5 billion venture round in March, opened a $2.5 billion credit facility in May, and completed a $13 billion venture round in September, valuing the company at $183 billion . This brings its total 2025 funding to $20 billion.  While I do not have Anthropic’s 2023 numbers, its spend on AWS in 2024 — around $1.35 billion — leaves (as I’ve mentioned) $4.85 billion in costs that are unaccounted for. The Information reports that costs for Anthropic’s 521 research and development staff reached $160 million in 2024 , leaving 394 other employees unaccounted for (for 915 employees total), and also adding that Anthropic expects its headcount to increase to 1900 people by the end of 2025. The Information also adds that Anthropic “expects to stop burning cash in 2027.” This leaves two unanswered questions: An optimist might argue that Anthropic is just growing its pile of cash so it’s got a warchest to burn through in the future, but I have my doubts. In a memo revealed by WIRED , Anthropic CEO Dario Amodei stated that “if [Anthropic wanted] to stay on the frontier, [it would] gain a very large benefit from having access to this capital,” with “this capital” referring to money from the Middle East.  Anthropic and Amodei’s sudden willingness to take large swaths of capital from the Gulf States does not suggest that it’s not at least a little desperate for capital, especially given Anthropic has, according to Bloomberg , “recently held early funding talks with Abu Dhabi-based investment firm MGX” a month after raising $13 billion . In my opinion — and this is just my gut instinct — I believe that it is either significantly more expensive to run Anthropic than we know, or Anthropic’s leaked (and stated) revenue numbers are worse than we believe. I do not know one way or another, and will only report what I know. So, I’m going to do this a little differently than you’d expect, in that I’m going to lay out how much these companies spent, and draw throughlines from that spend to its reported revenue numbers and product announcements or events that may have caused its compute costs to increase. I’ve only got Cursor’s numbers from January through September 2025, but I have Anthropic’s AWS spend for both the entirety of 2024 and through September 2025. So, this term is one of the most abused terms in the world of software, but in this case , I am sticking to the idea that it means “month times 12.” So, if a company made $10m in January, you would say that its annualized revenue is $120m. Obviously, there’s a lot of (when you think about it, really obvious) problems with this kind of reporting — and thus, you only ever see it when it comes to pre-IPO firms — but that’s besides the point. I give you this explanation because, when contrasting Anthropic’s AWS spend with its revenues, I’ve had to work back from whatever annualized revenues were reported for that month.  Anthropic’s 2024 revenues are a little bit of a mystery, but, as mentioned above, The Information says it might be between $400 million and $600 million. Here’s its monthly AWS spend.  I’m gonna be nice here and say that Anthropic made $600 million in 2024 — the higher end of The Information’s reporting — meaning that it spent around 226% of its revenue ($1.359 billion) on Amazon Web Services. [Editor's note: this copy originally had incorrect maths on the %. Fixed now.] Thanks to my own analysis and reporting from outlets like The Information and Reuters, we have a pretty good idea of Anthropic’s revenues for much of the year. That said, July, August, and September get a little weirder, because we’re relying on “almosts” and “approachings,” as I’ll explain as we go. I’m also gonna do an analysis on a month-by-month basis, because it’s necessary to evaluate these numbers in context.  In this month, Anthropic’s reported revenue was somewhere from $875 million to $1 billion annualized , meaning either $72.91 million or $83 million for the month of January. In February, as reported by The Information , Anthropic hit $1.4 billion annualized revenue, or around $116 million each month. In March, as reported by Reuters , Anthropic hit $2 billion in annualized revenue, or $166 million in revenue. Because February is a short month, and the launch took place on February 24 2025, I’m considering the launches of Claude 3.7 Sonnet and Claude Code’s research preview to be a cost burden in the month of March. And man, what a burden! Costs increased by $59.1 million, primarily across compute categories, but with a large ($2 million since January) increase in monthly costs for S3 storage. I estimate, based on a 22.4% compound growth rate, that Anthropic hit around $2.44 billion in annualized revenue in April, or $204 million in revenue. Interestingly, this was the month where Anthropic launched its $100 and $200 dollar a month “Max” plan s, and it doesn’t seem to have dramatically increased its costs. Then again, Max is also the gateway to things like Claude Code, which I’ll get to shortly. In May, as reported by CNBC , Anthropic hit $3 billion in annualized revenue, or $250 million in monthly average revenue. This was a big month for Anthropic, with two huge launches on May 22 2025 — its new, “more powerful” models Claude Sonnet and Opus 4, as well as the general availability of its AI coding environment Claude Code. Eight days later, on May 30 2025, a page on Anthropic's API documentation appeared for the first time: " Service Tiers ": Accessing the priority tier requires you to make an up-front commitment to Anthropic , and said commitment is based on a number of months (1, 3, 6 or 12) and the number of input and output tokens you estimate you will use each minute.  As I’ll get into in my June analysis, Anthropic’s Service Tiers exist specifically for it to “guarantee” your company won’t face rate limits or any other service interruptions, requiring a minimum spend, minimum token throughput, and for you to pay higher rates when writing to the cache — which is, as I’ll explain, a big part of running an AI coding product like Cursor. Now, the jump in costs — $65.1 million or so between April and May — likely comes as a result of the final training for Sonnet and Opus 4, as well as, I imagine, some sort of testing to make sure Claude Code was ready to go. In June, as reported by The Information, Anthropic hit $4 billion in annualized revenue, or $333 million. Anthropic’s revenue spiked by $83 million this month, and so did its costs by $34.7 million.  I have, for a while, talked about the Subprime AI Crisis , where big tech and companies like Anthropic, after offering subsidized pricing to entice in customers, raise the rates on their customers to start covering more of their costs, leading to a cascade where businesses are forced to raise their prices to handle their new, exploding costs. And I was god damn right. Or, at least, it sure looks like I am. I’m hedging, forgive me. I cannot say for certain, but I see a pattern.  It’s likely the June 2025 spike in revenue came from the introduction of service tiers, which specifically target prompt caching, increasing the amount of tokens you’re charged for as an enterprise customer based on the term of the contract, and your forecast usage. Per my reporting in July :  Cursor, as Anthropic’s largest client (the second largest being Github Copilot), represents a material part of its revenue, and its surging popularity meant it was sending more and more revenue Anthropic’s way.  Anysphere, the company that develops Cursor, hit $500 million annualized revenue ($41.6 million) by the end of May , which Anthropic chose to celebrate by increasing its costs. On June 16 2025, Cursor launched a $200-a-month “Ultra” plan , as well as dramatic changes to its $20-a-month Pro pricing that, instead of offering 500 “fast” responses using models from Anthropic and OpenAI, now effectively provided you with “at least” whatever you paid a month (so $20-a-month got at least $20 of credit), massively increasing the costs for users , with one calling the changes a “rug pull” after spending $71 in a single day . As I’ll get to later in the piece, Cursor’s costs exploded from $6.19 million in May 2025 to $12.67 million in June 2025, and I believe this is a direct result of Anthropic’s sudden and aggressive cost increases.  Similarly, Replit, another AI coding startup, moved to “Effort-Based Pricing” on June 18 2025 . I have not got any information around its AWS spend. I’ll get into this a bit later, but I find this whole situation disgusting. In July, as reported by Bloomberg , Anthropic hit $5 billion in annualized revenue, or $416 million. While July wasn’t a huge month for announcements, it was allegedly the month that Claude Code was generating “nearly $400 million in annualized revenue,” or $33.3 million ( according to The Information , who says Anthropic was “approaching” $5 billion in annualized revenue - which likely means LESS than that - but I’m going to go with the full $5 billion annualized for sake of fairness.  There’s roughly an $83 million bump in Anthropic’s revenue between June and July 2025, and I think Claude Code and its new rates are a big part of it. What’s fascinating is that cloud costs didn’t increase too much — by only $1.8 million, to be specific. In August, according to Anthropic, its run-rate “ reached over $5 billion ,” or in or around $416 million. I am not giving it anything more than $5 billion, especially considering in July Bloomberg’s reporting said “about $5 billion.” Costs grew by $60.5 this month, potentially due to the launch of Claude Opus 4.1 , Anthropic’s more aggressively expensive model, though revenues do not appear to have grown much along the way. Yet what’s very interesting is that Anthropic — starting August 28 — launched weekly rate limits on its Claude Pro and Max plans. I wonder why? Oh fuck! Look at that massive cost explosion! Anyway, according to Reuters, Anthropic’s run rate is “approaching $7 billion” in October , and for the sake of fairness , I am going to just say it has $7 billion annualized, though I believe this number to be lower. “Approaching” can mean a lot of different things — $6.1 billion, $6.5 billion — and because I already anticipate a lot of accusations of “FUD,” I’m going to err on the side of generosity. If we assume a $6.5 billion annualized rate, that would make this month’s revenue $541.6 million, or 95.8% of its AWS spend.   Nevertheless, Anthropic’s costs exploded in the space of a month by $135.2 million (35%) - likely due to the fact that users, as I reported in mid-July, were costing it thousands or tens of thousands of dollars in compute , a problem it still faces to this day, with VibeRank showing a user currently spending $51,291 in a calendar month on a $200-a-month subscription . If there were other costs, they likely had something to do with the training runs for the launches of Sonnet 4.5 on September 29 2025 and Haiku 4.5 in October 2025 . While these costs only speak to one part of its cloud stack — Anthropic has an unknowable amount of cloud spend on Google Cloud, and the data I have only covers AWS — it is simply remarkable how much this company spends on AWS, and how rapidly its costs seem to escalate as it grows. Though things improved slightly over time — in that Anthropic is no longer burning over 200% of its revenue on AWS alone — these costs have still dramatically escalated, and done so in an aggressive and arbitrary manner.  So, I wanted to visualize this part of the story, because I think it’s important to see the various different scenarios. THE NUMBERS I AM USING ARE ESTIMATES CALCULATED BASED ON 25%, 50% and 100% OF THE AMOUNTS THAT ANTHROPIC HAS SPENT ON AMAZON WEB SERVICES THROUGH SEPTEMBER.  I apologize for all the noise, I just want it to be crystal clear what you see next.   As you can see, all it takes is for Anthropic to spend (I am estimating) around 25% of its Amazon Web Services bills (for a total of around $3.33 billion in compute costs through the end of September) to savage any and all revenue ($2.55 billion) it’s making.  Assuming Anthropic spends half of its  AWS spend on Google Cloud, this number climbs to $3.99 billion, and if you assume - and to be clear, this is an estimate - that it spends around the same on both Google Cloud and AWS, Anthropic has spent $5.3 billion on compute through the end of September. I can’t tell you which it is, just that we know for certain that Anthropic is spending money on Google Cloud, and because Google owns 14% of the company — rivalling estimates saying Amazon owns around 15-19% — it’s fair to assume that there’s a significant spend. I have sat with these numbers for a great deal of time, and I can’t find any evidence that Anthropic has any path to profitability outside of aggressively increasing the prices on their customers to the point that its services will become untenable for consumers and enterprise customers alike. As you can see from these estimated and reported revenues, Anthropic’s AWS costs appear to increase in a near-linear fashion with its revenues, meaning that the current pricing — including rent-seeking measures like Priority Service Tiers — isn’t working to meet the burden of its costs. We do not know its Google Cloud spend, but I’d be shocked if it was anything less than 50% of its AWS bill. If that’s the case, Anthropic is in real trouble - the cost of the services underlying its business increase the more money they make. It’s becoming increasingly apparent that Large Language Models are not a profitable business. While I cannot speak to Amazon Web Services’ actual costs, it’s making $2.66 billion from Anthropic, which is the second largest foundation model company in the world.  Is that really worth $105 billion in capital expenditures ? Is that really worth building a giant 1200 acre data center in Indiana with 2.2GW of electricity? What’s the plan, exactly? Let Anthropic burn money for the foreseeable future until it dies, and then pick up the pieces? Wait until Wall Street gets mad at you and then pull the plug? Who knows.  But let’s change gears and talk about Cursor — Anthropic’s largest client and, at this point, a victim of circumstance. Amazon sells Anthropic’s models through Amazon Bedrock , and I believe that AI startups are compelled to spend some of their AI model compute costs through Amazon Web Services. Cursor also sends money directly to Anthropic and OpenAI, meaning that these costs are only one piece of its overall compute costs. In any case, it’s very clear that Cursor buys some degree of its Anthropic model spend through Amazon. I’ll also add that Tom Dotan of Newcomer reported a few months ago that an investor told him that “Cursor is spending 100% of its revenue on Anthropic.” Unlike Anthropic, we lack thorough reporting of the month-by-month breakdown of Cursor’s revenues. I will, however, mention them in the month I have them. For the sake of readability — and because we really don’t have much information on Cursor’s revenues beyond a few months — I’m going to stick to a bullet point list.  As discussed above, Cursor announced (along with their price change and $200-a-month plan) several multi-year partnerships with xAI, Anthropic, OpenAI and Google, suggesting that it has direct agreements with Anthropic itself versus one with AWS to guarantee “this volume of compute at a predictable price.”  Based on its spend with AWS, I do not see a strong “minimum” spend that would suggest that they have a similar deal with Amazon — likely because Amazon handles more than its infrastructure than just compute, but incentivizes it to spend on Anthropic’s models through AWS by offering discounts, something I’ve confirmed with a source.  In any case, here’s what Cursor spent on AWS. When I wrote that Anthropic and OpenAI had begun the Subprime AI Crisis back in July, I assumed that the increase in costs was burdensome, but having the information from its AWS bills, it seems that Anthropic’s actions directly caused Cursor’s costs to explode by over 100%.  While I can’t definitively say “this is exactly what did it,” the timelines match up exactly, the costs have never come down, Amazon offers provisioned throughput , and, more than likely, Cursor needs to keep a standard of uptime similar to that of Anthropic’s own direct API access. If this is what happened, it’s deeply shameful.  Cursor, Anthropic’s largest customer , in the very same month it hit $500 million in annualized revenue, immediately had its AWS and Anthropic-related costs explode to the point that it had to dramatically reduce the value of its product just as it hit the apex of its revenue growth.  It’s very difficult to see Service Tiers as anything other than an aggressive rent-seeking maneuver. Yet another undiscussed part of the story is that the launch of Claude 4 Opus and Sonnet — and the subsequent launch of Service Tiers — coincided with the launch of Claude Code , a product that directly competes with Cursor, without the burden of having to pay itself for the cost of models or, indeed, having to deal with its own “Service Tiers.” Anthropic may have increased the prices on its largest client at the time it was launching a competitor, and I believe that this is what awaits any product built on top of OpenAI or Anthropic’s models.  I realize this has been a long, number-stuffed article, but the long-and-short of it is simple: Anthropic is burning all of its revenue on compute, and Anthropic will willingly increase the prices on its customers if it’ll help it burn less money, even though that doesn’t seem to be working. What I believe happened to Cursor will likely happen to every AI-native company, because in a very real sense, Anthropic’s products are a wrapper for its own models, except it only has to pay the (unprofitable) costs of running them on Amazon Web Services and Google Cloud. As a result, both OpenAI and Anthropic can (and may very well!) devour the market of any company that builds on top of their models.  OpenAI may have given Cursor free access to its GPT-5 models in August, but a month later on September 15 2025 it debuted massive upgrades to its competitive “Codex” platform.  Any product built on top of an AI model that shows any kind of success can be cloned immediately by OpenAI and Anthropic, and I believe that we’re going to see multiple price increases on AI-native companies in the next few months. After all, OpenAI already has its own priority processing product, which it launched shortly after Anthropic’s in June . The ultimate problem is that there really are no winners in this situation. If Anthropic kills Cursor through aggressive rent-seeking, that directly eats into its own revenues. If Anthropic lets Cursor succeed, that’s revenue , but it’s also clearly unprofitable revenue . Everybody loses, but nobody loses more than Cursor’s (and other AI companies’) customers.  I’ve come away from this piece with a feeling of dread. Anthropic’s costs are out of control, and as things get more desperate, it appears to be lashing out at its customers, both companies like Cursor and Claude Code customers facing weekly rate limits on their more-powerful models who are chided for using a product they pay for. Again, I cannot say for certain, but the spike in costs is clear, and it feels like more than a coincidence to me.  There is no period of time that I can see in the just under two years of data I’ve been party to that suggests that Anthropic has any means of — or any success doing — cost-cutting, and the only thing this company seems capable of doing is increasing the amount of money it burns on a monthly basis.  Based on what I have been party to, the more successful Anthropic becomes, the more its services cost. The cost of inference is clearly increasing for customers , but based on its escalating monthly costs, the cost of inference appears to be high for Anthropic too, though it’s impossible to tell how much of its compute is based on training versus running inference. In any case, these costs seem to increase with the amount of money Anthropic makes, meaning that the current pricing of both subscriptions and API access seems unprofitable, and must increase dramatically — from my calculations, a 100% price increase might work, but good luck retaining every single customer and their customers too! — for this company to ever become sustainable.  I don’t think that people would pay those prices. If anything, I think what we’re seeing in these numbers is a company bleeding out from costs that escalate the more that its user base grows. This is just my opinion, of course.  I’m tired of watching these companies burn billions of dollars to destroy our environment and steal from everybody. I’m tired that so many people have tried to pretend there’s a justification for burning billions of dollars every year, clinging to empty tropes about how this is just like Uber or Amazon Web Services , when Anthropic has built something far more mediocre.  Mr. Amodei, I am sure you will read this piece, and I can make time to chat in person on my show Better Offline. Perhaps this Friday? I even have some studio time on the books.  I do not have all the answers! I am going to do my best to go through the information I’ve obtained and give you a thorough review and analysis. This information provides a revealing — though incomplete — insight into the costs of running Anthropic and Cursor, but does not include other costs, like salaries and compute obtained from other providers. I cannot tell you (and do not have insight into) Anthropic’s actual private moves. Any conclusions or speculation I make in this article will be based on my interpretations of the information I’ve received, as well as other publicly-available information. I have used estimates of Anthropic’s revenue based on reporting across the last ten months. Any estimates I make are detailed and they are brief.  These costs are inclusive of every product bought on Amazon Web Services, including EC2, storage and database services (as well as literally everything else they pay for). Anthropic works with both Amazon Web Services and Google Cloud for compute. I do not have any information about its Google Cloud spend. The reason I bring this up is that Anthropic’s revenue is already being eaten up by its AWS spend. It’s likely billions more in the hole from Google Cloud and other operational expenses. I have confirmed with sources that every single number I give around Anthropic and Cursor’s AWS spend is the final cash paid to Amazon after any discounts or credits. While I cannot disclose the identity of my source, I am 100% confident in these numbers, and have verified their veracity with other sources. Where is the rest of Anthropic’s money going? How will it “stop burning cash” when its operational costs explode as its revenue increases? January 2024 - $52.9 million February 2024 - $60.9 million March 2024 - $74.3 million April 2024 - $101.1 million May 2024 - $100.1 million June 2024 - $101.8 million July 2024 - $118.9 million August 2024 - $128.8 million September 2024 - $127.8 million October 2024 - $169.6 million November 2024 - $146.5 million December 2024 - $176.1 million January 2025 - $1.459 million This, apparently, is the month that Cursor hit $100 million annualized revenue — or $8.3 million, meaning it spent 17.5% of its revenue on AWS. February 2025 - $2.47 million March 2025 - $4.39 million April 2025 - $4.74 million Cursor hit $200 million annualized ($16.6 million) at the end of March 2025 , according to The Information, working out to spending 28% of its revenue on AWS.   May 2025 - $6.19 million June 2025 - $12.67 million So, Bloomberg reported that Cursor hit $500 million on June 5 2025 , along with raising a $900 million funding round. Great news! Turns out it’d need to start handing a lot of that to Anthropic. This was, as I’ve discussed above, the month when Anthropic forced it to adopt “Service Tiers”. I go into detail about the situation here , but the long and short of it is that Anthropic increased the amount of tokens you burned by writing stuff to the cache (think of it like RAM in a computer), and AI coding startups are very cache heavy, meaning that Cursor immediately took on what I believed would be massive new costs. As I discuss in what I just linked, this led Cursor to aggressively change its product, thereby vastly increasing its customers’ costs if they wanted to use the same service. That same month, Cursor’s AWS costs — which I believe are the minority of its cloud compute costs — exploded by 104% (or by $6.48 million), and never returned to their previous levels. It’s conceivable that this surge is due to the compute-heavy nature of the latest Claude 4 models released that month — or, perhaps, Cursor sending more of its users to other models that it runs on Bedrock.  July 2025 - $15.5 million As you can see, Cursor’s costs continue to balloon in July, and I am guessing it’s because of the Service Tiers situation — which, I believe, indirectly resulted in Cursor pushing more users to models that it runs on Amazon’s infrastructure. August 2025 - $9.67 million So, I can only guess as to why there was a drop here. User churn? It could be the launch of GPT-5 on Cursor , which gave users a week of free access to OpenAI’s new models. What’s also interesting is that this was the month when Cursor announced that its previously free “auto” model (where Cursor would select the best available premium model or its own model) would now bill at “ competitive token rates ,” by which I mean it went from charging nothing to $1.25 per million input and $6 per million output tokens. This change would take effect on September 15 2025. On August 10 2025 , Tom Dotan of Newcomer reported that Cursor was “well above” $500 million in annualized revenue based on commentary from two sources. September 2025 - $12.91 million Per the above, this is the month when Cursor started charging for its “auto” model.

1 views

Interview with a new hosting provider founder

Most of us use infrastructure provided by companies like DigitalOcean and AWS. Some of us choose to work on that infrastructure. And some of us are really built different and choose to build all that infrastructure from scratch . This post is a real treat for me to bring you. I met Diana through a friend of mine, and I've gotten some peeks behind the curtain as she builds a new hosting provider . So I was thrilled that she agreed to an interview to let me share some of that with you all. So, here it is: a peek behind the curtain of a new hosting provider, in a very early stage. This is the interview as transcribed (any errors are mine), with a few edits as noted for clarity. Nicole: Hi, Diana! Thanks for taking the time to do this. Can you start us off by just telling us a little bit about who you are and what your company does? Diana: So I'm Diana, I'm trans, gay, AuDHD and I like to create, mainly singing and 3D printing. I also have dreams of being the change I want to see in the world. Since graduating high school, all infrastructure has become a passion for me. Particularly networking and computer infrastructure. From your home internet connection to data centers and everything in between. This has led me to create Andromeda Industries and the dba Gigabit.Host. Gigabit.Host is a hosting service where the focus is affordable and performant host for individuals, communities, and small businesses. Let's start out talking about the business a little bit. What made you decide to start a hosting company? The lack of performance for a ridiculous price. The margins on hosting is ridiculous, it's why the majority of the big tech companies' revenue comes from their cloud offerings. So my thought has been why not take that and use it more constructively. Instead of using the margins to crush competition while making the rich even more wealthy, use those margins for good. What is the ethos of your company? To use the net profits from the company to support and build third spaces and other low return/high investment cost ventures. From my perspective, these are the types of ideas that can have the biggest impact on making the world a better place. So this is my way of adopting socialist economic ideas into the systems we currently have and implementing the changes. How big is the company? Do you have anyone else helping out? It’s just me for now, though the plan is to make it into a co-op or unionized business. I have friends and supporters of the project, giving feedback and suggesting improvements. What does your average day-to-day look like? I go to my day job during the week, and work on the company in my spare time. I have alerts and monitors that warn me when something needs addressing, overall operations are pretty hands off. You're a founder, and founders have to wear all the hats. How have you managed your work-life balance while starting this? At this point it’s more about balancing my job, working on the company, and taking care of my cat. It's unfortunately another reason that I started this endeavor, there just aren't spaces I'd rather be than home, outside of a park or hiking. All of my friends are online and most say the same, where would I go? Hosting businesses can be very capital intensive to start. How do you fund it? Through my bonuses and stocks currently, also through using more cost effective brands that are still reliable and performant. What has been the biggest challenge of operating it from a business perspective? Getting customers. I'm not a huge fan of marketing and have been using word of mouth as the primary method of growing the business. Okay, my part here then haha. If people want to sign up, how should they do that? If people are interested in getting service, they can request an invite through this link: https://portal.gigabit.host/invite/request . What has been the most fun part of running a hosting company? Getting to actually be hands on with the hardware and making it as performant as possible. It scratches an itch of eking out every last drop of performance. Also not doing it because it's easy, doing it because I thought it would be easy. What has been the biggest surprise from starting Gigabit.Host? How both complex and easy it has been at the same time. Also how much I've been learning and growing through starting the company. What're some of the things you've learned? It's been learning that wanting it to be perfect isn't realistic, taking the small wins and building upon and continuing to learn as you go. My biggest learning challenge was how to do frontend work with Typescript and styling, the backend code has been easy for me. The frontend used to be my weakness, now it could be better, and as I add new features I can see it continuing to getting better over time. Now let's talk a little bit about the tech behind the scenes. What does the tech stack look like? Next.js and Typescript for the front and backend. Temporal is used for provisioning and task automation. Supabase is handling user management Proxmox for the hardware virtualization How do you actually manage this fleet of VMs? For the customer side we only handle the initial provisioning, then the customer is free to use whatever tool they choose. The provisioning of the VMs is handled using Go and Temporal. For our internal services we use Ansible and automation scripts. [Nicole: the code running the platform is open source, so you can take a look at how it's done in the repository !] How do your technical choices and your values as a founder and company work together? They are usually in sync, the biggest struggle has been minimizing cost of hardware. While I would like to use more advanced networking gear, it's currently cost prohibitive. Which choices might you have made differently? [I would have] gathered more capital before getting started. Though that's me trying to be a perfectionist, when the reality is buy as little as possible and use what you have when able. This seems like a really hard business to be in since you need reliability out of the gate. How have you approached that? Since I've been self-funding this endeavor, I've had to forgo high availability for now due to costs. To work around that I've gotten modern hardware for the critical parts of the infrastructure. This so far has enabled us to achieve 90%+ uptime, with the current goal to add redundancy as able to do so. What have been the biggest technical challenges you've run into? Power and colocation costs. Colocation is expensive in Seattle. Around 8x the cost of my previous colo in Atlanta, GA. Power has been the second challenge, running modern hardware means higher power requirements. Most data centers outside of hyperscalers are limited to 5 to 10 kW per rack. This limits the hardware and density, thankfully for now it [is] a future struggle. Huge thanks to Diana for taking the time out of her very busy for this interview! And thank you to a few friends who helped me prepare for the interview.

0 views
Shayon Mukherjee 1 months ago

Mutable atomic deletes with Parquet backed columnar tables on S3

In the previous post, I explored a Parquet on S3 design with tombstones for constant time deletes and a CAS updated manifest for snapshot isolation. This post extends that design. The focus is in file delete operations where we replace a Parquet row group and publish a new footer using S3 Multipart Upload (MPU) and UploadPartCopy without having to download and rebuild unchanged bytes. We preserve the same system model with immutable data files and a manifest pointer that serializes visibility.

0 views
Shayon Mukherjee 1 months ago

An MVCC-like columnar table on S3 with constant-time deletes

Parquet is excellent for analytical workloads. Columnar layout, aggressive compression, predicate pushdown, but deletes require rewriting entire files. Systems like Apache Iceberg and Delta Lake solve this by adding metadata layers that track delete files separately from data files. But what if, for fun, we built something (arguably) simpler? S3 now has conditional writes (If-Match, If-None-Match) that enable atomic operations without external coordination. Let’s explore how we might build a columnar table format on S3 that gets most of Parquet’s benefits while supporting constant-time deletes.

0 views