Posts in Performance (20 found)

This blog post tells the time

Computer clock synchronization is a complicated process, requiring protocols like NTP and a specialized server to answer requests. In this post I explore a “serverless” method, which relies on widely available CDNs to distribute time. It’s the serverless time servers we didn’t realize we already had. This clock should display the correct time. If your device’s clock is set to the wrong time, it should tell you how far off the clock is set. The page starts the process by requesting a tiny asset through the Cloudflare CDN. As Cloudflare builds the response, an HTTP header transform rule adds timing information, like http.request.timestamp.sec , to the response headers. The client waits for the response and then analyzes the network request using the fine-grained metrics provided by performance resource timers . Finally, some math is applied to adjust for network delay. The PerformanceResourceTiming interface exposes detailed network timing information. This is similar to the web developer tools network tab, just accessible via JavaScript in the page. These metrics are extremely helpful to developers who are troubleshooting performance issues and will prove useful here. Notice that “sending” and “receiving” are both shown as zero milliseconds. The request and response used here are so small they likely fit in a single packet, so these events appear instantaneous. The only measurable part of the HTTP portion of the request is waiting for the network and the server to do their work. These detailed performance timing metrics help us address a major challenge of time distribution: any server provided timestamp has grown stale by the time it reaches the client. To account for this effect, the NTP protocol and similar software estimate the network round trip time. A good estimate for the network delay experienced by the response is “half the round trip time”, although this is not always accurate. Additional adjustment based on server processing delay further improves accuracy. Cloudflare helps us out by providing server-side timing information. The client generally can’t distinguish between network delay and server delay, so this information helps us estimate when the server generated the timestamp. This data includes metrics like cf.timings.origin_ttfb_msec which tells us how long the Cloudflare CDN waited on a response from Cloudflare Pages. At the end of all the measurement and the math, the clock display is an estimate. We’re guessing how much the server-provided timestamp aged before it reached the web browser. It’s an educated guess, informed by a lot of metrics, but there is uncertainty here. For a technique I’ve been calling serverless, I’ve sure talked about servers a lot. The term serverless really means that we’re not managing individual servers ourselves, the cloud hosting provider has abstracted those away. This setup uses Cloudflare Pages to host the tiny asset which this page fetches. The HTTP header transform rule is part of the CDN, we don’t even need Cloudflare Workers . So it’s just files I’ve pushed to GitLab, served by Cloudflare Pages, and some CDN configuration. Tons of servers, but abstracted away. Contrast this with NTP, where we’d need to run the NTP daemon and perhaps manage the underlying operating system. It feels “serverless” in comparison. The clock display includes error bounds, which describe the precision of the provided time. Network latency plays a big part here as we don’t know how long it took to reach Cloudflare or how long it took to get back. Network paths could be asymmetric or packet loss could cause unexpected delay in a single direction. While in normal cases we’d expect the server to process the request (and generate the timestamp) in the middle of our waiting period, in extreme cases those events could fall far to one side. This uncertainty, and the associated error bounds, are reduced when the network latency is lower, which plays into a strength of Cloudflare: their CDN points-of-presence are located geographically near major population centers. For most of us, network latency to Cloudflare is quite small. The performance resource timers also help us precisely estimate when Cloudflare processes the HTTP request as we can eliminate delays caused by DNS resolution, TCP handshake, and TLS session initialization. Precision could be improved by performing multiple requests and applying statistical analysis, but this page makes no such effort. In my testing I’ve often seen 60ms error bounds shown in the web clock. NTP clients, like the command line ntpdig has a much tighter estimate, closer to 6ms. This is an order of magnitude difference. While this method provides decent synchronization with the clock on Cloudflare CDN servers, we’ve got to consider how well synchronized that clock is with the official time. After all, if Cloudflare’s CDN servers provide the wrong timestamp it doesn’t matter how precisely we’ve synchronized, we’ll display the wrong time. Cloudflare’s CDN is not formally a time server, so we need to tread carefully when using it this way. I checked the accuracy against a couple sources. When I collected the ntpdig output shown above, my web clock reported I was behind by 130±70ms. These measurements are within each other’s error bounds, which shows agreement. I also checked using a GPS debugging app on my phone. GPS provides extremely accurate clocks and is likely the most accurate clock I can access. The clocks appeared to update in lock-step, again showing agreement. In this screenshot notice that my phone’s clock is ahead of the other clocks and this offset is detected by the web clock. In any case, it seems risky to depend on an unwitting time server, so without specific promises from Cloudflare I’d just consider this a demo. After all, the Internet doesn’t always know the right time . So far I’ve tested this on my laptop and phone, but I’d be interested to see how well it works for others. You can use tools like ntpdig or GPS debugging apps to compare. I’ve built a standalone web clock for that sort of testing. You may be surprised by how inaccurate your system time can be; slightly offset clocks are quite common. This is especially true when a device sleeps, suspends, or hibernates. When a computer’s CMOS battery is missing of failed, clocks can fall very far out of sync. I’d be curious to see what people discover (contact info at bottom). While the precision of this CDN-based method is relatively poor for a time synchronization protocol, it does offer some attractive features over current solutions. First and foremost: it’s web-native! NTP’s lack of security has been a growing concern. One replacement, Network Time Security (NTS), cryptographically authenticates information sent by the time server . The authenticated encryption of HTTPS similarly protects the CDN-based web clock approach. This avoids situations where an attacker-in-the-middle tampers with insecure NTP responses, messing up your system’s clock. There’s a lot of hazards here, unfortunately. Alternate time synchronization protocols have a history of mistakes, so its wise to be wary. Microsoft tried a TLS-based synchronization approach via Secure Time Seeding (STS). Their approach relied on time metadata in TLS connections, but most servers actually provide random data in the relevant field. This caused clock to reset to random times . In either case, this underscores the risks of getting a clock reference from systems that don’t realize they are being used as time servers. Closing on a more nostalgic note, NIST’s time.gov has a wonderfully retro clock widget . Unfortunately they no longer allow you to host it on your own site, probably due to server load. Here’s my own 88x31 badge, which is hereby MIT licensed. It makes use of SVG’s questionably ability to embed scripts in images.

0 views

MagiCache: A Virtual In-Cache Computing Engine

MagiCache: A Virtual In-Cache Computing Engine Renhao Fan, Yikai Cui, Weike Li, Mingyu Wang, and Zhaolin Li ISCA'25 This paper presents an implementation of RISC-V vector extensions where all vector computation occurs in the cache (i.e., SRAM-based in-memory computation). It contains an accessible description of in-SRAM computation, and some novel extensions. Recall that SRAM is organized as a 2D array of bits. Each row represents a word, and each column represents a single bit location in many words. A traditional read operation occurs by activating a single row. Analog values are read out from each bit and placed onto shared bit lines. There are two bit lines per column (one holding the value, one holding the complement). Values flow down to sense amplifiers that output digital values. Prior work has shown that this basic structure can be augmented to perform computation. Rather than activating a single row, two rows are activated simultaneously (let’s call the values of these rows and ). The shared bit lines perform computation in the analog domain, which results in two expressions appearing on the output of the sense amplifiers: ( AND ) and ( NOR ). Fig. 1(a) shows a diagram of such an SRAM array: Source: https://dl.acm.org/doi/10.1145/3695053.3731113 If you slap some digital logic at the end of the sense amplifiers, then you can generate other functions like OR, XOR, XNOR, NAND, shift, add. Shift and add involve horizontal connections. Fig. 4(c) shows a hardware diagram of this additional logic at the end of the sense amplifiers. Note that the resulting value can be written back into the SRAM array for future use. Multiplication is not directly supported but can be implemented with a sequence of shift and add operations. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Virtual Engine The innovation in this paper is to dynamically share a fixed amount of on-chip SRAM for two separate purposes: caching and a vector register file. The logical vector register file capacity required for a particular algorithm depends on the number of architectural registers used, and the width of each architectural register (RISC-V vector extensions allow software to configure a logical vector width). Note that this hardware does not have separate vector ALUs, the computation is performed directly in the SRAM arrays. Fig. 6 illustrates how the hardware dynamically allocates SRAM space between generic cache storage and vector registers (with in-memory compute). The unit of allocation is a segment . The width of a vector register determines how many segments it requires. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Initially, all SRAM space is dedicated to caching. When the hardware processes an instruction that writes to an uninitialized vector register, then the hardware allocates segments to hold data for that register (evicting cached data if necessary). This system assumes an enlightened compiler which will emit a instruction to hint to the hardware when it has reached a point in the instruction stream where no vector register has valid content. The hardware can use this hint to reallocate all memory back to being used for caching. Fig. 8 shows performance results normalized against prior work (labeled here). This shows a 20%-60% performance improvement, which is pretty good considering that the baseline offers an order-of-magnitude improvement over a standard in-order vector processor. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Dangling Pointers I wonder how this would compare to hardware that did not have a cache, but rather a scratchpad with support for in-memory computing. Subscribe now

0 views
./techtipsy 1 weeks ago

You can fake SSD-like disk speeds in any Linux VM, but it's unsafe (literally)

Do you have a need for speed really fast disk performance? Are you unwilling or unable to buy/rent a machine with fast, NVMe-based storage? Are you OK with potential data corruption? Then is the solution to all your problems! We had an interesting conundrum at work recently. Our platform does not use a lot of resources, but there are bursts of activity that require a lot of CPU and performant disk IO from our infrastructure. This was previously handled by manually starting some expensive AWS EC2 instances to cope with the load, but this manual process was error-prone due to the human factor (which did end up causing an actual production outage once), and AWS is stupidly expensive for what you get in return. Around this time I also learned about a Proxmox server that we were underutilizing. My goal was to investigate the resources that we had available and to ensure that we didn’t have to think about taking any manual actions while at the same time not relying on AWS and its expensive resources. I set up a few VM-s on the Proxmox machine, and did some testing. CPU, memory, that was all fine, but the IO-bound workloads that we had to run during those bursty periods would still be relatively slow. Not much slower than the main infrastructure provider that we were using, but slow enough for a beefy machine to not be able to handle more than a few parallel IO-heavy workloads running at the same time. We exhausted a few other wild-ass ideas during the investigation: Then one day I was browsing around Proxmox and noticed an interesting option on the virtual storage drives: setting the cache mode to . With this one trick, your VM will see really fast disk speeds up to a certain point, and it’s invisible from the perspective of your workloads, no customization needed. In a way, this is like one of the RAM-backed storage options, but for the whole VM. The major trade-off is that an unexpected shutdown of the VM or the VM host will likely result in data corruption. This is because you’re writing everything to memory first, and eventually the writes will end up on persistent storage, whenever the disks catch up with you. Something happens while changes are in memory, and they are lost. In our case, the data corruption risk is completely OK, as the workloads are ephemeral, the results of the work are sent to another machine immediately after completion, and the configuration of the machine is largely automated with Ansible. One instance of our workload would usually result in writing 50 MB to disk, and we observed about 300-500 IOPS of performance from HDD-backed storage. The disks were not able to handle more than one at a time if we cared about execution time. With the trick, and on some relatively old hardware (assume DDR3 memory), we saw numbers as high as 15K IOPS and disk throughput of 500+ MB/s. This was more than enough to handle peak loads, and the resources were always on and available on a rented server with a stable price that compared extremely well to AWS. Cloud service providers have their benefits, sure, but when all you need is raw speed and configurability to make it happen, then owning a physical Linux server (or a few of them for redundancy) is a no-brainer, slam-dunk decision, as long as you have someone in your team that knows how to manage one. Since you’re working with Linux VM-s already in the cloud, then you already have that person in your team, don’t you? :) Docker on a RAM-backed storage drive online resources did not inspire confidence in this working well, so we didn’t try this optimizing the workload to not be IO-heavy unsuccessful after spending a few hours on it, the high IO was a consequence of making an intentional trade-off to reduce CPU load, and the IO requirement was much more manageable putting certain folders in the container itself on RAM-backed storage highly container specific, and did not yield the desired results

0 views
Martin Fowler 1 weeks ago

Principles of Mechanical Sympathy

Modern hardware is remarkably fast, but software often fails to leverage it. Caer Sanders has found it valuable to guide their work with mechanical sympathy - the practice of creating software that is sympathetic to its underlying hardware. They distill this practice into everyday principles: predictable memory access, awareness of cache lines, single-writer, and natural batching.

0 views

FlexGuard: Fast Mutual Exclusion Independent of Subscription

FlexGuard: Fast Mutual Exclusion Independent of Subscription Victor Laforet, Sanidhya Kashyap, Călin Iorgulescu, Julia Lawall, and Jean-Pierre Lozi SOSP'25 This paper presents an interesting use of eBPF to effectively add an OS feature: coordination between user space locking code and the kernel thread scheduler to improve locking performance. The paper describes most lock implementations as spin-then-park locks (e.g., busy wait in user space for some time, then give up and call the OS to block the waiting thread). A big problem with busy waiting is the performance cliff under oversubscription . Oversubscription occurs when there are more active threads than cores. In this case, busy waiting can be harmful, because it wastes CPU cycles when there is other useful work to do. The worst case occurs when a thread acquires a lock and then is preempted by the OS scheduler while many other threads are busy waiting. If the OS thread scheduler were smart, it would preempt one of the busy waiters and let the lock holder keep running. But alas, that level of coordination isn’t available … until now. In the good old days, researchers would have modified Linux scheduling code and tested their modified kernel. The modern (easier) way to achieve this is to use eBPF. The authors wrote an eBPF program that runs (in kernel space) each time a context switch occurs. This program is called the Preemption Monitor . The Preemption Monitor works in conjunction with a custom user space lock implementation. The net result is that the Preemption Monitor can reliably detect when the OS scheduler preempts a thread that is holding a lock. When this occurs the eBPF program writes information to a variable that user space code can read. The locking algorithm is as follows: First, try to acquire the lock with a simple atomic compare-and-swap. If that fails, then busy wait. Similar to Hapax locks , this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables. During busy waiting, the variable written by the Preemption Monitor is checked. If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use). Fig. 2 has performance results. The x-axis shows thread count (which varies over time). The green line is FlexGuard. The idea is that it gives great performance when there is no oversubscription (i.e., fewer than 150 threads) and offers performance similar to a purely blocking lock (the dark blue line) when there is oversubscription. Source: https://dl.acm.org/doi/10.1145/3731569.3764852 Dangling Pointers This problem seems ripe for overengineering. In some sick world, the compiler, OS, and hardware could all coordinate to support a “true critical section”. All pages accessed inside this critical section would be pinned into main memory (or even closer to the CPU), and the OS would try extremely hard not to preempt threads inside of the critical section. This would require some upper bound on the critical section working set and running time. Subscribe now First, try to acquire the lock with a simple atomic compare-and-swap. If that fails, then busy wait. Similar to Hapax locks , this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables. During busy waiting, the variable written by the Preemption Monitor is checked. If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use).

0 views
DHH 1 weeks ago

Panther Lake is the real deal

Intel really delivered with Panther Lake. A 2026 Dell XPS 14 using this chipset with an IPS screen can hit just 1.4 watts of idle power draw on Omarchy. That's good enough for over 47 hours!! And in real-world mixed use on another 74-Wh machine, I've seen around 16 hours of battery life. That's a huge jump over the ~6 hours I was getting over the past two years from AMD-powered Framework laptops. Technically, Intel already had something close to Panther Lake on efficiency with the Lunar Lake chips from last year, but those were quite slow on any multi-core workloads (like a developer would need). With Panther Lake (358H), I'm getting 17,500 on Geekbench 6, which is about 10% faster than the already excellent AMD HX370, and a match for Apple's M5.  Apple remains ahead on single-core performance, but even there, Panther Lake is on par with an M3. And I don't remember anyone complaining that those were too slow. What everyone has been pining for was better battery life, and now we got it. On a machine with excellent integrated graphics that are good enough to play a ton of triple-A games no less! But we're getting more than that. The PC makers are getting their act together on all fronts. Haptic touchpads on level with Apple's is now standard on both high-end Dell and Asus laptops. Many of the new machines also have tandem OLED screens that blow even the nice micro-LED options from Apple out of the water. And PCs are now somehow both sleeker and slimmer than the MacBooks. Jonathan Ive knew this, he was just a bit ahead of the components, and he was willing to sacrifice reliability to get to what wasn't possible back then. But now it is, and the PC makers are taking full advantage. Now I know that any comparison between Macs and PCs are moot for most people. There's not a lot of cross-shopping going on these days. If you're locked into the Apple walled garden, it's hard to untangle yourself, so most just continue to buy whatever their team offers. But for the few who are either fed up with Apple in general, macOS Tahoe in particular, or just want to try a whole new way of computing with Omarchy, it's fantastic that battery life is no longer a blocker. It's been the #1 reason cited by folks who've been interested in trying Omarchy, but felt like they couldn't let go of Apple's efficiency advantage. Now that's largely gone. I also just love a good turnaround story. Intel had been on the ropes for years. Now they have a fantastic integrated GPU that's compatible with all the tens of thousands of PC games on the market, a super-efficient CPU that's a match for an M5 on multi-core and an M3 on single-core performance, and a range of PC makers finally taking the fight directly to Apple on touchpads, build quality, and weight. These new Panther Lake CPUs are made in Arizona too, btw. With the world as it is, I think any American should breathe a sigh of relief that if things get spicy with Taiwan, there's more to frontier computing than a TSMC plant within a short reach of China. There's still more work to be done on that front (as Intel CPU cores still come from TSMC!), but it's a huge step in the right direction. Personally, I'm just thrilled that competition is lifting all boats. Apple gave the entire laptop industry a huge wake-up call in 2020 with the introduction of the M chips. Intel's former CEO, Pat Gelsinger, saw the threat clearly, kicked off the 18A plan, but sadly didn't last long enough in the top seat to see his bet pay off with Panther Lake. The rest of us now benefit from his boldness. I'm also thrilled to see both Dell and Intel leaning into Linux. Omarchy 3.5 ships with every possible tweak to make these Panther Lake chips perform at their best, and that was only possible because Michael Dell assigned a team to work on it. So much love to Mr Dell for letting us borrow the brains and commits from senior engineers within both his company and Intel to ship this big new release. If you've been waiting on the sidelines for a laptop that can run Omarchy and still get amazing battery life, now is your magic moment. Give the new Dell XPS series, or any of the other laptops shipping with Panther Lake, a try. I think you'll be as impressed as I've been.

0 views
The Jolly Teapot 1 weeks ago

Browsing the web with JavaScript turned off

Some time ago, I tried to use my web browser with JavaScript turned off by default. The experiment didn’t last long , and my attempt at a privacy-protecting, pain-free web experience failed. Too many websites rely on JavaScript, which made this type of web browsing rather uncomfortable. I’ve kept a Safari extension like StopTheScript around, on top of a content blocker like Wipr , just in case I needed to really “trim the fat” of the occasional problematic webpage. * 1 Recently, I’ve given this setup a new chance to shine, and even described it in a post. The results are in: the experiment failed yet again. But I’m not done. Even if this exact setup isn’t the one I currently rely on, JavaScript-blocking is nevertheless still at the heart of my web browsing hygiene on the Mac today. For context, this need for fine-tuning comes from the fact that my dear old MacBook Air from early 2020, rocking an Intel chip, starts to show its age. Sure, it already felt like a 10-year-old computer the moment the M1 MacBook Air chip was released, merely six months after I bought it, but let’s just say that a lot of webpages make this laptop choke. My goal of making this computer last one more year can only be reached if I manage not to throw the laptop through the window every time I want to open more than three tabs. On my Mac, JavaScript is now blocked by default on all pages via StopTheScript. Leaving JavaScript on, meaning giving websites a chance, sort of defeated the purpose of my setup (performance and privacy). Having JS turned off effectively blocks 99% of ads and trackers (I think, don’t quote me on that) and makes browsing the web a very enjoyable experience. The fan barely activates, and everything is as snappy and junk-free as expected. For websites that require JavaScript — meaning frequently visited sites like YouTube or where I need to be logged in like LanguageTool  — I turn off StopTheScript permanently via the Websites > Extensions menu in the Safari Settings. I try to keep this list to a bare minimum, even if this means I have to accept a few annoyances like not having access to embedded video players or comments on some websites. For instance, I visit the Guardian multiple times daily, yet I won’t add it to the exception list, even if I’m a subscriber and therefore not exposed to the numerous “please subscribe” modals. I can no longer hide some categories on the home page, nor watch embedded videos: a small price to pay for a quick and responsive experience, and a minimal list of exceptions. For the few times when I actually need to watch a video on the Guardian, comment on a blog post, or for the occasional site that needs JavaScript simply to appear on my screen (more on that later), what I do is quickly open the URL in a new private window. There, StopTheScript is disabled by default (so that JavaScript is enabled: sorry, I know this is confusing). Having to reopen a page in a different browser window is an annoying process, yes. Even after a few weeks it still feels like a chore, but it seems to be the quickest way on the Mac to get a site to work without having to mess around with permissions and exceptions, which can be even more annoying on Safari. Again, a small price to pay to make this setup work. * 2 Another perk of that private browsing method is that the ephemeral session doesn’t save cookies and the main tracking IDs disappear when I close the window. I think. The problem I had at first was that these sessions tended to display the webpages as intended by the website owners: loaded with JavaScript, ads, modals, banners, trackers, &c. Most of the time, it is a terrible mess. Really, no one should ever experience the general web without any sort of blocker. To solve this weakness of my setup, I switched from Quad9 to Mullvad DNS to block a good chunk of ads and trackers (using the “All” profile ). Now, the private window only allows the functionality part of the JavaScript, a few cookie banners and Google login prompt annoyances, but at least I am not welcomed by privacy-invading and CPU-consuming ads and trackers every time my JS-free attempt fails. I know I could use a regular content blocker instead of a DNS resolver, but keeping it active all the time when JS is turned off feels a bit redundant and too much of an extension overlap. More importantly, I don’t want to be tempted to manage yet another exception list on top of the StopTheScript one (been there, done that, didn’t work). Also, with Safari I don’t think it’s possible to activate an extension in Private Mode only. John Gruber , in a follow-up reaction to The 49MB Web Page article from Shubham Bose, which highlights the disproportionate weight of webpages related to their content, wrote: One of the most controversial opinions I’ve long espoused, and believe today more than ever, is that it was a terrible mistake for web browsers to support JavaScript. Not that they should have picked a different language, but that they supported scripting at all. That decision turned web pages — which were originally intended as documents — into embedded computer programs. There would be no 49 MB web pages without scripting. There would be no surveillance tracking industrial complex. The text on a page is visible. The images and video embedded on a page are visible. You see them. JavaScript is invisible. That makes it seem OK to do things that are not OK at all. Amen to that. But if JavaScript is indeed mostly used for this “invisible” stuff, why are some websites built to use it for the most basic stuff? Video streaming services, online stores, social media platforms, I get it: JavaScript makes sense. But text-based sites? Blogs? Why? The other day I wanted to read this article , and only the website header showed up in my browser. Even Reader Mode didn’t make the article appear. When I opened the link in a private window, where StopTheScript is disabled, lo and behold, the article finally appeared. For some obscure reason, on that website (and others) JavaScript is needed to load text on a freaking web page. Even if you want your website to have a special behaviour regarding loading speeds, design subtleties, or whatever you use JavaScript for, please, use a tag, either to display the article in its most basic form, or at least to show a message saying “JavaScript needed for no apparent reason at all. Sorry.” * 3 This is what I do on my phone, as managing Safari extensions on iOS is a painful process. Quiche Browser is a neat solution and great way for me to have the “turn off JavaScript” menu handy, but without a way to sync bookmarks, history or open tabs with the Mac, I still prefer to stick to Safari, at least for now. ^ I still wish StopTheScript had a one-touch feature to quickly reload a page with JavaScript turned on until the next refresh or for an hour or so, but it doesn’t. ^ This is what I do for this site’s search engine , where PageFind requires JavaScript to operate. Speaking of search engine, DuckDuckGo works fine in HTML-only mode (the only main search engine to offer this I believe). ^ This is what I do on my phone, as managing Safari extensions on iOS is a painful process. Quiche Browser is a neat solution and great way for me to have the “turn off JavaScript” menu handy, but without a way to sync bookmarks, history or open tabs with the Mac, I still prefer to stick to Safari, at least for now. ^ I still wish StopTheScript had a one-touch feature to quickly reload a page with JavaScript turned on until the next refresh or for an hour or so, but it doesn’t. ^ This is what I do for this site’s search engine , where PageFind requires JavaScript to operate. Speaking of search engine, DuckDuckGo works fine in HTML-only mode (the only main search engine to offer this I believe). ^

0 views

RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations

RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations Hongrui Zhang, Yunan Zhang, and Hung-Wei Tseng ISCA'25 I recall a couple of decades ago when Pat Hanrahan said something like “all hardware wants to be programmable”. You can find a similar sentiment here : With most SGI machines, if you opened one up and looked at what was actually in there—processing vertexes in particular, but for some machines, processing the fragments—it was a programmable engine. It’s just that it was not programmable by you; it was programmable by me. And now, twenty years later, GPU companies have bucked the programmability trend and added dedicated ray tracing hardware to their chips. Little did they know, users would find a way to utilize this hardware for applications that have nothing to do with graphics. The task at hand is multiplying two (very) sparse matrices ( and ). Each matrix can be partitioned into a 2D grid, where most cells in the grid contain all 0’s. Cells in with non-zero entries must be multiplied by specific cells in with non-zero entries (using a dense matrix multiplication for each product of two cells). The core idea is elegantly simple, and is illustrated in Fig. 5: Source: https://dl.acm.org/doi/full/10.1145/3695053.3731072 The steps are: Build a ray tracing acceleration structure corresponding to the non-zero cells in For each non-zero cell in Trace a ray through to determine if there are any non-zero cells in that need to be multiplied by the current cell in In fig. 5 the coordinates of the non-zero cells in matrix are: [(2, 1) (2, 3) (3, 3) (7, 1)]. The figure shows rays overlaid on top of the result matrix, but I find it easier to think of the rays traced through matrix . The ray corresponding to the cell in at (2, 1) has a column index of 1, so the algorithm traces a ray horizontally through B at row 1. The ray tracing hardware will find that this ray intersects with the cell from at coordinate (1, 4). So, these cells are multiplied together to determine their contribution to the result. Fig. 7 has benchmark results. All results are normalized to the performance of the library (i.e., values greater than one represent a speedup). corresponds to the Intel MKL library running on a Core i7 14700K processor. The “w/o RT cores” bars show results from the same algorithm with ray tracing implemented in general CUDA code rather than using the ray tracing accelerators. It is amazing that this beats across the board. Source: https://dl.acm.org/doi/full/10.1145/3695053.3731072 Dangling Pointers It seems like the core problem to be solved here is pointer-chasing. I wonder if a more general-purpose processor that is located closer to off-chip memory could provide similar benefits. Subscribe now Build a ray tracing acceleration structure corresponding to the non-zero cells in For each non-zero cell in Trace a ray through to determine if there are any non-zero cells in that need to be multiplied by the current cell in

0 views
Den Odell 2 weeks ago

You're Looking at the Wrong Pretext Demo

Pretext , a new JavaScript library from Cheng Lou, crossed 7,000 GitHub stars in its first three days. If you've been anywhere near the frontend engineering circles in that time, you've seen the demos: a dragon that parts text like water , fluid smoke rendered as typographic ASCII , a wireframe torus drawn through a character grid , multi-column editorial layouts with animated orbs displacing text at 60fps . These are visually stunning and they're why the library went viral. But they aren't the reason this library matters. The important thing Pretext does is predict the height of a block of text without ever reading from the DOM. This means you can position text nodes without triggering a single layout recalculation. The text stays in the DOM, so screen readers can read it and users can select it, copy it, and translate it. The accessibility tree remains intact, the performance gain is real, and the user experience is preserved for everyone. This is the feature that will change how production web applications handle text, and it's the feature almost nobody is demonstrating. The community has spent three days building dragons. It should be building chat interfaces. And the fact that the dragons went viral while the measurement engine went unnoticed tells us something important about how the frontend community evaluates tools: we optimize for what we can see, not for what matters most to the people using what we build. The problem is forced layout recalculation, where the browser has to pause and re-measure the page layout before it can continue. When a UI component needs to know the height of a block of text, the standard approach is to measure it from the DOM. You call or read , and the browser synchronously calculates layout to give you an answer. Do this for 500 text blocks in a virtual list and you've forced 500 of these pauses. This pattern, called layout thrashing , remains a leading cause of visual stuttering in complex web applications. Pretext's insight is that uses the same font engine as DOM rendering but operates outside the browser's layout process entirely. Measure a word via canvas, cache the width, and from that point forward layout becomes pure arithmetic: walk cached widths, track running line width, and insert breaks when you exceed the container's maximum. No slow measurement reads, and no synchronous pauses. The architecture separates this into two phases. does the expensive work once: normalize whitespace, segment the text using for locale-aware word boundaries, handle bidirectional text (such as mixing English and Arabic), measure segments with canvas, and return a reusable reference. is then pure calculation over cached widths, taking about 0.09ms for a 500-text batch against roughly 19ms for . Cheng Lou himself calls the 500x comparison "unfair" since it excludes the one-time cost, but that cost is only paid once and spread across every subsequent call. It runs once when the text appears, and every subsequent resize takes the fast path, where the performance boost is real and substantial. The core idea traces back to Sebastian Markbage's research at Meta, where Cheng Lou implemented the earlier prototype that proved canvas font metrics could substitute for DOM measurement. Pretext builds on that foundation with production-grade internationalization, bidirectional text support, and the two-phase architecture that makes the fast path so fast. Lou has a track record here: react-motion and ReasonML both followed the same pattern of identifying a constraint everyone accepted as given and removing it with a better abstraction. The first use case Pretext serves, and the one I want to make the case for, is measuring text height so you can render DOM text nodes in exactly the right position without ever asking the browser how tall they are. This isn't a compromise path, it's the most capable thing the library does. Consider a virtual scrolling list of 500 chat messages. To render only the visible ones, you need to know each message's height before it enters the viewport. The traditional approach is to insert the text into the DOM, measure it, and then position it, paying the layout cost for every message. Pretext lets you predict the height mathematically and then render the text node at the right position. The text itself still lives in the DOM, so the accessibility model, selection behavior, and find-in-page all work exactly as they would with any other text node. Here's what that looks like in practice: Two function calls: the first measures and caches, the second predicts height through calculation. No layout cost, yet the text you render afterward is a standard DOM node with full accessibility. The shrinkwrap demo is the clearest example of why this path matters. CSS sizes a container to the widest wrapped line, which wastes space when the last line is short. There's no CSS property that says "find the narrowest width that still wraps to exactly N lines." Pretext's calculates the optimal width mathematically, and the result is a tighter chat bubble rendered as a standard DOM text node. The performance gain comes from smarter measurement, not from abandoning the DOM. Nothing about the text changes for the end user. Accordion sections whose heights are calculated from Pretext, and masonry layouts with height prediction instead of DOM reads: these both follow the same model of fast measurement feeding into standard DOM rendering. There are edge cases worth knowing about, starting with the fact that the prediction is only as accurate as the font metrics available at measurement time, so fonts need to be loaded before runs or results will drift. Ligatures (where two characters merge into one glyph, like "fi"), advanced font features, and certain CJK composition rules can introduce tiny differences between canvas measurement and DOM rendering. These are solvable problems and the library handles many of them already, but acknowledging them is part of taking the approach seriously rather than treating it as magic. Pretext also supports manual line layout for rendering to Canvas, SVG, or WebGL. These APIs give you exact line coordinates so you can paint text yourself rather than letting the DOM handle it. This is the path that went viral, and the one that dominates every community showcase. The canvas demos are impressive and they're doing things the DOM genuinely can't do at 60fps. But they're also painting pixels, and when you paint text as canvas pixels, the browser has no idea those pixels represent language. Screen readers like VoiceOver, NVDA, and JAWS derive their understanding of a page from the accessibility tree, which is itself built from the DOM, so canvas content is invisible to them. Browser find-in-page and translation tools both skip canvas pixels entirely. Native text selection is tied to DOM text nodes and canvas has no equivalent, so users can't select, copy, or navigate the content by keyboard. A element is also a single tab stop, meaning keyboard users can't move between individual words or paragraphs within it, even if it contains thousands of words. In short, everything that makes text behave as text rather than an image of text disappears. None of this means the canvas path is automatically wrong. There are legitimate contexts where canvas text rendering is the right choice: games, data visualizations, creative installations, and design tools that have invested years in building their own accessibility layer on top of canvas. For SVG rendering, the trade-offs are different again, since SVG text elements do participate in the accessibility tree, making it a middle ground between DOM and canvas. But the canvas path is not the breakthrough, because canvas text rendering has existed for fifteen or more years across dozens of libraries. What none of them offered was a way to predict DOM text layout without paying the layout cost. Pretext's and do exactly that, and it's genuinely new. This pattern often repeats across the frontend ecosystem, and I understand why. A dragon parting text like water is something you can record as a GIF, post to your socials, and collect thousands of impressions. A virtual scrolling list that pre-calculates text heights looks identical to one that doesn't. The performance difference is substantial but invisible to the eye. Nobody makes a showcase called "works flawlessly with VoiceOver" or "scrolls 10,000 messages without a single forced layout" because these things look like nothing. They look like a web page working the way web pages are supposed to work. This is Goodhart's Law applied to web performance: once a metric becomes a target, it ceases to be a good measure. Frame rate and layout cost are proxies for "does this work well for users." GitHub stars are a proxy for "is this useful." When the proxy gets optimized instead, in this case by visually impressive demos that happen to use the path with the steepest accessibility trade-offs, the actual signal about what makes the library important gets lost. The library's identity gets set by its most visually impressive feature in the first 72 hours, and the framing becomes "I am drawing things" rather than "I am measuring things faster than anyone has before." Once that framing is set, it's hard to shift. The best text-editing libraries on the web, CodeMirror , Monaco , and ProseMirror , all made the deliberate choice to stay in the DOM even when leaving it would have been faster, because the accessibility model isn't optional. Pretext's DOM measurement path belongs in that tradition but goes further: those editors still read from the DOM when they need to know how tall something is. Pretext eliminates that step entirely, predicting height through arithmetic before the node is ever rendered. It's the next logical step in the same philosophy: keep text where it belongs, but stop paying the measurement cost to do so. I've been thinking about performance engineering as a discipline for most of my career, and what strikes me about Pretext is that the real innovation is the one that is hardest to see. Predicting how text will lay out before it reaches the page, while keeping the text in the DOM and preserving everything that makes it accessible, is a genuinely new capability on the web platform. It's the kind of foundational improvement that every complex text-heavy application can adopt immediately. If you're reaching for Pretext this week, reach for and first. Build something that keeps text in the DOM and predicts its height without asking the browser. Ship an interface that every user can read, select, search, and navigate. Nobody else has done this yet, and it deserves building. Performance engineering is at its best when it serves everyone without asking anyone to give something up. Faster frame rates that don't make someone nauseous. Fewer layout pauses that mean a page responds when someone with motor difficulties needs it to. Text that is fast and readable and selectable and translatable and navigable by keyboard and comprehensible to a screen reader. The dragons are fun. The measurement engine is important. Let's try not to confuse the two.

0 views

Walking backwards into the future – A look at descriptor heap in Granite

It seems like I can never quite escape the allure of fiddling with bits more efficiently every passing year. I recently went through the process of porting over Granite’s Vulkan backend to use VK_EXT_descriptor_heap. There wasn’t exactly a burning need to do this work, but science demands I sacrifice my limited free time for these experiments. My name may or may not be on the extension summary, and it’s important to eat your own dog food. In this post, I want to explore ways in which we can port over an old school binding model to newer APIs should the need arise. Granite’s binding model is designed for really old Vulkan. The project started in January 2017 after all, at which point Vulkan was in its infancy. Bindless was not really a thing yet, and I had to contend with really old mobile hardware. Slot-based bindings have been with us since OpenGL and early D3D. I still think it’s a fine model from a user’s perspective. I have no problem writing code like: It’s very friendly to tooling and validation and I just find it easy to use overall. GPU performance is great too since vendors have maximal flexibility in how to implement the API. The major downside is the relatively heavy CPU cost associated with it since there are many API calls to make. In my projects, it’s rarely a concern, but when doing heavy CPU-bound workloads like PS2 GS emulation, it did start to matter quite a bit When SPIR-V shaders are consumed in Granite, they are automatically reflected. E.g., with GLSL: I automatically generate VkDescriptorSetLayout for each unique set, and combine these into a VkPipelineLayout as one does. VkDescriptorSetLayouts is hash’n’cached into a DescriptorSetAllocator. The implicit assumption by shaders I write is that low-frequency updates have lower set values. This matches Vulkan’s pipeline layout compatibility rules too. Given the hardcore descriptor churn this old model can incur, UBOs originally used VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. Since linearly allocating new UBOs per draw is a hot path, I wanted to avoid having to allocate and write new descriptor sets all the time. This is precisely what the dynamic buffer types were designed for. I did not use it for SSBOs since DYNAMIC has some unfortunate interactions with descriptor size, since you cannot change the size, only offset. The size of UBOs is somewhat irrelevant, and I just hardcoded in a 64K window. There are two main strategies for allocating sets from a VkDescriptorPool, both which are kinda bad. The typical model I believe most do is the “jumbo” allocator where you create a big pool with many sets and many descriptors with different descriptor types and pray for the best. When the pool is OOM-ed, allocate another. One unfortunate thing about the jumbo pool is that you can’t really know up front exactly how to balance the descriptor types properly. It will always be a shaky heuristic. In raw Vulkan 1.0, it was straight up illegal to allocate any further once a limit had been reached, causing even more headaches. The very first maintenance extension to Vulkan fixed this and added OUT_OF_POOL_MEMORY which allows applications to just keep going until the pool is exhausted. Fun fact is that some vendors would never exhaust the pool and just straight up ignore what you pass into vkCreateDescriptorPool, so that’s fun. Granite went the route of a slab allocator per VkDescriptorSetLayout instead, one allocator per thread. Allocate a group of like 64 VkDescriptorSets in one go and parcel them out as needed. Main advantage here was no need to keep calling vkAllocateDescriptorSets over and over, and in the early years, I even hash’n’cached the descriptor sets. The primary reason for doing that was that some early mobile drivers were extreeeeeeeemely slow at vkUpdateDescriptorSets for some reason. Not a great time. This slab approach lead to memory bloat though. At some point VK_KHR_descriptor_update_template was added which aims to accelerate vkUpdateDescriptorsSets. Instead of having the driver parse the structs and switching on the descriptorType to write descriptors, the update template allows drivers in theory to “precompile” a highly optimized function that updates descriptors based on the template that is provided in vkCreateDescriptorUpdateTemplate. This was a nice incremental thing to add to Granite. I don’t think the promise of update templates really worked out in the end though. Most drivers I think just resorted to parsing the original template instead, leading to no speedup. Push descriptors were designed quite early on in Vulkan’s life, but its adoption was … spotty at best. It didn’t make it into core until Vulkan 1.4! Push descriptors solved some issues for us slot and binding troglodytes since there was simply no need to mess around with allocating sets and pools when we could just push descriptors and the driver would deal with it. The major downside is that only one descriptor set can be a push set, but in Granite’s case, I could design for that limitation when writing shaders. The last set index in a VkPipelineLayout would get assigned as a push set. After going push descriptors, I dropped the old UBO_DYNAMIC path, since push descriptors are not compatible with it, and the UBO_DYNAMIC wins were … questionable at best anyway. It took a while to move to this model though. AMD Windows driver was infamously dragging its feet for years before finally accepting reality and at that point I was ready to move over. It’s still not a hard requirement in Granite due to mobile concerns, but then the driver hits the slow path, and I don’t really care anymore At some point, any modern renderer has to deal with this and Granite hit this wall with clustered shading, where an array of shadow maps became a hard necessity. I’m not a big fan of “everything is bindless” myself, since I think it makes debugging way more annoying and stresses tooling and validation more than it should, but sometimes the scissor juggling is necessary. When Granite reflects a shader looking like this: The set layout is converted into an UPDATE_AFTER_BIND set with VARIABLE_COUNT array length. There is also a special helper function to aid in allocating these bindless sets where the API mostly turns into: The CPU overhead of this isn’t quite trivial either, but with the set and pool model, it’s not easy to escape this reality without a lot of rewrites. For now, I only support sampled images with bindless and I never really had any need or desire to add more. For bindless buffers, there is the glorious buffer_device_address instead. This model has served and keeps serving Granite well. Once this model is in place, the only real reason to go beyond this for my use cases is performance (and curiosity). VK_EXT_descriptor_buffer asks the question of what happens when we just remove the worst parts of the descriptor API: Sets are now backed by a slice of memory, and pools are replaced by a big descriptor buffer that is bound to a command buffer. Some warts remain however, as VkDescriptorSetLayout and PipelineLayout persist. If you’re porting from the legacy model like I was, this poses no issues at all, and actually reduces the friction. Descriptor buffers are a perfectly sound middle-ground alternative for those who aren’t a complete bindless junkie yet, but want some CPU gains along the way. In the ideal use case for descriptor buffers, we have one big descriptor buffer that is always bound. This is allocated with PCI-e BAR on dGPUs, so DEVICE_LOCAL | HOST_VISIBLE. Instead of allocating descriptor sets, command buffer performs a linear allocation which is backed by slices allocated from the global descriptor buffer. No API calls needed. The size to allocate for VkDescriptorSet is queried from the set layout itself, and each descriptor is assigned an offset that the driver controls. There is a wart in the spec where the min-spec for sampler descriptor buffers is very small (4K samplers). In this case, there is a risk that just linearly allocating out of the heap will trivially OOM the entire thing and we have to allocate new sampler descriptor buffers all the time. In practice, this limitation is completely moot. Granite only opts into descriptor buffers if the limits are reasonable. There is supposed to be a performance hit to rebinding descriptor buffers, but in practice, no vendor actually ended up implementing descriptor buffers like that. However, since VK_EXT_descriptor_heap will be way more strict about these kinds of limitations, I designed the descriptor_buffer implementation around the single global heap model to avoid rewrites later. There is certainly a risk of going OOM when linearly allocating like this, but I’ve never hit close to the limits. It’s not hard to write an app that would break Granite in half though, but I consider that a “doctor, my GPU hurts when I allocate like this” kind of situation. This is where we should have a major win, but it’s not all that clear. For each descriptor type, I have different strategies on how to deal with them. The basic idea of descriptor buffers is that we can call vkGetDescriptorEXT to build a descriptor in raw bytes. This descriptor can now be copied around freely by the CPU with e.g. memcpy, or even on the GPU in shaders (but that’s a level of scissor juggling I am not brave enough for). These are the simplest ones to contend with. Descriptor buffers still retain the VkImageView and VkSampler object. The main addition I made was to allocate a small payload up front and write the descriptor once. E.g.: Instead of vkUpdateDescriptorSets, we can now replace it with a trivial memcpy. The memcpy functions are function pointers that resolve the byte count. This is a nice optimization since the memcpy functions can unroll to perfectly unrolled SIMD load-store. Allocating bindless sets of sampled images with this method becomes super efficient, since it boils down to a special function that does: I rarely use these, but they are also quite neat in descriptor buffers. VkBufferView is gone now, so we just need to create a descriptor payload once from VkDeviceAddress and it’s otherwise the same as above. This descriptor type is somewhat of a relic these days, but anyone coming from a GL/GLES background instead of D3D will likely use this descriptor type out of old habit, me included. The API here is slightly more unfortunate, since there is no obvious way to create these descriptors up-front. We don’t necessarily know all the samplers an image will be combined with, so we have to do it last minute, calling vkGetDescriptorEXT to create the combined descriptor. We cannot meaningfully pre-create descriptors for UBOs and SSBOs so we’re in a similar situation where we have to call vkGetDescriptorEXT for each buffer last-minute. Unfortunately, there is no array of descriptor version for GetDescriptorEXT, so in the extreme cases, descriptor buffers can actually have worse CPU overhead than legacy model. DXVK going via winevulkan .dll <-> .so translation overhead has been known to hit this, but for everyone else I’d expect the difference to be moot. Since descriptor buffer is an incremental improvement over legacy model, we retain optional support for push descriptors. This can be useful in some use cases (it’s critical for vkd3d-proton), but Granite does need it. Once we’re in descriptor buffer land, we’re locked in. Descriptor buffers are battle tested and very well supported at this point. Perhaps not on very old mobile drivers, but slightly newer devices tend to have it, so there’s that! RenderDoc has solid support these days as well. At a quick glance, descriptor heap looks very similar to D3D12 (and it is), but there are various additions on top to make it more compatible with the various binding models that exist out there in the wild, especially for people who come from a GL/Vulkan 1.0 kind of engine design. The normal D3D12 model has some flaws if you’re not fully committed to bindless all day every day, mainly that: This is to match how some hardware works, nothing too complicated. I allocate for the supported ~1 million resource descriptors and 4096 samplers. There is a reserved region for descriptors as well which is new to this extension. In D3D12 this is all abstracted away since applications don’t have direct access to the descriptor heap memory. For the resource heap, we have a 512 K descriptor area which can be freely allocated from, like we did with descriptor buffer. Unlike descriptor buffer where we hammer this arena allocator all the time, we will only rarely need to touch it with descriptor heap. The next ~500k or so descriptors are dedicated to holding the descriptor payload for VkImageView, VkSampler and VkBufferView. All of these objects are now obsolete. When Granite creates a Vulkan::ImageView, it internally allocates a free slab index from this upper region, writes the descriptor there and stores the heap index instead. This enables “true” bindless in a performant way. We could have done this before if we wanted to, but in descriptor buffer we would have eaten a painful indirection on a lot of hardware, which is not great. Some Vulkan drivers actually works just like this internally. You can easily tell, because some drivers report that an image descriptor is just sizeof(uint32_t). We’d have our index into the “heap”, which gets translated into yet another index into the “true” (hidden) heap. Chasing pointers is bad for perf as we all know. We keep a copy of the descriptor payload in CPU memory too, in case we have to write to the arena allocated portion of the heap later. The upper region of ~10k descriptors or so (depends on the driver) is just a reserved region we bind and never touch. It’s there so that drivers can deal with CmdResolveImage, CmdBlitImage and other such special APIs that internally require descriptors. For samplers, there is no arena allocator. It’s so tiny. Instead, when creating a sampler, we allocate a slab index and return a dummy handle by just pointer casting the index instead. We’ll make good use of the mapping APIs later to deal with this lack of arena allocation. In fact, we will never have to copy sampler descriptor payloads around, and we don’t have to mess around with static samplers either, neat! For the static sampler crowd, there is full support for embedded samplers which functions just like D3D12 static samplers, so there’s that but Granite doesn’t use it. It was a non-trivial amount of code to get to this point, but hey, that’s what happens when you try to support 3 descriptor models at once I guess … Core Vulkan 1.0 settled on 128 bytes of push constants being the limit. This was raised in Vulkan 1.4 but Granite keeps the old limit (I could probably live with 32 or 64 bytes to be fair). Push data expands to 256 byte as a minimum, and the main idea behind descriptor heap is that pipeline layouts are completely gone, and we get to decide how the driver should interpret the push data space. This is similar to D3D12 root parameters except it’s not abstracted behind a SetRootParameter() kind of interface that is called one at a time. In Vulkan, we can call CmdPushDataEXT once. VkPipelineLayout and VkDescriptorSetLayout is just gone now, poof, does not exist at all. This is huge for usability. Effectively, we can pretend that the VkPipelineLayout is now just push constant range of 256 bytes, and that’s it. If you’re fully committed to go bindless, we could just do the equivalent of SM 6.6 ResourceDescriptorHeap and SamplerDescriptorHeap and buffer_device_address to get everything done. However, Granite is still a good old slot based system, so I need to use the mapping features to tell the driver how to translate set/binding into actual descriptors. This mapping can be different per-shader too, which fixes a lot of really annoying problems with EXT_graphics_pipeline_library and EXT_shader_object if I feel like going down that path in the future. The natural thing to do for me was to split up the space into maximum 128 byte push constants, then 32 bytes per descriptor set (I support 4 sets, Vulkan 1.0 min-spec). It’s certainly possible to parcel out the data more intelligently, but that causes some issues with set compatibility which I don’t want to deal with. For every set, I split it up into buffers and images and decide on a strategy for each. Buffers are decided first since they have the largest impact on performance in my experience. This is very simple. If there are 3 or fewer buffers in a set (24 bytes), we can just stuff the raw pointers into push data and tell the driver to use that pointer. This is D3D12 root descriptors in a nutshell. Especially for UBOs, this is very handy for performance. We lose robustness here, but I never rely on buffer robustness anyway. The push data layout looks something like this: This is a new Vulkan speciality. Without modifying the shaders, we can tell the driver to load a buffer device address from a pointer in push data instead. This way we don’t have to allocate from the descriptor heap itself, we can just do a normal linear UBO allocation, write some VkDeviceAddresses in there and have fun. Given the single indirection to load the “descriptor” here, this looks a lot like Vulkan 1.0 descriptor sets, except there’s no API necessary to write them. This isn’t the ideal path, but sometimes we’re forced to allocate from the heap. This can happen if we have one of these cases: This is a pretty much D3D12’s root tables, but in Vulkan we can be a bit more optimal with memory since buffer descriptors tend to be smaller than image descriptors and we can pack them tightly. D3D12 has one global stride for any resource descriptor while Vulkan exposes separate sizes that applications can take advantage of. vkWriteResourceDescriptorsEXT is required here to write the SSBO descriptors. After buffers are parceled out for a descriptor set, we have some space left for images. At minimum, we have 8 bytes left (32 – 3 * sizeof(VkDeviceAddress)). This is the common and ideal case. If we don’t have any arrays of images, we can just have a bunch of uint32_t indices directly into the heap. At image view and buffer view creation time, we already allocated a persistent index into the heap that we can refer to. No API calls required when emitting commands. Combined image samplers work quite well in this model, because Vulkan adds a special mapping mode that packs both sampler index and the image index together. This fixes one of the annoying issues in EXT_descriptor_buffer. If we cannot use the simple inline indices, we have two options. The preferred one right now is to just allocate space in the descriptor heap just like the descriptor buffer path, because I’m quite concerned with unnecessary indirections when possible. At least we get to copy the payloads around without API commands. This path is also used for bindless sets. Unlike the descriptor buffer path, there is a major problem which is that linearly allocating from the sampler heap is not viable. The sampler heap is really small now just like in D3D12. In this case, Vulkan has an answer. This is a special Vulkan feature that functions like an indirect root table. This one is similar to INDIRECT_ADDRESS in that we don’t have to allocate anything from the heap directly and we can just stuff heap indices straight into a UBO. Overall, I think these new mapping types allows us to reuse old shaders quite effectively and it’s possible to start slowly rewriting shaders to take full advantage of descriptor_heap once this machinery is in place. For GPU performance, it seemed to be on-par with the other descriptor models on NVIDIA and AMD which was expected. Granite does not really hit the cases where descriptor_heap should meaningfully improve GPU performance over descriptor_buffer, but I only did a rough glance. For CPU performance, things were a bit more interesting, and I learned that Granite has quite significant overhead on its own, which is hardly surprising. That’s the cost of an old school slot and binding model after all, and I never did a serious optimization pass over it. A more forward looking rendering abstraction can eliminate most, if not all this overhead. The numbers here are for RADV, but it’s using the pending merge request for descriptor_heap support. – ~27 us to write 4096 image descriptors on a Ryzen 3950x with a RX 6800. This is basically exactly the same. ~13 us. This is really just a push_back and memcpy bench at this point. This case hits the optimal inline BDA case for heap. ~ 279 ns per dispatch. Doesn’t feel very impressive. Basically same perf, but lots of overhead has now shifted over to Granite. Certainly things can be optimized further. GetDescriptorEXT is somehow much faster than UpdateDescriptorSetWithTemplate though. ~ 157 ns / dispatch now, and most of the overhead is now in Granite itself, which is ideal. I added an extra buffer descriptor per set which hits the INDIRECT_ADDRESS path. Heap regressed significantly, but it’s all in Granite code at least. Likely related having to page in new UBO blocks, but I didn’t look too closely. ~ 375 ns / dispatch, hnnnnnng. The other paths don’t change much as is expected. About ~ 310 ns / dispatch for legacy and descriptor buffer models. This is the happy path for descriptor heap. ~ 161 ns / dispatch ~ 166 ns. Quite interesting that it got slower. The slab allocator for legacy sets seems to be doing its job very well. The actual descriptor copying vanished from the top list at least. ~ 145 ns. A very modest gain, and most of the overhead is now just Granite jank. All the paths look very similar now. ~ 170 ns or so. On RTX 4070 with 595 drivers. The improvements especially for buffers is quite large on NV, interestingly enough. For the legacy buffer tests, it’s heavily biased towards driver overhead: For the image tests the gains are modest, which is somewhat expected given how NV implements image descriptors before descriptor heap. It’s just some trivial u32 indices. Overall, it’s interesting how well the legacy Vulkan 1.0 model holds up here, at least on RADV on my implementation. Descriptor buffer and heap cannot truly shine unless the abstraction using it is written with performance in mind. This sentiment is hardly new. Just porting OpenGL-style code over to Vulkan doesn’t give amazing gains, just like porting old and crusty binding models won’t magically perform with newer APIs either. Either way, this level of performance is good enough for my needs, and the days of spamming out 100k draw calls is kinda over anyway, since it’s all GPU driven with large bindless data sets these days. Adding descriptor buffer and heap support to Granite was generally motivated by curiosity rather than a desperate need for perf, but I hope this post serves as an example of what can be done. There’s a lot of descriptor heap that hasn’t been explored here. GPU performance for heavily bindless workloads is another topic entirely, and I also haven’t really touched on how it would be more practical to start writing code like: which would side-step almost all Granite overhead. Overall I quite like what we’ve got now with descriptor heap as an API, a bastard child of descriptor buffer and D3D12 that gets the job done. As tooling and driver support matures, I will likely just delete the descriptor buffer path, keeping the legacy stuff around for compatibility. VkDescriptorSet VkDescriptorPool vkUpdateDescriptorSets (kinda) VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE VK_DESCRIPTOR_TYPE_STORAGE_IMAGE VK_DESCRIPTOR_TYPE_INPUT_ATTACHMENT VK_DESCRIPTOR_TYPE_SAMPLER VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER VK_DESCRIPTOR_TYPE_STORAGE_BUFFER VK_DESCRIPTOR_TYPE_ACCELERATION_STRUCTURE_KHR You very quickly end up having to call CopyDescriptorsSimple a LOT to shuffle descriptors into the heap. Since this is a call into the driver just to copy a few bytes around, it can quickly be a source of performance issues. In vkd3d-proton, we went to hell and back to optimize this case because in many titles, it was the number 1 performance overhead. Dealing with samplers is a major pain. The 2K sampler heap limit can be rather limiting since there is no good way to linearly allocate on such a small heap. Static samplers are quite common as a result, but they have other problems. Recompiling shaders because you change Aniso 4x to 8x in the settings menu is kinda a hilarious situation to be in, but some games have been known to do just that … The shader is using OpArrayLength on an SSBO. We need real descriptors in this case. The current implementation just scans the SPIR-V shader module for this instruction, but could be improved in theory. The shader is using an array of descriptors. For buffers, this should be very rare, but the PUSH_ADDRESS and INDIRECT_ADDRESS interfaces do not support this. Robustness is enabled. Test #1: Write 4096 image descriptors: 17.6 us (copies u32 indices) Test #2: 693 ns Test #3: 726 ns Test #4: 377 ns Test #5: 408 ns Test #1: 10.2 us (copies u32 indices) Test #2: 434 ns Test #3: 479 ns Test #4: 307 ns Test #5: 315 ns Test #1: 11 us (copies real 32 byte descriptors) Test #2: 389 ns Test #3: 405 ns Test #4: 321 ns Test #5: 365 ns

0 views
Simon Willison 2 weeks ago

Vibe coding SwiftUI apps is a lot of fun

I have a new laptop - a 128GB M5 MacBook Pro, which early impressions show to be very capable for running good local LLMs. I got frustrated with Activity Monitor and decided to vibe code up some alternative tools for monitoring performance and I'm very happy with the results. This is my second experiment with vibe coding macOS apps - the first was this presentation app a few weeks ago . It turns out Claude Opus 4.6 and GPT-5.4 are both very competent at SwiftUI - and a full SwiftUI app can fit in a single text file, which means I can use them to spin something up without even opening Xcode. I’ve built two apps so far: Bandwidther shows me what apps are using network bandwidth and Gpuer to show me what’s going on with the GPU. At Claude’s suggestion both of these are now menu bar icons that open a panel full of information. I built this app first, because I wanted to see what Dropbox was doing. It looks like this: I’ve shared the full transcript I used to build the first version of the app. My prompts were pretty minimal: Show me how much network bandwidth is in use from this machine to the internet as opposed to local LAN (My initial curiosity was to see if Dropbox was transferring files via the LAN from my old computer or was downloading from the internet.) mkdir /tmp/bandwidther and write a native Swift UI app in there that shows me these details on a live ongoing basis This got me the first version, which proved to me this was worth pursuing further. git init and git commit what you have so far Since I was about to start adding new features. Now suggest features we could add to that app, the goal is to provide as much detail as possible concerning network usage including by different apps The nice thing about having Claude suggest features is that it has a much better idea for what’s possible than I do. We had a bit of back and forth fixing some bugs, then I sent a few more prompts to get to the two column layout shown above: add Per-Process Bandwidth, relaunch the app once that is done now add the reverse DNS feature but make sure original IP addresses are still visible too, albeit in smaller typeface redesign the app so that it is wider, I want two columns - the per-process one on the left and the rest on the right OK make it a task bar icon thing, when I click the icon I want the app to appear, the icon itself should be a neat minimal little thing The source code and build instructions are available in simonw/bandwidther . While I was building Bandwidther in one session I had another session running to build a similar tool for seeing what the GPU was doing. Here’s what I ended up with: Here's the transcript . This one took even less prompting because I could use the in-progress Bandwidther as an example: I want to know how much RAM and GPU this computer is using, which is hard because stuff on the GPU and RAM does not seem to show up in Activity Monitor This collected information using and and gave me an answer - more importantly it showed me this was possible, so I said: Look at /tmp/bandwidther and then create a similar app in /tmp/gpuer which shows the information from above on an ongoing basis, or maybe does it better After a few more changes to the Bandwidther app I told it to catch up: Now take a look at recent changes in /tmp/bandwidther - that app now uses a sys tray icon, imitate that This remains one of my favorite tricks for using coding agents: having them recombine elements from other projects. The code for Gpuer can be found in simonw/gpuer on GitHub. These two apps are classic vibe coding: I don't know Swift and I hardly glanced at the code they were writing. More importantly though, I have very little experience with macOS internals such as the values these tools are measuring. I am completely unqualified to evaluate if the numbers and charts being spat out by these tools are credible or accurate! I've added warnings to both GitHub repositories to that effect. This morning I caught Gpuer reporting that I had just 5GB of memory left when that clearly wasn't the case (according to Activity Monitor). I pasted a screenshot into Claude Code and it adjusted the calculations and the new numbers look right, but I'm still not confident that it's reporting things correctly. I only shared them on GitHub because I think they're interesting as an example of what Claude can do with SwiftUI. Despite my lack of confidence in the apps themselves, I did learn some useful things from these projects: These two apps took very little time to build and have convinced me that building macOS apps in SwiftUI is a new capability I should consider for future projects. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . A SwiftUI app can get a whole lot done with a single file of code - here's GpuerApp.swift (880 lines) and BandwidtherApp.swift (1063 lines). Wrapping various terminal commands in a neat UI with Swift is easily achieved. Claude has surprisingly good design taste when it comes to SwiftUI applications. Turning an app into a menu bar app is just a few lines of extra code as well. You don't need to open Xcode to build this kind of application!

0 views
Farid Zakaria 2 weeks ago

Does anyone actually use the large code-model?

I have been focused lately on trying to resolve relocation overflows when compiling large binaries in the small & medium code-models. Often when talking to others about the problem, they are quick to offer the idea of using the large code-model. Despite the performance downsides of using the large code-model from the instructions generated, it’s true that its intent was to support arbitrarily large binaries. However does anyone actually use it? Turns out that large binaries do not only affect the instructions generated in the section but may also have effects on other sections within the ELF file such as (exception handling information), (optimized binary search table for ), and even . Let’s take and as an example. They specifically allow various encodings for the data within them ( or for 4 bytes and 8 bytes respectively) irrespective of the code-model used. However, it looks like the userland has terrible support for it! If we look at the format, we can see how these encodings are applied in practice. The entries in this column are the ones that actually resolve to specific DWARF exception header encoding formats (like , , , etc.) depending on the values provided in the preceding fields. format [ ref ]: Note: The values for and dictate their byte size and format. For example, if is set to , the field will be processed as an (signed 4-byte) value. Up until very recently ( pull#179089 ), LLVM’s linker would crash if it tried to link exception data ( ) beyond 2GiB. This section is always generated to help stack searching algorithms avoid linear search. Once we fix that though, it looks like ( gcc-patch@ ) and ( pull#964 ) explicitly either crash on or avoid the binary search table completely reverting back to linear search. How devasting is linear search here? If you have a lot of exceptions, which you theoretically might for the large code-model, I had benchmarks that started at ~13s improve to ~18ms for a ~700x speedup . Other fun failure modes that exist: Note: Don’t let confuse you, it’s actually 32bit: It seems like the large code-model “exists” but no one is using it for it’s intended purpose which was to build large binaries. I am working to make massive binaries possible without the large code-model while retaining much of the performance characteristics of the small code-model. You can read more about it in x86-64-abi google-group where I have also posted an RFC.

0 views
Max Bernstein 2 weeks ago

Using Perfetto in ZJIT

Originally published on Rails At Scale . Look! A trace of slow events in a benchmark! Hover over the image to see it get bigger. Now read on to see what the slow events are and how we got this pretty picture. The first rule of just-in-time compilers is: you stay in JIT code. The second rule of JIT is: you STAY in JIT code! When control leaves the compiled code to run in the interpreter—what the ZJIT team calls either a “side-exit” or a “deopt”, depending on who you talk to—things slow down. In a well-tuned system, this should happen pretty rarely. Right now, because we’re still bringing up the compiler and runtime system, it happens more than we would like. We’re reducing the number of exits over time. We can track our side-exit reduction progress with , which, on process exit, prints out a tidy summary of the counters for all of the bad stuff we track. It’s got side-exits. It’s got calls to C code. It’s got calls to slow-path runtime helpers. It’s got everything. Here is a chopped-up sample of stats output for the Lobsters benchmark, which is a large Rails app: (I’ve cut out significant chunks of the stats output and replaced them with because it’s overwhelming the first time you see it.) The first thing you might note is that the thing I just described as terrible for performance is happening over twelve million times . The second thing you might notice is that despite this, we’re staying in JIT code seemingly a high percentage of the time. Or are we? Is 80% high? Is a 4.5% class guard miss ratio high? What about 11% for shapes? It’s hard to say. The counters are great because they’re quick and they’re reasonably stable proxies for performance. There’s no substitute for painstaking measurements on a quiet machine but if the counter for Bad Slow Thing goes down (and others do not go up), we’re probably doing a good job. But they’re not great for building intuition. For intuition, we want more tangible feeling numbers. We want to see things. The third thing is that you might ask yourself “self, where are these exits coming from?” Unfortunately, counters cannot tell you that. For that, we want stack traces. This lets us know where in the guest (Ruby) code triggers an exit. Ideally also we would want some notion of time: we would want to know not just where these events happen but also when. Are the exits happening early, at application boot? At warmup? Even during what should be steady state application time? Hard to say. So we need more tools. Thankfully, Perfetto exists. Perfetto is a system for visualizing and analyzing traces and profiles that your application generates. It has both a web UI and a command-line UI. We can emit traces for Perfetto and visualize them there. Take a look at this sample ZJIT Perfetto trace generated by running Ruby with 1 . What do you see? I see a couple arrows on the left. Arrows indicate “instant” point-in-time events. Then I see a mess of purple to the right of that until the end of the trace. Hover over an arrow. Find out that each arrow is a side-exit. Scream silently. But it’s a friendly arrow. It tells you what the side-exit reason is. If you click it, it even tells you the stack trace in the pop-up panel on the bottom. If we click a couple of them, maybe we can learn more. We can also zoom by mousing over the track, holding Ctrl, and scrolling. That will get us look closer. But there are so many… Fortunately, Perfetto also provides a SQL interface to the traces. We can write a query to aggregate all of the side exit events from the table and line them up with the topmost method from the backtrace arguments in the table: This pulls up a query box at the bottom showing us that there are a couple big hotspots: It even has a helpful option to export the results Markdown table so I can paste (an edited version) into this blog post: Looks like we should figure out why we’re having shape misses so much and that will clear up a lot of exits. (Hint: it’s because once we make our first guess about what we think the object shape will be, we don’t re-assess… yet .) This has been a taste of Perfetto. There’s probably a lot more to explore. Please join the ZJIT Zulip and let us know if you have any cool tracing or exploring tricks. Now I’ll explain how you too can use Perfetto from your system. Adding support to ZJIT was pretty straightforward. The first thing is that you’ll need some way to get trace data out of your system. We write to a file with a well-known location ( ), but you could do any number of things. Perhaps you can stream events over a socket to another process, or to a server that aggregates them, or store them internally and expose a webserver that serves them over the internet, or… anything, really. Once you have that, you need a couple lines of code to emit the data. Perfetto accepts a number of formats. For example, in his excellent blog post , Tristan Hume opens with such a simple snippet of code for logging Chromium Trace JSON-formatted events (lightly modified by me): This snippet is great. It shows, end-to-end, writing a stream of one event. It is a complete (X) event, as opposed to either: It was enough to get me started. Since it’s JSON, and we have a lot of side exits, the trace quickly ballooned to 8GB large for a several second benchmark. Not great. Now, part of this is our fault—we should side exit less—and part of it is just the verbosity of JSON. Thankfully, Perfetto ingests more compact binary formats, such as the Fuchsia trace format . In addition to being more compact, FXT even supports string interning. After modifying the tracer to emit FXT, we ended with closer to 100MB for the same benchmark. We can reduce further by sampling —not writing every exit to the trace, but instead every K exits (for some (probably prime) K). This is why we provide the option. Check out the trace writer implementation from the point this article was written. We could trace: Visualizations are awesome. Get your data in the right format so you can ask the right questions easily. Thanks for Perfetto! Also, looks like visualizations are now available in Perfetto canary. Time to go make some fun histograms… This is also sampled/strobed, so not every exit is in there. This is just 1/K of them for some K that I don’t remember.  ↩ two discrete timestamped begin (B) and end (E) events that book-end something, or an instant (i) event that has no duration, or a couple other event types in the Chromium Trace Event Format doc When methods get compiled How big the generated code is How long each compile phase takes When (and where) invalidation events happen When (and where) allocations happen from JITed code Garbage collection events This is also sampled/strobed, so not every exit is in there. This is just 1/K of them for some K that I don’t remember.  ↩

0 views

Uptime of GitHub Pages alternatives

Many software developers feel we are at a source control inflection point. GitHub has reigned for over fifteen years, and we may be in the early days of an exodus. Developers have become increasingly disappointed with GitHub’s service, features, and overall direction. Blame is often directed to the buy-out by Microsoft, migration to Azure, Azure itself, or the intense focus on AI. Whatever the underlying reason, people are thinking about switching. This post measures the static website hosting uptime of various alternatives. First, a little background. Many source control hosting providers support website hosting. The core concept is that you can deploy a static website with little more than a git push. The ease of use is central to this product: developers just want a static website with minimal fuss. Since they already track their source code in a git repo, its easiest to launch a website from the same provider. Ease of use was central to my decision to host this blog on GitHub Pages. This post explores: All of these services provide a static website hosting free tier. I wanted to understand the reliability of these services before I migrated my content, so I created a simple test. I signed up for accounts on each service and deployed a test web page on each platform. The web pages are completely static, so they can be served from disk as-is or from a CDN cache. Finally, I created uptime monitors on UptimeRobot to detect downtime of these test pages. It’s been running for almost two years. The monitoring status page is public , so you can track how well these platforms perform over time. Here’s the monitoring status for each platform over the last ninety days: Some quick notes about the monitoring. Checks are performed at five minute intervals, so an outage that is shorter than that duration would either not be detected or would be reported as a five minute outage. The response timing for my test webpage on GitHub Pages was the best with an average response time over 100ms faster than all the others. The minimum response time was 6ms, which suggests that UptimeRobot is in the same data center as GitHub Pages. My monitor detected three outages over the last 23 months. Two were 404 Not Found errors, both happening on November 27th, 2024 and lasting ten minutes each. There was also a five minute DNS-related outage. GitHub was not to blame in this instance as I use a custom domain name and a third-party DNS provider. Focusing on the last full year, 2025, there were zero outages I could attribute to GitHub Pages. So my assessment of GitHub Pages test webpage uptime in 2025 is 100% . I was kind of surprised that GitHub Pages did so well here. Microsoft’s own status report shows occasionally issues with GitHub Pages. My custom monitor did not detect these. One explanation for the disagreement between these measurements is the presence of a third-party CDN. GitHub serves static assets for GitHub Pages through the Fastly CDN . I never change the test web pages, so I’m not testing the reliability of deployments. So in this instance, my custom monitor is really measuring Fastly, not any Microsoft-operated systems. GitLab Pages was the slowest platform I tested, with average response times over 300ms slower than GitHub Pages. GitLab had one large outage of twenty-five minutes and a short five minute outage. GitLab Pages appeared to have 99.994% uptime in 2025. This “four-nines” availability is excellent and is suitable for most websites. Bitbucket Cloud response times were middle-of-the-road. UptimeRobot detected twenty-eight periods of downtime for the Bitbucket Cloud test webpage. Nineteen of these were connection timeouts. The rest were 500-series HTTP status codes. Over 2025, the Bitbucket Cloud test webpage availability was measured as 99.936% uptime . This “three-nines” availability is excellent and is suitable for most websites. The Codeberg Pages test webpage had the second fastest response times. The Codeberg Pages test webpage had the worst availability with 489 periods of downtime. The longest of these nearly reaching seventeen hours. Over 2025, the Codeberg Pages test webpage availability was measured as 98.358% . This “one-nine” uptime is below availability targets of many websites. GitHub Pages took the top spot in this analysis, which wasn’t what I expected. Depending on your sensitivity to slow response times and availability, you may rank GitLab Pages or Bitbucket Cloud as the best alternative. It seems reasonable to measure GitLab Cloud latency from other locations, as the slow response times could be an artifact of the network path between GitLab and UptimeRobot. Codeberg Pages had the worst availability and appears unsuitable for all but the most outage tolerant of websites. If you need to use it, you could add a CDN of your own on top. Many CDNs are able to serve your websites even when the origin is down, thus hiding availability problems. This adds additional complexity, can impact privacy, and may carry extra costs. GitHub Pages Bitbucket Cloud Codeberg Pages ; and GitLab Pages

0 views

Dissecting and Modeling the Architecture of Modern GPU Cores

Dissecting and Modeling the Architecture of Modern GPU Cores Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, and Antonio Gonzalez MICRO'25 The purpose of this paper is to understand the microarchitecture of recent NVIDIA GPUs, to be able to update architectural simulators that are used for research purposes. The authors uncovered lots of interesting tidbits. Take this information with a grain of salt; it is derived from careful experimentation rather than NVIDIA documentation. The paper uses the term sub-core to represent the hardware module which can execute warp-wide instructions. Each SM comprises four sub-cores. Fig. 3 illustrates the components within a sub-core and shows how 4 sub-cores share instruction and data caches: Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero. In fig. 2 above, the values of these counters are checked in the block, and the counters are incremented in the block. The warp scheduler prefers to pick a warp and stick with it (e.g., it is not a round-robin scheduler). If the current warp cannot be scheduled (e.g., the stall counter is greater than zero, or there was a cache miss), then the scheduler switches to another warp. The warp scheduler issues instructions in program order (within a warp). There is no out-of-order execution support. The register file has a limited number of ports, and instructions must be controlled to avoid attempting too many reads or writes in parallel. Register file port contention is not handled by the warp scheduler, instead it is handled further down the pipe. For example, the stage in fig. 2 will stall fixed-latency instructions until register file read ports are available. The register file cache (RFC) is a hardware component that reduces contention on the register file read ports. The RFC has storage for 6 vectors (and tags). The compiler can mark a source operand of an instruction such that the hardware will store the source operand in the cache for a subsequent operation to use. Note that the RFC does not store per-warp values and is only useful for caching data within one warp. This plays nicely with the “pick a warp and stick to it” scheduling policy. Listing 4 has some example code sequences demonstrating how the compiler can direct the operation of the RFC (e.g., ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Memory Access Most of the resources that are shared between sub-cores are shared for efficiency reasons. A single sub-core will not generate memory requests at a high throughput, and there is locality of reference between the memory accesses in multiple sub-cores. The block in fig. 3 is shared in order to properly support thread group shared memory (as a thread group is spread across all sub-cores in a SM). The shared memory access modules can handle one request every two cycles. That means if all 4 sub-cores are contending on memory, each one can make a request every 8 cycles. There is a FIFO of depth ~4 between each sub-core and the shared memory structures. Typical read-after-write latency in shared memory is between 20-40 cycles. The authors built a simulation model based on their experiments. Mean percentage absolute error (MAPE) is one metric for measuring how accurate a simulation model is compared to real hardware. Table 4 shows that the model derived from the findings in this paper are a better performance model for recent NVIDIA GPUs than the baseline ( ): Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Subscribe now Source: https://dl.acm.org/doi/10.1145/3725843.3756041 Instruction Issue The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware. There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards: The instruction encoding allows any instruction to set the value of a per-warp stall counter. When the hardware issues such an instruction, it sets the stall counter to the specified value. On each clock cycle thereafter, the counter is decremented by one. The hardware will not issue more instructions for the warp until the counter reaches zero. This is useful for handling hazards with a fixed latency. Variable-latency hazards are resolved with dependence counters . The hardware tracks the value of six dependence counters per warp. The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued. One of these counters is decremented when the instruction writes to the register file, and the other is decremented when the instruction reads from the register file (to resolve WAR hazards). Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero.

0 views
The Coder Cafe 3 weeks ago

Working on Complex Systems

☕ Welcome to The Coder Cafe! Today, I’m sharing the talk I gave at the Monster SCALE Summit 2026 on working on complex systems. Get cozy, grab a coffee, and let’s begin! Introduction If you’ve been a subscriber since mid-2025, you were already here when I published the post that performed best on my newsletter: Working on Complex Systems . I really loved writing it, and I always had in mind to revisit it at some point. So, when someone at ScyllaDB reached out to invite me to speak at Monster SCALE Summit, I saw the perfect opportunity to turn it into a talk. The video isn’t a 1:1 mapping of the original content. I expanded it with more examples and new ideas. In it, I define what complex systems are, then discuss their common characteristics, and finally explore patterns for navigating them. Hope you will enjoy it! Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Latency and User Experience Probabilistic Increment Bloom Filters Monster SCALE Summit - 2026 Tech Talks ❤️ If you enjoyed this post, please hit the like button. Leave a comment Introduction If you’ve been a subscriber since mid-2025, you were already here when I published the post that performed best on my newsletter: Working on Complex Systems . I really loved writing it, and I always had in mind to revisit it at some point. So, when someone at ScyllaDB reached out to invite me to speak at Monster SCALE Summit, I saw the perfect opportunity to turn it into a talk. The video isn’t a 1:1 mapping of the original content. I expanded it with more examples and new ideas. In it, I define what complex systems are, then discuss their common characteristics, and finally explore patterns for navigating them. Hope you will enjoy it! Video Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Resources More From the Distributed Systems Category Latency and User Experience Probabilistic Increment Bloom Filters Monster SCALE Summit - 2026 Tech Talks

0 views
Loren Stewart 3 weeks ago

ChatGPT, Claude, and Gemini Render Markdown in the Browser. I Do the Opposite

The big AI chat apps ship heavy rendering libraries to every device. Cheddy Chat renders markdown server-side and streams finished HTML, eliminating 160-440KB of client JavaScript while keeping the main thread free.

0 views
alikhil 3 weeks ago

What is a CDN and Why It Matters?

With the rapid growth of GenAI solutions and the continuous launch of new applications, understanding the fundamental challenges and solutions of the web is becoming increasingly important. One of the core challenges is delivering content quickly to the end user . This is where a CDN comes into play. A CDN stands for Content Delivery Network . Let’s break it down. (Note: Modern CDN providers often bundle additional services such as WAF, DDoS protection, and bot management. Here, we focus on the static content delivery.) Content refers to any asset that needs to be loaded on the user’s device: images, audio/video files, JavaScript, CSS, and more. Delivery means that this content is not only available but also delivered efficiently and quickly. A CDN is a network of distributed nodes that cache content. Instead of fetching files directly from the origin server, users receive them from the nearest node, minimizing latency. Consider an online marketplace for digital assets, such as a photo stock or NFT platform. The application stores thousands of images on a central server. Whenever users open the app, those images must load quickly. If the application server is hosted in Paris, users in Paris will experience minimal ping. However: These numbers only reflect simple ICMP ping times. Actual file delivery involves additional overhead such as TCP connections and TLS handshakes, which increase delays even further. With a CDN, each user connects to the nearest edge node instead of the origin server. This is typically achieved via GeoDNS. Importantly, only the CDN knows the actual address of the origin server, which also improves security by reducing exposure to direct DDoS attacks. CDN providers usually operate edge nodes in major world cities. When a request is made: If the requested file is already cached on the edge node ( cache hit ), it is delivered instantly. If not ( cache miss ), the edge node requests it from the CDN shield . If the shield has the file cached, it is returned to the edge and then served to the user. If not, the shield fetches it from the origin server, caching it along the way. For popular websites, the cache hit rate approaches but rarely reaches 100% due to purges, new files, or new users. The shield node plays a critical role. Without it, each cache miss from any edge node would hit the origin server directly, increasing load. Many providers offer shields as an optional feature, and enabling them can significantly reduce origin stress. Beyond cache hits and misses, performance can be measured with concrete indicators: Time to First Byte (TTFB): How long it takes for the first data to arrive after a request. CDNs usually reduce TTFB by terminating connections closer to the user. Latency reduction: The difference in round-trip time between delivery from the origin versus delivery from an edge node. Cache hit ratio: The percentage of requests served directly from edge caches. These KPIs provide a real, measurable view of CDN efficiency rather than theoretical assumptions. The closer the edge node is to the end user, the faster the content loads. The key questions are: Where are the users located? Which CDN providers have the best edge coverage for those locations? But don’t rely on maps alone. Measure real performance with Real User Monitoring (RUM) using metrics like TTFB and Core Web Vitals. There are plenty of ready-made tools available. If you’re interested in building your own RUM system, leave a comment or reaction – I can cover that in a follow-up post. Users in Spain may see about 2× ping time. Users in the USA may see 6× ping time. Users in Australia may see 12× ping time. If the requested file is already cached on the edge node ( cache hit ), it is delivered instantly. If not ( cache miss ), the edge node requests it from the CDN shield . If the shield has the file cached, it is returned to the edge and then served to the user. If not, the shield fetches it from the origin server, caching it along the way. Time to First Byte (TTFB): How long it takes for the first data to arrive after a request. CDNs usually reduce TTFB by terminating connections closer to the user. Latency reduction: The difference in round-trip time between delivery from the origin versus delivery from an edge node. Cache hit ratio: The percentage of requests served directly from edge caches.

0 views
Ash's Blog 3 weeks ago

NumKong: 2'000 Mixed Precision Kernels For All 🦍

Around 2'000 SIMD kernels for mixed-precision BLAS-like numerics — dot products, batched GEMMs, distances, geospatial, ColBERT MaxSim, and mesh alignment — from Float6 to Float118, leveraging RISC-V, Intel AMX, Arm SME, and WebAssembly Relaxed SIMD, in 7 languages and 5 MB.

0 views

Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture

Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture Rohan Juneja, Pranav Dangi, Thilini Kaushalya Bandara, Tulika Mitra, and Li-Shiuan Peh MICRO'25 This paper presents an implementation of the Active Message (AM) architecture, as an alternative to FPGA/CGRA architectures. AM architectures have been studied for a while; this was my first exposure. An accelerator implemented on an FPGA or CGRA typically uses a spatial computing paradigm. Each “instruction” in the algorithm is pinned to a physical location on the chip, and data flows between the instructions. I prefer to think of the data in motion as the local variables associated with threads that also move (using a specialized memory consistency model ). The active message architecture flips that script around. Data structures are pinned, while instructions move to the relevant data . Fig. 5 shows two processing elements (PEs), each of which contain two active messages (AMs). An active message looks a lot like an instruction: it contains an opcode, source operands, and a result operand. Throughout the computation, AMs move between PEs. PEs have a local ALU and local memory. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 The AM at the top of the figure has and . Here, is an operand that is being carried around for future use. The AM with a opcode will make its way through the chip until it arrives at the PE which contains the data to be loaded. At this point, the load operation will execute, and a new AM will be created. In the figure above, the new AM is the one at the bottom of PE0. It has , , and . Op1 is forwarded unchanged from the predecessor AM. The value of was the value of the data loaded from memory. The new opcode was obtained from the config memory , which contains a description of the program that is being executed. The next step to be performed is to multiply . One might expect PE0 to perform the multiplication, but in the figure above the AM is routed to , which performs the multiplication. A reason why you would want to do this is in a situation where there are many AMs queued to access the data memory associated with PE0, but few AMs queued to access the data memory associated with PE1. In this situation, it is better to let PE0 perform loads for other AMs (because PE0 is the only PE that can fulfill that task) and find a PE that is currently idle to perform the multiplication (any PE can perform the multiplication). Now the question you should be asking is: what real-world applications exhibit load imbalances between PEs like this? If a data structure were split between all PEs evenly, you would think that load will be spread nicely across the PEs. The answer is: irregular workloads like sparse matrix-vector multiplication. Fig. 6 shows how a source matrix, source vector, and result vector could be partitioned across 4 PEs. You can imagine how the sparsity of the tensors being operated on would cause load imbalance between the PEs. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 Fig. 11 compares the Nexus Machine against other architectures (each design has the same number of ALUs). Fig. 12 shows performance-per-watt. Source: https://dl.acm.org/doi/10.1145/3725843.3756091 Dangling Pointers I imagine that AM architectures work best for algorithms that are insensitive to the order in which AMs are executed. That would be the case for matrix/vector multiplication (assuming addition is associative). It seems like there is a large design space here related to PE capabilities. Data structures could be replicated across PEs to enable memory access AMs to be serviced by multiple PEs, or the ALUs inside of each PE could be heterogeneous (e.g., some PEs can do division, others cannot). Subscribe now

0 views