GreatReads - Blog Aggregator · Phoenix Framework

Posts in Performance (20 found)

This blog post tells the time

Computer clock synchronization is a complicated process, requiring protocols like NTP and a specialized server to answer requests. In this post I explore a “serverless” method, which relies on widely available CDNs to distribute time. It’s the serverless time servers we didn’t realize we already had. This clock should display the correct time. If your device’s clock is set to the wrong time, it should tell you how far off the clock is set. The page starts the process by requesting a tiny asset through the Cloudflare CDN. As Cloudflare builds the response, an HTTP header transform rule adds timing information, like http.request.timestamp.sec , to the response headers. The client waits for the response and then analyzes the network request using the fine-grained metrics provided by performance resource timers . Finally, some math is applied to adjust for network delay. The PerformanceResourceTiming interface exposes detailed network timing information. This is similar to the web developer tools network tab, just accessible via JavaScript in the page. These metrics are extremely helpful to developers who are troubleshooting performance issues and will prove useful here. Notice that “sending” and “receiving” are both shown as zero milliseconds. The request and response used here are so small they likely fit in a single packet, so these events appear instantaneous. The only measurable part of the HTTP portion of the request is waiting for the network and the server to do their work. These detailed performance timing metrics help us address a major challenge of time distribution: any server provided timestamp has grown stale by the time it reaches the client. To account for this effect, the NTP protocol and similar software estimate the network round trip time. A good estimate for the network delay experienced by the response is “half the round trip time”, although this is not always accurate. Additional adjustment based on server processing delay further improves accuracy. Cloudflare helps us out by providing server-side timing information. The client generally can’t distinguish between network delay and server delay, so this information helps us estimate when the server generated the timestamp. This data includes metrics like cf.timings.origin_ttfb_msec which tells us how long the Cloudflare CDN waited on a response from Cloudflare Pages. At the end of all the measurement and the math, the clock display is an estimate. We’re guessing how much the server-provided timestamp aged before it reached the web browser. It’s an educated guess, informed by a lot of metrics, but there is uncertainty here. For a technique I’ve been calling serverless, I’ve sure talked about servers a lot. The term serverless really means that we’re not managing individual servers ourselves, the cloud hosting provider has abstracted those away. This setup uses Cloudflare Pages to host the tiny asset which this page fetches. The HTTP header transform rule is part of the CDN, we don’t even need Cloudflare Workers . So it’s just files I’ve pushed to GitLab, served by Cloudflare Pages, and some CDN configuration. Tons of servers, but abstracted away. Contrast this with NTP, where we’d need to run the NTP daemon and perhaps manage the underlying operating system. It feels “serverless” in comparison. The clock display includes error bounds, which describe the precision of the provided time. Network latency plays a big part here as we don’t know how long it took to reach Cloudflare or how long it took to get back. Network paths could be asymmetric or packet loss could cause unexpected delay in a single direction. While in normal cases we’d expect the server to process the request (and generate the timestamp) in the middle of our waiting period, in extreme cases those events could fall far to one side. This uncertainty, and the associated error bounds, are reduced when the network latency is lower, which plays into a strength of Cloudflare: their CDN points-of-presence are located geographically near major population centers. For most of us, network latency to Cloudflare is quite small. The performance resource timers also help us precisely estimate when Cloudflare processes the HTTP request as we can eliminate delays caused by DNS resolution, TCP handshake, and TLS session initialization. Precision could be improved by performing multiple requests and applying statistical analysis, but this page makes no such effort. In my testing I’ve often seen 60ms error bounds shown in the web clock. NTP clients, like the command line ntpdig has a much tighter estimate, closer to 6ms. This is an order of magnitude difference. While this method provides decent synchronization with the clock on Cloudflare CDN servers, we’ve got to consider how well synchronized that clock is with the official time. After all, if Cloudflare’s CDN servers provide the wrong timestamp it doesn’t matter how precisely we’ve synchronized, we’ll display the wrong time. Cloudflare’s CDN is not formally a time server, so we need to tread carefully when using it this way. I checked the accuracy against a couple sources. When I collected the ntpdig output shown above, my web clock reported I was behind by 130±70ms. These measurements are within each other’s error bounds, which shows agreement. I also checked using a GPS debugging app on my phone. GPS provides extremely accurate clocks and is likely the most accurate clock I can access. The clocks appeared to update in lock-step, again showing agreement. In this screenshot notice that my phone’s clock is ahead of the other clocks and this offset is detected by the web clock. In any case, it seems risky to depend on an unwitting time server, so without specific promises from Cloudflare I’d just consider this a demo. After all, the Internet doesn’t always know the right time . So far I’ve tested this on my laptop and phone, but I’d be interested to see how well it works for others. You can use tools like ntpdig or GPS debugging apps to compare. I’ve built a standalone web clock for that sort of testing. You may be surprised by how inaccurate your system time can be; slightly offset clocks are quite common. This is especially true when a device sleeps, suspends, or hibernates. When a computer’s CMOS battery is missing of failed, clocks can fall very far out of sync. I’d be curious to see what people discover (contact info at bottom). While the precision of this CDN-based method is relatively poor for a time synchronization protocol, it does offer some attractive features over current solutions. First and foremost: it’s web-native! NTP’s lack of security has been a growing concern. One replacement, Network Time Security (NTS), cryptographically authenticates information sent by the time server . The authenticated encryption of HTTPS similarly protects the CDN-based web clock approach. This avoids situations where an attacker-in-the-middle tampers with insecure NTP responses, messing up your system’s clock. There’s a lot of hazards here, unfortunately. Alternate time synchronization protocols have a history of mistakes, so its wise to be wary. Microsoft tried a TLS-based synchronization approach via Secure Time Seeding (STS). Their approach relied on time metadata in TLS connections, but most servers actually provide random data in the relevant field. This caused clock to reset to random times . In either case, this underscores the risks of getting a clock reference from systems that don’t realize they are being used as time servers. Closing on a more nostalgic note, NIST’s time.gov has a wonderfully retro clock widget . Unfortunately they no longer allow you to host it on your own site, probably due to server load. Here’s my own 88x31 badge, which is hereby MIT licensed. It makes use of SVG’s questionably ability to embed scripts in images.

Web Development

Performance Serverless

JavaScript

0 views

Dangling Pointers 2 days ago

MagiCache: A Virtual In-Cache Computing Engine

MagiCache: A Virtual In-Cache Computing Engine Renhao Fan, Yikai Cui, Weike Li, Mingyu Wang, and Zhaolin Li ISCA'25 This paper presents an implementation of RISC-V vector extensions where all vector computation occurs in the cache (i.e., SRAM-based in-memory computation). It contains an accessible description of in-SRAM computation, and some novel extensions. Recall that SRAM is organized as a 2D array of bits. Each row represents a word, and each column represents a single bit location in many words. A traditional read operation occurs by activating a single row. Analog values are read out from each bit and placed onto shared bit lines. There are two bit lines per column (one holding the value, one holding the complement). Values flow down to sense amplifiers that output digital values. Prior work has shown that this basic structure can be augmented to perform computation. Rather than activating a single row, two rows are activated simultaneously (let’s call the values of these rows and ). The shared bit lines perform computation in the analog domain, which results in two expressions appearing on the output of the sense amplifiers: ( AND ) and ( NOR ). Fig. 1(a) shows a diagram of such an SRAM array: Source: https://dl.acm.org/doi/10.1145/3695053.3731113 If you slap some digital logic at the end of the sense amplifiers, then you can generate other functions like OR, XOR, XNOR, NAND, shift, add. Shift and add involve horizontal connections. Fig. 4(c) shows a hardware diagram of this additional logic at the end of the sense amplifiers. Note that the resulting value can be written back into the SRAM array for future use. Multiplication is not directly supported but can be implemented with a sequence of shift and add operations. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Virtual Engine The innovation in this paper is to dynamically share a fixed amount of on-chip SRAM for two separate purposes: caching and a vector register file. The logical vector register file capacity required for a particular algorithm depends on the number of architectural registers used, and the width of each architectural register (RISC-V vector extensions allow software to configure a logical vector width). Note that this hardware does not have separate vector ALUs, the computation is performed directly in the SRAM arrays. Fig. 6 illustrates how the hardware dynamically allocates SRAM space between generic cache storage and vector registers (with in-memory compute). The unit of allocation is a segment . The width of a vector register determines how many segments it requires. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Initially, all SRAM space is dedicated to caching. When the hardware processes an instruction that writes to an uninitialized vector register, then the hardware allocates segments to hold data for that register (evicting cached data if necessary). This system assumes an enlightened compiler which will emit a instruction to hint to the hardware when it has reached a point in the instruction stream where no vector register has valid content. The hardware can use this hint to reallocate all memory back to being used for caching. Fig. 8 shows performance results normalized against prior work (labeled here). This shows a 20%-60% performance improvement, which is pretty good considering that the baseline offers an order-of-magnitude improvement over a standard in-order vector processor. Source: https://dl.acm.org/doi/10.1145/3695053.3731113 Dangling Pointers I wonder how this would compare to hardware that did not have a cache, but rather a scratchpad with support for in-memory computing. Subscribe now

This blog post tells the time

MagiCache: A Virtual In-Cache Computing Engine

You can fake SSD-like disk speeds in any Linux VM, but it's unsafe (literally)

Principles of Mechanical Sympathy

FlexGuard: Fast Mutual Exclusion Independent of Subscription

Panther Lake is the real deal

Browsing the web with JavaScript turned off

RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations

You&apos;re Looking at the Wrong Pretext Demo

Walking backwards into the future – A look at descriptor heap in Granite

Vibe coding SwiftUI apps is a lot of fun

Does anyone actually use the large code-model?

Using Perfetto in ZJIT

Uptime of GitHub Pages alternatives

Dissecting and Modeling the Architecture of Modern GPU Cores

Working on Complex Systems

ChatGPT, Claude, and Gemini Render Markdown in the Browser. I Do the Opposite

What is a CDN and Why It Matters?

NumKong: 2'000 Mixed Precision Kernels For All 🦍

Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture

You're Looking at the Wrong Pretext Demo