Posts in Go (20 found)
iDiallo Today

Why all the PRs?

It's a signal. That's why we get AI-generated PRs. We told everyone, in order to get your resume taken seriously, you need to show your work. When I was getting started in my career, that meant having your own website that you contribute to regularly. So I did that. I built websites, I maintained them. I kept maintaining them even after I got the jobs because that's how I actually honed my web programming skills. Where else was I going to try new frameworks, a new JavaScript paradigm, or try out Ruby on rails? I got the job, and I advised other developers to follow the same path. But then github became mainstream. Rather than just show a finished website, you could actually share the code that runs your project. Share a link to your github project and companies can review your code and directly gauge your experience. But even better, you can show your contribution to open source projects. Not just any projects. Popular projects. The github stars became a metric people look for. A signal that can be used to quickly assign a value to a candidate. But that’s the story told from the outside. I don’t think the github profile link was ever important, unless it was significantly good. Employees focused on their work rarely have the time to maintain healthy github activity. Their experience comes from their day to day job. So for the most part, not much attention was placed on github links other than skimming through those surface level details. When stacks of resumes came on my desk, the best candidates stood out because they had work experience. The good candidates had projects that they could link to, github or elsewhere. But then, the worst candidates had long padded resumes that had elements of every job application tips-and-tricks-article. They had a website, but it was built in a day for the purpose of getting a job, with nothing interesting to say. They had github links, but those often pointed to school projects, homework, or boilerplate code. That’s the vast majority of github links I used to get. People with active and well maintained github profiles were rare. Rare because it actually requires time, effort, and experience. But then we have AI. There was a golang auth issue that I've contributed to on github. It was already a few years old when I proposed a solution that worked for my case. It wasn't universal so it wasn't accepted. The discussion is revived every couple years, each person bringing one more piece to the puzzle. But then recently, someone exploded the thread with comments. And even created a PR to go with it. This was from a user that went from a dormant account to 4000 contributions in a year. It was all AI assisted code. This isn’t to comment on the quality of his code, but he was clearly trying to optimize the metric. Looking at his linkedin profile, he doesn’t work in a software engineering role, and it’s hard to decide if he would be a good contributor if hired. If we were to judge his resume by looking at the github profile, it might catch our attention. But then, there is a problem. There are hundreds, even thousands of people all doing the same thing. They are cranking up their contributions to github projects using AI, so they can have a better chance at getting hired as developers. I understand the job market is rough right now, especially for gen z , and anything to differentiate yourself is a plus. The problem is this is being done at the expense of open source projects. The contributors are not submitting PRs to your project because they are personally invested in it. Instead, they are trying to get their name on the contributors list so that they can use it as a signal in their resume. When we are out here debating if there is any merit in AI generated PRs, or if we should just judge the code, we tend to miss that their gesture is completely hollow. The PR’s author intentions are completely misaligned with the project's maintainers. They are playing a different game. We call it slop, or a waste of time, we ban them and they get really vocal about expressing their first amendment rights. We are directly interfering with their goal of padding their resume. I often ask, why don’t people who create those PRs not just start their own project? One answer I’m starting to believe is, nobody cares about a github profile with a handful of stars. You need to contribute to a popular project. Most if not all AI generated websites look the same, it doesn’t matter how well you customize the prompt. Most greenfield projects from new programmers look the same, the prompter lacks the experience to do anything different. Contributing to open source is a scary thing when you are new. Even when you have experience, it’s a deliberate act. You have to be invested in the work. Just like asking questions on stackoverflow, issues you raised will often get closed . And when they do, you have to learn from it. The value of an open source contributor is not in the volume of work they can perform. If you skim any important projects, you’ll see that the best contributors spend more time discussing the problem than writing code. Their value is in solving problems and contributing to the collective memory of the group. But when you are doing a drive-by PR that may or may not be correct, and you are just trying to get your name on a list, you are providing zero value to the maintainer. Just more work. This is the signal every slop PR generator is after.

0 views
Xe Iaso Yesterday

IPv6 zones in URLs are a mistake

IPv6 is weird. One of the more strange parts of the standard is that every interface's link local addresses are in . If you have a machine with two network interfaces, both of them will be in , so if you have a packet destined to , how do you disambiguate it? The answer is you use IPv6 scopes/zones . The exact format of what goes into a zone is OS dependent, but on Linux it's the interface name and on Windows it's the interface ID. This lets the kernel's routing table know how to handle an address range conflict. On my tower, this would be represented like this: Where is the name of my tower's ethernet device. When you create a host:port bindhost, you normally separate the hostname and port with a colon. IPv6 uses colons to separate hex groups. In order to disambiguate what's the host and what's the port, you typically format the IPv6 address in square brackets, so on port 80 would look like this: And with the right scope it looks like this: Now let's get URL encoding into the mix. From high orbit, you can imagine a URL's format as being something like this: An IPv6 zone would then be part of the hostname, just like with that port 80 example from earlier. So you'd think the URL would be something like this: But if you try to parse this as a URL in Go, you get an error: This happens because URLs can't represent all Unicode values, so any values that don't fit into the grammar of a URL become percent-encoded . This is why sometimes you'll see a in URLs in the wild; that's encoding the ascii space key, which is invalid in URLs. In order to work around this, you need to percent-encode the percent sign in the IPv6 zone: In theory, there is guidance for how to properly handle IPv6 zones in user interfaces in RFC 9844 , but there's no such guidance for URLs . Go also does not seem to follow this RFC in net/url . EDIT: It seems that this behaviour is compliant with RFC 6874 and that this is in fact how it is meant to be done. Our industry confounds me. So in the meantime in order for Anubis to point to IPv6 zoned addresses, you need to encode the with percent encoding. This is horrible, but it seems that this is an edge case that applies to other frameworks, programming languages, and libraries: Maybe some day in the future there will be a better option here. In the meantime my policy of not forking the Go standard library means that this somewhat terrible UX for an edge case is acceptable. I hate it, but what can you do? TL;DR: computers were a mistake. https://trac.nginx.org/nginx/ticket/623 https://github.com/psf/requests/issues/6808 https://datatracker.ietf.org/doc/html/draft-schinazi-httpbis-link-local-uri-bcp-03 -- Browsers don't currently support IPv6 zones because it breaks the concept of an "origin" which is used for many subtle things, this RFC draft attempts to define an zone origin in IPv6 so that browsers have a leg to stand on

0 views
Sean Goedecke 2 days ago

Anti-AI nostalgia and the cult of the past

Programmers were better back in the day, weren’t they? Back when we had real programmers. Not just people who got paid to write code, but people who lived it, who were obsessed with their craft, and whose code was a lively expression of themselves. Hackers were hackers in those days before money took over the industry. Don’t even get me started on LLMs. Could there be a better example of today’s degenerate spirit? A machine to mass-produce software (not good software, just barely good enough), so that the weak minds that dominate the industry can indulge their obsession with quantity : of slop code, of features, and ultimately of money, which is the only way they can understand value. If they weren’t destroying our way of life, they would be pitiable. All of them together don’t have a fraction of the spiritual integrity of someone like Mel . But as it is, we must band together to crush them and drive them from our industry like the parasites they are. Okay, that’s not actually what I believe. But there sure are a lot of posts 1 and comments on the internet that sound a bit like the paragraph above. Here are some older quotes that might sound similar: …the third collapse, in which power tends to pass into the hands of the lowest of the traditional castes, the caste of the beasts of burden and the standardized individuals. The result of this transfer of power was a reduction of horizon and value to the plane of matter, the machine, and the reign of quantity. 2 Usura rusteth the chisel \ It rusteth the craft and the craftsman \ It gnaweth the thread in the loom 3 The actual accomplishments of the past will nevertheless remain accomplishments, while the artistic stammerings of the painting, music, sculpture, and architecture produced by these types of charlatans will one day be nothing but proof of the magnitude of a nation’s downfall. 4 These are all from the writings (or speeches) of famous fascists: Julius Evola, Ezra Pound, and Hitler himself. Mussolini’s Doctrine of Fascism begins by defining fascism as a “spiritual attitude”, which the fascist man adopts in order to regain the mysterious qualities that were lost by the transition to modern life. In his classic Ur-Fascism , Umberto Eco’s first two defining features of fascism are the “cult of tradition” and the “rejection of modernism”. So when someone tells me that the industry has lost its way and we must deny the corrupting influence of modern technology in order to retvrn to the time of virile real programmers (who understood and appreciated the spiritual dimension of programming), I get suspicious. It’s strange to describe anti-AI sentiment as potentially fascist, since a very popular argument is that LLMs themselves are an inherently fascist tool. Surely both sides of the debate can’t be fascist? I do think that the structure of fascist arguments is generally persuasive , and that many avowedly anti-fascist groups do sometimes fall into this trap: describing the world as a struggle between the spiritual power of the macho, traditional man and the corrupting influence of degenerate (often foreign) capital. For instance, I am a big fan of Lord of the Rings. I’ve read the series and watched the films multiple times, and even made a failed attempt to learn Elvish as a kid. But it’s hard to deny that fascists absolutely love Lord of the Rings. “Marble statue of a Roman emperor” might be the most popular avatar for fascists on the internet, but Aragorn is the second most popular. Neo-fascist movements in Italy explicitly take up Lord of the Rings as a foundational text. Why? Because the core conflict in the text is between the traditional, nostalgic heroism of the Shire and Gondor, and the corrupting modern industrial (partly foreign ) influence of Saruman and Sauron 5 . I don’t think Lord of the Rings (or anti-AI rhetoric) is intrinsically fascist. In fact, the surface-level reading of the text is anti-fascist: the plucky people of the West banding together to fight Sauron’s command-and-control totalitarian society. But I can see why fascists love it. One common historical touch-point for anti-AI folks is the Luddites, who were a violent conservative labor movement in early 1800s England. Anti-AI blogs adopt Luddite language like “smashing frames”, and positively cite the Luddites as “the go-to enemies of fascism since its inception”. I’ve written at length about what we can learn from the Luddites in Luddites and burning down AI datacenters , but one point I think is under-emphasized by the (generally pro-Luddite) books is that the Luddites were a little bit fascist themselves . Brian Merchant’s Blood in the Machine is the most popular recent book on the Luddites. I enjoyed it, but Merchant’s attempts to paint the Luddites as a friendly, left-wing, proto-feminist movement 6 seemed really unconvincing to me. From the writings of the Luddites, it’s clear that they were interested in protecting the rights of their all-male elite guild fraternity. Here’s one Luddite threat to a workshop that explicitly includes a threat against the female workers 7 : We think it quite inconsistent with our duty as men, as husbands and as fathers to suffer ourselves to be ruined any longer by a set of vagabond strumpets and those gibbet-deserving rascals that are looking over them. We will lead them to their satisfaction. We sincerely hope, gentlemen, that you will discharge the bitches and take men into your employ again, or they must take what they get. These were fundamentally conservative people who felt (correctly) that modernity had deprived them of their elite status, handing it instead to lower-paid inferiors: women, vagabonds, and foreigners. The Luddites were obviously not fascists 8 . However, the basic ingredients were there: wounded pride, a masculine elite identity, hatred of modern economics, and violence aimed at restoring their previous position in society. The currents that produced Luddism are the same currents that guided so many unhappy people towards fascism. When things are looking grim for an elite group, they often turn towards any movement that promises a return to an idealized past. If my blog has themes, one of them is surely that many software engineers labor under a delusion that their job is to be excellent at their craft. Of course, wanting to be an excellent programmer is not a delusion; it is a completely legitimate value to hold, and a legitimate purpose to pursue. It’s just not what you’re paid to do at work. Your job , unfortunately, is producing shareholder value . This delusion has been punctured by the end of ZIRP , and again more recently by the rise of AI coding. In this environment, I worry that some software engineers will form exactly the kind of disillusioned elite that was the audience for Ezra Pound’s poems about “usury” or the Luddites’ campaign against unapprenticed (often female) textile workers. I worry that AI, and the companies that build AI, are becoming an enemy against which anything is permitted: an enemy which in Umberto Eco’s words is “at the same time too strong and too weak”, unable to reason and yet powerful enough to drastically reshape the global labor market for the worse. The enemy of fascism is nuance. Fascism presents a good, clean, rousing story about a spiritual conflict between right and wrong. It is anathema to fascism to stop and muddy the waters a bit: in this case, to explore the ways in which LLMs, like any transformative technology, can both support and endanger traditional values. In The left-wing case for AI I wrote about how AI is being used right now as a disability aid, and many disabled readers wrote in to share their positive experiences with LLMs, and often how alienated they feel by the anti-AI mainstream on the left. I recently got an email describing how there’s a sudden flood of accessibility software for blind people 9 that’s actually built by blind people , who can now iterate with a LLM to get a product that meets their needs. Framing AI as an ontological evil erases experiences like these. Being anti-AI is not inherently fascist. Many of the anti-AI posts I’ve quoted are thoughtful, sensitive pieces exploring how the author thinks about one of the biggest changes to our industry. I still think the world needs more articles like that, not less, but the more of them I read, the more I recognize the tropes: spiritually pure lovers of the craft, degenerate peddlers of corrupt modernism, a need to return to the traditional ways of the hacker, and a lament for the (potentially) waning power of an elite fraternity of programmers. I know I’m tiptoeing around the worst argument in the world . It isn’t a refutation of anti-LLM arguments to say that they are structurally similar in some ways to fascist arguments, any more than it’s a devastating critique to say the same thing about Lord of the Rings. Sometimes it is good to try and halt the march of progress! Some of our past traditions really were purer and more spiritually robust! It just bothers me, that’s all. I used to read The Story of Mel with unalloyed pleasure. Now it makes me nervous. If you believe you’re fighting the embodiment of fascism , or for the idea of value itself , what tactics are off-limits? What positions might you eventually come to accept? It feels wrong to directly associate my caricature with any actual posts, but it also feels wrong to make a blanket assertion without examples. Just so you know what I’m talking about, here are some posts that have elements of this attitude. I like some of these posts and dislike others. Page 329 of my copy of Julius Evola’s Revolt Against the Modern World . Ezra Pound, Canto XLV. “Usura” should be read as “usury”, or today we could gloss it as “capitalism”: all Pound’s examples of great art were from the pre-capitalist patronage era of art. Adolf Hitler, from his speech at the 1933 Party Congress in Nuremberg. Of course, there’s also historically been a strong pro -technology current in fascist thinking (even specificially Italian fascist thinking ). Page 134 of Blood in the Machine has a brief argument that Luddism was feminist because the (exclusively male) artisans’ wives would provide food for their meetings. No, really. From Kevin Binfield’s Writings of the Luddites , page 40. I’ve taken the liberty of re-rendering it in modern spelling and grammar. Aside from being too early, they didn’t have any connection to the state apparatus of power (in fact, they were ultimately crushed by it) and they famously lacked a singular leader. The example cited was BlindRSS . It feels wrong to directly associate my caricature with any actual posts, but it also feels wrong to make a blanket assertion without examples. Just so you know what I’m talking about, here are some posts that have elements of this attitude. I like some of these posts and dislike others. ↩ Page 329 of my copy of Julius Evola’s Revolt Against the Modern World . ↩ Ezra Pound, Canto XLV. “Usura” should be read as “usury”, or today we could gloss it as “capitalism”: all Pound’s examples of great art were from the pre-capitalist patronage era of art. ↩ Adolf Hitler, from his speech at the 1933 Party Congress in Nuremberg. ↩ Of course, there’s also historically been a strong pro -technology current in fascist thinking (even specificially Italian fascist thinking ). ↩ Page 134 of Blood in the Machine has a brief argument that Luddism was feminist because the (exclusively male) artisans’ wives would provide food for their meetings. No, really. ↩ From Kevin Binfield’s Writings of the Luddites , page 40. I’ve taken the liberty of re-rendering it in modern spelling and grammar. ↩ Aside from being too early, they didn’t have any connection to the state apparatus of power (in fact, they were ultimately crushed by it) and they famously lacked a singular leader. ↩ The example cited was BlindRSS . ↩

0 views
neilzone 3 days ago

Why are there no good tablets at the moment?

A friend was looking for a new tablet, and they asked me for a recommendation. And… I just don’t have one. The only good tablet, because Android can be replaced with GrapheneOS , was the Google Pixel Tablet, and that is no longer available. Secondhand prices are sky high. That was my go-to recommendation for a while. But it looks like Google has abandoned this project too. Amazon’s range of FireOS tablets are, IMHO, bloated with crapware which one cannot easily remove. Even the Fire-Tools scripts only get one so far. I can’t recommend one. There are some fun-looking “tablet computers”, but they are all expensive. A secondhand Surface Go, if one wants a Linux-based tablet, is readily available and pretty cheap, but honestly not what most people will want. And, while I like it as a cheap, touchscreen, Linux machine, it is not particularly powerful, which can be frustrating. And getting the camera working is a nuisance. I guess that there are some iPads, if one is accepting of Apple / iOS. Again, that wouldn’t be my choice, but I can see why some people like them. Why is there no good (non-Apple) tablet at the moment?

0 views

How Other Link Checkers Do Recursion

After I published Five Years of Trying to Add Recursion to lychee , one reply I got was a very fair question: If recursion is so hard, how do other link checkers do it? Plenty of them already crawl websites! This sent me down a rabbit hole of reading the code of other link checkers. The key takeaway is: they didn’t find a clever trick we missed. They were built as crawlers from the very first commit, and I initially built lychee as a stream. I went and read the source of the recursive checkers we list in lychee’s README : muffet (Go), LinkChecker (Python), linkinator (TypeScript), and broken-link-checker (JavaScript). This post is a teardown of how each one actually handles recursion, what it costs them, and what it means for lychee. If you haven’t read the first post , the summary is that lychee was architected as a one-shot, unidirectional pipeline ( ). Recursion needs a cycle (responses create new inputs), and cycles in an async, channel-based pipeline are where the dragons live . 🐲 Five years and four attempts later, the pieces we’ll need to do it properly only just landed. DAGs vs. cycles Every recursive checker I looked at is built from the same three parts: Diagrammatically, lychee is different from the others: Crawlers have a back-edge baked in. Our pipeline doesn’t, and every one of my failed attempts was an effort to bend that back-edge into a graph that was never designed for it. Let’s look at that graph design more closely: Note that the visited check happens in the enqueue step, atomically with the mark, before the worker ever touches the network. That ordering is the entire fix to the deduplication race that haunted lychee’s attempts 1–4, where the cache was written after checking. Each tool uses a variation on it. muffet (Go): a WaitGroup and a Set muffet is closest in spirit to lychee: a fast, single-binary, concurrent website checker. The dedup + scheduling decision lives in one method ( ): is a (a mutex-guarded ). returns whether the URL was already present, so a page is only scheduled the first time it’s seen. Dedup happens at enqueue, synchronized by the set’s mutex. This is basically a line-by-line translation of the diagram above. Checking a page fetches all of its links concurrently, and feeds qualifying ones back into , the back-edge: How muffet knows it’s done muffet’s answer to termination is a little built around a ( ): Every scheduled page increments the group; every completed page decrements it; returns when the count hits zero. The whole crawl bootstraps with a single before , so the counter is positive before anyone waits on it. This is the same counter I tried (and failed with) in Attempt 1 and Attempt 4 . The difference is the invariant: is only ever called from inside an already-running daemon that holds the count above zero (or from the bootstrap). There is no window where the counter briefly reads zero while work is still pending. Go’s enforces this invariant so naturally that it doesn’t feel like distributed termination detection at all, but that’s exactly what it is. It’s the moral equivalent of the primitive Kait contributed to lychee in 2026 . Where the tradeoffs are Concurrency isn’t bounded by the daemon manager. does for every task, spawning unbounded goroutines. The actual limiting happens downstream in a (a buffered-channel counting semaphore) and a per-host throttler pool. muffet separates “the frontier” from “the rate limiter,” which is exactly the separation lychee lacked when it tried to use one bounded channel as both in the past. Cheap goroutines do a lot of heavy lifting. Spawning a goroutine per link is “fine” in Go. The equivalent in Rust ( per link, each needing state) is what pushed me toward and the ownership pain I wrote about . On extensibility, muffet is a focused CLI, not a library. There’s no plugin surface; you get what the flags give you. lychee deliberately ships as a reusable crate, which raises the bar, since every architectural choice has to uphold the standards of a public API. On scalability, unbounded goroutines plus an in-memory visited set scale comfortably to large sites, but there’s no disk-backed frontier, so a truly enormous crawl is bounded by RAM. Same as lychee. Takeaways: muffet LinkChecker (Python): a joinable unbounded queue LinkChecker has existed since the year 2000. It’s a synchronous, thread-pool crawler. Its frontier is a hand-written ( ), a clone of Python’s with / . Look at the very first design comment: It’s explicit about the exact deadlock that bit me. That comment is our Attempt 4 backpressure deadlock , called out and designed around. lychee tried to push discovered URLs into a bounded channel; when it filled, the response handler blocked, no responses drained, no slots freed. Deadlock. 💥 LinkChecker’s answer is brutalist in nature: the frontier is unbounded . Backpressure is enforced elsewhere (a fixed thread count and per-host throttling), never by blocking a producer that is also a consumer. Termination by counter, done right blocks until hits zero ( ): Again: a counter. But the increment in and the decrement in are both inside the queue’s lock, and a worker calls only after fully processing an item including enqueuing its children . So children are counted before the parent is marked done, with no premature zero. It’s semantics implemented with a mutex and a condition variable. Deduplication, before the request LinkChecker writes the URL into its result cache at enqueue time ( ): That sentinel is a “fix” that’s missing in lychee’s attempts. By the time any worker thread checks the URL, the cache already says “mine,” so concurrent discovery from another page is a no-op. Per-host politeness and termination guards The ( ) throttles per host: and calls so a stuck crawl can’t hang forever. Where the tradeoffs are Blocking threads instead of async. Each of the (default 10–100) threads does blocking I/O via . Simple and battle-tested, but the concurrency ceiling is the thread count, and each thread carries a full stack. lychee’s Tokio model reaches thousands of concurrent in-flight requests on a handful of OS threads; LinkChecker can’t, and doesn’t try. The unbounded frontier trades a deadlock for unbounded memory. The explicit “no max size” decision means RAM growth on huge sites. There’s a cap and a periodic to mitigate it. Extensibility is excellent. LinkChecker has a real plugin system ( : anchor checks, SSL, virus scanning, and more) and many output loggers. This is the most extensible of the bunch, and it pays for that with a large, mature, somewhat old-fashioned codebase. On scalability, it’s GIL-bound and thread-limited, so raw throughput is the lowest here, but correctness and feature coverage are high. Takeaways: LinkChecker linkinator (TypeScript): Single-Threaded linkinator is a Node.js checker, and it benefits from something neither Go nor Rust provides: a single-threaded event loop . Check-and-insert into the visited set is atomic for free , because no two callbacks run simultaneously. The frontier is a concurrency-limited (a p-queue-style structure). Termination is one line in ( ): is the library’s termination detection: it resolves when the queue is empty and no task is in flight. Same idea as muffet’s and LinkChecker’s , just expressed as a promise and backed by a single-threaded runtime, so no Mutex is needed to protect the visited set. The back-edge and the race-free dedup When crawling, GETs the page, extracts links, and for each new URL re-enters the queue ( ): Because JavaScript is single-threaded, the entire thing executes without interruption. In Rust or Go, that’s a critical section you must guard with a mutex (and get the ordering right); in Node it’s just three statements. This is the single biggest reason recursion is easier in Node than in Rust. It’s just a language feature. linkinator also keeps a of keys, and a map so it can wait on an in-flight check and still report a duplicate broken link against every parent that references it. Those reuse-operations are themselves pushed onto the same queue, so correctly waits for them too. HEAD vs GET linkinator uses for leaf links but when it needs to crawl, because recursion needs the response body to find more links : This is precisely lychee’s remaining open problem : you can only recurse into pages you fetched with a body. linkinator just always GETs when crawling; lychee plans to reuse the body it already has in cache from the check it just performed. Where the tradeoffs are Single-threaded is both a blessing and a ceiling. No data races, trivially correct dedup, but HTML parsing is CPU work that blocks the one event loop. For thousands of pages, you’re bound by a single core. lychee’s multi-threaded runtime parses and checks in parallel. It suffers from in-memory result inflation. The source explicitly comments on “massive result inflation for heavily interlinked sites”: the array, , and all grow with the crawl. Fine for a docs site, heavy for a giant one. Rate limiting is reactive, not proactive. There’s a that backs off per host on a with , but no general per-host concurrency cap like lychee’s . linkinator can hammer a host until it complains; lychee now paces before the complaint. For extensibility, it’s an ( , , and so on), so it’s embeddable and scriptable, which is nice. It’s a library first, like lychee. Takeaways: linkinator broken-link-checker (JavaScript): event-driven, using two queues broken-link-checker (BLC) takes the event-driven model furthest. It’s built on , a queue with (concurrency) and , and it nests two of them: a site-level queue feeding a page-level . The frontier and dedup live in ( ). Visited pages are tracked in a , written at enqueue time: Recursion is governed by a filter that decides whether a discovered link becomes a crawled page: Termination by event cascade BLC has no counter and no . It rides the queue’s drain events. When the page-level queue empties it fires , which makes emit and call the site queue’s callback; when the site queue drains, it fires . That’s the public : That’s their termination detection, expressed as “the request queue reported empty.” And in classic Node.js fashion, the callback is what actually tells the site queue to free up a slot for another site. So the termination of one site is what allows another to start, and the termination of the whole crawl is what allows the process to exit. It’s a cascade of events that propagates from the page queue to the site queue to the process. Where the tradeoffs are It’s the best web citizen of the bunch. robots.txt is honored ( , ), is respected, and plus are first-class. This is a crawler that’s polite by default. Event cascades are powerful but fiddly. Termination is spread across half a dozen event handlers and two nested queues. It works, but the control flow is much harder to follow than . This is the JS cousin of the “leaky abstraction” problem I described, where recursion-awareness ends up sprinkled across many handlers. It’s single-threaded, the same ceiling as linkinator, plus the in-memory per site. On maturity versus momentum, it’s very widely used (it powers a lot of tooling), but development has slowed. The architecture is still sound and worth studying. Takeaways: broken-link-checker A note on markdown-link-check and the “industrial” crawlers Our README marks markdown-link-check as supporting recursion, but there’s some nuance there: it recurses over Markdown files , not by spidering a live website. There’s no HTTP frontier and no termination problem in the sense above. Worth a mention so the comparison is honest, not worth a teardown. If you want to see the pattern at full industrial scale, look at Scrapy (Python/Twisted) or Colly (Go). Both use the same approach: a scheduler (frontier) with a pluggable, optionally disk-backed queue, a dupefilter (often a Bloom filter rather than a ), a bounded downloader pool, and explicit “engine idle → close spider” termination. They solve exactly the problems lychee struggled with ( distributed termination detection , backpressure, dedup), just with years of dedicated crawler engineering behind them. The takeaway isn’t “lychee should be Scrapy”: it’s that crawling is a well-trodden architecture, and lychee is simply standing on a different one right now. Side-by-side Tool Lang / runtime Concurrency model Frontier “Done?” signal Dedup point Per-host limiting muffet Go, goroutines goroutine pool + semaphore + host throttler mutex-guarded set + daemon channel visited set at enqueue host throttler pool LinkChecker Python, threads fixed blocking thread pool unbounded joinable-queue counter ( ) result cache at (req/s) linkinator Node, event loop single-thread + p-queue ( ) p-queue at enqueue (race-free) reactive broken-link-checker Node, event loop ( ) nested request queues queue-drain events at enqueue + lychee (2026) Rust, Tokio tasks + channels + per-host pool lychee in 2026 finally has a column-for-column match. The is muffet’s and LinkChecker’s . The is BLC’s / and LinkChecker’s . The per-URI mutex is everyone’s enqueue-time dedup. So Why Couldn’t We Just Copy Them? Three reasons, in increasing order of how much they’re actually lychee’s fault. They started as crawlers; lychee started as a stream. Every tool above has a back-edge in its core data structure. lychee’s core was a DAG optimized for the 99% case (a list of files/URLs, checked once, fast). Retrofitting a cycle onto a pipeline is much harder than having one from the start. The problem is architectural in nature. The frontier and the rate-limiter must be different objects. muffet (set + semaphore), LinkChecker (unbounded queue + thread count), linkinator (p-queue + delayCache), BLC (request queue + maxSockets) all keep “what to do next” separate from “how fast to go.” lychee’s early attempts tried to make one bounded channel serve both roles, and a cycle through a bounded channel deadlocks. The fix (lychee’s plus a over an unbounded work source) is the same separation we’re aiming for now. Single-threaded runtimes get dedup for free. Both Node tools dedup with a plain and zero locking, because the event loop serializes access. Go and Python pay a mutex. Rust pays a mutex and fights the borrow checker about who owns the shared state across . That’s the ~30% “Rust tax” I estimated last time : not the algorithm, but the friction of expressing shared mutable frontier state under . None of this is a knock on lychee’s design. A unidirectional stream is the right call for the common, non-recursive case: it’s why lychee is fast and why the 30% channel regression from Attempt 2 was a dealbreaker. The other tools pay for their back-edge on every run, recursive or not. lychee refused to, and that principle is exactly why recursion took five years and why, when it lands, it won’t slow down the path everyone actually uses. I believe that we can have our cake and eat it too: a crawler architecture that supports recursion without sacrificing the speed of a one-shot pipeline. But it’s a harder problem than just “copy what they do,” because most link checkers didn’t start with uncompromising performance as their top goal. Key takeaways So when someone asks “how do other link checkers do recursion?”, the real answer is: they made it a part of the architecture from the beginning, and they leaned on a runtime (providing conveniences like a , a joinable queue, an idle promise) that solved termination without solving “distributed termination detection.” Thanks to the maintainers of muffet, LinkChecker, linkinator, and broken-link-checker: reading your source is the clearest way to learn about crawler architecture out there and we’re all in this together, just with a different set of tradeoffs. A mutable work queue (let’s call it “frontier”), not a fixed input stream. Discovered URLs go back into the same queue they came from. A visited set that’s updated at enqueue time (before the request completes), so two pages discovering the same link can’t both submit it. A primitive that answers “is everything done?”: a , a joinable-queue counter, an promise, or a queue-drain event. Concurrency isn’t bounded by the daemon manager. does for every task, spawning unbounded goroutines. The actual limiting happens downstream in a (a buffered-channel counting semaphore) and a per-host throttler pool. muffet separates “the frontier” from “the rate limiter,” which is exactly the separation lychee lacked when it tried to use one bounded channel as both in the past. Cheap goroutines do a lot of heavy lifting. Spawning a goroutine per link is “fine” in Go. The equivalent in Rust ( per link, each needing state) is what pushed me toward and the ownership pain I wrote about . On extensibility, muffet is a focused CLI, not a library. There’s no plugin surface; you get what the flags give you. lychee deliberately ships as a reusable crate, which raises the bar, since every architectural choice has to uphold the standards of a public API. On scalability, unbounded goroutines plus an in-memory visited set scale comfortably to large sites, but there’s no disk-backed frontier, so a truly enormous crawl is bounded by RAM. Same as lychee. muffet’s termination is a , full stop. It’s the design lychee converged on after five years; muffet got it for free from Go’s standard library on day one. The frontier and the concurrency limiter are separate things. A mutex-guarded set is the frontier; a semaphore plus host throttler bounds concurrency. Conflating them is what deadlocked lychee. Goroutines hide the cost that Rust makes you pay explicitly. The same per-task model that’s trivial in Go is where Rust’s /ownership friction shows up. Blocking threads instead of async. Each of the (default 10–100) threads does blocking I/O via . Simple and battle-tested, but the concurrency ceiling is the thread count, and each thread carries a full stack. lychee’s Tokio model reaches thousands of concurrent in-flight requests on a handful of OS threads; LinkChecker can’t, and doesn’t try. The unbounded frontier trades a deadlock for unbounded memory. The explicit “no max size” decision means RAM growth on huge sites. There’s a cap and a periodic to mitigate it. Extensibility is excellent. LinkChecker has a real plugin system ( : anchor checks, SSL, virus scanning, and more) and many output loggers. This is the most extensible of the bunch, and it pays for that with a large, mature, somewhat old-fashioned codebase. On scalability, it’s GIL-bound and thread-limited, so raw throughput is the lowest here, but correctness and feature coverage are high. The unbounded frontier is a deliberate anti-deadlock choice, documented in a one-line comment. It describes the exact problem we hit in lychee in attempt 4. Dedup at time (a placeholder in the cache) is their synchronization mechanism. The cache must claim the URL before the request, not after. Threads buy simplicity at the cost of throughput. A blocking thread pool is the easiest correct model… and the slowest one. Single-threaded is both a blessing and a ceiling. No data races, trivially correct dedup, but HTML parsing is CPU work that blocks the one event loop. For thousands of pages, you’re bound by a single core. lychee’s multi-threaded runtime parses and checks in parallel. It suffers from in-memory result inflation. The source explicitly comments on “massive result inflation for heavily interlinked sites”: the array, , and all grow with the crawl. Fine for a docs site, heavy for a giant one. Rate limiting is reactive, not proactive. There’s a that backs off per host on a with , but no general per-host concurrency cap like lychee’s . linkinator can hammer a host until it complains; lychee now paces before the complaint. For extensibility, it’s an ( , , and so on), so it’s embeddable and scriptable, which is nice. It’s a library first, like lychee. is the termination mechanism. Simple and provided by the JS runtime. A single-threaded event loop makes request deduplication pretty much free. This is the biggest structural reason recursion is easier in that case. Reactive 429 backoff is not the same as proactive per-host pacing. lychee’s aims higher, at the cost of more machinery. It’s the best web citizen of the bunch. robots.txt is honored ( , ), is respected, and plus are first-class. This is a crawler that’s polite by default. Event cascades are powerful but fiddly. Termination is spread across half a dozen event handlers and two nested queues. It works, but the control flow is much harder to follow than . This is the JS cousin of the “leaky abstraction” problem I described, where recursion-awareness ends up sprinkled across many handlers. It’s single-threaded, the same ceiling as linkinator, plus the in-memory per site. On maturity versus momentum, it’s very widely used (it powers a lot of tooling), but development has slowed. The architecture is still sound and worth studying. Termination is a cascade of queue-drain events, not a counter. Same idea, different syntax. Politeness is built in. robots.txt, , and make it the most server-friendly recursive checker by default. Event-driven control flow is the cost. Distributing recursion logic across many handlers is exactly the kind of spread-out complexity that makes the feature hard to reason about. There is no secret sauce. Every recursive checker is a worklist plus a visited set plus a quiescence detector. The “trick” is being shaped like a crawler from commit one. Termination is always the same idea wearing different clothes: (muffet), joinable-queue counter (LinkChecker), (linkinator), queue-drain events (BLC), (lychee 2026). All of them are distributed termination detection. Dedup belongs at enqueue, before the request. Marking a URL visited after checking it (what lychee did for four attempts) is the bug. Everyone else claims the URL the moment it enters the frontier. Separate the frontier from the rate limiter. A bounded channel that is both your queue and your backpressure will deadlock the instant you add a cycle. There is no free lunch. Node’s single thread makes dedup trivial at the cost of performance; Go’s goroutines and make termination trivial at the cost of a runtime; Rust gives you neither for free but hands you a compiler that refuses to let the races compile and you can get the network card to glow if you know exactly what you are doing.

0 views
Unsung 6 days ago

“Nemo? That’s a nice name.”

Do you know those “Are you still here” screens? In some cases – like banking – they are ostensibly there to improve security. In public transit ticket or similar machines, on the other hand, they exist just so the machine can easily reset itself ahead of a future customer. Resetting to default state happens on your phone, too. I’ve been thinking about it this past week and found a few examples. The Google Search app comes back how you left it, except if you abandon it for longer than 45 minutes. If that‘s the case, it returns to a pristine, deterministic homepage. (You can always come back to the previous session, though.) When you pause a podcast or music, at least in my setup, it will be on the home screen for 5 subsequent minutes – you can then resume it by simply tapping on your AirPods. But leave it dormant for longer than that, and the home screen forgets about it and resuming is impossible: = 3x)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/3-framed.1600w.avif" type="image/avif"> = 3x)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/4-framed.1600w.avif" type="image/avif"> My favourite: if you swipe through the apps back and forth on the iPhone in a touch UI equivalent of command-tabbing, there needs to be a specific moment where the app you switch to becomes the “current” app. On desktop, it’s easy – you can reset the state at the next invocation of ⌘⇥. But there is no such obvious moment on mobile. When there is no obvious moment, timeout can be a great answer. And so it is here, and it seems to be set at about 5–6 seconds: I understand the Google Search and the app switching examples. But I am not sure I know why a podcast resets so soon. A challenge when talking about this without seeing the code – as it is with many things on Unsung – is that I don’t know if this is carelessness, a technical limitation, a design consideration I’m unaware of, or just something that’s intentional, but I happen to disagree with. Even testing this has been hard if the delays are longer than seconds. The extra challenge for Google Search, as it is for many other apps, is that they also reset when iOS itself purges it to make room for other apps. This isn’t great, and can be avoided if you care; we talked before about Bear and how it remembers its precise state even after the system evicts it from its memory. Whether you want your app to remember you forever, or whether you want some deliberate forgetfulness, it could be important to take ownership of that. My go-to example of a miss in this category is Google Maps, which completely throws away its current trip-in-progress status whenever the iOS kicks it to the metaphorical curb – and returning to that status later again as a user is a really complicated sequence of steps including rewinding back the time, on top of travel already being stressful. = 3x)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/6-framed.1600w.avif" type="image/avif"> = 3x)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/7-framed.1600w.avif" type="image/avif"> By the way, I can think of fewer examples on the desktop, but that makes sense given desktop apps are less tactical, and given that ⌘Q exists. But Spotlight does freshen itself up after about 7 or 8 minutes… = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/8.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/8.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/9.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/9.1600w.avif" type="image/avif"> …and Raycast actually offers an option to set its short-term memory from between 0 seconds and three minutes, with the default being 90 seconds: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/10.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/nemo-thats-a-nice-name/10.1600w.avif" type="image/avif"> #details #google #interface design #raycast

0 views

How my minimal, memory-safe Go rsync steers clear of vulnerabilities

Back in January 2025, multiple different security researchers published a total of 6 security vulnerabilities in rsync , some of which allow arbitrary code execution and file leaks, so naturally I was wondering whether/how my gokrazy/rsync implementation was affected. Did implementing my own (compatible, but minimal) rsync in Go, a modern and memory-safe programming language, really rule out entire classes of security vulnerabilities? This deep dive article was in the making since January 2025, but was delayed because we uncovered more unpublished vulnerabilities in the process! The “Security Vulnerabilities” section now covers all 12 vulnerabilities from the January 2025 batch and the May 2026 batch. If you are running (upstream, samba) rsync in production, upgrade to version 3.4.3 or newer. If you are running gokrazy/rsync in production, upgrade to version v0.3.3 or newer. Feel free to skip over the nitty-gritty security issue details and jump directly to: For context, I blogged about rsync, how I use it, and how it works back in June 2022. See also all posts tagged “rsync” . The original motivation for writing my own rsync (back then only a server, today all directions are supported) was to provide the software packages of distri, my Linux distribution research project for fast package management , which I wanted to host on router7 , my small home Linux+Go internet router, which in turn is built on gokrazy , my Go appliance platform. I am still running multiple gokrazy/rsync servers for this original purpose, and also many others! Having rsync available as a primitive (that you can link into your Go programs!) is really nice. This article covers the following security vulnerabilities: The first batch of the vulnerabilities above was announced on the oss-security mailing list , but note that the original report has more detail compared to the oss-security summaries! The later vulnerabilities were announced via GitHub Security Advisories on the rsync project . When the checksums are read by the daemon, two different checksums are read: Most importantly, note that field is filled with bytes. always has a size of 16: rsync.h is an attacker-controlled value and can have a value up to bytes, as the next snipper shows: The problem here is that can be larger than 16 bytes, depending on the digest support the binary was compiled with: md-defines.h support is common and sets the value to 64. As a result, an attacker can write up to 48 bytes past the buffer limit. Upstream fix: The upstream fix for CVE-2024-12084 changes the field to a dynamically-allocated field, which is allocated with length, and fixes the bounds check to check against the (checksum length for this transfer’s algorithm). Can Go help prevent this? Yes: Missing or incorrect bounds checks will not result in a heap buffer overflow in Go! Instead, attempting to write out of bounds will result in a panic because the Go runtime performs bounds checks. How does gokrazy/rsync fare? gokrazy/rsync also had insufficient validation! Our issue was different, though: It wasn’t size confusion, we just were not doing any validation of the sum header at all — oops! We can confirm that the Go runtime’s bounds check triggers on an attempt to write out of bounds by changing the code like so and running the tests: As expected, the Go runtime panics with the following message: Of course, crashing the entire server is not the best failure mode, so I added the missing bounds checking to turn the panic into an error . Because of the same lack of validation as in the previous CVE-2024-12084 vulnerability, an attacker could select a checksum algorithm with short checksums (e.g. with 8 byte checksums), but then claim they were sending longer checksums (e.g. 9 bytes), making the victim leak one byte of uninitialized stack content in the response. Leaking one byte of stack content may seem benign, but as the Google Security report puts it: The first pair of vulnerabilities are a Heap Buffer Overflow and an Info Leak. When combined, they allow a client to execute arbitrary code on the machine a Rsync server is running on. The client only requires anonymous read-access to the server. The daemon matches checksums of chunks the client sent to the server against the local file contents in . Part of the function prologue is to allocate a buffer on the stack of bytes: The daemon then iterates over the checksums the client sent and generates a digest for each of the chunks and compares them to the remote digest: Notably, the number of bytes that are compared again are bytes. In this case, the comparison does not go out of bounds since can be a maximum of . However, the local buffer, not to be confused with the attacker-controlled , is a buffer on the stack that is not cleared and thus contains uninitialized stack contents. A malicious client can send a (known) checksum for a given chunk of a file, which leads to the daemon writing 8 bytes to the stack buffer . The attacker can then set to 9 bytes. The result of such a setup would be that the first 8 bytes match and an attacker-controlled 9th byte is compared with an unknown value of uninitialized stack data. An attacker can divide a file into 255 chunks and as a result leak one byte per file download. An attacker can incrementally repeat the process, either in the same connection or by resetting the connection. As a result, they can leak bytes of uninitialized stack data, which can contain pointers to Heap objects, Stack cookies, local variables and pointers to global variables and return pointers. With those pointers they can defeat ASLR. Upstream fix: There are two relevant upstream fixes: Can Go help prevent this? Yes: By design, Go initializes all variables to the zero value. Go programmers do not need to remember to explicitly initialize variables. How does gokrazy/rsync fare? gokrazy/rsync is not affected by this vulnerability: Variables are always initialized in Go. Additionally, selecting checksums other than MD4 was only introduced in protocol version 30 (gokrazy/rsync implements protocol version 27). Description: (quoting the Google Security report ) When the syncing of symbolic links is enabled, either through the or ( ) flags, a malicious server can make the client write arbitrary files outside of the destination directory. A malicious server can send the client a file list such as: Symbolic links, by default, can be absolute or contain characters such as . In practice, the client validates the file list and when it sees the entry, it will look for a directory called , otherwise it will error out. If the server sends as [both, a directory and a symbolic link], [the client] will only keep the directory entry, thus the attack requires some more details to work. In mode, which the server can enable for the client, the server sends the client multiple file lists. The deduplication of the entries happens on a per-file-list basis. As a result, a malicious server can send a client multiple file lists, where: As a result, the directory is created first and is considered a valid entry in the file list. Then, the attacker changes the type of to a symbolic link. When the server then instructs the client to create the file, it will follow the symbolic link and thus files can be created outside of the destination directory. Can Go help prevent this? No. This vulnerability is caused by a logic error: when multiple file lists are used, the merged file list needs to be re-verified. But see Defense in depth: Go’s Upstream fix: The upstream fix for CVE-2024-12087 adds the missing validation. How does gokrazy/rsync fare? gokrazy/rsync is not affected by this vulnerability: gokrazy/rsync does not implement the incremental recursion mode ( ). The trade-off here is implementation complexity vs. resource usage: the incremental recursion mode allows working with the file set in a “windowed” way, as opposed to having to scan the entire file set before any transfer can begin. See also my How does rsync work? blog post. Description: (quoting the Google Security report ) The CLI flag makes the client validate any symbolic links it receives from the server. The desired behavior is that symbolic links target can only be 1) relative to the destination directory and 2) never point outside of the destination directory. The function is responsible for validating these symbolic links. The function calculates the traversal depth of a symbolic link target, relative to its position within the destination directory. As an example, the following symbolic link is considered unsafe: As it points outside the destination directory. On the other hand, the following symbolic link is considered safe as it still points within the destination directory: This function can be bypassed as it does not consider if the destination of a symbolic link contains other symbolic links in the path. For example, take the following two symbolic links: In this case, foo would actually point outside the destination directory. However, the function assumes that is a directory and that the symbolic link is safe. Upstream fix: The upstream fix for CVE-2024-12088 makes stricter by not allowing anywhere within the path, except at the very beginning. Can Go help prevent this? No. This vulnerability is caused by a logic error: the validation function was incorrect. We could have implemented that same bug. But see Defense in depth: Go’s How does gokrazy/rsync fare? gokrazy/rsync is not vulnerable: The feature is not yet implemented in gokrazy/rsync. The rsync receiver (in client mode) did not sanitize file names provided by the rsync sender, or otherwise prevent opening files outside the destination tree. A malicious sender could instruct a receiver to compare checksums of arbitrary files outside the destination tree. By observing the receiver’s reaction to a provided one-byte checksum, a malicious sender can leak arbitrary files. When a client connects to a malicious server the server is able to leak the contents of an arbitrary file on the client’s machine. In the client will read type as well as the from the server if the server sets the appropriate flags. The flag will not be set for the client. The caller ( ) then uses the server provided values to determine a file to compare the incoming data with. In the contents of the file specified by are copied into the destination file. This can be achieved by the server sending a negative token. The server sends a checksum to compare. If they don’t match, a 0 is returned. When the return value is 0 the receiver will then send a to the generator. The generator will then write a message to the server. The server can use this as a signal to determine if the checksum they sent was correct. By starting off with a of 1 a malicious server is able to determine the contents of the target file byte by byte. Upstream fix: The upstream fix for CVE-2024-12086 prevents opening files outside the destination tree by verifying the sender-provided path. Can Go help prevent this? Yes, Go offers an API to prevent this, see Defense in depth: Go’s . How does gokrazy/rsync fare? gokrazy/rsync is not vulnerable: the fuzzy matching feature was introduced with rsync protocol version 29, but gokrazy/rsync implements protocol version 27. Description: (quoting the Red Hat Security Advisory ) A flaw was found in rsync. This vulnerability arises from a race condition during rsync’s handling of symbolic links. Rsync’s default behavior when encountering symbolic links is to skip them. If an attacker replaced a regular file with a symbolic link at the right time, it was possible to bypass the default behavior and traverse symbolic links. Depending on the privileges of the rsync process, an attacker could leak sensitive information, potentially leading to privilege escalation. Upstream fix: The upstream fix for CVE-2024-12747 changes calls in the rsync sender to use the option. The paths are not expected to be symlinks at that point in the algorithm (symlinks would be handled with ). Can Go help prevent this? Yes, Go offers an API to prevent this, see Defense in depth: Go’s . How does gokrazy/rsync fare? gokrazy/rsync was vulnerable before commit , which introduces the same mitigation that upstream rsync uses. To reproduce the issue, use the following steps: Check out gokrazy/rsync v0.2.7: Patch the code as follows to undo the fix and execute the attack: Running the test now shows that the server traversed the symlink: A surprising discovery When I shared a draft of this article with Damien Neil, member of the Go Security Team and the author of the traversal-resistant API , he pointed out: I believe the gokrazy fix for CVE-2024-12747 is insufficient. You’re calling with , but only prevents symlink traversal in the last path component. This is probably still vulnerable to replacing an earlier path component so can be redirected by symlinking to . We reported this to the rsync security contact address in April 2025. In December 2025 I learned that someone else had also independently discovered and reported this issue. Ultimately, this resulted in CVE-2026-29518, published on 2026-05-20. Description: (quoting the rsync 3.4.3 NEWS entry ) TOCTOU symlink race condition allowing local privilege escalation in daemon mode without chroot. An rsync daemon configured with is exposed to a time-of-check / time-of-use race on parent path components. A local attacker with write access to a module can replace a parent directory component with a symlink between the receiver’s check and its open(), redirecting reads (basis-file disclosure) and writes (file overwrite) outside the module. Under elevated daemon privilege this allows privilege escalation. Default is not exposed. Reach: local attacker on the daemon host, write access to a module path, daemon configured with . Upstream fix: The upstream fix for CVE-2026-29518 uses , which is similar to Go’s API. Can Go help prevent this? Yes, Go offers an API to prevent this, see Defense in depth: Go’s . How does gokrazy/rsync fare? gokrazy/rsync was vulnerable until I switched the sender and the receiver to the traversal-resistant API . Description: (quoting the GitHub Security Advisory ) Description: The receiver’s compressed-token decoder accumulated a 32-bit signed counter without overflow checking. A malicious sender can trigger an overflow that, with careful manipulation, leaks process memory contents to the attacker – environment variables, passwords, heap and library pointers – significantly weakening ASLR and facilitating further exploitation. Reach: authenticated daemon connection with compression enabled (the default for protocols >= 30 when both peers advertise it). Disabling compression on the daemon (“refuse options = compress” in rsyncd.conf) is the available workaround. Upstream fix: The upstream fix for CVE-2026-43618 introduces the missing checks. How does gokrazy/rsync fare? gokrazy/rsync is not vulnerable because it does not implement compression. See gokrazy/rsync issue #35 for details on why compression support sounds simple, but is non-trivial. Description: (quoting the GitHub Security Advisory ) The 2025 fix that added a guard in was not applied to the visually-identical block in . A malicious rsync server can drive any connecting client into a deterministic by setting in the compatibility flags, sending a flist whose first sorted entry is not a leading “.” directory (which causes to set ), then sending a transfer record with and a non- iflag word. The receiver reads and dereferences the result. On glibc x86-64 the dereferenced pointer is mmap chunk metadata that lands at an unmapped address, hence a clean ; non-glibc allocators have not been audited. Reach: any rsync client doing a normal pull from an attacker-controlled URL. Works for both rsync:// URLs and remote-shell pulls. is the protocol-30+ default; no special options are required on the victim. Workaround: on the client. Upstream fix: The upstream fix for CVE-2026-43620 adds the guard to as well. How does gokrazy/rsync fare? Just like for CVE-2024-12087 , gokrazy/rsync is not affected by this vulnerability: gokrazy/rsync does not implement the incremental recursion mode ( ). Description: (quoting the GitHub Security Advisory ) Description: Earlier fixes for symlink races on the receiver’s open() call (CVE-2026-29518) missed the same race class on every other path-based system call: chmod, lchown, utimes, rename, unlink, mkdir, symlink, mknod, link, rmdir, lstat. On rsync daemons with “use chroot = no” a local attacker with filesystem access on the daemon host can swap a symlink into a parent directory component between the receiver’s check and one of these syscalls, redirecting it outside the exported module. The fix routes each affected path-based syscall through a parent dirfd opened under RESOLVE_BENEATH-equivalent kernel-enforced confinement (openat2 on Linux 5.6+, O_RESOLVE_BENEATH on FreeBSD 13+ and macOS 15+, per-component O_NOFOLLOW walk elsewhere). Default “use chroot = yes” is not exposed. Reach: local attacker on the daemon host, write access to a module path, daemon configured with use chroot = no. Upstream fix: The upstream fix for CVE-2026-43619 uses the family of syscalls, just like Go’s . Can Go help prevent this? Yes, Go offers an API to prevent this, see Defense in depth: Go’s . How does gokrazy/rsync fare? gokrazy/rsync is not affected, because it uses Go’s API throughout. Description: (quoting the GitHub Security Advisory ) On an rsync daemon configured with the global rsyncd.conf setting, the reverse-DNS lookup of the connecting client was performed after the daemon had chrooted into . If did not contain the files glibc needs for resolution ( , , , NSS service modules), the lookup failed and the connecting hostname was set to “UNKNOWN”. Hostname-based deny rules (“hosts deny = *.evil.example”) therefore could not match, and an attacker controlling their PTR record could connect from a hostname the administrator had intended to deny. IP-based ACLs are unaffected. The per-module setting is unrelated to this issue. Reach: rsync daemon configured with AND hostname-based ACLs AND does not include the libc resolver fixtures. Upstream fix: The upstream fix for CVE-2026-43617 moves the DNS lookup to an earlier point in the protocol. How does gokrazy/rsync fare? gokrazy/rsync is not vulnerable because we only implement IP-based allow/deny lists, not hostname-based allow/deny lists. Description: (quoting the GitHub Security Advisory ) The rsync client’s HTTP proxy support contains an off-by-one out-of-bounds stack write in ( ). After issuing the request, rsync reads the proxy’s first response line one byte at a time into a 1024-byte stack buffer with the bound , so the loop only ever writes . If the proxy (or a man-in-the-middle in front of it) returns 1023+ bytes on the first response line without a terminator, the loop exits with — a slot the loop never wrote, so holds stale stack bytes left there by the earlier that formatted the outgoing request. The post-loop code then does: The lands one byte past the end of the on-stack , corrupting whatever lives in the adjacent stack slot. AddressSanitizer reports at in the frame. Upstream fix: The upstream fix for CVE-2026-45232 validates the attacker-supplied data. How does gokrazy/rsync fare? gokrazy/rsync does not implement such proxy support, so it is not vulnerable. Let’s summarize how Go fares: Aside from being written in Go, another key difference between gokrazy/rsync and the official upstream rsync is that the gokrazy implementation is minimal : Let’s have a look at whether gokrazy/rsync was affected by each CVE at the time of publishing: To be clear: all known vulnerabilities are fixed in gokrazy/rsync! The table above documents what the state was at the time when each CVE was published. In other words: When the January 2025 vulnerabilities were published, gokrazy/rsync panicked (CVE-2024-12084) and was vulnerable to a TOCTOU race (CVE-2024-12747). In the process of fixing the TOCTOU issue, we discovered CVE-2026-29518, which was fixed in gokrazy/rsync before the CVE was published. CVE-2026-43619 was discovered even later, but was also already fixed in gokrazy/rsync with the same fix: using Go’s everywhere. As I was reading the vulnerability reports, I noticed that the reports were slightly misleading by their choice of words: most reports just spoke of “server” and “client”. However, in an rsync transfer, both sides, the rsync client and the rsync server can assume either role: sender (upload files) or receiver (download files)! Some setups come with further restrictions that make certain attacks harder or impossible to pull off. For example, when running in daemon mode, file system access can be restricted to the pre-configured module paths (but not in command mode!). Here is a diagram to give you an overview of the 4 different setups and role/protocol layering: In the context of our vulnerability reports, I would say that the Arbitrary File Leak vulnerability (CVE-2024-12086)’s original title “Server leaks arbitrary client files” can easily be misunderstood. Instead, I would say: The rsync receiver will leak arbitrary files to a malicious sender . I have verified that a malicious client sender can make an unpatched remote rsync open files outside the destination tree (e.g. the system password database) when running in command mode, for example over SSH. (But, when running in daemon mode, the server enables additional path sanitization, which prevents this attack.) Similarly, the Symlink Path Traversal vulnerability (CVE-2024-12087) speaks about a “malicious server”, but again, it should be “malicious sender”, which can be either the client or the server. The OpenBSD project is known for its security focus, so how does openrsync compare? openrsync is not affected by the Heap Buffer Overflow (CVE-2024-12084) and Stack Info Leak (CVE-2024-12085) vulnerabilities because it validates the checksum length and only supports one checksum size/algorithm (MD4). openrsync is not affected by CVE-2024-12086, CVE-2024-12087 and CVE-2024-12088 because it does not implement the relevant features (like gokrazy/rsync). Even if it was vulnerable, openrsync’s defense-in-depth measures like using OpenBSD’s and to restrict file system access would have prevented successful exploitation — at least when running on OpenBSD. openrsync is not affected by CVE-2024-12747 because it used from the very moment they implemented symlink support . But, because is not a sufficient fix for this issue, openrsync is affected by CVE-2026-29518! The above covers the January 2025 batch of vulnerabilities; the May 2026 batch is similar in that most features just are not implemented. Overall, I say: Well done, Kristaps and contributors! By diligently implementing validation, restricting the attack surface and employing defense-in-depth measures, openrsync manages to not be affected by almost all of the reported vulnerabilities. Which APIs and environments can we use on Linux for defense-in-depth measures? I’ll go through the ones supports, ordered by traditional to modern. Within a few weeks after starting the project, I added support for dropping privileges and using mount/pid namespaces on Linux to restrict the file system objects that my rsync server could work with. This approach works very well to mitigate path traversal attacks, but requires privileges, meaning we need to run as or in a Linux user namespace (if enabled on your distribution / system). That limitation makes mount namespaces well-suited for server setups, but usually unavailable for interactive one-off transfers that are typically running under a human’s user account. In the same commit that introduced Linux mount/pid namespace support, I also included a systemd service file that restricted file system access to home directories and encouraged folks in the README to further restrict file system access, depending on what their use-case allows. These file system restrictions, if set up correctly, mitigate the File Leak (CVE-2024-12086) and Path Traversal (CVE-2024-12087) vulnerabilities. The Symlink Race Condition (CVE-2024-12747) relies on privilege escalation through the rsync process, but thanks to the DynamicUser feature, our process has fewer privileges than other users. Similarly to mount namespaces, these measures are great for server setups, but too cumbersome to set up for interactive one-off usages. I stumbled upon Justine’s blog post Porting OpenBSD pledge() to Linux (2022) and was reminded that Linux offers the Landlock API for unprivileged, per-process access control, similar to OpenBSD’s system call, which openrsync uses. The basic idea is that once your program knows the directory it works with, it makes a call like and no longer has access to other file system locations. I had previously heard of Landlock at a Go Meetup, so I knew there was Go support for Landlock. Back in 2022, I enabled Landlock support in the gokrazy kernel images. So I gave it a shot in March 2025 and implemented Landlock support to restrict file system access . It took me a few hours, which seems a little longer than one might expect at first. Making Landlock work (and/or skipping it) in our test environment ran into a couple of road blocks: Our tests had defined many functions that get run in the same process, but when repeatedly adding rulesets, we would exceed the limit of 16 (!) policy layers per process. Once I had it set up just right, it is a beautiful solution. Now we can restrict rsync transfers to their sources (read-only) or destination directories (read-write), even for unprivileged invocations of ! 🎉 The downside to Landlock is that Landlock operates at the process level. This means that Landlock policies must include the files that your program needs, e.g. needs to be able to read for user id lookup, so if the attacker is after the file, Landlock does not help. In February 2025, the Go 1.24 release introduced the API, which is resistant against path traversal, see The Go Blog: Traversal-resistant file APIs (by Damien Neil, March 2025). This API allows more fine-grained control (per file system operation) compared to Landlock. Go 1.25 (released in August 2025) added more methods to , making it a convenient choice for most file system usage. I have converted all of ’s file system usage to use , which is a great fit: users configure input/output directories, but the filenames received over the network are untrusted. That’s exactly what was designed for! When I first looked into using , I thought that some system calls could inherently not be made with this API, like for example to create device node files. Damien explained: It won’t support mknod, though. However, you should be able to use it to enable a safe mknod: If you’re curious how that looks in practice, check out ’s usage in , line 15-29 . Another stumbling block was when I realized that unlike with , Linux only implements , but no (as of Linux 7.0)! Luckily, Lennart Poettering pointed out that there’s a trick to skip path resolution without : you can probably bind to in the meantime… And indeed, this works! Path resolution is skipped because we only specify a basename (last component of a path) after the known-safe , not a path (see line 49-56 ). With these two tips, v0.3.1 and newer are fully using , meaning all file system access is traversal-safe! 🥳 Lacking validation causes vulnerabilities It is interesting to note that aside from the TOCTOU vulnerabilities (CVE-2024-12747, CVE-2026-29518 and CVE-2026-43619), all other vulnerabilities were caused by missing or incorrect input validation. In three cases, there was just no validation to begin with. In another case (CVE-2024-12088), the subject matter of file system path resolution is tricky enough that the existing validation did not cover all edge cases. As the Go verdict section explains in more detail, the most valuable structural fixes are to provide bounds checking (= always-on validation) and safe-by-default APIs like Go’s . Too much complexity A few of the vulnerabilities came from evolution of the rsync protocol: The code used to correctly perform sufficient validation, but then new features were added. For example, when checksum algorithm negotiation was added (protocol version 30), the validation was not correctly updated. When incremental recursion was added (also protocol version 30), the validation that made sense for individual file lists was not updated for the new processing approach of merging incremental file lists. Avoiding complexity avoids vulnerabilities! Both gokrazy/rsync and also openrsync were not vulnerable to 8 out of the 12 security vulnerabilities simply because they do not implement the feature with the vulnerability. Of course, these features were added to rsync because they were valuable to someone at some point, and of course I am not saying that we should just… not develop software any further, ever. But, I consider it ideal to use an implementation whose complexity is appropriate for and proportional to the complexity of the use-case . In other words: for simple use-cases, reach for a simple implementation. Only reach for the fully-featured implementation where needed. The verdict on whether using Go has helped . The verdict on whether a minimal re-implementation like gokrazy/rsync helps . My comparison with OpenBSD’s (written in C). Defense in depth mechanisms one can use on Linux. The conclusion . CVE-2024-12084 to 12088 (original report) CVE-2024-12747 (discovered separately by Aleksei Gorban “loqpa”) CVE-2026-29518 (discovered by Damien Neil and myself! and independently by Nullx3D ) CVE-2026-43617 to 43620 CVE-2026-45232 rsync performed insufficient validation: It read the (attacker-controlled) checksum length from the network and compared the length against . However, rsync’s data structures always declared a 16 byte buffer: is always 16 (bytes), which is sufficient to hold an MD4 or MD5 checksum. used to be 16 (bytes), but can be larger when rsync is compiled with SHA256 or SHA512 checksum support. Hence, the bounds check was ineffective! An attacker could write out of bounds. This issue was introduced with commit in September 2022 , which added SHA256/SHA512 checksum support. A 32-bit Adler-CRC32 Checksum A digest of the file chunk. The digest algorithm is determined at the beginning of the protocol negotiation. The corresponding code can be seen below: sender.c : The “Some checksum buffer fixes” commit prevents this attack because the attacker-controlled can no longer be larger than the transfer’s checksum length. The “prevent information leak off the stack” commit initializes the memory to zero, thereby making any stack leak through impossible. Check out gokrazy/rsync v0.2.7: Patch the code as follows to undo the fix and execute the attack: The Go runtime’s bounds checks turn more serious security issues into a panic. A panic is still a denial-of-service risk, but that’s much preferable. Go initializes memory to zero, making info leaks like CVE-2024-12085 impossible. Go’s API prevents most of the remaining vulnerabilities. Only one out of twelve vulnerabilities (CVE-2026-43617) is a proper bug in the application logic that using Go could not have prevented. gokrazy/rsync is unaffected by many vulnerabilities because it does not implement the feature in question, for example . Like all other wire protocol-compatible rsync implementations, gokrazy/rsync targets protocol version 27, because later protocol versions introduce significant complexity. In some cases, features that would be good to implement come with significant blockers, e.g. compression is tricky, see gokrazy/rsync issue #35 for details. os.Root.OpenFile the parent directory of the target, File.Fd to get the file descriptor for that directory, https://pkg.go.dev/golang.org/x/sys/unix#Mknodat to create the file.

0 views
(think) 2 weeks ago

nREPL Forever

Last week I announced Port , a small prepl client for Emacs. That post focused on Port itself, but writing it left me with the itch to do a follow-up on the bigger picture, because the socket REPL / prepl story is one I’ve been meaning to write up for years. If you’ve been around Clojure long enough, you remember the chatter. Socket REPL landed in Clojure 1.8 (January 2016), prepl in Clojure 1.10 (December 2018), and for a couple of years there was a steady stream of posts, tweets, and Slack threads to the effect of “this is what we should be building tools on. nREPL is on the way out.” Some serious people put their weight behind that idea, and some of them went and built tools to prove it. Now it’s 2026 and we can take stock. The pitch was good. Socket REPL is just the Clojure REPL exposed on a TCP port. prepl wraps it with a structured printer so the bytes coming back are EDN-tagged maps ( , , , ) instead of a human-readable prompt. Both ship with Clojure itself. No external server library, no middleware, no third-party namespaces. You start a JVM, you bind a port, you’re done. The intellectual case for moving off nREPL had been made by Rich Hickey himself, most clearly in a March 2015 clojure-dev post that’s worth reading in full. Rich didn’t actually attack nREPL by name in that message. What he did was argue carefully for what a REPL is : a thing that reads characters, evaluates forms, prints results, and loops, with those streams available to user code so that things like nested REPLs and debuggers compose naturally. The money line: While framing and RPC orientation might make things easier for someone who just wants to implement an eval window, it makes the resulting service strictly less powerful than a REPL. His proposal, in the same post, was that tools should open multiple connections to the running program: one for the human-facing stream, and dedicated channels for IDE operations. The socket REPL (which landed in 1.8 the following January) and prepl (which arrived in 1.10) were the official implementation of that worldview. A handful of editor projects took the cue and built clients: It was real momentum. If you were following Clojure tooling in 2018-2020, it genuinely felt like nREPL might be the past, and the future would be some combination of socket REPL plus a thin self-installing protocol on top of it. You can find a fair number of “RIP nREPL” hot takes from that period if you go looking. I went and surveyed each of those projects recently while working on Port. The pattern is depressingly consistent: Tutkain started on prepl. In November 2021, its v0.11 release explicitly stopped using prepl message framing and switched to a hand-rolled EDN-RPC protocol that Tutkain boots onto the raw socket REPL by sending it a base64-encoded blob. The new protocol has request ids, op dispatch ( , , , , , , …), and server-managed thread bindings. In other words: Tutkain grew into nREPL, just spelled differently. Chlorine never used prepl directly. It used socket REPL plus an -style upgrade blob. Its author’s successor project, Lazuli , abandoned the whole approach in favor of nREPL. The post-mortem is worth reading and is fairly blunt: tools that attempted prepl went back to nREPL because, honestly, it’s simply better. Conjure had a prepl client in its early Rust days. The current Lua/Fennel rewrite ships only an nREPL client. The author’s reasoning in the release notes was that nREPL “has complete ecosystem adoption and brilliant ClojureScript support.” Clojure-Sublimed technically still talks to a raw socket REPL, but only after sending it an EDN-printing prelude that upgrades the REPL to a structured protocol of tonsky’s own design. His post on the topic is one of the most thoughtful pieces I’ve read on Clojure REPL design, and his conclusion is roughly: the bare socket REPL is more useful than prepl because you can install your own protocol on top of it. Which is true. But notice that everyone who reached that conclusion ended up reinventing the same wheel: ids, ops, request/response correlation, completion support, lookup, interrupts. You know, the things nREPL has had since 2010. So the trajectory looks roughly like this: Pure prepl clients are nearly extinct in the wild. The one I found that qualifies is propel by Oliver Caldwell (of Conjure fame), which is delightful, about 70 lines of Clojure, and explicitly synchronous (one outstanding eval at a time). That works! But it’s not a foundation for the kind of feature set people expect from an editor. Here’s where I land. Rich isn’t wrong that prepl is closer to a “real” REPL in the strict sense. prepl genuinely is a more faithful encoding of read-eval-print: each form goes in, each result comes out, and the semantics match what you’d get at the standard REPL prompt. The thing is, “real REPL” is not the property you optimize for when you’re building editor tooling. The properties editor tooling actually needs are: nREPL was explicitly designed for those properties. The ops, middleware, and transport abstractions exist precisely because the people building it knew the consumers are not humans typing at a prompt, they’re programs negotiating a session. Calling nREPL “not a real REPL” is technically defensible and practically beside the point. Nobody on the consuming end is confused about what nREPL is for . I wrote about nREPL’s revival in 2018 . At that point I had just finished migrating the project out of Clojure Contrib, and the goal was to give it a real home and a working development process. It was a lot of work, but in hindsight things played out pretty well. Looking at where things ended up: Meanwhile prepl is, as best as I can tell, mostly a curiosity. It got me a side project I had fun with. It did not displace nREPL. The history of tooling protocols is full of cases where “purer”, “simpler”, or “more elegant” lost to “shipped, documented, and battle-tested.” LSP beat fifteen ad-hoc language protocols. DAP beat the same fifteen debuggers. nREPL beat prepl in the (Clojure) editor space. It’s not that the simpler thing is bad. prepl is a fine, elegant little protocol, and there’s a real case for embedding it in CI scripts, ops automation, deployment pipelines, or anywhere you want to drive a Clojure VM programmatically without pulling in a server library. Use it there. But for editor tooling? The Clojure community made an enormous, multi-year, multi-tool investment in nREPL. We have the protocol, the middleware, the manual, the books, the conference talks. nREPL works, it’s actively maintained, it’s increasingly portable across Clojure dialects, and the design decisions that Rich called out as un-REPL-like are the exact ones that make it a good substrate for editors. So I’ll say what I felt awkward saying back in 2018: nREPL forever. It’s the right abstraction for the job, and it’s not going anywhere. One more thing. After finishing Port I got curious what a minimal nREPL client would look like by comparison, so I went and built one. As you can imagine, it turned out to be significantly simpler. If that sounds interesting, take a look at neat , a small, language-agnostic nREPL client for Emacs. Keep hacking! Tutkain for Sublime Text Chlorine for Atom Conjure for Neovim (in its early Rust incarnation) Clojure-Sublimed by Nikita Tonsky a steady drip of smaller experiments around , , and friends Editor decides nREPL is too heavy or an undesirable external dependency and starts on prepl. Editor discovers prepl has no ids, no ops, no interrupts, no server-side completion, no namespace tracking, no test runner integration, etc. Editor rolls a custom protocol on top of socket REPL, or… Editor gives up and goes to nREPL. A way to correlate a request with its response when output and results are interleaved. A way to multiplex – one connection, several logical conversations. Server-side hooks for the operations every IDE expects: completion, lookup, go-to-definition, find-references, test running, stacktrace structuring, interrupt. A protocol stable enough that ten different editors can target it without each one inventing its own dialect. nREPL itself is healthier than it has ever been. Active maintainers, a proper manual , a steady release cadence, an actual ecosystem organization on GitHub. Most popular Clojure editors support it. CIDER , Calva , Cursive (via its own client), Conjure, vim-iced , you name it. babashka ships with nREPL built in. You boot a and you get an nREPL server, no extra dependencies. That’s how a lot of people use nREPL in scripting contexts today, and it’s been a hit. basilisp (the Clojure dialect on Python) has nREPL support . nREPL running on Python, talking to Emacs, evaluating Clojure. Nice. ClojureCLR has a working nREPL story now, and jank (the C++ Clojure) has nREPL on its roadmap too. The middleware ecosystem ( , , , , , …) is alive, well, and continues to add features.

0 views
Corrode 2 weeks ago

Migrating from Go to Rust

Out of all the migrations I help teams with, Go to Rust is a bit of an outlier. It’s not a question of “is Rust faster?” or “does Rust have types?”, Go already gets you most of the way there. The discussion is mostly about correctness guarantees , runtime tradeoffs , and developer ergonomics . A quick disclaimer before we start: this guide is heavily backend-focused . Backend services are where Go is strongest, small static binaries, a standard library focused on networking, and an ecosystem of libraries for HTTP servers, gRPC, databases, etc. That’s also where most teams considering Rust are coming from (at least the ones who reach out to me), so I think that’s the comparison that’s actually useful in practice. If you’re writing CLI tools, embedded firmware, or game engines, some of this still applies, but to be honest, I’m afraid this is not the best resource for you. For context, I’ve written about Go and Rust before: “Go vs Rust? Choose Go.” back in 2017, and later the “Rust vs Go: A Hands-On Comparison” with the Shuttle team, which walks through a small backend service in both languages. What you will learn in this article I’ll be upfront: I’m not a fan of Go. I think it’s a badly designed language, even if a very successful one. It confuses easiness with simplicity , and several of its core design tradeoffs ( everywhere, error handling as a discipline rule rather than a type, the long absence of generics) point in a direction I disagree with. That said, success matters! Go has captured a real and persistent share of working developers, hovering around 17–19% in the JetBrains Developer Ecosystem Survey. Rust is growing steadily but is still a smaller slice: Go is clearly working for a lot of people, and a guide that pretends otherwise isn’t helpful. So I’ll do my very best to be objective in this guide rather than relitigate old arguments. But you should know my priors so you can calibrate. The other prior worth disclosing: I run a Rust consultancy; of course I’m biased! More people using Rust is good for my business. But I’ve also worked in both languages professionally and shipped Go services to production. This guide is for Go developers who want an honest, side-by-side look at what changes when you move to Rust. For a deliberately opposite take, I recommend reading “Just Fucking Use Go” by Blain Smith. Holding both views in your head at once is more useful than either one alone. If you prefer to watch rather than read, here’s a video from the Shuttle article above, read and commented by the Primeagen: Go developers already have one of the cleanest toolchains in the industry. Back in the day, it started off a trend of “batteries included” toolchains that give you a single, consistent interface for building, testing, formatting, linting, and managing dependencies. I’m glad that Rust followed suit, because it’s a great model. It’s one of my favorite parts about both ecosystems. has even more built-in: The big difference is that in Go you typically reach for third-party tools ( , , , ) to fill gaps. In Rust, the first-party ecosystem covers more out of the box. Things that do require external crates (e.g. , ) install with one command and feel native, e.g. gives you right away. Both communities have converged on the same insight about formatters: a single canonical style, even an imperfect one, is worth more than the bikeshedding it eliminates. Gofmt’s style is no one’s favorite, yet gofmt is everyone’s favorite. — Rob Pike, Go Proverbs The same is true of : not everyone likes every detail, but the absence of style debates in code review is worth far more than the occasional formatting preference you’d have made differently. The headline is that Go and Rust are both compiled, statically typed, single-binary-deploy languages with strong concurrency stories. The differences are about what guarantees you get from the compiler and how much control you have over runtime behaviour . Go developers don’t usually come to Rust because Go is “too slow.” For most backend workloads, Go is plenty fast. People are generally a bit frustrated with Go’s verbose error handling, the danger of segmentation faults from pointers, and the lack of generics (for a long time) or any sophisticated type system features, such as enums or traits. Interfaces are not a worthy replacement for traits, and the Go standard library has some weird gaps, such as the lack of a type. I call it my billion-dollar mistake. It was the invention of the null reference in 1965 … This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years. — Tony Hoare, inventor of , QCon London 2009 This is the one I hear most often. You ship a Go service, it runs fine for months, and then one Tuesday at 3 a.m. a code path runs where someone forgot to check whether a pointer was , and the goroutine panics. Go’s compiler does not force you to consider the absence case. Rust’s does: You literally cannot dereference an without acknowledging the case. Whole categories of pager-duty incidents disappear. is a great tool, but it’s a runtime detector, it only finds races that actually execute during your tests. Mutating a map from two goroutines without a lock compiles fine in Go and only blows up in production under load. In Rust, sharing mutable state across threads requires types that implement and . Try to share a plain between threads and the program does not compile . You’re forced to wrap it in an , an , or use a channel. That race condition becomes a type error. 1 is fine for a while. After a few years, you notice three things: It’s worth being honest about the counter-argument here, since it came up in the Lobste.rs thread on my Shuttle article: experienced Go developers point out that and catch most of the “forgot to handle the error” cases in practice, and that explicit is easier to read than dense chains. Both points are fair, and the explicit style is a deliberate cultural value, not an accident: I think that error handling should be explicit, this should be a core value of the language. — Peter Bourgon, GoTime #91 , quoted in Dave Cheney’s Zen of Go My take is that lints are an opt-in safety net you have to remember to set up, while Rust’s is the type signature itself, there’s no way to forget. The boilerplate-vs-readability tradeoff is more genuinely subjective. The operator handles propagation; handles wrapping; and a on is exhaustively checked . Add a new variant tomorrow and the compiler shows you every place that needs updating. Go got generics in 1.18, and they’re useful, but the implementation has constraints (no methods with type parameters, GC shape stenciling, occasional surprising performance characteristics). Rust generics monomorphize, each instantiation produces specialized code with zero runtime cost. Combined with traits, this gives you real zero-cost abstractions. This matters less in handler code and more in shared infrastructure (middleware, generic repositories, decoders, parsers), where Go often pushes you back to / plus type assertions. Go’s GC is excellent, concurrent, low-pause, well-tuned for typical service workloads. But “low-pause” is not “no-pause.” Under heavy allocation, P99 latency tails are noticeably worse than a Rust equivalent that simply doesn’t allocate on the hot path. I won’t oversell this, for the vast majority of services, Go’s GC is a non-issue. But for latency-sensitive systems (trading, real-time bidding, network proxies, high-throughput ingestion), the lack of GC pauses is a genuine selling point. Go is death by a thousand paper cuts. It is a very pragmatic language and if you are willing to glance over the above issues, you can be very productive in it. But at a certain codebase size, the problems start to compound. There is no single moment when Go loses its appeal, but teams find themselves wishing for more (more safety, more control, more expressiveness) and that’s when they start looking around for alternatives. The fastest way to feel comfortable in Rust is to map patterns you already know. For a longer, fully-worked example of building the same backend service in both languages, see the Shuttle comparison , the section below focuses on the patterns that come up most often. The operator does the dance for you, including type conversion if is implemented (idiomatic with ’s ). There is no in safe Rust. References can’t be null. Pointers can be, but you almost never use raw pointers in application code. Go’s interfaces are structural, a type satisfies an interface implicitly: Rust’s traits are nominal, you implement them explicitly: The Go style is great for ad-hoc duck typing. The Rust style is great for refactoring and discoverability, you can grep for every implementer of a trait. The closest equivalent of / in Rust is , but you almost never want it. The Go community knows the cost of reaching for too: interface{} says nothing. — Rob Pike, Go Proverbs Generic functions with trait bounds ( ) cover the vast majority of cases and give you monomorphization with no runtime dispatch. Where Go pre-1.18 would have forced you back to plus a type assertion, Rust’s traits + generics let you stay specific. When you do want runtime dispatch (e.g. heterogeneous storage of different implementers), reach for or . That’s the direct Rust analog of holding an value in Go. Go’s concurrency model is famously simple: Goroutines are cheap, the runtime schedules them across OS threads, and channels ( ) are the primary coordination primitive. The Go proverb captures the philosophy: Don’t communicate by sharing memory; share memory by communicating. — Rob Pike, Go Proverbs This is the area where Go genuinely shines, several commenters in the Lobste.rs discussion made the point that goroutines “just disappear” into normal-looking blocking code, and that’s worth giving Go credit for. Rust async is more powerful, but it’s also more visible in your code. Rust uses / on top of an executor (almost always for backend services): The shape is similar. The differences: For most backend code, the day-to-day feel is similar: spawn a task, communicate via channels, use timeouts liberally. In Go, you plumb a through every blocking call: Rust has no built-in . The closest equivalent for cancellation is : For timeouts, wraps any future. For deadlines/values, you typically pass them as explicit arguments or via spans rather than a single context object. Some Go developers miss the implicit-feel of . In practice, the explicit Rust style is easier to reason about, you always know exactly what’s cancellable and what isn’t. The deeper point is that neither language gives you cancellation for free, the discipline just shows up at different layers: Go doesn’t have a way to tell a goroutine to exit. There is no stop or kill function, for good reason. If we cannot command a goroutine to stop, we must instead ask it, politely. — Dave Cheney, The Zen of Go In Go that “asking politely” is a plumbed through every call site by convention. In Rust it’s a (or a channel) plumbed through every call site, but the compiler can actually tell you when you forgot. Both languages have channels. The translation is direct: Rust’s channels distinguish sender and receiver as separate types, which makes ownership and -ness explicit at the type level. Rust’s is the equivalent of a Go value receiver; is a pointer receiver with mutation. Owned (consuming the value) has no Go analog and is occasionally very useful (typestate, builders). Go’s is a UTF-8 byte slice with copy-on-assign semantics (the header is copied, the underlying bytes are shared and immutable). Rust splits this into two types: As a rule of thumb, take in arguments, return when you produce new data. This is mostly painless once you internalize it. The vs split is a microcosm of Rust’s broader “borrow vs own” model. Go got generics in 1.18 (March 2022), thirteen years after the language shipped. They are useful, but they feel tacked on, and in practice they have most of the downsides of a generic type system without delivering the upsides you’d expect coming from Rust, Haskell, or even modern C++. This is a strong claim, so let me back it up. The most telling signal is that three years after generics landed, Go’s own standard library still mostly avoids them. still takes a closure instead of a constraint. is still typed as / . The generic helpers that do exist live in a small handful of packages: , , , and a few entries under . Compare that to Rust, where generics permeate the standard library from day one: , , , , , / , , , every collection, every smart pointer. You cannot write idiomatic Rust without using generics, because the standard library is generic. In Go, generics are an opt-in feature for library authors who really need them. In Rust, they’re the substrate everything else is built on. Rust’s generics are tied to traits, which double as the language’s mechanism for ad-hoc polymorphism, supertraits, associated types, blanket impls, and coherence. Go’s constraints are just interfaces with an extra operator for type-set membership. There are no: The practical consequence is that the moment your abstraction needs more than “a function that works for any with these few operations,” Go pushes you back to plus type assertions, code generation, or runtime reflection. Rust uses a Hindley-Milner-style inference engine that propagates type information through entire expressions, including across closures, iterator chains, and operators. You routinely write: and the compiler figures out is from the range, and is from the target. Go’s inference is much shallower. It can usually infer type parameters from function arguments, but it cannot infer from return-position context , cannot chain inference through generic builders the way Rust does, and frequently forces explicit type arguments at call sites: In Rust this is the exception; in Go it’s still common. Rust monomorphizes: every and produces specialized machine code with zero runtime dispatch. Go uses GCShape stenciling with dictionaries , where types that share a “GC shape” share the same compiled function and dispatch through a dictionary at runtime. The result is a compile-time/runtime tradeoff that often surprises people: generic Go code can be measurably slower than the equivalent hand-written non-generic version, because every method call on a type parameter goes through an indirection. There’s a well-known PlanetScale post showing exactly this. In Rust, generic code is the fast path. Reaching for (the equivalent of Go’s interface dispatch) is a deliberate choice you make when you want runtime polymorphism. This is the part that bothers me most. A good generics system removes reasons to fall back to escape hatches. In Rust, generics + traits eliminate most of what you’d otherwise need or runtime reflection for. The type system gets stronger. In Go, generics did not remove , did not remove , did not remove code generation as the dominant pattern for things like ORMs, decoders, and mocks. still uses reflection. still uses . still generates code. The places where a real generics system would shine are the same places Go reaches for runtime mechanisms it had before 1.18. Generics in Go feel additive, a new tool in the box that’s useful in narrow cases. Generics in Rust feel foundational; remove them and the language collapses. That’s the difference, and it’s why generic Go code, in my experience, doesn’t read better than the -based code it replaced; it just reads differently, with more punctuation. If you’re already opinionated in Go, the Rust ecosystem has converged to a similar level of “default picks.” For a typical backend service: + + + + + covers 90% of what you need. I want to be straightforward here. Coming from Go, you will hit a wall . The wall has a name. Go’s runtime handles memory and aliasing for you. Rust pushes that decision into the type system. The first few weeks you’ll write code that “should obviously work” and the compiler will refuse it. The patterns that bite Go developers most often: With all of these rules, the borrow checker truly sounds like a “gatekeeper” of sorts, which keeps getting in the way and is just overall frustrating to deal with. That is not the mental mindset you should have when learning Rust. The borrow checker truly uncovers real and very existing bugs in your code, and if you don’t address them, your program will deal with safety issues. So whenever you get a compiler error from , take a step back and think how your code could break. A few questions you can ask yourself: That is the mindset you need to understand the borrow checker. Humans are genuinely bad at reasoning about memory. We forget that pointers can be null, that old references can outlive the data they point to, and that multiple threads can touch the same data at the same time. We tend to have a “linear” mental model of how data flows through a program, but in reality it’s closer to a complex graph with many paths and interactions. Every condition forces you to consider what happens in both branches. Every loop forces you to consider what happens on every iteration. That is exactly the kind of reasoning the borrow checker is designed to do for you! It enforces best practices at compile time, and it can feel annoying when your own mental model disagrees with the borrow checker’s (which is the more accurate one 99% of the time). There are cases where the borrow checker is genuinely too strict, but they are rare, and as a beginner you’ll almost never run into them. I got memory management wrong plenty of times in my early days, but I approached it with a learner’s mindset , which helped me ask “what’s wrong with my code?” instead of “what’s wrong with the compiler?”, a reaction I see a lot in trainings. The good news is that once you internalize borrowing, it stops fighting you. Most experienced Rust developers will tell you the borrow checker became an ally somewhere between weeks 4 and 12. The first month is the hardest. Be honest with your team, Rust compile times are a real downgrade from Go’s. A clean release build of a medium service can take minutes in comparison to Go’s near-instantaneous compiles. Incremental builds and are reasonable and compile times have gotten much better over the years, but you’ll feel the difference. To mitigate, use in your edit loop, split into a workspace once it pays off, and keep proc-macro-heavy crates in their own crate so they only recompile when they change. See tips for faster Rust compile times for a deeper dive. Go’s “one type of function, sync everywhere, the runtime handles concurrency” is genuinely simpler than Rust’s split between and . You’ll need to think about which of your functions are async, where you , and how that interacts with traits. Async traits (stable since Rust 1.75) help a lot, but there are still rough edges (especially around with async methods). Rust’s crate ecosystem is growing and libraries are high-quality across the board, but Go has a head start in some backend-adjacent domains: Kubernetes operators, cloud-provider SDKs, database drivers for certain niche stores. Before you commit, spend a day checking that the libraries you depend on have Rust equivalents you’re willing to use. Teams I help often have to hand-roll at least one or two core libraries themselves. For example, they might have to update an abandoned crate for XML schema validation, or write their own client for a lesser-known protocol. You don’t have to rewrite everything in one go. The strategies that work best, in order of how I usually recommend them: If one specific service in your fleet is the perpetual problem child (high CPU, latency-sensitive, or constantly hit with reliability issues), rewrite just that one in Rust, behind the same API contract. This is the lowest-risk migration. Other Go services keep talking to it via HTTP/gRPC, oblivious to the underlying language. Background workers, queue consumers, ingestion pipelines, and CPU-bound batch jobs are excellent first targets. They typically have a clear input/output boundary (a queue, a topic) and no shared in-process state with the rest of the system. You can call Rust from Go via cgo, and there are good guides on how to do it . (Reach out if you’d be interested in a guide on this from me.) In practice, I rarely recommend it for backend services. The build complexity and FFI overhead usually outweigh the benefits compared to “just stand up a Rust service and put it behind a network call.” For libraries and CLI tools, it’s more viable. If you have an API gateway or reverse proxy, you can route specific endpoints to a new Rust service while the rest stays in Go. This works particularly well when one bounded context (auth, search, billing) is the right unit to migrate. The pattern is often called “strangler fig,” because the new service grows around the old one until it eventually replaces it entirely. Start with a service that has a clear boundary. Don’t pick the most central, most-deployed service in your fleet. Pick the one where the contract with the rest of the system is well-defined and the blast radius is small. Keep the same API contract. If your Go service exposes a REST API, your Rust service should too: same paths, same JSON shapes, same error envelope. The migration is invisible to clients, and you can swap traffic incrementally with a gateway. Don’t translate idioms verbatim. Resist the urge to write Go-flavoured Rust. becomes . Goroutine-per-request becomes only when you actually need it (axum already concurrently handles requests). Interfaces with one method usually become trait bounds on a generic, not . Use the compiler as a pair programmer. Rust’s compiler errors are usually pretty good. Read them slowly. They almost always tell you the right answer. The team members who struggle longest are the ones who fight the compiler instead of treating it as a collaborator. Invest in training early. I’ve seen teams try to do a Rust migration “on the side,” learning as they go. It rarely ends well. It’s a bit like training for a marathon by signing up for the race and then trying to run it without any prior training. You can do it, but it’s going to be painful and you might not finish. Block off real time for learning: a workshop, an online course , paired sessions on real code. The upfront investment pays back many times over once the team is fluent. (Hey, if you want to talk about training options, I’m happy to chat .) Not everything should be migrated. Go is excellent for: A hybrid strategy is fine and common. Many of the teams I work with end up with a polyglot backend: Go for the “boring” services, Rust for the ones where reliability and performance pay back the extra effort. Numbers vary wildly by workload, so take these as rough guidance. Not promises! But here are some ballpark numbers, based on Go-to-Rust migrations I’ve helped with: Honestly, you’re unlikely to get a 10x throughput improvement going from Go to Rust the way you might from Python. What you get is fewer “silly errors” and flatter latency tails, plus the ability to expand into other domains like embedded development or systems programming while still using the same language. That’s often the most surprising side-effect of a migration: there’s a lot of opportunity for code-sharing across teams that previously had to use different stacks. You can use Rust for everything. Going from Go to Rust is a different kind of migration than coming from Python or TypeScript . Coming from Go, you know the benefits of a statically-typed, compiled language. So you’re not trading away dynamic typing or a slow runtime, you’re trading away in exchange for a more robust codebase with fewer footguns, and a stricter compiler that catches more mistakes at compile time. There is a steeper learning curve, however. For foundational services (services that your organization relies on, that have high uptime requirements, that are critical to your business), that trade is obviously worth it. For others, Go remains the right answer. The point of a migration is to put each problem in the language that solves it best. Ready to Make the Move to Rust? I help backend teams evaluate, plan, and execute Go-to-Rust migrations. Whether you need an architecture review, training, or hands-on help porting a critical service, let’s talk about your needs . Rust’s type system doesn’t catch all data races, but types that truly can’t be shared between threads without synchronization won’t compile. You can still have logic bugs in your synchronization, but you won’t have the kind of “oh no, I forgot to lock this” that often leads to silent data corruption. ↩ Where Go and Rust overlap, and where they diverge. How Go patterns map to Rust. What you gain from the borrow checker. Where I tell people to keep Go and where Rust is worth the migration cost. How to migrate Go services incrementally. The boilerplate dilutes the actual logic of your function. Wrapping with is a discipline rule, not a compiler rule. It’s easy to drop context on the floor. Sentinel errors via / work, but the compiler doesn’t tell you when you forgot to handle a new variant. Rust async functions return s. They don’t run until awaited or spawned. The compiler tracks / across points. If you hold a non- value across an await, you get a compile error explaining exactly why. There’s no built-in goroutine-style preemption. Long CPU-bound work in an async task starves the executor; you offload to or instead. Channels ( , , ) are first-class but live in libraries, not the language. , owned, heap-allocated, growable. Equivalent to you intend to mutate. , a borrowed view into someone else’s string data. Equivalent to a Go parameter most of the time. Supertraits / constraint hierarchies. In Rust you write , and any automatically satisfies and . Go has no equivalent; you stack interface embeddings, but the constraint solver doesn’t reason about hierarchies the way Rust’s trait system does. Associated types. Rust’s has , so is a first-class thing you can name in bounds. Go’s closest equivalent is a second type parameter, which leaks into every signature. Blanket impls. In Rust, automatically gives every type a method. Go has no way to add methods to a type from outside its defining package, generic or not. Methods with their own type parameters. This is an explicit, documented non-feature in Go. You cannot write . In Rust, generic methods on generic types are routine. Long-lived references. In Go, you’d happily hold a from a map for as long as you want. In Rust, that borrow blocks mutation of the map for its whole lifetime. The fix is usually to clone, or to scope the borrow tighter. Self-referential structs. Common in Go (a struct holding both data and an iterator over it). In Rust, this requires , , or a redesign. Almost always: redesign. Sharing mutable state across goroutines. What you’d write as becomes . Slightly more verbose, much more checked. Returning references from functions. Lifetime annotations show up. They’re not as bad as their reputation, but they’re new. If a value got moved from one place to another, what would happen if the original place tried to use it again? If a value is shared across threads, what would happen if one thread modified it while another thread is using it? If a pointer is dereferenced , what would happen if it was null or dangling? When a value goes out of scope , what would happen if it was still being used somewhere else? Kubernetes-native tooling : operators, controllers, CRDs. The ecosystem is overwhelmingly in Go. CLI utilities and dev tooling : fast compiles, easy cross-compilation, simple deployment. Glue services : thin API layers, proxies, format converters. The boilerplate ratio in Rust isn’t worth it here. Anywhere your team velocity matters more than absolute correctness guarantees . CPU usage: 20–60% reduction. Less dramatic than Python-to-Rust, because Go is already efficient. The wins come from no GC and tighter loops. Memory: 30–50% reduction, mostly from the absence of GC overhead and a smaller runtime. P99 latency: significantly more consistent. Rust services tend to flatline where Go services have visible GC-induced jitter. (This has gotten much better on the Go-side ever since they introduced their low-latency GC, but the difference is still there under heavy load.) Production incidents: this is the one teams report most enthusiastically. The classes of bugs that survive and reach production (data races, nil dereferences, missed error paths) just don’t compile in Rust. Oncall rotations are typically very boring after a Rust migration. Rust’s type system doesn’t catch all data races, but types that truly can’t be shared between threads without synchronization won’t compile. You can still have logic bugs in your synchronization, but you won’t have the kind of “oh no, I forgot to lock this” that often leads to silent data corruption. ↩

0 views
Brain Baking 2 weeks ago

The Death of the Brick & Mortar Toy Store

It doesn’t take a genius to figure out why more and more local stores are going defunct. A short trip downtown makes the destructive nature of Amazon et al. apparent: the city centre is littered with for-sale or for-rent signs, stuck on dirty windows of almost every third building. In 2024, I already wrote about the challenges of buying games locally , but now that we have two kids, I think about this more often. Yes, it’s annoying for myself, but no, it’s not a big deal: physical editions of rarer Nintendo Switch games or retro video games aren’t available locally anyway. But what about buying the kids a simple box of LEGO? Even that’s not possible anymore. And to me, that’s very sad. When my wife was little, her parents would take her out to the centre on Christmas eve where she could choose a little present for herself. None of the toy shops she used to frequent with her folks back in the day are still in business. None of them. So we can’t offer the same thing to our kids: we’d have to drive further—to a bigger supermarket with a toy region, or to a chain store. And to me, that’s very sad. The evolution of types of stores in our local city centre from small, independent, and varied to big names and nothing but shoes, boutique clothing, or counterfeit made-in-China watches is a curious phenomena. That got me thinking: In which stores was I a (regular) customer, what kind of toy did I buy there, and which of these businesses are still selling stuff today? The only photo I could find of a Christiaensen store (in Brussels) by Jeugdsentiment. The local Game Mania store in May 2009, a year before it closed down. That Chinese restaurant? It got replaced by an Indian one before being replaced by... a for-sale sign. There are two remarkable exceptions to this bleakness: comic book store Wonderland and board game specialist Oberonn . Both stores are not a part of some bigger holding and both stores stem from my youth and are still alive and kicking. In fact, they used to compete: in high school I used to buy new Magic: The Gathering (MtG) booster packs from the opened box at the counter top in Wonderland while Oberonn even sold singles in binders. The last time I visited Wonderland I learned they stopped selling MtG as not to clash with Oberonn . Local Christian youth association shop De Banier not only sells outfits but also creative trinkets for crafting and has a small board game selection. Strange, as that’s only 30 metres away from Oberonn —and usually a bit less expensive. They still exist but they recently meddled with their opening hours, shortening the time span. Hopefully that’s not a bad sign… We bought many of our favourite games there and my wife always finds some kind of jewellery making toolkit in there as well. I hope one day a Pipoos store finds its way to Hasselt as well. We thought the one in Maastricht was gone but it seems that they simply moved instead. The photo was taken from a Google Maps history in time save point: I didn’t know it was possible to go back in time using Street View!  ↩︎ Related topics: / hasselt / By Wouter Groeneveld on 18 May 2026.  Reply via email . Christiaensen : a Belgian toy store chain from the seventies and eighties that got bought out by the Dutch Blokker: see the Jeugdsentiment nostalgia: Christiaensen post. I bought Stratego Legends there when I was 16. Bart Smit : a Dutch toy store chain that got bankrupt and bought by Intertoys/Maxitoys. The Christiaensen store got converted into a Bart Smit that now is yet another empty building. I bought too many Nintendo GB(A)/(3)DS games there and was a regular for over a decade. Every time we went shopping, I just had to drop in and see what’s on sale: they would regularly slash prices so you had to be quick. At one time, there were three Bart Smit stores in Hasselt. I even remember being gifted the MegaDrive cart Toejam & Earl in Panic on Funkotron by my grandparents somewhere in the nineties. Whether you fancied a video game or a LEGO box, Bart Smit was the go-to solution for almost every Flemish/Dutch kid. That building now is yet another boring clothes store. DreamLand : another toy store chain with venerable Belgian roots owned by Colruyt group that briefly had a fancy underground store near a new parking lot not even five years ago. Of course it had to go. I bought The Quest for El Dorado and other board games there, and I think we also bought baby toys for our daughter there. The bigger store about away from us recently also closed down. The store chain is still alive as is their webshop, but for how long… There’s still a DreamLand nearby but no longer in the centre. Free Record Shop : a Dutch retailer that primarily sold music CDs and boomed during the nineties. The one in Sint-Truiden also had a second hand selection that included GBA/DS games. Good times… Free Record Shop was declared bankrupt in 2013. I bought several albums and every good handheld game I could there. Fnac : a French retail chain with a long history that never made it to our city: we used to drop by when visiting Leuven. They usually are more expensive than the above alternatives. In 2020 they finally opened a shop in Hasselt. Since a month, it’s for rent. Yup. I bought a few puzzle games, picture books, and audio CDs there. Broux : a renowned local model building specialist my late father in law loved. I think by now you can guess its fate. I’m not big into the hobby but tagged along once and got myself some kind of fighter jet. I never finished it. Game Mania : the local Game Stop that used to have more than 30 stores across Belgium. I loved its early location at the outskirts of our village, conveniently placed close to a road I passed when cycling home from high school. I convinced my sister to help finance the silver GameCube plus Wind Waker and Super Mario Sunshine . Best purchase ever. They usually were (at least) pricier than supermarket/online competitors but I didn’t care and just wanted to support them. This is also where I got my original Paper Mario 2 edition for the painful full price of (that was even more painful in 2004). I guess that didn’t work out: yet another bankruptcy. The local Game Mania store moved buildings twice before being gone in 2024 1 . The photo was taken from a Google Maps history in time save point: I didn’t know it was possible to go back in time using Street View!  ↩︎

0 views
Aran Wilkinson 3 weeks ago

Introducing jjw: a workspace manager for jj

I released jjw, a Go CLI for managing jj workspaces with bookmarks and lifecycle hooks. This post explains why I built it, how it works, and how to get started.

0 views
Stratechery 3 weeks ago

The Deployment Company, Back to the 70s, Apple and Intel

Listen to this post: Good morning, President Trump is on the way to China, and Sharp China is your go-to podcast for understanding what happens next. Add it to your podcast player now in anticipation of the next few episodes breaking down the trip. On to the Update: From Reuters : OpenAI said on Monday it is setting up a new company with more than $4 billion in initial investment to help organizations build and deploy artificial intelligence systems, and will acquire an AI consulting firm, Tomoro, to quickly scale up the unit. After its early models saw strong resonance with consumers, OpenAI has been working aggressively to sign corporate contracts and establish a large presence in the business world where its AI will see large-scale deployment. The venture, which will be majority owned and controlled by OpenAI, also comes as rival Anthropic enjoys strong success in its enterprise AI push with its Claude family of models seeing rapid adoption among businesses. The new firm, called OpenAI Deployment Company, will help the ChatGPT maker embed engineers specializing in frontier AI deployment into organizations that will then work closely with various teams to identify where AI can make the biggest impact, OpenAI said. Its acquisition of Tomoro, a consulting firm that helps enterprises deploy AI, will bring around 150 experienced AI engineers and “deployment specialists” to the new unit from day one. Tomoro was formed in 2023 in alliance with OpenAI, and counts companies such as Mattel, Red Bull, Tesco and Virgin Atlantic as its clients, according to its website. That was on Monday; on Tuesday, from The Information : Google plans to hire hundreds of engineers to help customers start using its business-focused AI products, according to a person familiar with the situation. Google’s new “forward deployed engineers” will form a new team within Google Cloud, the unit’s chief, Thomas Kurian, said on LinkedIn on Tuesday, without disclosing the size of the effort. Matt Renner, Google Cloud’s chief revenue officer, said in a separate post that the move would help Google “show up for our customers with more technical resources (vs just an ocean of salespeople).” The announcement is one of several in the industry in recent weeks as tech companies are deploying armies of humans—often described as “forward deployed engineers”—and partnerships with consulting companies to get customers using AI-driven technology intended to automate work. On Monday, OpenAI launched the “OpenAI Deployment Company” in partnership with consulting and investment firms. Last week, Anthropic announced the creation of a joint venture with private equity firms to sell its AI to the PE firms’ customers. It is, needless to say, tempting to drop some snark about AGI apparently not being good enough to deploy AI, but instead I’m going to go with “as predicted”. In 2024’s Enterprise Philosophy and the First Wave of AI , I made the case that the proper analogy for AI in the enterprise was not SaaS, but rather the first wave of computing in the 1970s. Agents aren’t copilots; they are replacements. They do work in place of humans — think call centers and the like, to start — and they have all of the advantages of software: always available, and scalable up-and-down with demand…Benioff isn’t talking about making employees more productive, but rather companies; the verb that applies to employees is “augmented”, which sounds much nicer than “replaced”; the ultimate goal is stated as well: business results. That right there is tech’s third philosophy: improving the bottom line for large enterprises. Notice how well this framing applies to the mainframe wave of computing: accounting and ERP software made companies more productive and drove positive business results; the employees that were “augmented” were managers who got far more accurate reports much more quickly, while the employees who used to do that work were replaced. Critically, the decision about whether or not to make this change did not depend on rank-and-file employees changing how they worked, but for executives to decide to take the plunge. Specifically, I don’t think that the Deployment Company is going in to help employees use chatbots; that’s even more clearly the case with the PE firms that both OpenAI and Anthropic are doing deals with. I expect there to be an ever-increasing number of deals where PE buys software firms with reliable cash flows and conducts significant layoffs, forcing AI to pick up the slack, solving stock-based compensation issues in the process. I don’t know if the mandate for the Deployment Company is going to be quite so harsh, but I assume this is a company that is hired by the executive suite to fundamentally rethink business processes in a way that hasn’t been done since the mainframe: Most historically-driven AI analogies usually come from the Internet, and understandably so: that was both an epochal change and also much fresher in our collective memories. My core contention here, however, is that AI truly is a new way of computing, and that means the better analogies are to computing itself. Transformers are the transistor, and mainframes are today’s models. The GUI is, arguably, still TBD. To the extent that is right, then, the biggest opportunity is in top-down enterprise implementations. The enterprise philosophy is older than the two consumer philosophies I wrote about previously: its motivation is not the user, but the buyer, who wants to increase revenue and cut costs, and will be brutally rational about how to achieve that (including running expected value calculations on agents making mistakes). That will be the only way to justify the compute necessary to scale out agentic capabilities, and to do the years of work necessary to get data in a state where humans can be replaced. The bottom line benefits — the essence of enterprise philosophy — will compel just that. What I wonder is how much of the work ends up reworking data; that, as I noted in that article, is why I was bullish on Palantir: That leaves the data piece, and while Benioff bragged about all of the data that Salesforce had, it doesn’t have everything, and what it does have is scattered across the phalanx of applications and storage layers that make up the Salesforce Platform. Indeed, Microsoft faces the same problem: while their Copilot vision includes APIs for 3rd-party “agents” — in this case, data from other companies — the reality is that an effective Agent — i.e. a worker replacement — needs access to everything in a way that it can reason over. The ability of large language models to handle unstructured data is revolutionary, but the fact remains that better data still results in better output; explicit step-by-step reasoning data, for example, is a big part of how o1 works. To that end, the company I am most intrigued by, for what I think will be the first wave of AI, is Palantir… That integration looks like this illustration from the company’s webpage for Foundry, what they call “The Ontology-Powered Operating System for the Modern Enterprise”: What is notable about this illustration is just how deeply Palantir needs to get into an enterprise’s operations to achieve its goals. This isn’t a consumery-SaaS application that your team leader puts on their credit card; it is SOFTWARE of the sort that Salesforce sought to move beyond. Google’s Kurian, by the way, did dismiss any sort of Palantir comparison in a Stratechery Interview last month: This all makes perfect sense, particularly this bit about the Knowledge Catalog definitely fits how I’ve been thinking. I wrote about this a few years ago about this importance of this whole layer and understanding it, it’s a bit of a big lift to get this in place. You have some sort of analog, say, with like a Palantir that’s putting in like their ontology thing. They have FDEs out on the site, multi-month projects doing this. You have OpenAI talking about Frontier, their agent layer, and they’re partnering with all the tech consultancies to build this out. Is this going to entail a lot of boots on the ground to get this graph working and functional in a way that your agents can operate effectively across it? TK: We’re not competing with Palantir, we’re not building a semantic dictionary or an ontology. What we’re doing is, today I’ll give you the closest analogy. TK: Today when you use a model, let’s say you use Gemini, and you ask a question, Gemini goes through reasoning, and then it shows you a citation. A citation is, “How did I answer the question and what’s the source I derived from?” Now imagine that citation was a query that needed to go to a folder in, for example, a storage system because there’s some documents there and a database because, for example, in a part number, just think about there’s a part number document that lists all the part numbers and sits in a drive and then that part number you need to fetch out to say it’s the modem that the guy is coming to repair, and that’s mapped to a table in a database. So what the graph does, we use Gemini, so we don’t need humans, we use Gemini to say, “Hey, go and read all these documents in these drives and extract the information from it and then match that to the database table that has the reference to the part number”, and so then when Gemini turns around and says, “I got this query about how much inventory of modems they are”, the first thing it does is it says, “Okay, go to the Knowledge Catalog and it says modem is part number one, two, three, four, five”, and then it says, “By the way the table in the database that has the inventory information about this part number is this table, here’s a SQL”, it then makes the quality of what we generate higher and then when it answers the question it shows back — back to your, “Trust my data”, it shows a grounding citation saying, “That’s where we got it from.” Well, so much for not needing humans! I joke, mostly — Kurian was referring to not needing a Palantir-like ontology, not necessarily dismissing the need for FDEs — but it sure is interesting how AI is creating the need for new kinds of jobs. It’s almost as if the world is more dynamic, and pure intelligence, unadulterated by what already exists and the burden of reflexivity, is more static, than the most pessimistic prognosticators may have anticipated. More prosaically, OpenAI and Anthropic need the revenue, enterprises need the imagination, and Google needs to stay in the game. From the Wall Street Journal : Apple and Intel have reached a preliminary agreement for Intel to manufacture some of the chips that power Apple devices, according to people familiar with the matter. Intensive talks between the two companies have been ongoing for more than a year, and they hammered out a formal deal in recent months, these people said. Bloomberg News previously reported the talks. It’s still unclear which Apple products Intel would make chips for, these people said. Apple ships more than 200 million iPhones a year as well as millions of iPads and Mac computers. Ming-Chi Kuo reported on X late last year that Intel would make Apple’s most basic M processor on its 18A process; he didn’t specify which generation. Regardless, while the Wall Street Journal cites Trump administration pressure, and an earlier Bloomberg article Apple’s concentration risk on TSMC and Taiwan, the most obvious reason for a deal — assuming it exists — is economic. Specifically, Apple has for two quarters running said it can’t satisfy demand because it can’t get enough capacity at TSMC. CEO Tim Cook referenced this point multiple times on the last earnings call , but I think this was the most important articulation: The constraint in the March quarter and the June quarter, the primary constraint is the availability of the advanced nodes our SoCs are produced on, not memory. And so I don’t want to predict for supply and demand to match because if I look at it realistically, I think on the Mac mini and the Mac Studio, I believe it will take several months to reach supply-demand balance. And so we’re not at the point where we’re saying this is going to end anytime soon. And it’s not because of a problem per se other than we just undercalled the demand. And there are lead times to this, as you well understand, and it takes a while to correct that. And the primary constraint from a product point of view, or the majority of it for this quarter, for the June quarter will be on the Mac. And it’s Mac mini, Mac Studio and the MacBook Neo. It’s all of those. Cook talked about lead times last quarter as well, and the important thing to note is that while it does take five months or so to make new chips, assuming Apple realized it needed more iPhone 17 Pro chips right away, those new A19 Pro lines only started producing chips partway through last quarter (which is why iPhone 17 Pro sales weren’t as high as they could be). Critically, however, what seems likely is that Apple took capacity away from the Mac to make more iPhone chips, and now doesn’t have enough chips for the Mini and Studio either. The long-and-short of it is this: Apple doesn’t have flexible access to TSMC capacity anymore, because so much of that capacity is going to AI in particular, and it’s costing Apple meaningful money across multiple product lines. This was always the thing that would bring companies to Intel; I wrote in TSMC Risk : Becoming a meaningful customer of Samsung or Intel is very risky: it takes years to get a chip working on a new process, which hardly seems worth it if that process might not be as good, and if the company offering the process definitely isn’t as customer service-centric as TSMC. I understand why everyone sticks with TSMC. The reality that hyperscalers and fabless chip companies need to wake up to, however, is that avoiding the risk of working with someone other than TSMC incurs new risks that are both harder to see and also much more substantial. Except again, we can see the harms already: foregone revenue today as demand outstrips supply. Today’s shortages, however, may prove to be peanuts: if AI has the potential these companies claim it does, future foregone revenue at the end of the decade is going to cost exponentially more — surely a lot more than whatever expense is necessary to make Samsung and/or Intel into viable competitors for TSMC. This, incidentally, is how the geographic risk issue will be fixed, if it ever is. It’s hard to get companies to pay for insurance for geopolitical risks that may never materialize. What is much more likely is that TSMC’s customers realize that their biggest risk isn’t that TSMC gets blown up by China, but that TSMC’s monopoly and reasonable reluctance to risk a rate of investment that matches the rest of the industry means that the rest of the industry fails to fully capture the value of AI. We’re already here (reportedly). TSMC’s failure to invest aggressively enough over the last several years will, in the end, give Intel the single most important thing it needs to become a viable competitor: the customer who did more than any other to make TSMC into the leader in the first place. This Update will be available as a podcast later today. To receive it in your podcast player, visit Stratechery . The Stratechery Update is intended for a single recipient, but occasional forwarding is totally fine! If you would like to order multiple subscriptions for your team with a group discount (minimum 5), please contact me directly. Thanks for being a subscriber, and have a great day!

0 views
Anton Zhiyanov 1 months ago

Solod v0.1: Go ergonomics, practical stdlib, native C interop

Solod ( So ) is a system-level language with Go syntax and zero runtime. It's designed for two main audiences: The initial version (let's call it v0) was focused on picking a subset of Go and translating it to C. The next logical step was to port Go's standard library and make it easier to interop with C. That's what the v0.1 release I'm presenting today is all about. Standard library • SQLite bindings • Persistent map • Store and retrieve • Command-line interface • Performance • Wrapping up Solod v0.1 ships with the following stdlib packages ported from Go: And a couple of its own packages: Stdlib documentation In the following sections, I'll demonstrate some of the v0.1 features using a simple example: a persistent key-value store backed by SQLite. Since So doesn't provide yet, we'll call SQLite directly through its C API. To do this, let's import the necessary headers with the directive and generate extern declarations using the sobind tool: The directive is required for constants ( ) and types ( ). As for functions ( ), we can just declare them without a body — the transpiler will treat them as extern declarations even without . With the SQLite API in place, let's implement a key-value type that wraps the database connection: Add a constructor that connects to an SQLite database and creates a table to store the items: As you can see, this So code looks a lot like regular Go code. However, there are some key differences: First, let's implement the method: No surprises here, just a bunch of SQLite API calls. The method is more interesting: The pointer returned by is managed by SQLite. It becomes invalid after calling (which does before returning). Because of this, we need to allocate a copy of the returned value, using in this case. So's approach to memory allocation is similar to Zig's — all heap allocations must be done explicitly by providing a specific instance of the interface. The caller, of course, must free the allocated string: Here, is a specific allocator that uses libc's and . Alternatively, we could use or any other implementation of the interface: With the type in place, let's create a simple CLI using the package: Then add command routing: Again, no surprises here — the package works just as it does in Go. Solod isn't trying to outperform hand-tuned C. Still, performance matters: the code is benchmarked and optimized to run reasonably fast. Since So compiles to plain C and then to native code with full optimizations, the results are sometimes better than Go's. Here are some highlights from the benchmarks: There're no GC pauses and no Cgo bridge cost when calling C libraries. The tradeoff is that you have to handle memory yourself, but as the SQLite example above shows, So's allocator interface makes that pretty manageable. Solod vs. Go benchmarks Solod is still in its early days, but with the v0.1 release, it's ready for hobby projects. The already-ported parts of the Go standard library make it easy to write command-line tools (check out the , , , and examples ). Plus, with native C interop, you can build just about anything else you need. The next release (v0.2) will likely focus on networking, concurrency, or both — along with more stdlib packages. If you're interested, take a look at So's readme — it has all the information you need to get started. Or try So online without installing anything. Go developers who want low-level control and zero-cost C interop, without having to learn a new language or standard library. C developers who like Go's style. , , and — Abstractions and types for general-purpose I/O. , , , and — Common byte and text operations. and — Generic heap-allocated data structures. and — Generating random data. , , and — Working with the command line and files. — Structured logging. — Measuring and displaying time. — Memory allocation with a pluggable allocator interface. — Low-level C interop helpers. When compiled, the code is first translated to plain C, then compiled into a native binary using GCC or Clang. Unlike Go, there is no runtime (no automatic heap memory allocation, no garbage collection, no goroutine scheduler). There is no overhead when calling C functions, unlike Go's Cgo. The interop syntax is a bit cleaner. For example, Go's ( in the call) automatically decays to C's . Buffered I/O is 3x faster than Go. String and byte operations are up to 2.5x faster. Maps are 1.5x faster for modifications. Integer formatting is 2x faster.

0 views
Manuel Moreale 1 months ago

Hyde Stevenson

This week on the People and Blogs series we have an interview with Hyde Stevenson, whose blog can be found at lazybea.rs . Tired of RSS? Read this in your browser or sign up for the newsletter . People and Blogs is supported by the "One a Month" club members. If you enjoy P&B, consider becoming one for as little as 1 dollar a month. Hyde Stevenson is a nickname I've been using online for years. It's a mix from Dr Jekyll and Mr Hyde, and its author Robert Louis Stevenson. Privacy is important to me, so I generally avoid using my real name. My parents are from Serbia, but I was born in Paris. I lived in London, and, now, I live in southern Europe. More vitamin D was needed in my life. I had two passions as a kid: sport, and computers. Sport has always been a big part of my life. When I was a kid, all my friends played football, but I was always more into basketball. I don't mind watching a good football game, but that's where it ends. But, basketball is another thing. I'm a big Nikola Jokic fan, and I haven't missed a Denver game for the last four years. When we were kids, we all dreamt about the NBA. There weren't many games available to watch. We had one guy who ordered games on tape direct from the US. Then, we shared, and copied them. Basketball was our life. We played at school, after the school, the weekends. We were chasing the best playgrounds to compete with other players. It was great. It was the end of the 80s. Bird, Magic, Jordan, the Pistons Bad Boys, and also Yugoslavian players like Vlade Divac and Dražen Petrovic. The Dream Team too, the real one. I'll always wonder what might've happened if the war in the Balkans hadn't happened and the USA and Yugoslavia had played each other in the Olympics final. That love for the game made me play at a semi-pro level. But, a bad coach put me off the courts. I was young and didn't understand why I couldn't play more when I knew I had the level. I remember one shooting training where I got 46/50 on 3pts, and the guy behind me got 36/50. Did the coach say something to me? Nope. That was enough, and I took a break from the game for a few years to pursue another passion: boxing. My love of boxing probably stems from those nights when my father would wake me up at 4am to watch Mike Tyson's fights. I've always loved boxing. My father's mate's nephew was a boxer. He invited me to train at his gym. And I got hooked. Sad story about this young man. He went pro, but after a bar fight, I heard he was murdered out for revenge by someone involved in that brawl. I also had a great group of friends, and we trained grappling, and MMA for four or five years. A good friend trained us grappling. Today, he trains fighters who fought in the UFC, and got lucky to meet many MMA fighters like Jon Jones . Another one, Guillaume Kerner trained us Thai boxing. Guillaume was one of the first western European Thai boxer who won a World Title in Thailand. You can check some highlights of his career . That was before I moved to London. When I got back in France, I was training exclusively in boxing until 2021, when I moved abroad. Since I relocated, I've really missed the camaraderie of the boxing club. I'm lucky enough to have a garage where I've hung a punching bag and can keep training. For those interested, I started last year a #50kPushUps challenge . The goal is to make 50,000 push-ups in one year. I could write many anecdotes about people I met, but I want also to share my other passion: computers. When I meet people, the first thing they say to me is that I don't look like a computer guy. Stereotypes... 🤷 My passion probably started when one night my father brought home the VCS, the Video Computer System, later renamed the Atari 2600. It's not a computer, but that's where it all started. Later, I asked if I could have a computer, and they offered me the Amstrad CPC464 with its 64Kb RAM, and cassette deck. Later, my grandmother offered me the updated version the CPC6128 with the same RAM, but with a 3-inch floppy disk. After that I had many other ones. I started to build them. I tried my first Linux distro in 1995. It was a Debian. Today, my main distribution is still Debian, even if I tried, and used many others. I've tried probably many window managers over the years. But, for the last 15 years more or less, I've been using only awesomewm , a tiling window manager, light, and customizable if you know Lua a bit. I could write a lot about Linux, but I don't think it'd be of much interest to our readers. What I can say is that my love for computers is what got me to where I am today in my career. My first blog was about Debian, the GNU/Linux distribution. It was in 2001, and it was called debianworld.org. I used to write how-tos, and articles about Linux. I used the blog to post English to French translation of the Debian Weekly News, but also the Securing Debian Manual , and some part of the Advanced Bash scripting guide . Then in 2014, after a long summer, I found out I got cyber squatted. And, just like this it was gone. Then, for five years, I didn't set up anything online until 2019. I met a colleague that asked me if I participated in any conferences, or if I had a blog. That's when I wanted to have a personal place online again. I love bears, that's why I chose that domain name. And, lazy, because I am sometimes. About the theme, it took me some time to create it, and be happy with the final result. But, then, it didn't really change. It depends. First, I need a topic, or an idea. Sometimes a blog post, a news, a new tool, or basically anything can inspire me to write directly a post. But, often, I like to go through my Zettelkasten. Every morning, I use this keybinding -0. That opens a random note. If it doesn't sparkle anything, I hit the same keys again. A "new" note appears, and, sometimes, a discussion starts. I will add more content, or argue with previous thoughts. That's how some drafts start. English not being my mother tongue, I read the different parts multiple times to be sure to make sense. My goal is to make simple sentences, but that connect with everyone. Once done, I check if some grammar hasn't been forgotten by my LSP. Then, a script will sync the content to my blog, and post it also on Mastodon. I don't. I just need my laptop, a terminal, and a coffee. That's all. Maybe the physical space could help some people. Maybe if I had a seaside view, it could impact my creativity 😅. Previously, for other projects, I used Drupal, then Wordpress. But, for this one, I wanted something easily to maintain. No database, or plugins updates. Something simple. That's why I went for a SSG, a Static Site Generator. I chose Hugo , and I've been happy with it for years. There is some JavaScript from Carl Schwan's post to add Mastodon's comment on the blog. So far it works well. Everything is hosted on a dedicated server. All post have been written in Neovim, my go-to editor, on a Tuxedo laptop. My local repository has a backup on a Synology DS1812+ NAS, which also had a remote backup. That repository is pushed on a private Codeberg repository too. Domain name was purchase at Unlimited.rs , a registar in Serbia. Originally, the name of the blog was lazybear.io, but since the announcement that it will disappear in the future, that's when I switched to a Serbian one. For other projects, I use also Porkbun that I love. I don't think so. A few of my friends suggested that I should specialize and monetize it, but that was never its goal. It's my little corner on the web where I can do whatever I want. I can tweak it as I want, try new things, post photos the way I want, without having to follow a specific format. It was always meant to be my place to experiment. I don't track visitors, I don't care about numbers. Now, and then, I get some emails, and I like the discussions I get there. Keep them coming 🙌 The domain name is around €24 per year. The dedicated server around €30 per month, but I use it for other things too. It doesn't generate any money. I could add a Ko-fi account, and maybe I will... just in case. 😇 If people want to monetize it, I don't see any issue with that. Everyone is free to do whatever they want. Ok, I have a couple of them! And, two French photographers: I also have a list of blogs I enjoy, and follow . Yeah start a blog, value your privacy, and send an email to Manuel so we can find more about you. Now that you're done reading the interview, go check the blog and subscribe to the RSS feed . If you're looking for more content, go read one of the previous 139 interviews . People and Blogs is possible because kind people support it. Rldane.space Zerokspot.com Joelchrono.xyz Benjaminhollon.com Christiantietze.de Jeremyjanin.com GregoryMignard.com

0 views
qouteall notes 1 months ago

Rust Async Traps

In Rust, if you call an async function, it returns a future. But the future is just data by default. If you don't await it or spawn a it, its async code won't run. The word "future" has very different meaning in Java. In Java, when obtaining a , the task should be already running. Async runtime schedules async tasks on threads. When an async task suspends, the thread can run other async tasks. But it requires the async task to cooperatively suspend ( ). An async task can keep running without for long time, and the async runtime cannot force-suspend it. Then a scheduler thread will be kept occupied. This is called blocking the scheduler thread . When a scheduler thread is blocked, it reduces overall concurrency and reduces overall performance. And it may cause deadlock. The normal sleep and normal locking will block thread using OS functionality. When a thread is blocked by OS, async runtime don't know about it. In Tokio, use for mutex and and sleep. They will coorporatively pause and avoid that issue. That issue is not limited to only locking and sleep. It also involves networking and all kinds of IOs. So Tokio provides its own set of IO functionalities, and you have to use them when using Tokio for max performance. Also, heavy computation work without point is also blocking. The async runtime cannot force-suspend the heavy computation if it doesn't cooperatively . Tokio also supports an "escape hatch". The task spawned by runs in another thread pool and won't block the normal scheduler thread. The code that does non-async blocking or heavy compute work should be ran in . How to deadlock Tokio application in Rust with just a single mutex Why do I get a deadlock when using Tokio with a std::sync::Mutex? In Rust, a future can be dropped. When it's dropped, its async code stops executing in an await point. This is called cancellation. It's a implicit exit mechanism. The control flow of it is not obvious in code. Note it cancels the future, not the IO. Cancelling a future just stops the async code from running (and drop related data). The already-done IO operations won't be cancelled. (The written files won't be magically rolled back. The sent packets won't be magically withdrawn.) Cancellation not the only implicit exit mechanism. Panic is another implicit exit mechanism. And in the languages that have exceptions (Java, JS, Python, etc.), exception is another implciit exit mechanism. However, exceptions and panics are often logged, but future cancel is often not logged . Although panic is implicit code control flow, it's often explicit in logs. It's easy to debug because it's visible in log. But a future cancel by default logs nothing. Debugging future cancel issue is much harder than debugging panics. The cancellation "catch": normally when the parent future cancels, the inner futures are also cancelled. It propagates from outside to inside. The can stop that propagation. Although is , dropping it won't cancel the spawned task. So if you want to avoid cancellation, wrap it in (and don't call ). In Golang, there is panic, but there is no implcit cancellation. All cancellation need to be explicit. (However managing context cancellation in Golang still has traps, just different to async Rust.) Two examples of cancellation issues: Alan tries to cache requests, which doesn't always happen , Barbara gets burned by select See also: Dealing with cancel safety in async Rust , Cancelling async Rust There is another kind of "cancel": doesn't drop the future but does not the future. This is also dangerous. Elaborated below. Tokio documentation about cancellation safety: 1 , 2 Note again that "cancel" just drops Rust future (and un-track it in async runtime). It doesn't cancel the IO operation. With epoll, the buffer can be directly put inside future, with no extra allocation. If the Rust future is dropped, it just don't do the IO after being notified. With io_uring, dropping the future doesn't cancel the kernel's IO process. So putting buffer into future in io_uring is not memory-safe on cancellation (kernel will write into freed memory). Two solutions: See also: Notes on io-uring As previously mentioned, dropping a future cancels it. There is another kind of "cancellation": just not the future, without dropping the future. It's also dangerous. It may cause deadlock or weird delaying. In you can pass ownership of a future, but you can also pass a future borrow. When a future borrow is passed, one dangerous case can happen. If the select goes into one branch, the future of other branches are dropeed. If you pass a future borrow to it, the borrow itself is dropped, but the borrowed future is not dropped. However, the borrowed future will not be polled again (you can explicit await it after the , but it doesn't before finishing). This creates a temporaily un- -ed future. This is dangerous when async lock is involved. After acquiring lock, the returned future holds lock. If the future holding lock is dropped, it released lock. But if the future holds lock but not dropped and not polled, it's likely to deadlock. This is the mechanism behind futurelock . When using buffered stream, some futures in buffer may be temporarily un- -ed. This can cause weird delaying or deadlock. https://tmandry.gitlab.io/blog/posts/for-await-buffered-streams/ https://without.boats/blog/poll-progress/ Rust currently have no in-place initialization. Heap-allocating one thing requires firstly creating it on stack then move it to heap. In release mode, it can be optimized to directly initializing on heap. But in debug mode it still involves creating on stack. Some futures may be very large. Creating a large future on stack can cause stack overflow. Sometimes it stack overflows in debug mode but not release mode, because in release mode it directly writes to heap. In Windows the default stack size is smaller so it's more likely to stackoverflow. There is currently some inefficiency in future size. See Async Future Memory Optimisation How to reduce future size: It will print All of them execute on main thread. There is no parallelism. The parallelism can be enabled by using . But without it has no parallelism by default. This is different in Golang. In Golang, goroutines are parallel. Async-sync-async sandwitch: Async function call sync function that blocks on another async function. Its async-to-sync calling blocks scheduler thread. It's very prone to deadlock. Tokio does multi-thread work-stealing scheduling. Its purpose is very similar to OS scheduling. And an async task's purpose is very similar to OS thread. The duality of the two: As long as the data is owned by a thread, it's data-race free. The correspondence: as long as the data is owned by an async task, it's data-race free. Tokio requires the future to be . This can create some troubles. It requires because Tokio does work stealing. An async task in one thread could be then scheduled to another async task. However if async task is analogous to thread, then if we ensure that the data is owned by async task, it can also achieve data-race free, even if the data is not . However Rust doesn't check "async task boundary". An async task can pass data out. Then the data is no longer owned by async task. There is no language mechanism that ensures that the data is tied within async task. So you still have to satisfy even for the data that's only used with one async task. The constraint can be avoided for thread-per-core async runtimes. Using multiple async runtimes together is possible but is hard and error-prone. And there are many async-runtime-specific types. So async runtime naturally has exclusion. That's why Tokio has monopoly. In Golang you can only use one official goroutine scheduler. In Rust, although Tokio has monopoly, you have choices of using other async runtimes. This trap is not Rust-specific. When using thread pool, it often has thread count limit, which limits concurrency. But in async, there is no concurrency limit by default. This is good for high-performance web server. But it has downsides: One solution is to add a semaphore to limit concurrency. Structural concurrency force all concurrent tasks to be scoped. Then the tasks form a tree-shaped structure. Structural concurrency can borrow data from parent. There is no need to make the future . There is no need to wrap things in . The tree shape is free of cycles, so awaiting on child tasks alone cannot deadlock (but it can deadlock if other kinds of waits are involved). But there are cases that structural concurrency cannot handld. One is background tasks. For example, a web server provides a Restful API that launches a background task. The background task keeps running after the request that launch task finishes. The bane of my existence: Supporting both async and sync code in Rust Why async Rust? Async Rust can be a pleasure to work with (without ) Making Async Rust Reliable - Tyler Mandry FuturesUnordered and the order of futures The "fully owned" here means not just ownership in Rust semantics. The has internal data structures. The "fully owned" applies to these internal data structures. One async task fully own the means the internal data structure (that contains reference count) is only accessible from one async task. ↩ . When one branch is selected, the futures of other branches are cancelled. . Explcitly cancel a task. . When timeout is reached but the future hasn't finished, it's cancelled. In epoll, the OS notifies app that an IO can be done, then the app does another system call to do IO. It involves context switching from kernel to app (receive notification), then to kernel (do the IO syscall) then to app (finishing IO). The app can choose to not do the IO after receiving notification. This works well with Rust future cancellation. In io_uring, the OS directly finish IO (write to buffer) then tell the app. It's just a context switch from kernel to app (it's faster than epoll's kernel-to-app-to-kernel-to-app). The IO is fully done by kernel. The app cannot choose to "receive notification but not do IO". When app receives notification, the IO has already been done. This doesn't work well with Rust async cancellation. Make the future non-cancellable. Rust doesn't yet have linear type (must-move type) so this cannot be guaranteed by language. Make the buffer heap-allocated. When future is dropped, the buffer can still exist, kernel can write to it without violating memory safety. Avoid creating an in-place buffer like . The buffer will directly be in the future. When calling another async function, firstly box that future then await on it. If not boxed, the sub-future will be directly put inside parent future. Making async code call sync code is easy, but has risk of blocking scheduler thread, as mentioned previously. Making sync code call async is not easy. It requires using async runtime's API. But it's less risky. For scraper, if concurrency is too high, it may use too much memory then OOM. If it sends too many concurrent requests to a remote server, it may trigger rate limit then most requests fail. The "fully owned" here means not just ownership in Rust semantics. The has internal data structures. The "fully owned" applies to these internal data structures. One async task fully own the means the internal data structure (that contains reference count) is only accessible from one async task. ↩

0 views
Langur Monkey 1 months ago

Local TTS is getting very capable and accessible

Around 2007 I spent half a year in the University of Aberdeen working on my final year project involving NLP . The project consisted of an interactive game that was controlled by language input. It also had to produce speech. At that time, we managed to partner with a group at La Salle University that were working on a TTS system for Catalan. It was a closed system that was accessible via a web API, but it was far too slow for real time use. I ended up preprocessing the audio of all dialog in the project. At that time, I was amazed that a computer could so easily convert text to an understandable audio file. The voice was very robotic, and the results were hit or miss, but it worked . Fast forward to today, TTS systems are everywhere. Several groups have released low-parameter TTS models that run very well on consumer hardware. I have been using the lightweight Kitten TTS for a while with fantastic results. The models are so lightweight that some websites are heavier than entire Kitten TTS models: Projects like streamline and trivialize Kitten TTS inference. I have a shell script in one of my directories that does everything in a single command: This clones the project, pulls dependencies and models, and plays the audio. It is quite fast, especially when using cached data. Kitten TTS produces acceptable results, though the output usually lacks emotion and nuance. For simple use cases (reading notifications, generating voiceovers for scripts) it’s more than sufficient. Qwen3-TTS , which I’ve been recently testing, represents a step-up in quality. It’s extremely good, and local inference is practical even on modest hardware given the model sizes. It offers three interesting variants: The voice design models are particularly clever: you describe the voice you want alongside the text to convert. Want a deep, gravelly voice with a Scottish accent? Or an excited teenager talking about a video game? Just describe it. It’s remarkable that you can run this locally so easily. However, as far as I know there’s no off-the-shelf CLI tool that handles dependencies, downloads the model, and runs inference out of the box. That’s why I created QwenSay . With it, you can clone the repository and convert text to speech locally from your terminal without wrestling with dependencies or writing any code. Here’s how it works. First, set it up: Now, you are ready to convert your text to speech with Qwen3-TTS: This uses the default 1.7B voice design model. You can also specify the model with . There are many other CLI arguments that you can use to tune your output. Check out the repository documentation for more details. Whether you’re building accessibility features, creating voiceovers for projects, or just experimenting, this is worth a try. I’ve made QwenSay my go-to TTS tool because it produces high-quality results and is genuinely fast.

0 views
Stratechery 1 months ago

An Interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman About Bedrock Managed Agents

Good morning, As I noted yesterday, today’s Stratechery Interview is early in terms of my timing — Tuesday instead of Thursday — and late in terms of delivery — 1pm Eastern instead of 6am — because the topic was embargoed. That embargo created a bit of a weird situation for me over the last several days: So here we are. I think the Microsoft-OpenAI deal makes a lot of sense for both sides. Here are the bullet points of the new arrangement from Microsoft’s post : I think the most important point is the last one. Azure had a real competitive advantage thanks to being the only hyperscaler able to offer OpenAI models, but this also hindered OpenAI, particularly once it became clear that many enterprises cared first and foremost about accessing models on their current cloud of choice; I’ve been noting for a while that this was a real competitive advantage for Anthropic . In other words, Azure’s exclusivity was actively damaging Microsoft’s investment in OpenAI, and given Anthropic’s rapid growth this year, Microsoft needed to tend to their investment, even if it diminished Azure’s differentiation. OpenAI, meanwhile, clearly sees AWS as a massive opportunity — so much so that they are forgoing Azure-related revenue for the next few years (which, per the previous point, will help Azure management feel better about losing their exclusivity; their PnL is going to look a lot better without paying a revenue share to OpenAI). OpenAI is also releasing Microsoft from the AGI clause ; now the agreement between the two companies will run through 2032 no matter what. What does seem clear is that OpenAI’s focus is going to be on AWS, and the greatest evidence in that regard is the topic of this interview: Bedrock Managed Agents, powered by OpenAI. The easiest way to think about this offering is Codex in AWS; a lot of what makes Codex work is the fact that it is local, which gives you a lot of complexity, particularly in terms of security, for free. It’s another thing entirely to figure out how to make agents work across an organization, and the goal of this offering is to make these workflows much more accessible for organizations who already have most of their data in AWS. To that end, in this interview, we discuss how AWS created the entire cloud category, and the impact it had on startups, and how AI is both similar and different to that previous paradigm shift. Then we discuss Bedrock Managed Agents, what it is, and how it differs from Amazon’s existing AgentCore offering. We also touch on Trainium and why chips won’t matter to most AI users, and why partnering makes sense relative to Google’s focus on full integration. As a reminder, all Stratechery content, including interviews, is available as a podcast; click the link at the top of this email to add Stratechery to your podcast player. On to the Interview: This interview is lightly edited for clarity. Matt Garman and Sam Altman — well Matt, welcome to Stratechery — and Sam, welcome back [I previously interviewed Altman in October 2025 , March 2025 , and February 2023 ]. Sam Altman: Thank you. Matt Garman: Thank you, thanks for having me. So Matt, this is your first time on Stratechery. Alas, I think that Sam’s presence is going to preclude the usual getting to know you section. Besides, he doesn’t want to hear us reminisce about our times at Kellogg Business School, but it is good to have a fellow alumnus on the podcast. MG: Yeah, I’m happy to be here. I’ll come back another time and we can do a little deeper dive. That’d be great. You’ve been working on AWS since you were an intern, and you’re now in charge of the entire organization during this AI wave. What aspects of building the AI business are the same as building the original commodity compute business, for lack of a better term, and what aspects are really different? MG: I think that the parts that are the same are that I see that same excitement and builders out there being able to do things that they were never able to do before, and one of the cool things is when we first started AWS, is developers all of a sudden could get their hands on infrastructure that was only available to the largest companies who had millions of dollars to go build data centers. With a credit card and a couple of dollars, they could spin up applications and it really exploded what was possible for people building out there on the Internet. We kind of took the idea that people could build whatever they want and we weren’t going to presuppose what they should do and that the creativity of the world out there was, if we could put powerful tools in front of them, they’d build interesting and amazing things. I think this is as much, if not more, transformational to what it’s enabling builders out there to do. As you think about what’s possible, you don’t have to have gone to school and learned for 10 years to code in order to go build an application, you don’t have to have huge teams of hundreds of people and months and months and months of time to go build things. You can build things with small teams, you can build it fast and you can iterate quickly, and AI is unlocking all sorts of innovation across every different aspect of the world. I think in many ways that’s very similar, and it’s super exciting to see what it’s enabling from the customer base out there. There was a bit, though, when AWS came along, you were the only one , so you get all the upsides and downsides and everything sort of for free. Is there a bit where it felt like in the AWS era, there’s a lot about commodity compute, making it fungible, elastic, cheap — in AI, particularly in training, it feels like the winning abstraction was more about these really vertically integrated super clusters, really advanced networking, and really tight linkages between software and hardware. Was that sort of a surprise for you, where you’re coming at it now — instead of fresh, “We’re the only ones here, we had a particular way of looking at large-scale compute”, and at least for the first few years of AI, it maybe didn’t perfectly align? MG: I don’t know that it was different for us. I think for what was different though, is just the incredible rapid scale of adoption, and I think that that’s probably surprised everybody. Sam, you can weigh in different if you disagree, but just the speed of adoption and how fast people have grabbed onto the capabilities there, I think has surprised everyone. It’s different if you go to the, when we started cloud computing, it took us a really long time to explain why a bookseller would provide your compute power, that was a lot of explanation to explain what cloud computing was. There was a lot of hard work that people forget, but back in 2006, it wasn’t a given that that’s just how the world’s computing would move to and so there was a lot of kind of hard work there. Do you think you had to do a bit of explaining now though, because lots of people were anchoring on the training era and you’re like, “We’re thinking about the inference era “, and that’s going to be something different, maybe you still had to get those explanatory powers going again? MG: You do, but it’s just how quickly people understand what you’re talking about is just totally different. So I think yes, I think if you move from where people are saying, “That does seem kind of cool, and it’s really neat that I have this intelligent chatbot that I can talk to”, going to, “I can actually do work in your enterprise”, has been a little bit of an education, but it’s also been relatively quick in the scope of how fast technology moves. We’re going to get to the product that we’re here for very quickly, I promise, but Sam — from the startup ecosystem perspective, when you look back, obviously AWS, transformational , completely changed where the barrier was, now anyone can get started. You have seeds, you have angel investors, and it sort of moves back the barrier where the cutoff point, you don’t have to get servers on a PowerPoint, you can build an app and then go to your Series A or whatever it might be. What, though, is different or the same compared to what that enabled versus the world today from your perspective? SA: I think there have been four great moments for platform enablement of startups at mass scale: there was the Internet, there was cloud, there was mobile, and then there was AI. The first one of those that I was kind of like an adult for was the cloud and in the early days of YC [Combinator] — it’s like hard to overstate what a change this meant for startups. Before, you had these startups that were like renting colo[cation] space and putting together servers and putting stuff in there and it was this like massively complex thing, and you had to like raise all this money. Then all of a sudden, even though the cloud happened like right after YC got started, I guess it was the year after. I was just going to ask that — is it really at the end of the day, they’re really hand-in-hand more than you realized at the time? SA: They felt incredibly hand-in-hand at the time, it felt like YC was, you know, surfing this wave of the cloud from the very beginning because there were some early pre-AWS examples. You don’t need to put that much money into a startup to get something off the ground if AWS exists compared to what it might’ve been before. SA: It was this huge enabling change and it was part of why YC sounded so crazy at the time. People were like, “Well, there’s no way you can fund a startup with a few tens of thousands of dollars, it’s impossible, the server costs more than that”, so it was this complete change to what startups could do with small amounts of capital. Startups generally win when there is a big platform shift and you can do things with a faster cycle time and much less capital than before, that’s a classic way startups can beat big companies, and at the beginning of my career, I really witnessed that happen with the cloud, it actually feels quite directionally similar now watching what companies are doing building on AI, but as Matt was saying, the speed of it is crazy. Is there a bit where the incumbents, the large companies, are adopting this way faster than they than they were the cloud? SA: There’s definitely more of that, but I also mean just the the rate that revenue is scaling in at startups — I spoke at YC recently and I kind of asked at the end, “What are the expectations for revenue for a good company at the end of YC?”, and they’re like, “Well it’s kind of changing every month, maybe we’d have a different answer at the beginning of the batch versus the end of the batch”, and this never used to happen before. Just the rate at which people are able to build scaled business on this new platform is unlike anything I’ve seen before. You were the cloud of choice for basically all startups, a huge advantage to that whole era, Matt. What makes you the cloud of choice today? Because you think about a lot of people building on the OpenAI API, or is that something you felt, “Actually we’re coming at this market from a very different perspective, we have a huge installed base who’s begging us to get AI things, and we have less visibility into this whole cohort that Sam’s talking about”? MG: I think there’s a couple of things. One is, is we’re quite excited about our partnership, and I think it’s going to be really meaningful to a bunch of startups out there. But today, even if you go and you talk to startups, the vast majority of scaling startups are still scaling on AWS today, and there’s a whole bunch of reasons for that. The scale is there, the availability is there, the security is there, the reliability is there, that kind of partner ecosystem of other ISVs are in AWS, the customers are in AWS. (laughing) Everyone’s used the AWS panel whether they wanted to or not, so they’re used to it. MG: And we help them. We spend a ton of time enabling startups, whether it’s with credits, but it’s not just with credits, it’s advice on how to set up your systems, how to think about go-to-market, a bunch of those things that are, I think, are really appreciated by a bunch of the startups, we invest a lot of time and effort to make sure because we really feel like the startups are the lifeblood of AWS. They were from the beginning, like when Sam was talking about it, but they remain today, and I still go once a quarter out to Silicon Valley or other places to meet directly with startups to hear what they’re doing, to make sure that what we’re building is landing with them. So there is more competition today than there was 20 years ago for that startup attention, and it’s just as important for us as it’s ever been and and we spend a ton of time to make sure that we’re meeting the needs of those startups. Is it fair to say people building directly on the OpenAI API, as opposed to say the Azure version of it, are more likely to have a stack of AWS for for regular compute and then OpenAI for for their AI? MG: I think that’s a very common pattern that a lot of startups have today, absolutely. Well that brings us to today’s announcement: Bedrock Managed Agents, powered by OpenAI, I think I got that right. The pitch, as I understand it, is not simply OpenAI models are available in AWS — I don’t think that’s allowed — it’s that OpenAI’s frontier models are being packaged inside an AWS-native agent runtime, identity, permission state, logging, governance, and deployment. Sam, is that the right way to articulate it? SA: Yeah, that was pretty good. Thank you. What is this? Now explain it in English. SA: I think the next phase of AI is going from you supply some text to an agent and get more text back, or even you supply a bunch of code and get more code back, to we are going to have these agents running inside of a company doing all different kinds of work. Virtual co-workers is kind of my least bad of the ways I’ve heard this described, but no one has quite figured out the right language for this, and we are packaging a new product that we’re working on together to help enable companies that want to build these sorts of stateful agents and make them available. Again, I think we don’t know exactly how the world’s going to talk about these, use these, but if you look at what’s happening [with Codex], I think there’s a great example of where we can see this all going. How important is the harness , the runtime around the model, the tools, state — to your point, a very important word to you — memory, permissions, evals, to making agents actually work? SA: Hard to overstate how critical it is. I no longer think of the harness and the model as these entirely separable things, like my experience of using these, I am very aware of the fact that I don’t always know when I fire something off in Codex and it does an amazing thing for me. I don’t know how much credit — Was it that the model is amazing or the harness was amazing? SA: Yeah, exactly. To what extent is the harness developed in conjunction with the model? Where does that integration happen? Is it in post-training? Is it in the prompt? What makes this integration work? SA: Both of those. It’s not really part of the pre-training process but I would say you can look at it — there’s a more interesting thing here which is the fact that we’ve seen examples of this many times in the past of where things that we thought were very separable get baked in more and more and more. Like the way we initially thought about tool-calling, which is now a critical part of how we use these models, was not something that we thought about deeply integrating into the training process and over time we’ve done more and more of that. I would also suspect that model and harness come together more over time and I would for that matter, I would expect that pre-training and post-training eventually come together more over time as well. It’s such a cliché to say, but I’ll do it anyway, because I think it’s very, very true — we’re so early in the paradigm of all of this, this is still like the Homebrew Computer Club days of how much this is like really matured as an industry. This is why I think so interesting, I wrote about this a few weeks ago , in any value chain, ultimately a point of integration emerges that that’s where it’s really important, these two pieces have to go together to make it work. And over time, that’s obviously where a lot of value collects — my thesis then is that this harness-model integration is the key point. It’s to your interest, but it sounds like you agree. SA: It is to my interest, I do agree, but I also would say even more broadly, what you care about is that you go type into Codex what you want to happen and that it happens. You don’t care about the implementation details. SA: I don’t think you do. There have been so many examples as we’ve been figuring all of this out where we had to do something at the level of the system prompt, that later we didn’t. The general observation here is as the models get smarter, you have more flexibility to get them to behave in the ways you want which sounds like an obvious statement, but it is— It’s easier to tell a 10-year-old what to do than a 5-year-old. SA: When I think back to what we had to do to get any drop of utility squeezed out of these models back in the GPT-3 days that now you never would have to, because of course the model just understands and does it well out of the box, that trend may keep going much further. MG: I was just going to add to that — I completely agree with that and I think when you talk to customers who have ideas exactly what they want these systems to do, previous to this kind of joint collaboration that we worked on together, is that customers were kind of forced to pull that together themselves, right? They wanted these models and agents to remember that they work together well and they wanted to integrate into their existing systems, and it’s not just third-party tools, it’s their own tools. They want them to learn about their own data, their own applications, and their own operating environment and all of that kind of integration today, at least, is left to every single customer to do on their own. So part of this joint collaboration that we were leaning into together is co-building a new type of product that actually brings those things much closer together so that customers can much more easily go accomplish these things that they want to do, where identity is already kind of built into that product, where the ability to go authenticate to your database all happens inside of your AWS VPC [ Virtual Private Cloud ]. You can do a bunch of these things that would be possible to do if we were kind of at the OpenAI APIs and AWS over here, but by building this thing together, we make it much easier for customers to much more rapidly get to value and go accomplish the thing they want to do inside of their enterprise environment. So you think that you can build a functional agent in a generic harness, it’s just way more difficult? You’re making it easier? Or is there a bit where actually there might not even be stuff you can do if you don’t have them tied together? SA: To go back to your earlier analogy, pre-AWS days, you could do a lot if you were willing to go stand in a cage and buy a bunch of servers and figure out how to connect them and hire your own network engineer, and you could make a lot of things happen and then all of a sudden as soon as you could just like log into an AWS control panel and click, “I need another S3 instance”, or whatever, you could make a lot more things happen because the activation energy, the amount of work that required for the basics, got way better so you can do a lot with the models today. Yet every time I watch someone use our models or try to set up some of this work Matt was saying, I am torn between being happy they’re so impressed and feel like this is a magical technology and pulling my hair out at how much pain and suffering they’re going through to get anything to work at all, and that’s not just true of developers building these products, even using ChatGPT and watching people copy and paste things from here to there and try to have this complicated set of prompts — I know that’s going to go away, and I’m thrilled. It’s still so early, and so bad. Just don’t take away your integration with BBEdit , that’s all I ask, my number one favorite feature of the ChatGPT app. (laughing) Thank you. SA: A) This stuff is just way too hard to do, and we think if we can make it way easier it’ll bring way more value to developers and businesses, but B) there are a lot of things that you just can’t reliably get to work at all and I think through our joint collaboration not only will it be a story of ease of use and not having to go build out your own colo or whatever, we are going to jointly figure out a lot of new things to build where people will be able to build products and services that just can’t be done even with a lot of pain and suffering. I actually want to come back to that point about things to be built. But just to go back to Codex real quick — Codex is a harness and model, it runs locally. Why is it easier to get agents to work locally right now? SA: Actually, we started with it running in the cloud, and I think eventually you do want it to run in the cloud. For sure. I’m walking through the transition to this offering, which is in the cloud. But why did you go back to local? SA: You have your whole environment there, your computer’s set up, your data is there, you don’t have to like think about — it was just easier to get to work, even though it’s not the end state. But getting to a world where agents do run in the cloud and when you — if you have a very intensive thing, or you need to close your computer or whatever, you can hand stuff off to working on the cloud, I think is clearly going to be great. But the ease of use that we were able to deliver clearly in the short term, it won out to have it using your local environment. There’s one way that I think about it, is like you have the old school security model, which is like the castle-and-moat sort of thing, and you’re moving to a new security model of zero trust and everything having the appropriate permission structure and authenticating and all those bits and pieces, and it feels like to me one way to frame running locally, it’s like your self-imposed castle-and-moat, everything’s on there, I just assume it’s all fine and easy to do. And a way I’m thinking about this, and Matt, let me know if that resonates with you, is to get all those pieces to actually function in a production environment you just can’t even have that all locally, you have to be operating this environment from the get-go, is that a right way to think about it? MG: I don’t know that there’s any computing environment that’s gotten rid of a client, there are just benefits of operating locally. There’s a reason that most of your iPhone apps also have a local component, whether it’s connectivity or latency or just local compute or access to files and applications. The local client does have a particular — as Sam said, it’s easy, it works really well, it’s constrained, though, there’s limits to it. You can’t scale out your local laptop, you have what you have and once you start getting in an enterprise contract, sharing between two people gets to be a little bit harder — thinking about permissions, thinking about security boundaries gets to be a little bit harder. So there’s a number of those pieces where I think that, I wouldn’t say that having the local environment is a bad thing, it’s just a different thing, and I think that you’re eventually going to want to have that bride across both. That’s my question, because you have in the cloud era, you had containers that helped you converge local and production environments, but it kind of feels like in this case if you have to deal with agents, to your point, say I was like a virtual co-worker and or whatever it might be, if they have their own identity and they have their own permissions and all those sorts of things, to even build them you need to be in the right environment as you’re going to deploy it, it would seem that way to me. SA: I think there is so much to figure out here. Just to give one example, if you’re an employee at a company, do you want to have one account for when you use some service, and then should your agent just use your account, or should your agent use a different account so that the server can tell which is which? Or what if you want lots of agents? SA: Exactly. I suspect that what we actually want is something we haven’t figured out yet, and maybe it’s that when Ben’s agent is logging in as Ben, it uses Ben’s account but it notes that it’s an agent and not the real Ben. We don’t even have a primitive to think about that, but we may quickly need to figure that out and and my sense is there there are going to be 50 other things like that where as we have agents join the workforce and act with increasing levels of autonomy and complexity of tasks, a lot of the mental models that we have for how software works and how access control and permissions work inside of a company or on the broader Internet, those are all just going to have to evolve. How do you think about, Matt, in terms of security and access policies and whatnot for agents? MG: Yeah, I do think that that’s where when you move more of these workloads into the cloud that you can have as a central organization, more controls over some of the security pieces of it. And I do think, when we talk to customers all of the time, it is what they worry about, which is, “I love the promise of what I can do with some of these really powerful models and agents, how do I make sure that I don’t have a company-ending event where I screw it up?”, and there’s the worry out there. I think we can help with that because it these are solvable problems, they are, and I think, giving some customers confidence, “Well, it operates inside of this VPC”, and you can at least then control that boundary and know what it has access to, or it goes through this gateway, and you can give it permissions, much like you give it a role inside of the rest of your environment. These are constructs that over the last 20 years, we’ve built up a really rich set of capabilities, so that it’s not just Y Combinator startups, but it’s global banks and healthcare agencies and everybody in the world and government agencies that can use AWS and having built up all of that security structure around it, I think can help us further accelerate how they take advantage of this technology and kind of have these safeguards to run fast. I think a lot of times when you’re in a company, particularly companies that are in risk-averse environments, having those safety guardrails where they say, “If it operates inside of the sandbox, I am excited to go fast”, can actually help many of our customers start to use these technologies for a much broader set of things. A lot of these capabilities you’re talking about that you’ve developed over 20 years and you’re trying to put it in place for agents are exposed today through AgentCore . So what is the relationship between Bedrock Managed Agents powered by OpenAI and Bedrock AgentCore? MG: A lot of what we’ve built together is building on the building blocks of AgentCore in order to kind of pull some of these pieces together. So there’s like a super set that sits on top of that? MG: The AWS team and the OpenAI team used AgentCore components together with the OpenAI models and a bunch of those pieces to go and co-build this product together. AgentCore is kind of our set of primitives that just like if with AWS, if you want to go and build our own agentic workflows, you can do that. You can have a memory component, you can have a safe execution environment, you can have a permissioning capability, and you can go and configure all of those and we have customers running those in production today that are doing really cool things. But not with OpenAI. MG: But not with OpenAI, they have to use different models today, that’s true. Actually, that’s not true, we have people doing it with OpenAI. Oh, just calling to another cloud or whatever. MG: They just call directly to the OpenAI model. So we actually absolutely have people doing it with OpenAI today, not natively inside of Bedrock, but they’re still using that. And it’s an open ecosystem where you can pull different capabilities to go build whatever you want and my bet is that people will continue to do that. We have builders out there that love to, to Sam’s analogy, love to continue to build computers at home today, even though you don’t have to do that, and even though people like to build and we think that people for a long time will build their own agents, but the vast majority of them are going to want an easier way to do it where they don’t want to have to go configure all of those pieces themselves and that’s part of what we’ve launched in this collaboration together. Just to be super clear, you talk about this managed experience with Bedrock Managed Agents, you can also use AgentCore and pull from a model, whether on AWS or somewhere else. And just to make clear, Sam, this is a question for you, this is the distinction between OpenAI on say, Azure, where that’s just you have direct access to the API, and that is distinct from this managed service on Amazon. Is that correct? SA: Correct, yep. And you feel very good about that, that’s scoped correctly in all terms, it’s not going to be an issue going forward? SA: Yeah, I think things will evolve over time, but I feel very good about this as a way to start. Is this going to be an exclusive offering for AWS? Or do you anticipate having this sort of managed agent service on other clouds? SA: Yeah, we’re doing this exclusively with Amazon, we’re excited about it. How much of the exclusive is, “Look, we’re using all Amazon’s APIs, of course it’s only on Amazon”, or is this the overall idea of a managed experience, it’s not just a “We’re using Amazon APIs”, it’s, “Right now this is going to be on Amazon”? SA: Spiritually, we want to do this as a joint effort between our companies. Got it. The PR does say something, and this goes back to the point you mentioned, Matt, earlier about you could call out to other APIs and glue this all together yourself. In this case, the customer data stays within AWS, so what exactly does OpenAI see, what does that mean? MG: That’s right. So the whole thing kind of stays within your VPC and so data is protected inside of the Bedrock environment. Got it. And this is going to be running on OpenAI models through Bedrock, and these are going to be on Trainium ? MG: They’ll be through a mix of different – some of it will be on Trainium, some of it will be on GPUs. Is that just a function of timing? Because I think as part of your announcement a couple of months ago — MG: Some of it’s timing and capabilities, I think we’ll kind of be mixing in the different components of building the system together, using the right infrastructure for the right parts of it. But over time, more and more of it will be on Trainium. SA: We are quite excited to get these models running on Trainium. I can imagine. One quick question, just a general question about Trainium, Matt. Trainium, is it fair to think, and this is the way I’m thinking about it, so I want to make sure I have it right. Trainium — very unfortunately named, because it’s really going to be about inference going forward — the number one manifestation will be through managed services like a Bedrock, where the customer doesn’t even necessarily know what compute they’re using, is that a fair way to think about it? MG: Number one, I take responsibility for bad naming across all AWS services. Look, I have a word-of-mouth site named Stratechery, so I have all sympathy for bad naming. SA: I think Trainium is a cool word. MG: It is a cool word. It is a cool word, it just feels like it’s an inference chip, not a training chip. MG: It is. But, yeah, naming aside, it is useful for both training and inference. And look, it’s a chip that we’re incredibly excited about, and both in the current generations as well as ongoing, we think that’s going to be a huge business and a real enabler for a lot of the things that we do together. I think just with GPUs, by the way, you’re going to interact with a lot of these accelerator chips through abstractions. So the vast majority of customers don’t interact with GPUs either, except through maybe like in their laptop or something like that, for graphics. But when you’re talking to OpenAI, even if they’re running on GPUs, you’re not talking to the GPUs, if you’re talking to Claude, you’re through GPUs or Trainium or TPUs, you’re not talking to any of those chips, you’re talking to the interface. And the vast majority of inference out there is being done on one of a handful of models. And so whether it’s 5, 10, 20, 100, it’s not millions of people that are programming to those things directly, and that’s gonna be true going forward just because these systems are so complex, they’re very large. If you’re going to go train a model, not that many people have enough money to go train a model, not that many people have the expertise to actually manage it. They’re very complicated systems, and the OpenAI team is incredible in their ability to squeeze value out of a very large compute cluster. But not that many people have the team that can do that, independent of what the chip happens to be, and so I think that that’s going to be true for all accelerator chips, honestly. SA: Ben, I increasingly think of what we have to do as a company is to be a token factory. But what the customer cares about is that we can deliver the best unit of intelligence at the lowest price and as much of it as they want, with as much capacity as they want. Do you think we stick with pricing as far as — pricing is based on tokens, does that make sense in the long run? SA: No. And in fact, like there was an interesting example of this with our model that just came out , 5.5. where the per-token cost is much higher than 5.4, but it requires a hugely fewer number of tokens to get the same answer, and you actually don’t care about how many tokens the answer takes, you just want the piece of work done, and you want again a price and an amount of capacity you can have for that. So maybe I was wrong to say “token factory”, but we’re like an intelligence factory or something. We just want as many units of intelligence for the lowest price and whether that is a bigger model running fewer tokens, a smaller model running lots of tokens, whether a GPU or Trainium or something else, whether we do any of the other kind of number of things we could do about that creatively, I don’t think customers care. In fact, they don’t really interact with that. When you go put something into Codex or when you go build a new kind of agent in the SRE [ Stateful Runtime Environment ], you should never have to think about that and you should just be astonished at how much you get for how little cost. Is the reduced token usage is that model, or is that harness? SA: That’s mostly model, it’s a little bit harness. Got it. Do you anticipate Matt, by the way, I asked Sam the exclusive question, do you anticipate offering a similar managed service for other models? MG: We’re focused on doing this with OpenAI right now. We’re very excited about what we’re doing together, and the fullness of time is a long time. The fullness of time is a long time, I’ll let you stick with that one. It’s fine, I had to ask the question. I do have a question as far as customers, Sam, to your point, both your input on this, I’m curious — when people are actually in production, where does OpenAI’s responsibility end and AWS’s begin? It sounds to me, if all the data is on AWS and it’s staying there, and they’re operating at a higher level, this is ultimately AWS’s responsibility? Is that the right way — am I thinking about that correctly from a consumer perspective? MG: Yeah, I think that’s right. When you’re going to call somebody, you’ll call AWS support to help you out, and it’s part of your AWS environment and you build it together and your AWS account reps are going to help you there. And we’ll bring in, when we’re building it, we’ll bring in our OpenAI colleagues to help you figure out how to best take advantage of this or whatever. At some point, if we run into a bug that we need their help with, we’ll escalate over to them, but AWS will be that frontline support that you kind of interact with. Where do you see the scale of this business, Sam, relative to your core API business? SA: I hope it’s going to be huge, we’re putting a lot of effort into this, we’re committing to buy a lot of compute, I believe there will be a lot of revenue there to support this. The increasing framework that I’ve had is that at a low enough price, demand for intelligence is essentially uncapped. So is it very elastic in that regard? You decrease price, demand goes up? SA: It’s certainly that, but again, you can decrease the price of water and maybe you’ll drink a little more water, maybe you’ll shower twice a day instead of once a day, there’s some elasticity there but at some point you’re like, “You know what, I have enough water”. Also you will buy water no matter how much it costs if you have to. SA: Other utilities, if electricity is cheaper you’ll certainly use more of it, but if you think about intelligence as a utility, there’s no other utility I know of that I’m just like, “I just want more, I’ll just use more as long as the price is low enough, I’ll just use more”. MG: I will say actually and interestingly it’s largely been true of compute power where if you think about the cost of a compute cycle today versus what it was 30 years ago, like I don’t even know how many orders of magnitude cheaper, and there’s more compute being sold today than ever. Right. People don’t really think about the cost of compute at least until they’re at extremely high levels it’s a material level, but by and large strategically speaking it’s just assumed you have compute. What’s the runway to getting there with with AI where it’s not the number one thought process, “How much am I spending here?”. SA: I don’t think that is the number one thought process. Right now we have way more customers asking us, “No matter what the price is, can you give me more? I just need more capacity, I’ll pay you extra”, than we have arguing with us about the price. But I do think we are going to continue to bring the price down crazily dramatically, now maybe the more we do that the amount of wealth that wants to flow and just goes up more and more and more. But I am confident we will continue to be able to reduce the cost of today’s level of intelligence quite dramatically — one thing that has somewhat surprised me is how much, and I don’t know if this is going to stay the case or not, but at least today how much of the total market demand is at the absolute frontier. Right, there’s a lot of questions about that. It’s very expensive to serve the front end, people can just get the previous one, but you’re saying people just want to be on the front end no matter what? SA: So far they do. MG: And I think that’s a good signal that you’re not anywhere close to where we want to be and that there’s so much more demand, and I really do think it’s like if you go 40 years ago to compute demand, a computer was crazy expensive, and now it’s dwarfed by the the power that’s in everybody’s cell phone and we sell billions more of those things. I do think that that’s what’s going to happen to the AI world where today you’re pushing, everybody wants to use the frontier because that’s what you need in order to get a lot of useful work, and everyone’s so excited about the capabilities out there. I think over time, you will have a mix of models, by the way, where you will have some smaller models that are able to do stuff that even the latest OpenAI models aren’t able to do yet, but they will be smaller and cheaper and faster over time, and you’ll have the super big ones that are going to go try to cure cancer and other things like that. But I think we’re still at just the early stages of what’s possible and when you see this much demand and this much growth when you’re at the early stages of what’s possible, it’s exciting for what the future holds. Is there a bit of a cynical view here where, Sam, you had a bunch of customers that are like, “We’d love to use OpenAI models, but all our stuff’s in AWS, we’re not moving”. And Matt, you’re like, “Look, all our stuff’s in AWS, can you please go get OpenAI models?”, and this is just satisfying that need — and it turns out, because AWS is the biggest, that was an astronomical amount of need. Is that just the easiest answer? Or is there a bit here, too, where you actually think you can deliver something highly differentiated that will also draw new customers for each of you? SA: We’re clearly thrilled to get access to AWS customers, and so many people love AWS. Yeah, that is a true statement. MG: That part is definitely true. (laughing) Right. MG: And vice-versa, our customers are very excited to get access to OpenAI technology. SA: But I do think there is something incredible and new to build together, and I am hopeful that when people look back on this in a year, the most important thing people will talk about is not like, “Oh, finally, you can get access to these models via AWS”, or whatever, but it’ll be like, “Wow, we didn’t realize how important this new product was”. I think we are close at a model and harness and capability level to just a completely new kind of computing and that will feel very different than the existing ways people have thought about, “I need an API to this model”, or whatever. MG: I couldn’t agree more, that’s exactly it. The first part is great and is nice and the second part is, I think, what we all get super excited about. To that point, I mentioned I want to come back to this earlier, but I have a theory, which may or may not be correct, I’m curious your guys’ point about this, about stuff to be built. Specifically, there may end up being this real middleware or middle layer of where you have all these different databases and SaaS apps and all these bits and pieces of data in an organization that can stretch across things, you have this agent layer/harness or with the harness, I guess, sitting on top, and there’s something to be built in the middle and OpenAI Frontier gets at this a little bit. Is this part of this? Or is this something to be built? Or am I totally off base and we don’t need that at all? SA: You are totally right that we need something there. When I’ve been talking to customers recently, like large enterprises, they’re like, “I want some sort of agent runtime environment, I want a management layer where I can connect my data to agents and also make sure that I understand where I’m spending on tokens and not and have some sort of oversight there, and I want some sort of workspace” — hopefully it’ll be Codex — “something like that for my employees”, and that package of what people are asking for is getting remarkably consistent, but there is work to go off and now go build all that offering. It feels like there’s like almost a double agent layer that’s necessary. There’s like the agent layer to maintain the middle layer that is constantly spelunking down in all these data sources and then there’s the actual user interface layer that is where people are actually interacting with. Does that sort of fit with where we’re going or is that off base? SA: On both of those, I agree that that’s a picture of how the world looks today. As the models get really smart, I don’t think we know exactly what the architecture of the future is going to look like. Right now people do, at this sort of call it user agent layer, want to interact with multiple agents and we make it so that you can build agents for this thing and that thing and they can talk together and whatever else and then at the company management layer, people have all these controls about how you help the AI go spelunk and files in file systems. And at some point you realize that you’re just holding on to the past for no reason at all, this should just be in the model. SA: That’s what I was going to say. At some point, you may say, “Actually, we have such incredible capabilities, let’s re-architect the whole thing”. MG: Yeah, I agree. And I think there’s something different, and I’m not sure we all know what it is yet, but that’s part of the beauty also, is you get customers using and building and you can learn from them and figure out how you can make that easier, faster, better for them. Sam, this is the second time we’ve done one of these product launch interviews, last time it was with Kevin Scott and New Bing — you were pretty confident about the threat you posed to Google then, how well do you think that worked out? SA: I think we have done better than I expected. ChatGPT is, I think, the first really large-scale new consumer product since Facebook. Is that actually the answer, you’ve done better than you expected, but it manifested mostly through ChatGPT as opposed to other other areas? SA: No, I think we’ve also done quite well on the API, particularly on Codex, but that was not what I was thinking at the time. At the time, I was thinking maybe these new kinds of language interfaces are going to change the way people find information on the the Internet and you know — Google, also just absolutely phenomenal company, I think in many ways Google is still underrated just in terms of the breadth and depth of what they do, but I am happy with how ChatGPT has performed relatively. I actually have a Google question for you Matt, in a similar way. Google was just up there this week, Thomas Kurian talking about their fully integrated stack, all the way up and down from model to chip to to agent layer, all that sort of thing. You’re here with another company executive, definitionally not fully integrated within Amazon, but is there a bit where everyone was critical of you not having a frontier edge model — now that we’re in this sort of inference area, you’re used to serving a lot of companies. Did you maybe end up in a better spot by being neutral in a way? Was that on purpose or did you accidentally end up in a great place that you didn’t realize it was going to be? MG: A little bit on purpose. We, since we started AWS, we have always embraced our partners as a key part of us supporting our end customers. Since the very beginning, it’s been an incredibly important part of our strategy is to lean in with partners and maybe different than some others, we view our success is if the partners are successful and they’re building on top of us or together with us, and if they’re successful, then we’re successful, that’s awesome. We view it as that’s growing the pie together, then that’s a win, and it’s not necessarily how others view the world. Sometimes they say, “I have to own everything”, and that’s okay, that’s a view that people have. But I think that choice is important, and that way the best products win. And by the way, you can have first-party products in that world, you can have lots of third-party products in that world, but our view is we want the customers to be able to pick the best thing for them. And if the best thing is your own stuff that you’re building, awesome. For us, if the best thing is what our partners are building, but it’s on top of us, we view that as a win as well, it’s because it’s the best thing for our customers. We’ve long thought that, and it’s actually how we built the Bedrock platform in the AI world. We want to support a broad set of models, we want to support a broad set of capabilities, and it’s true, it’s been true across from databases to compute platforms to other things like that. So I think it’s been an intentional strategy, I think it’s a strategy that customers appreciate because they like that, and we’re excited to continue to lean into it. Yeah, it’s interesting. There’s the balance between software, platform, infrastructure, and everyone says they’ll serve everyone. But it does feel like you go way back when AWS started, it’s like you start with the I [Infrastructure], and that gives you almost – that gives you the greatest flexibility, it feels like, from my perspective, to meet Sam in the middle. Sam’s got a great S [Software], you guys are building a P [Platform] together, I guess is the way to put it. MG: That’s right. It does make it hard where you say, “We have one S3”, there’s not other S3 offerings, that part is true. So some of those core components are, like you said, at the infrastructure layer, we do lean in pretty heavily on the stuff that we build. But as you move up that stack, I think there’s a broader set of capabilities and if you view the world that — in no world do I think any one company is going to own every application and as you get further down the stack, when you get to kind of the models and services layer, there’s fewer of those and you get down the infrastructure, there’s even fewer of those and our view is kind of embracing that whole set of partners is great for us end customers. Sam, any final words? SA: I think that was very well put. I really do think there’s a potential at a new generation of the kinds of products that developers can now build and given how steep we expect model capability progress to be over the next year, the fact that we’re going to go on this journey together and try to really build a platform to enable it, is coming at a good time, and I think people are going to love it. Very good. Matt, Sam, thanks for coming on Stratechery. MG: Awesome. Thanks for having us. SA: Thank you. This Daily Update Interview is also available as a podcast. To receive it in your podcast player, visit Stratechery . The Daily Update is intended for a single recipient, but occasional forwarding is totally fine! If you would like to order multiple subscriptions for your team with a group discount (minimum 5), please contact me directly. Thanks for being a supporter, and have a great day! Last Friday I conducted the following interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman about Bedrock Managed Agents, powered by OpenAI ; naturally, one of my questions was about how this fit in with OpenAI’s deal with Microsoft giving Azure exclusive access to OpenAI models. Late Sunday I heard through the grapevine that Microsoft would announce something Monday morning; I wondered if it might be a preemptive lawsuit! On Monday Microsoft and OpenAI announced they had amended their agreement , allowing OpenAI to serve its products on other cloud providers, including AWS. Microsoft remains OpenAI’s primary cloud partner, and OpenAI products will ship first on Azure, unless Microsoft cannot and chooses not to support the necessary capabilities. OpenAI can now serve all its products to customers across any cloud provider. Microsoft will continue to have a license to OpenAI IP for models and products through 2032. Microsoft’s license will now be non-exclusive. Microsoft will no longer pay a revenue share to OpenAI. Revenue share payments from OpenAI to Microsoft continue through 2030, independent of OpenAI’s technology progress, at the same percentage but subject to a total cap. Microsoft continues to participate directly in OpenAI’s growth as a major shareholder.

0 views
Unsung 1 months ago

Abort, Retry, No, Thanks

If there was one go-to example of an impenetrable error message in the 1980s, it must have been this – popping up, for example, if your disk drive was dirty: On some technical level, the options made sense: “Abort” would stop whatever you were doing, “Retry” would try to repeat the action, and “Ignore” would proceed as if there was no error. But in the heat of a moment, or seeing it for the first time, this was a puzzling choice to be asked to make. Not only were the words weighted improperly (the seemingly most innocuous action here, “Ignore,” was actually the only one that could do actual lasting damage); it also wasn’t entirely clear what’s the safe thing to do to get out of the situation . (The redesign of “Abort, Retry, Ignore” was “Abort, Retry, Fail,” and it wasn’t really a huge improvement.) Last night, I installed Google Photos on my iPhone, and the first message that greeted me was this: This is really a matryoshka doll of bad dialog presentation. First: any buttons in a dialog should be labeled with enough information to keep me going . Here, both have generic labels, so now I need to pay attention. Second: Even after reading, I have no idea what is the choice I’m making. I see the pathway marked “yes, keep it the way I had it” and, sure – this would be generally what I want from any given computer on any given Sunday. But what’s the actual alternative? But the third, and most important one, is this: this dialog has no safe escape hatch. By now, in UX design, we established quite a few canonical escape hatches: But you can’t × this dialog out. The main button seems positive, but it also feels like I’m taking an action with consequences, and I don’t want to deal with that. There is a “No, thanks,” but it doesn’t feel like the other “No, thankses” I have seen – it’s juxtaposed with copy that makes it seem… a dangerous thing to choose. And this last bit makes it a pretty serious design offense, because you are now messing with foundational stuff. You need to protect those escape hatches for the future; the moment you introduce hesitation into the mix and taint “No, thanks” as a concept , really bad things will start happening all across your product. In real life, fire doors have to open outwards when pushed with body weight, aircraft stick shakers are impossible to ignore, and anti-lock braking systems do smart things even after your brain turns off its smart parts. I know seeing a dialog like this would never happen in a moment of true panic, but sometimes I think of the user in their most absent-minded moment: trying to get their kids to hurry up for school, on hold with an annoying cable provider, with a cat looking like it’s about to jump up directly into a running toaster. A dialog on their phone pops up. If that dialog absolutely has to happen, what is the escape hatch it can offer so they can dismiss it safely if they cannot think about it at all ? This Google Photos screen needs a lot more rethinking and rewriting, but in its current incarnation, it desparately needs a clear and trustworthy escape hatch I can tap absentmindedly, just so I can get to my photos. #errors #google #onboarding #writing a Cancel button, a × close box, a “No, thanks” link, a press of an Escape key.

0 views
Stratechery 1 months ago

An Interview with Google Cloud CEO Thomas Kurian About the Agentic Moment

Listen to this post: Good morning, This week’s Stratechery Interview is with Google Cloud CEO Thomas Kurian . Kurian joined Google to lead the company’s cloud division in 2018; prior to that he was President of Product Development at Oracle, where he worked for 22 years. I previously spoke to Kurian in March 2021 , April 2024 , and April 2025 . The occasion for these interviews, at least for the last three years, is Kurian’s annual keynote at Google Cloud Next. You can watch the keynote here , and read the blog about Google’s announcements here . I spoke to Kurian a week ago, on April 15, and at that time only had access to the afore-linked blog post. With regards to the keynote, which I have since watched, I thought it was a powerful opening: Kurian returned to last year’s theme, about a unified architecture, but emphasized that the use cases were no longer theoretical or pilots but running at scale for real users. He also emphasized — in a foreshadowing of a point we discussed below — that Google itself was running on the same infrastructure as Google Cloud. Google CEO Sundar Pichai, meanwhile, talked about Google’s capex investment, and that (1) half of it was going towards Google Cloud, and (2) that Google Cloud was running the same stack as Google itself. I sense a theme! Pichai also emphasized security, a point that Kurian was also careful to raise in our talk, before discussing the shift to agents. To that end, in this interview — which again, was conducted before the keynote — we discuss agents. Specifically, I wanted to get Kurian’s take on the quality of Gemini’s harness (unsurprisingly, he thinks it’s great). Google has an integration advantage, but is it paying off in such a large company? I was also curious about how Google thinks about TPUs specifically and the cloud business generally in terms of balancing its internal needs with external customers like Anthropic. We also talk about the software ecosystem, why Google still believes in partnerships, and why the company was ready to seize the AI moment (hint: it’s because of Kurian). As a reminder, all Stratechery content, including interviews, is available as a podcast; click the link at the top of this email to add Stratechery to your podcast player. On to the Interview: This interview is lightly edited for clarity. Thomas Kurian , welcome back to Stratechery. I promise I have recording turned on this year — in fact, I have two recordings turned on. TK: Thank you so much, Ben. Good to see you, thanks for taking the time. Well, I look forward to talking to you. It’s good to talk to you for multiple interviews, much better than talking to you multiple times in one interview, so we’re already doing better this year. But like last year, we are recording before your Google Next keynote . We’re actually quite a bit ahead, I think we’re several days ahead, but this podcast won’t be released until after the keynote. Therefore, I’m going to ask the exact same question I asked last year. Specifically, I like watching keynotes, not for the announcements, but for the framing that happens up front. Last year, that framing was infrastructure, [Google CEO] Sundar Pichai actually delivered that at the opening, then you came in and talked about that, and that was the context for everything that you talked about. What is the framing this year? TK: The framing this year is that as AI models have become more sophisticated, we see customers evolving the use of AI models from being used to answer questions in a chatbot-like fashion, to actually automating tasks on their behalf, and to automate process flows within the organization. By automating process flows, you both get efficiency improvements, productivity improvements, frankly, you can also change the way that you introduce new products and services to market, for example. In order to do that well, the technology, what you need is a world-class agent platform and to underpin the agent platform, you need world-class infrastructure. You need the way that the agents interact with your company’s data and your business — so you need capabilities to help an agent really understand the company’s business information and context. I think, as you’ve seen in the press, AI and cyber have become very contextual now, there’s a lot of concerns that AI will accelerate the speed of cyber attacks on people’s systems, and so we’re going to be talking about how we’re bringing AI and our cyber technology together to protect, including the integration of Wiz , and then we’re introducing Gemini Enterprise and our agent platform to customers. That’s sort of the theme of what we’re talking about. You mentioned agents last year, everyone was talking about them to a degree, what has really changed from last year to this year that makes this different? I read your whole blog post, it’s very long, and I think the word “agent” may appear in every single paragraph. TK: There’s three or four big things that have changed. The first is capabilities of models — Gemini is able to reason much more effectively as new versions of Gemini have come out. Second, they’re able to maintain long-running memory, which you require if you have an agent that’s automating tasks over many, many steps, it has to maintain a lot of state in memory. Third, their interaction with tools and the rest of the world, there have been good abstractions, skills, tools, MCPs [ Model Context Protocol ], as they’re called, they’re all abstractions for how an agent reasons and interacts with the rest of a company’s systems. All of them have advanced and so the core capabilities that the models themselves have gotten a lot better, the capability and the ability to use tools and interact with the rest of the world has become a lot better, the abstractions that the world exposes itself to the model has improved and so now you have models have these capabilities to do these very complex tasks. That all makes sense and certainly tracks. A lot of these announcements, though, as I was going through them, a lot was about the infrastructure around agents, which makes sense — the orchestration, registry, identity, security, all these bits and pieces. All of this is clearly necessary for large enterprises, something they’re going to worry about and ask about. But the agents have to actually work; do Gemini agents actually work? Because there’s a lot of talk, you know, Gemini was the belle of the ball four months ago, but over the last little bit, it’s been mostly a lot about Anthropic and Claude, Codex, a lot of talk about that, and Gemini, not much talk. What’s your feeling about your actual capabilities, not just agents in general? TK: I’ve always said when people ask us about it, I always say, “Let our customers talk about it, rather than we talk about it”, I think you’re going to hear from 500 customers telling their stories at Next. Even people building agents, we have a whole range of them, from Citigroup to Bosch to eBay to Virgin Voyages to Walmart, there’s a whole range of them, Food and Drug Administration, etc., Comcast, Unilever, all of them are going be talking about specific business problems they had. For example, for Citi, they’ll be talking about a new wealth advisor, Investment Management, where they’re using our agents to research a person’s investment priorities. So a person says, “Here’s my priorities for investment, my kids are going to school, I need this kind of cash flow in order to fund it”, and then it researches your financial portfolio and interacts with you to give you recommendations. If you look at Comcast, they’re using us for all of the work that they do for consumer services — this is repair, scheduling appointments, dispatching field technicians, there’s very complex flows that have many, many steps and interact with you with a lot of complex systems. If you look at some of these flows, they require all of the capabilities I talked about. So as an example, I want the capability to call a set of tools, and those tools may be I want to book an appointment, so I need calendar, I need to look up, if I’m dispatching a technician, I need to look up spare parts so I need to pull up from my inventory that spare parts inventory, I need to schedule that to be available at the same time as the person who’s going out, I need to update my inventory that have taken something out of it. I mean, these are very, very complex steps. What’s interesting about all these complex steps and going through all these bits and pieces, it sounds like you’re saying that almost the more constraints there are, the more things you’re bumping up into, is that actually a better environment for instituting these sort of flows just because what you need to do is clearly defined? TK: Just being perfectly frank, Ben, having constraints requires the model to be even more intelligent. Just as an example, the number of variants in a process flow that’s complicated many, many steps, the number of different idiosyncratic situations that you may encounter are large so you cannot a priori program every one of them. You need to teach the model to use, for example, to be able to spin up a virtual machine and use a tool in the virtual machine to generate code to deal with some of these situations. So the most sophisticated thing is where you can give the model a high level set of instructions and have it goal seek an outcome. So you say, “I need to schedule this appointment”, and it turns out there may be 19 different conditions that occur when you’re trying to schedule an appointment and as part of that, you can’t a priori tell the model every single possible condition deterministically. So you need to teach the model, “Okay, the user did not tell you what to do, but the goal was to schedule an appointment, so here is how you generate code to then create a collection of things that can interact with the model and understand what to do”. This is very interesting, you’re walking through this process, this makes a lot of sense. How do you have that conversation with DeepMind? You’re connecting the, “This is the workflow that is needing to happen, these are what we need the model to do, this is where it does well, where it doesn’t”, what’s the working relationship there? TK: We have a harness in which all these flows journeys, for example, as we see them with customers, we put them into the harness and they get into the reinforcement loop for Gemini. How tight is that process? TK: Very tight. We have people sitting next to [DeepMind CEO] Demis’ [Hassabis] team, in fact I just came from a meeting with them, that loop is what allows us — we are in a unique position in the market. We’re unique in three different ways, we’re unique because we have the whole stack of AI technology. In order to do agents well, you need to have a model that takes all these journeys and puts it into the harness that handles the improvement, as we call it, hill climbing, literally every hour of every day, and the complexity of the journeys we see are in some ways much more complicated because in companies, you have many different systems, different conditions, different flows, you may not see that in other domains, like in a pure consumer domain. In order to do these well, you also need, for example, models need to spin up compute, models need to now hold on to tokens for longer because they need to hold, for example, a KV cache that holds memory about what’s happening during the transaction flow. Having awesome infrastructure, both classical, what we call classical compute machines, and TPUs gives us real strength there. Third, as you walk through these, one of the things you find is a lot of the systems these models interact with are things like databases, enterprise applications. So understanding the context of these, like for example, “How much inventory do you have?”, defining “What is inventory?”, “What part are you talking about?”, “What part number are you talking about?”, those things require you to have technology that understands the business graph and the dictionary of all the objects and the sources of information in your company. Our strength in data processing gives us some technology that we’re going to be talking about next week around something we call Knowledge Catalog, think of it as as your global dictionary for all information within the company, that’s a unique strength. And then obviously you don’t want information that’s critical to your company exposed on the Internet, you don’t want your model to get attacked because now it’s handling very complex process flows, you don’t want it hijacked, and so all the anxiety around cyber, we have very specific tools on, so our differentiation is all these pieces working together. That makes sense, the integration is a big part of your pitch. At the same time, you’re also a big, sprawling company and I think there’s maybe a perception, that I maybe hold, that some of the frontier labs are much more focused, they’re much more top-down about, “This is how our harness is going to work, the way it’s going to use tooling”, and all the things you’re talking about having this feedback flow back in sounds great unless there’s so many different takes on the way it should work and then you have your own internal customers as well. How do you balance having a point of view versus getting stuck in the muck? TK: Every product that Google has is on the same Gemini version, on the same day, on the same hour, every one of us is using the same harness. And you feel good that that harness is where it needs to be — it’s not getting pulled in 50 million directions thanks to all your customers and Google’s workloads? TK: Absolutely not, we are very focused on working with Demis and [DeepMind CTO] Koray [Kavukcuoglu] who lead our team to make sure they see the sophistication of these scenarios and we work literally side-by-side, hour-to-hour with them. There’s been a lot of speculation on are we distracted the company… I don’t think you’re distracted, I think it’s more just a matter of it’s a classic big company versus small company bit. Like a startup comes in and you have a very clear point of view and you don’t have all the enterprise stuff, you don’t have all this protecting the data, or permissions and all those structures, and yet that stuff sort of gets pulled along because there’s such demand to use your product that works really well and then over here it’s like, “Hey, we have everything protected and we have all these things around it”, but does the core product actually deliver? TK: The core product is being used by lots of people. The proof of that — we generate 16 billion tokens a minute, up from 10 just last December or January. Well, your financial results certainly showed that as well. There’s a bit where you’re doing so well, I have to be a little hard on you here. TK: A lot of people told us we were dead in 2023 — we’re still living. I think you’re doing more than living, you’re doing very well. TK: And so we never say anything negative about anybody else, our results prove for themselves. I always say, let our customers tell the story, they’re doing amazing things with Gemini in companies, enterprise, and they see the value of what we’re delivering for them. You mentioned that everyone in Google is on the same version of Gemini, using the same harness. Does that also apply to all this infrastructure around agents you’re doing, around sort of identity and security? TK: Yeah, in the enterprise, the way that all the infrastructure works is we have configurable mechanisms. Like for example, when you configure an agent, a very simple thing is you want to configure the agent with a different identity from a person, just a very simple example so that you can track, “Who did this transaction? Was it the human or the agent?, because there’s issues like liability. You may want to revoke permissions for the agent at a certain point in time, you want to allow it to only do certain tasks and not everything that the human does so there are controls you want to put around an individual agent and a collection of things that’s separate from the person. As we bring agents to consumers as part of our Gemini app, very similar concepts want to be exposed, and so the architecture that we use allows us to have those things. The sources of that may be different. In the consumer world, they may use the Google login account, in the enterprise world, they may use a directory to store it, but that’s just an abstraction of our technology to the rest of the world. We’ve been talking a lot about Gemini agents and the whole Gemini platform, but you also have just the broader Google Cloud platform. One of your major tenants is a company I was just sort of referring obliquely to, which is Anthropic, they’re doing a lot of inference on TPUs in particular. If Anthropic wins deals at the expense of Gemini, is that still a win? TK: We sell different parts of our stack. One of the things people don’t realize is we monetize many different parts of the stack in different ways. Like Anthropic, there’s a lot of labs that use our stack — in fact, most of the large AI labs use our stack. So if somebody uses TPUs to either to train their model or to use it for inference, we’re monetizing that part of the stack, that gives us resources to then fund our R&D and other investments. Some of the labs use our TPU and our Gemini model, others may use our TPU and then buy our cybersecurity protection for their models. So as a platform player, we have to allow our technology to be monetized in as many ways as possible and we don’t see it as a zero sum. Sometimes, though, if you have the SaaS layer and the platform layer and the infrastructure, is there one that is the most important? On one hand, SaaS has the highest margins, it kind of decreases going down. On the other hand, that infrastructure needs to be used, you’re spending a lot of money on it, you want full utilization. How do you think about that in terms of what’s the most important? I know they’re all important, but how do you think about that tradeoff? TK: If we were making TPUs just for ourselves, we would have lower volume than we do as a general purpose TPU supplier, which means there would be times of day that we would not be using those TPUs. Do you follow me? Like if you think how chat systems work, they’re very diurnal in nature, because you ask questions when you’re awake and we have a great search business and we have a great Gemini app business, but there would be a certain diurnalty to it during the daytime, there’d be a lot of questions, what about in the evening? Because we sell TPUs in the market, we’re able to offer it at spot to the rest of the world because we have such a large business. We’re able to also get manufacturing, better terms with suppliers and other things because of a real volume player, and that in turn lowers our cost of goods sold. So there are many more dynamics. The company is very focused on ensuring we win every part of this, not just one part of it. Gemini is obviously a super important initiative for us, and you’ll see the big announcements are around— For sure, it’s almost all Gemini. TK: But I wouldn’t assume that if we do that, the only way to do that is to offer our chips along with our model. We see a strong business offering our chips to many other people and you’ll see all of this is what’s accelerating our differentiation, and you see it in our financial results. Your financials are incredible, your revenues up, margins are up hugely, I’ve been posting that chart of them for a long time, last quarter was amazing . I do have to ask about TPUs, though. You talk about selling our TPU chips, to date that has meant TPU instances on GCP, but now there’s talk about actually selling TPU chips, what’s the status of that? What’s the official word, can I go buy a TPU? TK: I’ll explain a little bit what we see. So let me talk briefly about what the announcements we’re making, what the product is being used for, and then how we bring some of it to market. TK: We’re introducing two big new TPUs next week. One is TPU 8t, which “t” stands for training, it’s more optimized for training, think of it as 9,600 TPU chips, a single pod, as we call it, it has three times better performance than the current generation, which is already the leading one in the market. Then there’s 8i, which is “i” for inference, it’s 1,152 chips, three times the SRAM, and it has a new thing called the Collectives Engine, which gives you super efficient calculation performance for inference. Now, along with that, we are introducing Nvidia VR200, we’re also introducing more ARM capability for classical compute, because people who use models increasingly need to spin up a VM in order to do tasks, and that VMs we see interest in. We’re introducing not just new compute families, but also new storage, there are two new storage offerings. There’s one, the fastest Lustre solution in the market, it’s 10 terabits per second, that’s just to give you a sense, it’s like five times number two. We’re also introducing a new thing for ultra low latency — when you do inference, you want super low latency in accessing storage, we call it Rapid Storage, it can give you 15 terabits per second with ultra low latency, like microsecond latency. So why are we introducing all this stuff? TPUs, definitely a big market is the AI labs, but we’re seeing interest from new segments of the market. So a big new segment is financial services and when I say financial services, capital markets, and the reason is that today, if you’re a trading firm, a capital markets firm, you spend a lot of time running algorithmic trading and algorithmic trading is running numerical algorithms on traditional Intel type cores, x86 cores. Now what they find is that models can do inferencing and the inference performance is actually better than traditional numerical computing. So that’s one new segment, the second segment is high performance compute. We see a ton of people wanting to do energy modeling, computational fluid dynamics, solid state, there’s a whole bunch of parameters there too. What’s interesting about those is, you will see at our event, Citadel Securities for example, talk in the keynote about how they’re using TPU. Citadel, as you know, is a large capital markets firm. Department of Energy, they have a mission called Genesis , which is the new national lab mission on changing the energy infrastructure for the United States. There’s a big Brazilian largest utility in Brazil, Axia, all of them are examples of people who are part of just the keynote talking about how they use TPUs. When we look at that, there’s a couple of different things we see. Capital markets firms say, “Hey, if we’re going to replace our algorithmic trading solution, you have to bring TPU to where the venue is”. Right, because they care about the latency of going to a data center, that’s why they’re all New Jersey. TK: Secondly, if you’re a national lab, you have so much data you’ve collected over the last X number of years with your experiments — saying you have to bring all that data to the cloud to reason on it doesn’t make sense, so you will see us putting TPU in other people’s venues, and when we do that, we’re introducing new ways of people also procuring it. When I say procuring it, you buy it as a system, you don’t have to buy it just as a cloud source. How does this new way of selling, which is almost like a third way, so you have in Google’s data centers, you have bringing TPUs to customers, but then you have a deal like last week where between Anthropic and Broadcom and Google, this is going in their data centers. There’s these sort of renegade data centers that have access to power, maybe they were doing Bitcoin or whatever it might be, there’s been a big push to get TPUs into those. Where does that fit into this? TK: I would not assume everything you read in the press is true. Well, the Anthropic announcement was definitely a a big announcement. TK: Just to be honest with you, we have a flavor that runs in the cloud and a flavor that runs in third-party data center. The technology, the machines are identical. My question here is, where is that coming from? Is that part of your TSMC allocation? Is that Broadcom’s? Because no one can get enough compute, so ultimately that goes all the way back to the root. TK: The chips are all part of our global — TPU is a Google chip, as you know. So it’s part of global allocation, Broadcom partner who manufactures the TPUs with us and so it’s just part of the overall business. The new thing we’re talking about is just that you can run TPU in other venues. Makes sense. Will we ever have enough compute? Last year you said, “I think we’re going to resolve it shortly”, it doesn’t seem very resolved, what’s the status there? TK: We’ve worked super hard as an organization, our team that’s done our compute infrastructure, our global data centers, machines, all that, they’ve done an amazing job, there’s always a shortage, there’s never enough. But it doesn’t mean that we’re not — we would not be growing at the rate we are if we didn’t have enough compute. And so there’s more that we want, but there’s also the reality of our teams have done an amazing job, and our customers who are using it will tell you they’re seeing the benefits of the hard work our teams have done. There’s potential customers in the market, maybe current customers, who may be willing to pay basically any price for compute at this point. How do you think about the short term, “Wow we can actually just make a lot of money right now”, versus, “We need to invest in our products” — you had Microsoft, who I’m not going to ask you to comment on, but last quarter they’re like, “Yeah, we allocated less to Azure because we had our own internal workloads”. These are real trade-offs that you need to think about, how do you think about that in terms of GCP? TK: We run a balanced portfolio, we want to grow different parts of our business, we sit down as an executive team and also with Sundar and work through how we’re going to balance the different parts of our portfolio. We see, broad brush, three to four buckets of things. One bucket of things is where we want to grow Gemini as a business, our core Gemini business is doing super well, 16 billion tokens a minute, up 40% since last quarter, even this product called Gemini Enterprise , which is our core agent platform, has grown 40% sequentially quarter-over-quarter. So that part of the business, we’re committed to making it super successful, it’s a priority for us. Second segment of the business is where Gemini is being used inside of some of our core products, so I’ll give you an example. We’ve introduced Gemini inside our threat intelligence tools. Why is that? Because we have real expertise at Google scanning the dark web to identify threats, the problem is there’s so many of them, an average organization doesn’t know which of those many threats apply to them. So we use Gemini to process and prioritize which threats might affect you, it’s 98% accurate and has processed 3.9 million threats in the last year, so that’s an example of Gemini being used as an embedded capability. Right. The whole SaaS, PaaS, IaaS — the SaaS bit is still important. TK: There’s that capability, there’s people who want to use Gemini to reason on data in our analytics infrastructure so there’s a second big set where Gemini is an embedded capability and that in turn depends on chips and TPUs and GPUs. And the third one is offering our compute platform to people. We balance across those because we want all of them to be successful by bringing hardware or out machines to other people’s venues. We’re broadening our TAM, total addressable market, in that part of the business also we see a different cash flow model than if you were putting CapEx so there’s a lot of different parameters we have to balance. All those ones you listed for you to make trade-offs on, but then you also have to get in a meeting with Sundar and the other leaders of Google to make trade-offs with DeepMind and their R&D and with the consumer products. What are those meetings like? TK: We have a regular set of cadence of meetings and we balance the different priorities and we want to be successful on many different dimensions. I wouldn’t assume all of these dimensions are zero sum. Like, for example, when we offer our product in other venues, we drive cash flow in a different way than putting CapEx — so to some extent, that changes the boundary of how we offer our capital boundary as a company also. So I think there’s a general view of there’s a compute shortage, and if you give one, you will have to take from another, I think that’s an overly simplistic view of it, having been in this for long enough and having been, my team does both parts. We are responsible for delivering all the infrastructure for Alphabet, and they’ve done an amazing job doing that, and I’m also responsible for running the cloud business, and you can tell that our differentiation, I come back to this, it would be a different problem if you didn’t have demand. You can, and whenever I ask us to prove that you’ve got demand, I always say, “Look at our results”. Well that’s been the biggest change even since January where there was still some sort of latent skepticism about, “Is all this CapEx worth it?”, feels like those questions have been completely erased at this point. Speaking of markets in the last couple months, all these SaaS companies are getting killed in the market, you have a big SaaS business, you’re definitely not getting killed in the market, why are you escaping it? TK: I think we have transitioned. The core fundamentals is finding, and this is the way we approach our product portfolio, I’ll give you a very simple example — 2023, we said, “Hey, at 2022, we said, we’re not just going to build a secure cloud, we’re also going to start offering cybersecurity products”. When we entered the market and then we looked at what other things people — the value of cyber is driven by two dimensions. Dimension one, “What is it protecting?”, because it has to protect high value things, and the other element is, “How good is it at protecting?”, “What’s the technology that it’s going to use to protect?”. So we said, “There are only two valuable places to protect, there’s either the endpoint”, which is your desktop on which apps run, other people are doing a good job there, the rest of the world is moving all their applications and data to the cloud, let’s protect that. Second, we said AI is going to find vulnerabilities because at the end of the day, finding vulnerabilities is a question of a model really understanding code, and if you can find vulnerabilities at a much more accelerated rate, people need to fix vulnerabilities at an incredibly aggressive, fast rate, and so we started a set of work back then and we said to ensure that we have the leading product portfolio, let’s acquire Wiz. We’re now working on, you’ll see a number of announcements, there’s the Threat Intelligence Agent that allows us to you know understand the threat landscape and use Gemini to prioritize what you should pay attention to where a lot of people are using Gemini to actually scan their code, and then we’re introducing three new Gemini-powered agents with Wiz , one called Red Agent — think of it as continuous red-teaming of your infrastructure, a Blue Agent that says, “Okay, I looked at what’s happening with the Red team and I know what you need to go fix”, and a Green Agent that says, “I’ll fix it for you”, and that’s going to cut the cycle time. Like our Threat Intelligence Agent, you will see reference customers from Chicago Mercantile Exchange, there’s a whole bunch of them talking next week, about how it takes an investigation that just take 30 minutes and does it in 30 seconds, that allows you to get response. Now, this is an example of when we started, people said, “Why would a hyperscaler want to become a cyber company?”, and we were like, “It’s not about being a hyperscaler, it’s about solving that problem at the intersection of — AI is going to accelerate cyber threats and you cannot do repair the old way”. Yep, it really answers the question that people had when you acquired Wiz, which is, “ Why do you need to buy it , why can’t you just build it?”. It’s like, “Well, in two years, it’s going to be too late”. That’s, I think, also felt very tangibly right now. TK: Today, we are where we are because we made that bet. TK: So when people ask, “Why are you guys growing even in sectors that may be struggling?”, it’s because we have differentiation and we made those decisions early. That makes sense. One of the interesting product announcements this year is this cross-cloud lakehouse which lets customers leave their data in AWS and Azure while still being query-able by by your services instantly. Is this the final admission that even if enterprises love your AI and love Gemini, they’re not going to shift all their workloads if they’re already on other clouds? Lots of your products have been about that in the past — even Wiz is about that to a certain exten — but is that just the reality? There’s not going to be a huge amount of spillover as far as pulling things from other clouds to Google. TK: If you use BigQuery today, you don’t have to move your transactional applications to BigQuery. If you’re using Gemini today, you can keep your applications in another cloud and use Gemini to reason on it. The problem we were trying to solve is a very specific problem. Today, when people talk about lakehouses, they say, “We have a multi-cloud lakehouse”. What they really mean is their lakehouse can be run on any cloud, but when it’s running on a particular cloud, you can only access the data in that cloud. And then people say, “That’s crazy, because I’ve got data in a SaaS app like Salesforce”, “I’ve got data in an ERP system”, “I’ve got data in Azure and Amazon, and I’d like to use analysis across all this”, one choice to customers is copy all that data out, that’s expensive for them because of the egress tax that everybody imposes. So we said, “Keep your data there, we can still give you world-class analysis”, and so it’s solving that custody. The customer has a problem, they want to do analysis, there are four things we’re giving them. Keep your data where it is, no matter how many clouds. We’re not talking about a single cloud lakehouse, we’re talking about across all the clouds and across all your SaaS apps, we can do analysis, one. Two, people said, “How fast can you run?”, the proof that we’re going to show is we’re 2x better in price performance than the market leader, right out of the gate. The third one, people said, “I’m not an expert on writing Python and Spark, can you give me essentially vibe coding for Python and Spark?” — yes, you’ll see us introduce a agent manager to generate Python and Spark code using Gemini. And then the last one people said today, Ben, if you ask a question, I was using that example of field service, I’m running a query on, “How much inventory do I have in parts?”, before I send the technician — that information sits inside an application in a set of tables in a database, most organizations have thousands of databases, teaching the model which system has what information, and the notion of part is split across 10 different tables in this particular database, you need a system that builds that semantic graph of all the information in your company. Right, this is the Knowledge Catalog . TK: That’s the catalog, and that gives you super good accuracy when you’re researching information. So we put all this together and back to, we’ve always been super pragmatic. I always say enterprises have certain problems that they see independent of a cloud. For example, security — they don’t want to buy three different security tools from three different hyperscalers. Analytics — they don’t want to buy three different analytic tools from three different hyperscalers. Others have chosen to say, “My stuff only works with my cloud”, that’s why enterprises often choose us, because we work across all the clouds and all the security environments you have and you can keep stuff wherever you are and use Gemini to access and automate stuff for you, so all that is just part of listening to customers. This all makes perfect sense, particularly this bit about the Knowledge Catalog definitely fits how I’ve been thinking. I wrote about this a few years ago about this importance of this whole layer and understanding it, it’s a bit of a big lift to get this in place. You have some sort of analog, say, with like a Palantir that’s putting in like their ontology thing . They have FDEs out on the site, multi-month projects doing this. You have OpenAI talking about Frontier , their agent layer, and they’re partnering with all the tech consultancies to build this out. Is this going to entail a lot of boots on the ground to get this graph working and functional in a way that your agents can operate effectively across it? TK: We’re not competing with Palantir, we’re not building a semantic dictionary or an ontology. What we’re doing is, today I’ll give you the closest analogy. TK: Today when you use a model, let’s say you use Gemini, and you ask a question, Gemini goes through reasoning, and then it shows you a citation. A citation is, “How did I answer the question and what’s the source I derived from?” Now imagine that citation was a query that needed to go to a folder in, for example, a storage system because there’s some documents there and a database because, for example, in a part number, just think about there’s a part number document that lists all the part numbers and sits in a drive and then that part number you need to fetch out to say it’s the modem that the guy is coming to repair, and that’s mapped to a table in a database. So what the graph does, we use Gemini, so we don’t need humans, we use Gemini to say, “Hey, go and read all these documents in these drives and extract the information from it and then match that to the database table that has the reference to the part number”, and so then when Gemini turns around and says, “I got this query about how much inventory of modems they are”, the first thing it does is it says, “Okay, go to the Knowledge Catalog and it says modem is part number one, two, three, four, five”, and then it says, “By the way the table in the database that has the inventory information about this part number is this table, here’s a SQL”, it then makes the quality of what we generate higher and then when it answers the question it shows back — back to your, “Trust my data”, it shows a grounding citation saying, “That’s where we got it from.” What do you need from everyone in the ecosystem if this is going to work, all these SaaS applications and across all these entities, not just what’s in your databases, but what’s in a SAP database or whatever it might be. How do you get them on board so you can understand their data and build this Knowledge Catalog? TK: Really easy, the first thing is to use the lakehouse we support a standard format, industry is very standardized on it, it’s called Iceberg , so anybody who supports Iceberg we can talk to it and so that’s pretty much the whole world right now, so we don’t need them to do anything special to make it work. Second, all of these business systems have API specifications, and our Catalog can learn off of those API specifications, we just teach Gemini to process those, and so we can build a catalog pretty quickly. There are reports that OpenAI on Amazon Bedrock has been massively popular. Are we going to get OpenAI on Vertex? TK: We would love to have them. We are announcing a variety of third-party models on Vertex, including Anthropic, including open source, we’re open to any model provider on Vertex. I believe you. That’s going to be great, when and if it happens. Just one last question. We’ve talked in this interview series previously about how I think, and this is before your time, it’s not your fault, that Google Cloud missed the boat in terms of being a point of integration for the Silicon Valley enterprise ecosystem. I think last year I asked you if AI represented a new opportunity to do that. However, is there a bit where the models, and you’re in this game because you have one of the leading models, is just going to eat everything and is going to gradually expand to do the jobs and everyone else is just going to be a system of record? It’s going to be all one interface, that the integration, such that it is, is all under the surface, it’s not necessarily tying things together in user space. Is Gemini going to be all the user needs in the long run? TK: We don’t see it that way. In fact, one announcement you’ll see us make next week is how many third-party SaaS and ISV [independent software vendors] vendors are embedding Gemini not just as a model, but as an agent platform, because they want to build agents and our agent platform, you can use to build agents, not just our own agents, but they can use it and there’s a lot of independent software vendors embedding those agents. And do they see you as like, “Hey, you’re another established guy, let’s go with you because we don’t know what these other folks are up to, they want to eat all of us”? TK: It’s also the capabilities. The differentiation, I would say, is just think about you’re a bank or an insurance company, and think about you’re a SaaS vendor selling to them or an independent software vendor, there’s a number of things around identity, policy management. For example, if you’re a bank and you have documentation about a person and their credit, you cannot have that egress the bank’s boundary, so we have a gateway that protects against that, that’s part of our agent platform. You want to have auditability on the agent to say which agent did what task on what system when, that’s built into the platform. You want to have a registry where you expose all your skills so that people are not duplicate building all these things, we have a registry that does that. This is sort of the bit we started with at the beginning, it’s not just going to benefit your agents it’s going to benefit all agents, that’s sort of the pitch. TK: So one of the things that people like is the fact that we built all that plumbing for them, and so they don’t have to invest in it, they can focus on the value add that they have on their agent side. Additionally, for companies in this broader ecosystem, the cost of agents — and it becomes part of their bill of materials, if you will, the cost of goods sold — the fact that we have these super efficient chips that run inference with such efficiency eventually translates into cost efficiency for a third party that’s building on top of us. You can see that all of those benefits, we’re taking away all that complexity for these guys, so we definitely don’t see that all the ecosystem is going to die, we definitely don’t see that, we see us facilitating that ecosystem. You’ll see us announcing a number of things, including a substantial investment in dollars to accelerate the partner ecosystem around our platform. Thomas Kurian, great to talk to you again. TK: Thanks so much, Ben. And just in closing, the work that we announce every year at Next is a testament to all those customers and partners who gave us a shot to work with them. You’ll see them telling their story, and it’s a testament to all those people at our organization that made a bet to solve a technical problem a different way, or to bring our technology — we’ve hugely expanded our go-to-market organization, and doing all that with growing top line and operating income at the same time is a testament to the demand we see for our products and services. I mean, six, seven years ago, people used to tell us, “You have no shot in the market”, I think we are now truly uniquely positioned. Name one other player that has the stack of technology to do AI, when I look forward, I think there’s no question in people’s minds that the central problem that companies need to solve and technology providers need to solve is how good is the capability you offer for AI. We’re the only ones with chips, models, the context to feed the models from all of the data infrastructure, the cyber tools, and then a world-class agent platform. I would also add, you’re actually an enterprise company now. The things you talked about, pragmatism, listening to customers, all these pieces, GCP did not have at all a decade ago — there’s a bit where Wiz was ahead of its time, for sure, being forward-looking, but there’s a bit where the organization is ready for this moment in a way I don’t think it would have been previously. I find it very impressive. TK: We are very proud of the team. Also for Alphabet, to do AI well, you have to do a couple of things. One, see the breadth of problems that we see, we see all of the consumer problems, we see the enterprise problems, we see the problems that search sees, we see the problems that YouTube needs, we see all those that we’re solving with AI, that gives us a breadth of capability that the model needs to solve, that over time is a real strength because the diversity of problems we’re solving. Second, in order to do AI well, you have to invest, and in order to invest, you need to monetize in as many different ways as possible. I think we are very confident that our team, we do not have any hubris, but we are confident in where we stand. I think it’s very impressive. I look forward to your keynote. TK: Thanks so much Ben, it’s a privilege to talk to you every year and it’s great that you took the time to speak with me. And it’s all recorded, I can promise you that! This Daily Update Interview is also available as a podcast. To receive it in your podcast player, visit Stratechery . The Daily Update is intended for a single recipient, but occasional forwarding is totally fine! If you would like to order multiple subscriptions for your team with a group discount (minimum 5), please contact me directly. Thanks for being a supporter, and have a great day!

0 views
Ahmad Alfy 1 months ago

Stop Hardcoding Your Timeouts

A developer rant about tools built for one kind of internet Recently, I’ve been losing my mind to hardcoded timeouts . Silent, arbitrary, unconfigurable time limits baked into tools by developers who apparently have never had to wait more than 200ms for anything in their lives. Let me tell you about my week. Now that coding agents are everywhere, everyone is using skills. The popular way to add them is through packages developed by vercel-labs, and the go-to collection is awesome-copilot , a curated set of skills sitting at 30K+ stars at the time of writing. Except I can’t use it. The repository is too big, and the installer just chokes and dies. There’s an open issue about this since February #278 on the vercel-labs/skills repo and no one has responded. I’d be happy to send a PR and fix it myself. I just need someone to acknowledge it exists. Is there a configuration option? A flag? An environment variable? No, there is nothing. The workaround I found? Clone the repo manually first, then install from the local copy. It works, mostly. Except now points to a path on my machine. My colleagues cannot use it. I also have to update my copy everytime I want update my skills. One workaround creates a lot of other problems. Then came Docker Gordon, the AI-powered debugging assistant baked into Docker. Useful concept. I was stepping through a container build issue, the kind that requires iteration: tweak, rebuild, inspect, repeat. I’ve never used Gordon but when the error manifested itself, it came with a suggestion to try Gordon and so I did. Except Gordon has a hard limit: if your container doesn’t finish building within two minutes , it gives up. The session dies. You start over. A two-minute build might sound like plenty if you’re in a fast environment with warm caches and pulled base images. But if you’re pulling a fresh base image over a slower connection? Debugging a multi-stage build with several heavy layers? Forget it. Gordon has already moved on. There is no way to configure this. No env var. No flag. Nothing. The tool just assumes that two minutes is forever, and if you need more, that’s your problem. Developers often working on fast machines, in offices or homes with gigabit connections, in cities with world-class infrastructure. They build tools with timeout defaults that reflect their own experience. And then they ship those tools to the whole world, with no knobs to turn. The thing is, timeouts need to exist. Infinite waits are bad. Hanging processes are bad. I’m not arguing against timeouts. I’m arguing against unconfigurable timeouts. Against the implicit message that says: if you can’t do this in 60 seconds, your environment is wrong, not my assumption. A timeout should be: This isn’t hard. It’s respect for your users. I’m writing this from Cairo. My internet is decent, better than many places in the world. But it’s not 1 Gbps symmetric fiber. It’s not co-located next to an npm registry mirror. A of a large repo takes time. Pulling a Docker image takes time. These are not failures. They are physics. When your tool dies silently after 60 seconds without any way to change that limit, you haven’t built a tool for the world. You’ve built a tool for your office. And this matters more than most developers acknowledge. The global developer community isn’t located in San Francisco or Amsterdam or London. It’s in Lagos, in Karachi, in Cairo. It’s people on 4G connections, on shared broadband, on connections that have real latency because the nearest CDN edge is 50ms away instead of 5. When you assume a fast connection, you’re not making a neutral technical decision. You’re making a statement about whose experience matters. I don’t think anyone is doing this maliciously. I think it’s a blind spot. Your internet is fast, so a 60-second timeout feels generous. Your machines are powerful, so a 2-minute build window seems like plenty. But please: before you ship a timeout, ask yourself: And then add a config option. One environment variable. One flag. That’s all it takes to go from “this tool doesn’t work for me” to “this tool works for me.” As Bruce Lawson once said: it’s the World Wide Web, not the Wealthy Western Web. The web and the tools we build on top of it are for everyone. Let’s start acting like it. A safe default for the common case Clearly documented so users know it exists Overridable via a flag, an environment variable, a config file, something What if the user is on a slower connection? What if their repo is larger than mine? What if they’re debugging something slow, and that’s the whole point?

0 views