Posts in Ai (20 found)
Rik Huijzer Yesterday

Google Shenanigans

For years it has been a common theme among programmers that Google's search results have changed for the worse. It feels like the suggestions are becoming less and less applicable over time. Today, I spotted one of the worst cases that I have seen so far when searching for a documentary called "Flatten the Curve Flat Earth" (2022). This documentary is about flattening the curve of the earth and has nothing to do with medicine. However, Google automatically interprets it as a kind of medical statement: ![google.png](/files/af5746d53a67434a) Notice especially the second search result where "m...

0 views
Kix Panganiban 4 days ago

First make it fast, then make it smart

In the speed vs. intelligence spectrum, I've always figured the smarter AI models would outperform faster ones in all but the most niche cases. On paper and in benchmarks, that often holds true. But I've learned -- at least for me personally -- that faster models bring way more utility to the table. Man, I hate the word "agentic." I find it to be trendy and pretentious, but "AI-assisted coding" is a mouthful with too many syllables, so I'll begrudgingly stick with "agentic" coding for now. When it comes to agentic coding, the instinct is to pick a smart model that can make clever changes to your code. You know, the kind that thinks out loud, noodles on an idea for a bit, then writes code and stares at it for a bit. On the surface, that seems like the way to go -- but for me, it's often a productivity killer. I've got ADHD, and even on meds, my attention span is flaky at best. Waiting for a model to "think" while it performs a task makes me lose focus on what it's even doing. (This is also why I don't believe in running agents in parallel, or huge plan-think-execute loops like what Google Antigravity does). My brain wanders off, and by the time the model's done, I've got to context-switch back to review its work. Then I ask for changes, wait again, and repeat the cycle. It's slow, painfully boring, and doesn't even guarantee I'll get what I want. Those little dead-air moments are what ruin it for me. I took a break from agentic coding for a while and went back to writing code by hand. No waiting, no boredom, and I always got exactly what I needed (usually). But then I realized something -- not all coding tasks need deep thought. My buddy Alex likes to call them "leaf node edits": small, trivial changes that are more mechanical than cerebral. Think splitting functions, renaming stuff (when doesn't cut it), or writing HTML and Markdown. These are perfect to delegate to AI because failing at tasks like these are rarely consequential, and mistakes are easier to spot. I think the trick really here is to not rely on AI to think or make architectural or design decisions. Do the planning and heavy lifting yourself; just use the tool for fast, broad-sweeping autocomplete. It's less like hiring a programmer and more like extending your typing speed. I once wrote about how Dumb Cursor is the best Cursor and that Cursor peaked with its first Composer release (I don't know why they chose to name two distinctly different features the same, but who knows anything these days). I take it all back. Composer is back with a vengeance, and it's fast . Aggressively fine-tuned for parallel tool calling, it flies through making changes -- even if it's not that smart. It makes silly mistakes and sometimes spits out vibe-codey slop (think: too many inline comments, ignoring , or overusing where it's not needed) -- but because it's so dang quick , it's a joy to use. For those leaf node edits, speed beats smarts every time. I've also tinkered with other fast models like Gemini Flash. It's cheap and decently smart, but it's just too unreliable for me. Google's API endpoints randomly conk out, I've found that it struggles with tool calling, and it'll hallucinate if you stuff too much into its context (which it touts as obnoxiously large). I'm sure there are workarounds -- but I don't want to fuss with it. My goal with agentic coding is low-friction help, not a side project to debug the tool itself. Then there are superfast inference providers like Cerebras , Sambanova , and Groq . They let you run open-weight, smart models (think Qwen or Kimi) at lightning speed. If I weren't already using Cursor, I'd probably go back to Roo or Crush with these. I just don't want to be managing multiple providers, API keys, and strict rate limiting -- it feels like a hassle, kinda defeating the purpose of a fast model. At the end of the day, my brain craves tools that keep up with me, not ones that make me wait. Faster models might not be the smartest, but for leaf node edits and mechanical tasks, I find them to be much more palatable. I'd rather iterate quickly and fix small goofs than sit through a slow model's deep thoughts. Turns out, speed isn't just a feature for me -- it's a necessity.

0 views
yakko 4 days ago

So you wanna build a local RAG?

0 views

Tech predictions for 2026 and beyond

We’ve caught glimpses of a future that values autonomy, empathy, and individual expertise. Where interdisciplinary cooperation influences discovery and creation at an unrelenting pace. In the coming year, we will begin the transition into a new era of AI in the human loop, not the other way around. This cycle will create massive opportunities to solve problems that truly matter.

0 views
DHH 5 days ago

Local LLMs are how nerds now justify a big computer they don't need

It's pretty incredible that we're able to run all these awesome AI models on our own hardware now. From downscaled versions of DeepSeek to gpt-oss-20b, there are many options for many types of computers. But let's get real here: they're all vastly behind the frontier models available for rent, and thus for most developers a curiosity at best. This doesn't take anything away from the technical accomplishment. It doesn't take anything away from the fact that small models are improving, and that maybe one day they'll indeed be good enough for developers to rely on them in their daily work. But that day is not today. Thus, I find it spurious to hear developers evaluate their next computer on the prospect of how well it's capable of running local models. Because they all suck! Whether one sucks a little less than the other doesn't really matter. And as soon as you discover this, you'll be back to using the rented models for the vast majority of the work you're doing. This is actually great news! It means you really don't need a 128GB VRAM computer on your desk. Which should come as a relief now that RAM prices are skyrocketing, exactly because of AI's insatiable demand for more resources. Most developers these days can get by with very little, especially if they're running Linux. So as an experiment, I've parked my lovely $2,000 Framework Desktop for a while. It's an incredible machine, but in the day-to-day, I've actually found I barely notice the difference compared to a $500 mini PC from Beelink (or Minisforum). I bet you likely need way less than you think too.

1 views
Simon Willison 6 days ago

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

Anthropic released Claude Opus 4.5 this morning, which they call "best model in the world for coding, agents, and computer use". This is their attempt to retake the crown for best coding model after significant challenges from OpenAI's GPT-5.1-Codex-Max and Google's Gemini 3 , both released within the past week! The core characteristics of Opus 4.5 are a 200,000 token context (same as Sonnet), 64,000 token output limit (also the same as Sonnet), and a March 2025 "reliable knowledge cutoff" (Sonnet 4.5 is January, Haiku 4.5 is February). The pricing is a big relief: $5/million for input and $25/million for output. This is a lot cheaper than the previous Opus at $15/$75 and keeps it a little more competitive with the GPT-5.1 family ($1.25/$10) and Gemini 3 Pro ($2/$12, or $4/$18 for >200,000 tokens). For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $1/$5. The Key improvements in Opus 4.5 over Opus 4.1 document has a few more interesting details: I had access to a preview of Anthropic's new model over the weekend. I spent a bunch of time with it in Claude Code, resulting in a new alpha release of sqlite-utils that included several large-scale refactorings - Opus 4.5 was responsible for most of the work across 20 commits, 39 files changed, 2,022 additions and 1,173 deletions in a two day period. Here's the Claude Code transcript where I had it help implement one of the more complicated new features. It's clearly an excellent new model, but I did run into a catch. My preview expired at 8pm on Sunday when I still had a few remaining issues in the milestone for the alpha . I switched back to Claude Sonnet 4.5 and... kept on working at the same pace I'd been achieving with the new model. With hindsight, production coding like this is a less effective way of evaluating the strengths of a new model than I had expected. I'm not saying the new model isn't an improvement on Sonnet 4.5 - but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two. This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors. Google's Nano Banana Pro image generation model was notable in that its ability to render usable infographics really does represent a task at which previous models had been laughably incapable. The frontier LLMs are a lot harder to differentiate between. Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis? And honestly, this is mainly on me. I've fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they've fallen one-by-one and now I'm embarrassingly lacking in suitable challenges to help evaluate new models. I frequently advise people to stash away tasks that models fail at in their notes so they can try them against newer models later on - a tip I picked up from Ethan Mollick. I need to double-down on that advice myself! I'd love to see AI labs like Anthropic help address this challenge directly. I'd like to see new model releases accompanied by concrete examples of tasks they can solve that the previous generation of models from the same provider were unable to handle. "Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5" would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond. In the meantime, I'm just gonna have to keep on getting them to draw pelicans riding bicycles . Here's Opus 4.5 (on its default "high" effort level ): It did significantly better on the new more detailed prompt : Here's that same complex prompt against Gemini 3 Pro and against GPT-5.1-Codex-Max-xhigh . From the safety section of Anthropic's announcement post: With Opus 4.5, we’ve made substantial progress in robustness against prompt injection attacks, which smuggle in deceptive instructions to fool the model into harmful behavior. Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry: On the one hand this looks great, it's a clear improvement over previous models and the competition. What does the chart actually tell us though? It tells us that single attempts at prompt injection still work 1/20 times, and if an attacker can try ten different attacks that success rate goes up to 1/3! I still don't think training models not to fall for prompt injection is the way forward here. We continue to need to design our applications under the assumption that a suitably motivated attacker will be able to find a way to trick the models. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Opus 4.5 has a new effort parameter which defaults to high but can be set to medium or low for faster responses. The model supports enhanced computer use , specifically a tool which you can provide to Opus 4.5 to allow it to request a zoomed in region of the screen to inspect. " Thinking blocks from previous assistant turns are preserved in model context by default " - apparently previous Anthropic models discarded those.

1 views

Premium: The Hater's Guide To NVIDIA

This piece has a generous 3000+ word introduction, because I want as many people to understand NVIDIA as possible. The (thousands of) words after the premium break get into arduous detail, but I’ve written this so that, ideally, most people can pick up the details early on and understand this clusterfuck. Please do subscribe to the premium! I really appreciate it. I've reached a point with this whole era where there are many, many things that don't make sense, and I know I'm not alone. I've been sick since Friday last week, and thus I have had plenty of time to sit and think about stuff. And by "stuff" I mean the largest company on the stock market: NVIDIA.  Look, I'm not an accountant, nor am I a "finance expert." I learned all of this stuff myself. I learn a great deal by coming to things from the perspective of being a dumbass , a valuable intellectual framework of "I need to make sure I understand each bit and explain it as simply as possible." In this piece, I'm going to try and explain both what this company is, how we got here, and ask questions that I, from the perspective of a dumbass, have about the company, and at least try and answer them. Let's start with a very simple point: for a company of such remarkable size, very few people — myself included, at times! —  seem to actually understand NVIDIA. NVIDIA is a company that sells all sorts of stuff, but the only reason you're hearing about it as a normal person is that NVIDIA's stock has become a load-bearing entity in the US stock market. This has happened because NVIDIA sells "GPUs" — graphics processing units — that power the large language model services that are behind the whole AI boom, either through "inference" (the process of creating an output from an AI model) or "training" (feeding data into the model to make its outputs better). NVIDIA also sells other things, which I’ll get to later, but it doesn’t really matter to the bigger picture. Back in 2006, NVIDIA launched CUDA , a software layer that lets you run (some) software on (specifically) NVIDIA graphics cards, and over time this has grown into a massive advantage for the company. The thing is, GPUs are great for parallel processing - essentially spreading a task across multiple, by which I mean thousands, of processor cores at the same time - which means that certain tasks run faster than they would on, say, a CPU. While not every task benefits from parallel processing, or from having several thousand cores available at the same time, the kind of math that underpins LLMs is one such example.  CUDA is proprietary to NVIDIA, and while there are alternatives (both closed- and open-source), none of them have the same maturity and breadth. Pair that with the fact that Nvidia’s been focused on the data center market for longer than, say, AMD, and it’s easy to understand why it makes so much money. There really isn’t anyone who can do the same thing as NVIDIA, both in terms of software and hardware, and certainly not at the scale necessary to feed the hungry tech firms that demand these GPUs. Anyway, back in 2019 NVIDIA acquired a company called Mellanox for $6.9 billion, beating off other would-be suitors, including Microsoft and Intel. Mellanox was a manufacturer of high-performance networking gear, and this acquisition would give NVIDIA a stronger value proposition for data center customers. It wanted to sell GPUs — lots of them — to data center customers, and now it could also sell the high-speed networking technology required to make them work in tandem.  This is relevant because it created the terms under which NVIDIA could start selling billions (and eventually tens of billions) of specialized GPUs for AI workloads. As pseudonymous finance account JustDario connected (both Dario and Kakashii have been immensely generous with their time explaining some of the underlying structures of NVIDIA, and are worth reading, though at times we diverge on a few points), mere months after the Mellanox acquisition, Microsoft announced its $1 billion investment in OpenAI to build "Azure AI supercomputing technologies." Though it took until November 2022 for ChatGPT to really start the fires, in March 2020 , NVIDIA began the AI bubble with the launch of its "Ampere" architecture, and the A100, which provided "the greatest generational performance leap of NVIDIA's eight generations of GPUs," built for "data analytics, scientific computing and cloud graphics." The most important part, however, was the launch of NVIDIA's "Superpod": Per the press release:  One might be fooled into thinking this was Huang suggesting we could now build smaller, more efficient data centers, when he was actually saying we should build way bigger ones that had way more compute power and took up way more space. The "Superpod" concept — groups of GPU servers networked together to work on specific operations — is the "thing" that is driving NVIDIA's sales. To "make AI happen," a company must buy thousands of these things and put them in data centers and you'd be a god damn idiot to not do this and yes, it requires so much more money than you used to spend. At the time, a DGX A100 — a server that housed eight A100 GPUs (starting at around $10,000 at launch per-GPU, increasing with the amount of on-board RAM, as is the case across the board) — started at $199,000. The next generation SuperPod, launched in 2022, was made up of eight H100 GPUs (Starting at $25,000-per-GPU, the next generation "Hopper" chips were apparently 30x times more powerful than the A100), and retailed from $300,000. You'll be shocked to hear the next generation Blackwell SuperPods started at $500,000 when launched in 2024 . A single B200 GPU costs at least $30,000 . Because nobody else has really caught up with CUDA, NVIDIA has a functional monopoly ( edit: I wrote monopsony in a previous version, sorry), and yes, you can have a situation where a market has a monopoly, even if there is, at least in theory, competition. Once a particular brand — and particular way of writing software for a particular kind of hardware — takes hold, there's an implicit cost of changing to another, on top of the fact that AMD and others have yet to come up with something particularly competitive. Anyway, the reason that I'm writing all of this out is because I want you to understand why everybody is paying NVIDIA such extremely large amounts of money. Every year, NVIDIA comes up with a new GPU, and that GPU is much, much more expensive, and NVIDIA makes so much more money, because everybody has to build out AI infrastructure full of whatever the latest NVIDIA GPUs are, and those GPUs are so much more expensive every single year. With Blackwell — the third generation of AI-specialized GPUs — came a problem, in that these things were so much more power-hungry, and required entirely new ways of building data centers, along with different cooling and servers to put them in, much of which was sold by NVIDIA. While you could kind of build around your current data centers to put A100s and H100s into production, Blackwell was...less cooperative, and ran much hotter. To quote NVIDIA Employee Number 4 David Rosenthal : In simple terms, Blackwell runs hot, so much hotter than Ampere (A100) or Hopper (H100) GPUs that it requires entirely different ways to cool it, meaning your current data center needs to be ripped apart to fit them. Huang has confirmed that Vera Rubin, the next generation of GPUs, will have the same architecture as Blackwell . I would bet money that it's also much more expensive. Anyway, all of this has been so good for NVIDIA. As the single vendor for the most important component in the entire AI boom, it has set the terms for how much you pay and how you build any and all AI infrastructure. While there are companies like Supermicro and Dell who buy NVIDIA GPUs and ship them in servers to customers, that's just fine for NVIDIA CEO Jensen Huang, as that's somebody else selling his GPUs for him. NVIDIA has been printing money, quarter after quarter, going from a meager $7.192 billion in total revenue in the third (calendar year) quarter of 2023 to an astonishing $50 billion in just data center revenue (that's where the GPUs are) in its most recent quarter , for a total of $57 billion in revenue , and the company projects to make $63 billion to $67 billion in the next quarter. Now, I'm going to stop you here, because this bit is really important, really simple, yet nobody thinks about it much: NVIDIA makes so much money, and it makes it from a much smaller customer base than most companies, because there are only so many entities that can buy thousands of chips that cost $50,000 or more each.   $35 billion, $39 billion, $44 billion, $46 billion and $57 billion are very large amounts of money, and the entities pumping those numbers into the stratosphere are collectively having to spend hundreds of billions of dollars to make it happen. So, let me give you a theoretical example. I swear I'm going somewhere with this.You, a genius, have decided you are about to join the vaunted ranks of "AI data center ownership." You decide to build a "small" AI data center — 25MW (megawatts, which in this example, refers to the combined power draw of the tech inside the data center). That can't be that much, right? OpenAI is building a 1.2GW one out in Abilene Texas . How much could this tiny little thing cost? Okay, well, let's start with those racks. You're gonna need to give Jensen Huang $600 million right away, as you need 200 GB200 racks. You're also gonna need a way to make them network together, because otherwise they aren't going to be able to handle all those big IT loads , so that's gonna be another $80 million or more, and you're going to need storage and servers to sync all of this up, which is, let's say, another $35 million. So we're at $715 million. Should be fine, right? Everybody's cool and everybody's normal. This is just a small data center after all. Oops, forgot cooling and power delivery stuff — that's another $5 million. $720 million. Okay. Anyway, sadly data centers require something called a "building." Construction costs for a data center are somewhere from $8 million to $12 million per megawatt , so, crap, okay. That's $250 million, but probably more like $300 million. We're now up to $1.02 billion, and we haven't even got the power yet. Okay, sick. Do you have one billion dollars? You don't? No worries! Private credit — money loaned by non-banking entities — has been feeding more than $50 billion dollars a quarter into the hungry mouths of anybody who desires to build a data center . You need $1.02 billion. You get $1.5 billion, because, you know, "stuff happens." Don't worry about those pesky high interest rates — you're about to be printing big money, AI style! Now you're done raising all that cash, it'll now only take anywhere from 6 to 18 months for site selection, permitting, design, development, construction, and energy procurement . You're also going to need about 20 acres of land for that 100,000 square foot data center . You may wonder why 100,000 square feet needs that much space, and that's because all of the power and cooling equipment takes up an astonishing amount of room. So, yeah, after two years and over a billion dollars, you too can own a data center with NVIDIA GPUs that turn on, and at that point, you will offer a service that is functionally identical to everybody else buying GPUs from NVIDIA. Your competitors are Amazon, Google and Microsoft, followed by neoclouds — AI chip companies selling the same thing as you, except they're directly backed by NVIDIA, and frequently, the big hyperscaler companies with brands that most people have heard of, like AWS and Azure. Oh, also, this stuff costs an indeterminately-large amount of money to run. You may wonder why I can't tell you how much, and that's because nobody wants to actually discuss the cost of running GPUs, the thing that underpins our entire stock market. There're good reasons, too. One does not just run "a GPU" — it's a GPU in a server of other GPUs with associated hardware, all drawing power in varying amounts, all running in sync with networking gear that also draws power, with varying amounts of user demand and shifts in the costs of power from the power company. But what we can say is that the up front cost of buying these GPUs and their associated crap is such that it's unclear if they ever will generate a profit, because these GPUs run hot , all the time , and that causes some amount of them to die. Here are some thoughts I have had: The NVIDIA situation is one of the most insane things I've seen in my life. The single-largest, single-most-valuable, single-most-profitable company on the stock market has got there through selling ultra-expensive hardware that takes hundreds of millions or billions of dollars (and years of construction in some cases) to start using, at which point it...doesn't make much revenue and doesn't seem to make a profit.  Said hardware is funded by a mixture of cashflow from healthy businesses (see: Microsoft) or massive amounts of debt (see: everybody who is not a hyperscaler, and, at this point, some hyperscalers). The response to the continued proof that generative AI is not making money is to buy more GPUs, and it doesn't appear anybody has ever worked out why. This problem has been obvious for a long time, too.  Today I'm going to explain to you — simply, but at length — why I am deeply concerned, and how deeply insane this situation has become. A 25MW data center costs about $1 billion, with $600 million of that being GPUs — 200 GB200 racks, to be specific. It needs about 20 acres — 100,000 square feet for the data center, roughly. NVIDIA sells about $50 billion of GPUs and associated hardware in a quarter, so let's say that $40 billion of that is just the GPUs and $10 billion is everything else (primarily networking gear), so around 13,333 GB200 racks. I realize that NVIDIA sells far more than that (GB300 racks, singular GPUs, and so on). Deep-pocketed hyperscalers like Microsoft, Google, Meta and Amazon representing 41.32% of NVIDIA's revenue in the middle of 2025 , funneling free cash flow directly into Jensen Huang's pockets... ...for now. Amazon ( $15 billion ), Google ( $25 billion ), Meta ( $30 billion ) and Oracle ( $18 billion ) have all had to raise massive amounts of debt to continue to fund AI-focused capital expenditures, with more than half of that ( per Rubenstein ) spent on GPUs. Otherwise, basically anybody buying GPUs at any scale has to fund doing so with either venture capital (money raised in exchange for part of the company) or debt. NVIDIA, at this point, is around 8% of the value of the S&P 500 (the 500 leading (meaning they meet certain criteria of size, liquidity (cash availability) and profitability) companies on the US stock market). Its continued health — and representative value as a stock, which is not necessarily based on its actual numbers or health, but in this case kind of is? — has led the stock market to remarkable gains. It is not enough for NVIDIA to simply be a profitable company. It must continue beating the last quarter's revenue, again and again and again and again, forever . If that sounds dramatic, I assure you it is the truth. NVIDIA's continued success — and its ability to continue delivering outsized beats of Wall Street's revenue estimates — depends on: The willingness of a few very large, cash-rich companies (Microsoft, Meta, Amazon and Google) to continue buying successive generations of NVIDIA GPUs forever. The ability of said companies to continue buying successive generations of GPUs forever. The ability of other, less-cash-rich companies like Oracle to continue being able to raise debt to buy massive amounts of GPUs — such as the $40 billion of GPUs that Oracle is buying for Stargate Abilene forever. This is becoming a problem. The ability of unprofitable, debt-ridden companies like CoreWeave, AI "neoclouds" that use the GPUs they purchase from NVIDIA as collateral for loans to buy more GPUs , to to continue raising that debt to buy more GPUs. The ability of anybody who buys these GPUs to actually install them and use them, which requires massive amounts of construction... and more power than is currently available, even to the most well-funded and conspicuous projects . In simple terms, its success depends on the debt markets to continue propping up its revenues, because there is not really enough free cash in the world to continue pumping it into NVIDIA at this rate. And after all of this, large language models, the only way to make any real money on any of these GPUs , must prove they can actually produce a profit. Per my article from September, I can find no compelling evidence (outside of boosters speciously claiming otherwise) that it's profitable to sell access to GPUs. Based on my calculations, there's likely little more than $61 billion of actual AI revenue in 2025 across every single AI company and hyperscaler. Note that I said "revenue." Absolutely nobody is making a profit.

0 views

Manifesto: AI (as a term and field) should subsume CS

In French the term “informatique” feels slightly better, as a label to describe the field, than “Computer Science” feels in English. But this is a rare occurrence for French, because most of the other terms, like “technologie de l’information”, and “science des données”, feel awkward and far from their “real” cultural counterpart, the thing in itself that we do, when we do it.

0 views
Manuel Moreale 6 days ago

A moment in yet another memorial

There’s something unique about visiting WW memorials. I don’t even know how to explain it. It’s a strange mix of awe, sorrow, gratefulness, and many other feelings all bunched together. Thank you for keeping RSS alive. You're awesome. Email me :: Sign my guestbook :: Support for 1$/month :: See my generous supporters :: Subscribe to People and Blogs

0 views

Praise the Backup

Well, that was a fun weekend. I have spent half of my time reinstalling MacOS Sequoia , and trying to get it back the way it was, while trying to avoid losing important files. You see, on that chilly Saturday afternoon, I wanted to take care of my ageing computer, and tried an app that was supposed to clean the old files and “residue” from previously uninstalled apps. As a reader of this blog, you may know that I tend to use very few apps , but I try a lot of them . Trying a lot of apps means doing a lot of installs, and then a lot of uninstall processes. So, while experimenting with yet another app, it crashed in the middle of its cleaning work. And, because I was being dumb, I thought it would be a good idea to empty the bin at that moment. 5, 10, 15, 20, 25… These were the thousands of files being deleted permanently from the bin. Even with lots of old app files, the number still seemed rather high. I stopped the process only after losing thousands of files and realised that this cleaning app had put in the bin a lot of files and folders that it shouldn't have. A lot of files and folders. My blog files with all my Eleventy settings, all of it. Most of my system preference files. The app even deleted its own application files, which is why the app crashed, I believe. None of my other apps or extensions could be launched, error messages everywhere. I was having a lot of fun. I restarted my computer, hoping the powerful reboot spirits would once again act miraculously, but my dear old MacBook Air welcomed me as it was a brand-new Mac, almost fresh installation. Even my keyboard was set to the wrong layout (which made it truly fantastic to enter a password in such a moment of panic), my wallpaper was gone, the dock was featuring all the default apps, and I was logged off my iCloud account. Thankfully, this last part turned out to be a good thing because my personal and most important files, stored on iCloud Drive, were safe from whatever had happened on my machine. I also had a two-month-old backup on an external SSD, mirrored on JottaCloud . The cherry on top was that I couldn’t use the “Put Back” right-click action on the files left in the bin as they were not put there by the Finder, but by this third-party app. 1 There were 1200 files and folders left or so, most of them obscure preference files. Needless to say that I didn’t really bother taking hours of my weekend putting them back where they belonged, even if I knew how. I scavenged what I could, everything that seemed important — including a folder called “BBEdit Backups” (more on that later) — and used this opportunity to start anew. Since my last backup was two months old, needless to say that I had a decent amount of work to do putting everything back together, including the last four posts of the website you're reading — which had been vaporised from my computer. I had to reinstall all my apps, my preferences, my keyboard shortcuts, everything that I could, while I could still recall what they were in detail. I won’t blame the app that caused all of this, or my old computer, as much as I will blame myself. I should have been more careful about how to use it properly, I shouldn’t have decided to empty the bin at that moment, and I should have done better and more frequent backups: once every quarter is definitely not enough. The clean MacOS install experience itself was not great: It was very slow, annoying, and during all this time I worried about not being able to connect to my site again or make Eleventy work the same way it did (sorry if I get a little PTSD ). 2 Today, as I write this, my computer doesn’t really feel any faster; a clean install can only do so much on the last generation of Intel MacBook Airs. MacOS was a pain, and I was reminded of my Windows user days more than I expected. For example, I kept getting a message along the lines of “The widget blahblahblah is a different version from a previous version of that widget, do you want to open it?” and clicking “No” just brought back the pop-up window three or four more times before it eventually went away. The prompt even interrupted me while I was trying to type my complicated Wi-Fi password. Not once, not twice, but thrice. Now, everything seems fine. Eleventy works. Xmit works. BBEdit is just like it was. This whole experience made me realise three main things. Apologies if you see anything weird on this site: some little layout issues and typos that were fixed in the last two months may have returned. Please let me know if you see anything suspicious (or any of the usual typos). In the meantime, don’t be an idiot like me: take care of those backups. I won’t name the app in this post because I’m not 100% sure if the app was the sole guilty party in this affair, if guilty at all. Maybe I didn’t set it up right, maybe it’s all my fault!  ↩︎ As the song goes.  ↩︎ That BBEdit is, indeed, just too good . I’m not sure if I could have brought everything back so quickly and confidently without this app. The BBEdit automated backup folder, the one I found in the bin, really saved me. Many of the most recent versions of the Jolly Teapot text files were still there, so I didn’t have to import the text from the live website. Just when I thought I couldn’t love this app more than I already did. I’m proud of myself for thinking of creating a backup of my BBEdit preferences too. That I seriously needed to create a better backup system so that in the event of something like this happening again, whether a human error or an app shitting the bed, I would only have a week or two of files to recover, and not a whole nine weeks of them. I just created an Automator workflow to help me automate my backups and include more files. I considered using Time Machine on my external SSD, or using an app like Hazel , but for my minimal needs, this Automator workflow should do just fine. That I may have actually enjoyed all of this: the crash and this weird situation gave me an excuse to both operate a clean installation on my Mac and justify the purchase of a new one. I will probably wait until March for the next generation of MacBooks Air, but the regular M5 MacBook Pro has never looked so good. I won’t name the app in this post because I’m not 100% sure if the app was the sole guilty party in this affair, if guilty at all. Maybe I didn’t set it up right, maybe it’s all my fault!  ↩︎ As the song goes.  ↩︎

1 views
xenodium 6 days ago

WhatsApp from you know where

While there are plenty of messaging alternatives out there, for better or worse, WhatsApp remains a necessity for some of us. With that in mind, I looked for ways to bring WhatsApp messaging to the comfort of my beloved text editor. As mentioned in my initial findings , WhatsApp on Emacs is totally doable with the help of wuzapi and whatsmeow , which offer a huge leg up. Today, I introduce a super early version of Wasabi , a native Emacs interface for WhatsApp messaging. I wanted installation/setup to be as simple as possible. Ideally, you install a single Emacs package and off you go. While leveraging XMPP is rather appealing in reusing existing Emacs messaging packages, I felt setting up a WhatsApp gateway or related infrastructure to be somewhat at odds with 's simple installation goal. Having said that, wuzapi / whatsmeow offer a great middle ground. You install a single binary dependency, along with , and you're ready to go. This isn't too different from the git + magit combo. As of now, 's installation/setup boils down to two steps if you're on macOS: While you may try Homebrew on Linux, you're likely to prefer your native package manager. If that fails, building wuzapi from source is also an option. While runs as a RESTful API service + webhooks , I wanted to simplify the Emacs integration by using json-rpc over standard I/O, enabling us to leverage incoming notifications in place of . I floated the idea of adding json-rpc to wuzapi to 's author Nicolas, and to my delight, he was keen on it. He's now merged my initial proof of concept , and I followed up with a handful of additional patches (all merged now): With the latest Wasabi Emacs package and wuzapi binary, you now get the initial WhatsApp experience I've been working towards. At present, you can send/receive messages to/from 1:1 or group chats. You can also download/view images as well as videos. Viewing reactions is also supported. Needless to say, you may find some initial rough edges in addition to missing features. Having said that, I'd love to hear your feedback and experience. As mentioned is currently available on GitHub . I've now put in quite a bit of effort prototyping things, upstreaming changes to , and building the first iteration of wasabi . I gotta say, it feels great to be able to quickly message and catch up with different chats from the comfort of Emacs. Having said that, it's taken a lot of work to get here and will require plenty more to get to a polished and featureful experience. Since going full-time indie dev, I have the flexibility to work on projects of choice, but that's only to an extent. If I cannot make the project sustainable, I'll eventually move to work on something else that is. If you're keen on Wasabi 's offering, please consider sponsoring the effort , and please reach out to voice your interest ( Mastodon / Twitter / Reddit / Bluesky ). Reckon a WhatsApp Emacs client would help you stay focused at work (less time on your phone)? Ask your employer to sponsor it too ;-) Add JSON-RPC 2.0 stdio mode (via -mode=stdio) for communication Expose more HTTP endpoints as JSON-RPCs . Enable setting a custom data directory via -datadir=/path/to/data . Add Homebrew recipe/installation .

0 views
Simon Willison 1 weeks ago

Olmo 3 is a fully open LLM

Olmo is the LLM series from Ai2 - the Allen institute for AI . Unlike most open weight models these are notable for including the full training data, training process and checkpoints along with those releases. The new Olmo 3 claims to be "the best fully open 32B-scale thinking model" and has a strong focus on interpretability: At its center is Olmo 3-Think (32B) , the best fully open 32B-scale thinking model that for the first time lets you inspect intermediate reasoning traces and trace those behaviors back to the data and training decisions that produced them. They've released four 7B models - Olmo 3-Base, Olmo 3-Instruct, Olmo 3-Think and Olmo 3-RL Zero, plus 32B variants of the 3-Think and 3-Base models. Having full access to the training data is really useful. Here's how they describe that: Olmo 3 is pretrained on Dolma 3 , a new ~9.3-trillion-token corpus drawn from web pages, science PDFs processed with olmOCR , codebases, math problems and solutions, and encyclopedic text. From this pool, we construct Dolma 3 Mix , a 5.9-trillion-token (~6T) pretraining mix with a higher proportion of coding and mathematical data than earlier Dolma releases, plus much stronger decontamination via extensive deduplication, quality filtering, and careful control over data mixing. We follow established web standards in collecting training data and don't collect from sites that explicitly disallow it, including paywalled content. They also highlight that they are training on fewer tokens than their competition: [...] it's the strongest fully open thinking model we're aware of, narrowing the gap to the best open-weight models of similar scale – such as Qwen 3 32B – while training on roughly 6x fewer tokens. If you're continuing to hold out hope for a model trained entirely on licensed data this one sadly won't fit the bill - a lot of that data still comes from a crawl of the web. I tried out the 32B Think model and the 7B Instruct model using LM Studio . The 7B model is a 4.16GB download, the 32B one is 18.14GB. The 32B model is absolutely an over-thinker! I asked it to "Generate an SVG of a pelican riding a bicycle" and it thought for 14 minutes 43 seconds , outputting 8,437 tokens total most of which was this epic thinking trace . I don't usually quote the full SVG in these write-ups, but in this case it's short enough that I think it's worth sharing. The SVG comments give a great impression of what it was trying to do - it has a Bicycle, Bike frame, Pelican, Left and Right wings and even "Feet on pedals". Rendered it looks like this: I tested OLMo 2 32B 4bit back in March and got something that, while pleasingly abstract, didn't come close to resembling a pelican or a bicycle: To be fair 32B models generally don't do great with this. Here's Qwen 3 32B's attempt (I ran that just now using OpenRouter ): I was particularly keen on trying out the ability to "inspect intermediate reasoning traces". Here's how that's described later in the announcement: A core goal of Olmo 3 is not just to open the model flow, but to make it actionable for people who want to understand and improve model behavior. Olmo 3 integrates with OlmoTrace , our tool for tracing model outputs back to training data in real time. For example, in the Ai2 Playground, you can ask Olmo 3-Think (32B) to answer a general-knowledge question, then use OlmoTrace to inspect where and how the model may have learned to generate parts of its response. This closes the gap between training data and model behavior: you can see not only what the model is doing, but why---and adjust data or training decisions accordingly. You can access OlmoTrace via playground.allenai.org , by first running a prompt and then clicking the "Show OlmoTrace" button below the output. I tried that on "Generate a conference bio for Simon Willison" (an ego-prompt I use to see how much the models have picked up about me from their training data) and got back a result that looked like this: It thinks I co-founded co:here and work at Anthropic, both of which are incorrect - but that's not uncommon with LLMs, I frequently see them suggest that I'm the CTO of GitHub and other such inaccuracies. I found the OlmoTrace panel on the right disappointing. None of the training documents it highlighted looked relevant - it appears to be looking for phrase matches (powered by Ai2's infini-gram ) but the documents it found had nothing to do with me at all. Ai2 claim that Olmo 3 is "the best fully open 32B-scale thinking model", which I think holds up provided you define "fully open" as including open training data. There's not a great deal of competition in that space though - Ai2 compare themselves to Stanford's Marin and Swiss AI's Apertus , neither of which I'd heard about before. A big disadvantage of other open weight models is that it's impossible to audit their training data. Anthropic published a paper last month showing that a small number of samples can poison LLMs of any size - it can take just "250 poisoned documents" to add a backdoor to a large model that triggers undesired behavior based on a short carefully crafted prompt. This makes fully open training data an even bigger deal. Ai2 researcher Nathan Lambert included this note about the importance of transparent training data in his detailed post about the release : In particular, we're excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative). This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." arXiv preprint arXiv:2506.10947 (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." arXiv preprint arXiv:2507.10532 (2025).) I hope we see more competition in this space, including further models in the Olmo series. The improvements from Olmo 1 (in February 2024 ) and Olmo 2 (in March 2025 ) have been significant. I'm hoping that trend continues! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Christian Jauvin 1 weeks ago

We didn't get the AI failure modes that philosophy anticipated

The original idea of AI, that we got mostly through science-fiction, and also a little from the philosophy of mind and logic, imagined an entity that would implement idealized and mechanical notions of thoughts, reasoning and logic. Such an entity would of course know everything there is to know about such topics, and its behavior would thus be rooted in them. Although this would mean that the entity would generally behave in impressive and powerful ways, it was also implicitly understood that sometimes this “perfection” would lead to paradoxical behaviors and “errors”: the robot stuck in a circle in the Asimov story is the quintessential example.

0 views
Armin Ronacher 1 weeks ago

LLM APIs are a Synchronization Problem

The more I work with large language models through provider-exposed APIs, the more I feel like we have built ourselves into quite an unfortunate API surface area. It might not actually be the right abstraction for what’s happening under the hood. The way I like to think about this problem now is that it’s actually a distributed state synchronization problem. At its core, a large language model takes text, tokenizes it into numbers, and feeds those tokens through a stack of matrix multiplications and attention layers on the GPU. Using a large set of fixed weights, it produces activations and predicts the next token. If it weren’t for temperature (randomization), you could think of it having the potential of being a much more deterministic system, at least in principle. As far as the core model is concerned, there’s no magical distinction between “user text” and “assistant text”—everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template. You can look at the system prompt templates on Ollama for the different models to get an idea. Let’s ignore for a second which APIs already exist and just think about what usually happens in an agentic system. If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You’d maintain the conversation history as tokens in RAM, and the model would keep a derived “working state” on the GPU—mainly the attention key/value cache built from those tokens. The weights themselves stay fixed; what changes per step are the activations and the KV cache. From a mental-model perspective, caching means “remember the computation you already did for a given prefix so you don’t have to redo it.” Internally, that usually means storing the attention KV cache for those prefix tokens on the server and letting you reuse it, not literally handing you raw GPU state. There are probably some subtleties to this that I’m missing, but I think this is a pretty good model to think about it. The moment you’re working with completion-style APIs such as OpenAI’s or Anthropic’s, abstractions are put in place that make things a little different from this very simple system. The first difference is that you’re not actually sending raw tokens around. The way the GPU looks at the conversation history and the way you look at it are on fundamentally different levels of abstraction. While you could count and manipulate tokens on one side of the equation, extra tokens are being injected into the stream that you can’t see. Some of those tokens come from converting the JSON message representation into the underlying input tokens fed into the machine. But you also have things like tool definitions, which are injected into the conversation in proprietary ways. Then there’s out-of-band information such as cache points. And beyond that, there are tokens you will never see. For instance, with reasoning models you often don’t see any real reasoning tokens, because some LLM providers try to hide as much as possible so that you can’t retrain your own models with their reasoning state. On the other hand, they might give you some other informational text so that you have something to show to the user. Model providers also love to hide search results and how those results were injected into the token stream. Instead, you only get an encrypted blob back that you need to send back to continue the conversation. All of a sudden, you need to take some information on your side and funnel it back to the server so that state can be reconciled on either end. In completion-style APIs, each new turn requires resending the entire prompt history. The size of each individual request grows linearly with the number of turns, but the cumulative amount of data sent over a long conversation grows quadratically because each linear-sized history is retransmitted at every step. This is one of the reasons long chat sessions feel increasingly expensive. On the server, the model’s attention cost over that sequence also grows quadratically in sequence length, which is why caching starts to matter. One of the ways OpenAI tried to address this problem was to introduce the Responses API, which maintains the conversational history on the server (at least in the version with the saved state flag). But now you’re in a bizarre situation where you’re fully dealing with state synchronization: there’s hidden state on the server and state on your side, but the API gives you very limited synchronization capabilities. To this point, it remains unclear to me how long you can actually continue that conversation. It’s also unclear what happens if there is state divergence or corruption. I’ve seen the Responses API get stuck in ways where I couldn’t recover it. It’s also unclear what happens if there’s a network partition, or if one side got the state update but the other didn’t. The Responses API with saved state is quite a bit harder to use, at least as it’s currently exposed. Obviously, for OpenAI it’s great because it allows them to hide more behind-the-scenes state that would otherwise have to be funneled through with every conversation message. Regardless of whether you’re using a completion-style API or the Responses API, the provider always has to inject additional context behind the scenes—prompt templates, role markers, system/tool definitions, sometimes even provider-side tool outputs—that never appears in your visible message list. Different providers handle this hidden context in different ways, and there’s no common standard for how it’s represented or synchronized. The underlying reality is much simpler than the message-based abstractions make it look: if you run an open-weights model yourself, you can drive it directly with token sequences and design APIs that are far cleaner than the JSON-message interfaces we’ve standardized around. The complexity gets even worse when you go through intermediaries like OpenRouter or SDKs like the Vercel AI SDK, which try to mask provider-specific differences but can’t fully unify the hidden state each provider maintains. In practice, the hardest part of unifying LLM APIs isn’t the user-visible messages—it’s that each provider manages its own partially hidden state in incompatible ways. It really comes down to how you pass this hidden state around in one form or another. I understand that from a model provider’s perspective, it’s nice to be able to hide things from the user. But synchronizing hidden state is tricky, and none of these APIs have been built with that mindset, as far as I can tell. Maybe it’s time to start thinking about what a state synchronization API would look like, rather than a message-based API. The more I work with these agents, the more I feel like I don’t actually need a unified message API. The core idea of it being message-based in its current form is itself an abstraction that might not survive the passage of time. There’s a whole ecosystem that has dealt with this kind of mess before: the local-first movement. Those folks spent a decade figuring out how to synchronize distributed state across clients and servers that don’t trust each other, drop offline, fork, merge, and heal. Peer-to-peer sync, and conflict-free replicated storage engines all exist because “shared state but with gaps and divergence” is a hard problem that nobody could solve with naive message passing. Their architectures explicitly separate canonical state, derived state, and transport mechanics — exactly the kind of separation missing from most LLM APIs today. Some of those ideas map surprisingly well to models: KV caches resemble derived state that could be checkpointed and resumed; prompt history is effectively an append-only log that could be synced incrementally instead of resent wholesale; provider-side invisible context behaves like a replicated document with hidden fields. At the same time though, if the remote state gets wiped because the remote site doesn’t want to hold it for that long, we would want to be in a situation where we can replay it entirely from scratch—which for instance the Responses API today does not allow. There’s been plenty of talk about unifying message-based APIs, especially in the wake of MCP (Model Context Protocol). But if we ever standardize anything, it should start from how these models actually behave, not from the surface conventions we’ve inherited. A good standard would acknowledge hidden state, synchronization boundaries, replay semantics, and failure modes — because those are real issues. There is always the risk that we rush to formalize the current abstractions and lock in their weaknesses and faults. I don’t know what the right abstraction looks like, but I’m increasingly doubtful that the status-quo solutions are the right fit.

0 views
Sean Goedecke 1 weeks ago

Why it takes months to tell if new AI models are good

Nobody knows how to tell if current-generation models are any good . When GPT-5 launched, the overall mood was very negative, and the consensus was that it wasn’t a strong model. But three months later it turns out that GPT-5 (and its derivative GPT-5-Codex) is a very strong model for agentic work 1 : enough to break Anthropic’s monopoly on agentic coding models. In fact, GPT-5-Codex is my preferred model for agentic coding. It’s slower than Claude Sonnet 4.5, but in my experience it gets more hard problems correct. Why did it take months for me to figure this out? The textbook solution for this problem is evals - datasets of test cases that models can be scored against - but evals are largely unreliable . Many models score very well on evals but turn out to be useless in practice. There are a couple of reasons for this. First, it’s just really hard to write useful evals for real-world problems , since real-world problems require an enormous amount of context. Can’t you take previous real-world problems and put them in your evals - for instance, by testing models on already-solved open-source issues? You can, but you run into two difficulties: Another problem is that evals are a target for AI companies . How well Anthropic or OpenAI’s new models perform on evals has a direct effect on the stock price of those companies. It’d be naive to think that they don’t make some kind of effort to do well on evals: if not by directly training on public eval data 2 , then by training on data that’s close enough to eval data to produce strong results. I’m fairly confident that big AI companies will not release a model unless they can point to a set of evals that their model does better than competitors. So you can’t trust that strong evals will mean a strong model, because every single new model is released with strong evals. If you can’t rely on evals to tell you if a new model is good, what can you rely on? For most people, the answer is the “vibe check”: interacting with the model themselves and making their own judgement. Often people use a set of their own pet questions, which are typically questions that other LLMs get wrong (say, word puzzles). Trick questions can be useful, but plenty of strong models struggle with specific trick questions for some reason. My sense is also that current models are too strong for obvious word puzzles. You used to be able to trip up models with straightforward questions like “If I put a ball in a box, then put the box in my pocket, where is the ball?” Now you have to be more devious, which gives less signal about how strong the model is. Sometimes people use artistic prompts. Simon Willison famously asks new models to produce a SVG of a pelican riding a bicycle. It’s now a common Twitter practice to post side-by-side “I asked two models to build an object in Minecraft” screenshots. This is cool - you can see at a glance that bigger models produce better images - but at some point it becomes difficult to draw conclusions from the images. If Claude Sonnet 4.5 puts the pelican’s feet on the pedals correctly, but GPT-5.1 adds spokes to the wheels, which model is better? Finally, many people rely on pure vibes: the intangible sense you get after using a model about whether it’s good or not. This is sometimes described as “big model smell”. I am fairly agnostic about people’s ability to determine model capability from vibes alone. It seems like something humans might be able to do, but also like something that would be very easy to fool yourself about. For instance, I would struggle to judge a model with the conversational style of GPT-4o as very smart, but there’s nothing in principle that would prevent that. Of course, for people who engage in intellectually challenging pursuits, there’s an easy (if slow) way to evaluate model capability: just give it the problems you’re grappling with and see how it does. I often ask a strong agentic coding model to do a task I’m working on in parallel with my own efforts. If the model fails, it doesn’t slow me down much; if it succeeds, it catches something I don’t, or at least gives me a useful second opinion. The problem with this approach is that it takes a fair amount of time and effort to judge if a new model is any good, because you have to actually do the work : if you’re not engaging with the problem yourself, you will have no idea if the model’s solution is any good or not. So testing out a new model can be risky. If it’s no good, you’ve wasted a fair amount of time and effort! I’m currently trying to decide whether to invest this effort into testing out Gemini 3 Pro or GPT-5.1-Codex - right now I’m still using GPT-5-Codex for most tasks, or Claude Sonnet 4.5 on some simpler problems. Each new model release reignites the debate over whether AI progress is stagnating. The most prominent example is Gary Marcus, who has written that GPT-4 , GPT-4o , Claude 3.5 Sonnet , GPT-5 and DeepSeek all prove that AI progress has hit a wall. But almost everyone who writes about AI seems to be interested in the topic. Each new model launch is watched to see if this is the end of the bubble, or if LLMs will continue to get more capable. The reason this debate never ends is that there’s no reliable way to tell if an AI model is good . Suppose that base AI models were getting linearly smarter (i.e. that GPT-5 really was as far above GPT-4 as GPT-4 was above GPT-3.5, and so on). Would we actually be able to tell? When you’re talking to someone who’s less smart than you 3 , it’s very clear. You can see them failing to follow points you’re making, or they just straight up spend time visibly confused and contradicting themselves. But when you’re talking to someone smarter than you, it’s far from clear (to you) what’s going on. You can sometimes feel that you’re confused by what they say, but that doesn’t necessarily mean they’re smarter. It could be that they’re just talking nonsense. And smarter people won’t confuse you all the time - only when they fail to pitch their communication at your level. Talking with AI models is like that. GPT-3.5 was very clearly less smart than most of the humans who talked to it. It was mainly impressive that it was able to carry on a conversation at all. GPT-4 was probably on par with the average human (or a little better) in its strongest domains. GPT-5 (at least in thinking mode) is smarter than the average human across most domains, I believe. Suppose we had no objective way of measuring chess ability. Would I be able to tell if computer chess engines were continuing to get better? I’d certainly be impressed when the chess engines went from laughably bad to beating me every time. But I’m not particularly good at chess. I would lose to chess engines from the early 1980s . It would thus seem to me as if chess engine progress had stalled out, when in fact modern chess engines have double the rating of chess engines from the 1980s. I acknowledge that “the model is now at least partly smarter than you” is an underwhelming explanation for why AI models don’t appear to be rapidly getting better. It’s easy to point to cases where even strong models fall over. But it’s worth pointing out that if models were getting consistently smarter, this is what it would look like : rapid subjective improvement as the models go from less intelligent than you to on par with you, and then an immediate plateau as the models surpass you and you become unable to tell how smart they are. By “agentic work” I mean “LLM with tools that runs in a loop”, like Copilot Agent Mode, Claude Code, and Codex. I haven’t yet tried GPT-5.1-Codex enough to have a strong opinion. If you train a model on the actual eval dataset itself, it will get very good at answering those specific questions, even if it’s not good at answering those kinds of questions. This is often called “benchmaxxing”: prioritizing evals and benchmarks over actual capability. I want to bracket the question of whether “smart” is a broad category, or how exactly to define it. I’m talking specifically about the way GPT-4 is smarter than GPT-3.5 - even if we can’t define exactly how, we know that’s a real thing. Open-source coding is often meaningfully different from the majority of programming work. For more on this, see my comments in METR’S AI productivity study is really good , where I discuss an AI-productivity study that was done on open-source codebases. You’re still only covering a tiny subset of all programming work. For instance, the well-known SWE-Bench set of coding evals are just in Python. A model might be really good at Python but struggle with other languages. Nobody knows how good a model is when it’s launched. Even the AI lab who built it are only guessing and hoping it’ll turn out to be effective for real-world use cases. Evals are mostly marketing tools. It’s hard to figure out how good the eval is, or if the model is being “taught to the test”. If you’re trying to judge models from their public evals you’re fighting against the billions of dollars of effort going into gaming the system. Vibe checks don’t test the kind of skills that are useful for real work, but testing a model by using it to do real work takes a lot of time. You can’t figure out if a brand new model is good that way. Because of all this, it’s very hard to tell if AI progress is stagnating or not. Are the models getting better? Are they any good right now? Compounding that problem, it’s hard to judge between two models that are both smarter than you (in a particular domain). If the models do keep getting better, we might expect it to feel like they’re plateauing, because once they get better than us we’ll stop seeing evidence of improvement. By “agentic work” I mean “LLM with tools that runs in a loop”, like Copilot Agent Mode, Claude Code, and Codex. I haven’t yet tried GPT-5.1-Codex enough to have a strong opinion. ↩ If you train a model on the actual eval dataset itself, it will get very good at answering those specific questions, even if it’s not good at answering those kinds of questions. This is often called “benchmaxxing”: prioritizing evals and benchmarks over actual capability. ↩ I want to bracket the question of whether “smart” is a broad category, or how exactly to define it. I’m talking specifically about the way GPT-4 is smarter than GPT-3.5 - even if we can’t define exactly how, we know that’s a real thing. ↩

0 views
Jeff Geerling 1 weeks ago

Air Lab is the Flipper Zero of air quality monitors

This air quality monitor costs $250. It's called the Air Lab , and I've been using it to measure the air in my car, home, studio, and a few events over the past few months. And in using it over the course of a road trip I learned to not run recirculate in my car quite as often—more on that later. Networked Artifacts built in some personality:

0 views
The Tymscar Blog 1 weeks ago

OpenAI Demo'd Fixing Issue #2472 Live. It's Still Open.

During OpenAI’s GPT-5 launch event, they demoed the model’s ability to fix real bugs in production code. Live on stage. In their own repository. The kind of demo that makes CTOs reach for their credit cards and engineers nervously update their resumes. There’s just one small problem: the fix they promised to merge “right after the show” is still sitting there, unmerged, three and a half months later. At exactly 1 hour and 7 minutes into their launch video, they started working on issue #2472 in their openai-python repository.

0 views
Kix Panganiban 1 weeks ago

Google Pay is finally available in the Philippines

During our Australia trip last month, I was really spoiled by the convenience of contactless payments. I never had to take out cash (in fact, I got weird looks from some shops when I did), and I rarely had to use my physical cards (except for some bigger purchases that required on-chip validation). So when we got back home to the Philippines -- the withdrawal I experienced was too real. Cash is still king, and credit cards involve the weird practice of handing over your card for the merchant to physically manhandle and tap themselves (why -- I don’t know). So when Google Pay finally went live in the Philippines a few days back, needless to say, I was super stoked. Google Pay released a list of supported cards , and I luckily had a few that were on the list. I had no trouble setting up my RCBC and EastWest credit cards, plus my GoTyme debit card. I already had my business Wise card set up -- so now it’s just a matter of testing if that works too. Unfortunately, I couldn’t get my UnionBank Visa card to work no matter what I tried (short of calling customer support, which I detest). Even though it’s a payment method on my Google account, importing it gave me an unsupported error, and neither manual entry nor a photo of the card worked. I eventually gave up, figuring my other cards were enough. I got to try out Google Pay at Rebel Bakehouse in Podium and Cafe Beam in Baliwag. The baristas were surprisingly up to date with Google Pay support -- they weren’t shocked or hesitant when I pulled out my phone to pay. Just a week ago, I bet they’d have asked questions or even declined it (I’ve heard a story of a merchant reacting as if they were being scammed!). My RCBC and EastWest credit cards worked flawlessly, but the GoTyme debit card didn’t. I suspect it’s because merchants here still need to manually select debit or credit on their terminals -- unlike in other countries where contactless systems just accept both by default. Processing was quick, and Google Pay instantly sent me a notification with the merchant’s name and amount. That’s super handy for keeping track of card expenses on the go! In summary, the experience was great. With the underlying NFC technology becoming more accepted (like at MRT stations) -- I’m hopeful contactless payments will spread even further and eventually become the norm. QR payments are fine, but they take extra steps and never feel as seamless as just tapping my phone. I’m a bit bummed that Apple Pay is lagging behind and won’t launch until 2026, so I’ll likely stick to Android as my main phone for now. On a brighter note, syncing my cards to my watch for Google Pay wrist payments was pretty smooth. I had to re-enter verification codes and OTPs, but the setup was quick. I can’t wait to try that out next time!

0 views

Automating agentic development

This week, I visited my friends at 2389 in Chicago. These are the folks who took my journal plugin for Claude Code and ran with the idea, creating botboard.biz , a social media platform for your team's coding agents. They also put together an actual research paper proving that both tools improve coding outcomes and reduce costs. Harper is one of the folks behind 2389...and the person who first suggested to me that maybe I could do something about our coding agents' propensity for saying things like: His initial suggestion was that maybe I could make a single-key keyboard that just sends Back in May, I made one of those . When I added keyboard support to the Easy button, I made sure not to disable the speaker. So it helpfully exclaimed "That was easy!" every time it sent: But...you still had to press the button. I'm pretty sure the button got used for at least a day before it was...retired. But the problem it was designed to solve is very, very real. And pretty frustrating. Yesterday morning, sitting in 2389's offices, we spent a bunch of time talking about automating ourselves out of a job. In that spirit, I finally dug enough into Claude Code hooks to build out the first version of Double Shot Latte , a Claude Code plugin that, hopefully, makes a thing of the past. DSL is implemented as a Claude Code "Stop" hook. Any time Claude thinks it should stop and ask for human interaction, it first runs this hook. The hook hands off the last couple of messages to another instance of Claude with a prompt asking it to judge whether Claude genuinely needs the human's help or whether it's just craving attention. It tries to err on the side of pushing Claude to keep working. To try to avoid situations where it misjudges Claude's ability to keep working without guidance, it bails out if Claude tries to stop three times in five minutes. Testing DSL was...a little bit tricky. I needed to find situations where Claude would work for a bit and then stop and ask for my approval to keep working. Naturally, I asked Claude for test scenarios. The first was "build a full ecommerce platform." Claude cranked for about 20 minutes before stopping. I thought the judge agent hadn't worked, but...Claude had actually fulfilled the entire spec and built out an ecommerce platform. (The actual implementation was nothing to write home about, but I'm genuinely not sure what it could have done next without a little more direction. The second attempt fared no better. On Claude's advice, I asked another Claude to build out an HTML widget toolkit. Once again, it cranked for a while. It built widgets. It wrote tests. It wrote a Storybook. And when it stopped for the first time...I couldn't actually fault it. Slightly unsure how to test things, I put this all aside for a bit to work on another project. I opened up Claude Code and typed Claude greeted me like it normally does. And instead of stopping there like it usually would, it noticed that there were uncommitted files in my working directory and started to dig through each of them trying to reverse engineer the current project. Success! (I hit to stop it so that I could tell it what I actually wanted.) Double Shot Latte will absolutely burn more tokens than you're burning now. You might want to think twice about using it unsupervised. If you want to put Claude Code into turbo mode, DSL is available on the Superpowers marketplace . If you don't yet have the Superpowers marketplace set up, you'll need to do that before you can install Double Shot Latte: Once you do have the marketplace installed, run this command inside Claude Code: Then, restart Claude Code so it can pick up the new hook.

0 views
Simon Willison 1 weeks ago

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

Hot on the heels of Tuesday's Gemini 3 Pro release, today it's Nano Banana Pro , also known as Gemini 3 Pro Image . I've had a few days of preview access and this is an astonishingly capable image generation model. As is often the case, the most useful low-level details can be found in the API documentation : Designed to tackle the most challenging workflows through advanced reasoning, it excels at complex, multi-turn creation and modification tasks. [...] These 14 images can include the following: Max Woolf published the definitive guide to prompting Nano Banana just a few days ago. I decided to try his example prompts against the new model, requesting results in 4K. Here's what I got for his first test prompt, using Google's AI Studio : Here's what I got: The result came out as a 24.1MB, 5632 × 3072 pixel PNG file. I don't want to serve that on my own blog so here's a Google Drive link for the original . Then I ran his follow-up prompt: I'll note that it did put the plate-sized cookie on a regular plate. Here's the 24.9MB PNG . The new model isn't cheap. Here's the API pricing : it's 24 cents for a 4K image and 13.4 cents for a 1K or 2K image. Image inputs are 0.11 cents (just over 1/10th of a cent) each - an earlier version of their pricing page incorrectly said 6.7 cents each but that's now been fixed. Unlike most of Google's other models it also isn't available for free via AI Studio: you have to configure an API key with billing in order to use the model there. So this thing is great at following instructions. How about rendering text? I tried this prompt, this time using the Gemini consumer app in "thinking" mode (which now uses Nano Banana Pro for image generation). Here's a share link - my prompt was: Infographic explaining how the Datasette open source project works This is a great opportunity to test its ability to run searches (aka "Grounding with Google Search"). Here's what it created based on that 9 word prompt: That's a pretty extraordinary result for such a simple prompt. The text is all spelled correctly and rendered without glitches. The content is solid too - it even included logos for the most popular publish platforms, and a tiny thumbnail of the Datasette UI which is close-enough for an infographic. "Data Ingestion (Read-Only)" is a bit off. Everything else looks great to me. Here's the thinking trace for that image generation process: Initiating Datasette Infographic I'm starting a detailed infographic about the Datasette project. I plan to use Google Search to gather specific information about its core processes, which I will then visualize in a multi-panel diagram. Developing the Datasette workflow I am visualizing the Datasette workflow as described in the search results. I'm focusing on the conversion of data from different formats into a SQLite database, which is then presented as a web application complete with a user interface and JSON output. Visualizing Datasette Workflow I've sketched out the initial structure for the infographic, focusing on a flowchart style. The key components include data input from diverse sources, conversion to SQLite, the Datasette core application, its main features (web UI, JSON API, SQL querying), ecosystem plugins, and finally, deployment options. I aim for clear icons and labels to represent each step. Now I will focus on refining the flow and clarifying the relationships between the parts. Defining Core Datasette Stages I've sketched out the initial infographic structure. I have identified the key components, focusing on visual representation of the data flow and the key benefits and features of the system. I have made updates to the flowchart with clear icons and labels. Now I'm working to expand the data input section and refine the visual representation of each step. Constructing Datasette Infographic I am translating the current Datasette workflow outline into visual elements. The input section will show icons for different data formats. I've sketched the data conversion stage with clear illustrations of the necessary tools. I will add a depiction of the Datasette core application, showcasing its interfaces and functions. The plugin ecosystem and deployment options will follow. Now, I am refining the visual representation. Analyzing Datasette's Architecture I'm currently working on representing the Datasette open-source project's architecture visually. I've broken it down into key components: raw data processing, SQLite conversion, core application functionalities, plugin integration, and deployment methods. I'm focusing on creating a clear, linear representation to show the workflow. Mapping the Datasette Flow I've crafted a comprehensive graphic to illustrate Datasette's operation. It spans the data ingestion, SQLite conversion, main application features, plugin interactions, and deployment stages, aiming for clarity and a logical flow. This note from the announcement post caught my eye: We believe it's critical to know when an image is AI-generated. This is why all media generated by Google's tools are embedded with our imperceptible SynthID digital watermark. Today, we are putting a powerful verification tool directly in consumers' hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon. Last night I used Nano Banana Pro to generate a fake photograph of raccoons stealing our food delivery, then scrubbed out the little diamond icon using the Apple Photos "cleanup" tool. I uploaded that Gemini app and asked "Was this image created with AI?": It replied: Yes, it appears that all or part of this image was created with Google Al. SynthID detected a watermark in 25-50% of the image. Presumably that 25-50% figure is because the rest of the photo was taken by me - it was just the raccoons that were added by Nano Banana Pro. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . High-resolution output : Built-in generation capabilities for 1K, 2K, and 4K visuals. Advanced text rendering : Capable of generating legible, stylized text for infographics, menus, diagrams, and marketing assets. Grounding with Google Search : The model can use Google Search as a tool to verify facts and generate imagery based on real-time data (e.g., current weather maps, stock charts, recent events). Thinking mode : The model utilizes a "thinking" process to reason through complex prompts. It generates interim "thought images" (visible in the backend but not charged) to refine the composition before producing the final high-quality output. Up to 14 reference images : You can now mix up to 14 reference images to produce the final image. Up to 6 images of objects with high-fidelity to include in the final image Up to 5 images of humans to maintain character consistency

0 views