Posts in Ai (20 found)

rose ▪ bud ▪ thorn - may 2026

Reply via email Published 31 May, 2026 It was my wife's birthday, and our wedding anniversary! Baked some cakes and had a great time. Mine is the Donauwelle attached at the bottom of the post, my wife baked the fruit cake. My friend who visited Japan bought us great gifts from there; I got two gachapon (Cinnamoroll and My Melody), some matcha and My Sweet Piano chopsticks. I finally have it in writing and it's been communicated officially that I am my department's data protection coordinator now. I blogged more. I bought myself a big Build-A-Bear Usahana and a tiny one for my bag. Also, new matcha and I restocked my skincare and supplements :) I feel spoiled by myself. I'm having a great time at the gym, going 3 times a week, and incorporating the strength machines now. The added muscle/strength really helps with posture and counteracting the desk sitting. I'm making good progress. I reduced negativity from my online space. I went to a protest for ME/CFS! I have been better with keeping up with emails. Anita, if you are reading this, I cannot reply to you because it says sending key is not valid. We have a bread cutting machine now! Makes it easier to cut the bread my wife is baking for us :) I attended CPDP 2026 in Brussels. I reached Magenta status (35+ translated cases) as a Country Reporter for Noyb. Working on better eating behaviors and no guilt during rest. I am working on slowly booking cool classes and activities for the next few months. Been struggling with my face shape. I have chubbier cheeks anyway naturally, but whenever I need a round of Prednisone or I am stressed or there's hormonal stuff going on, they get bigger (cortisol, water retention). They are bigger lately... definitely a source of discomfort and shame when we live in a time of razor sharp jaws and almost-hollow cheeks. I will now have to do my injections weekly :( Dienogest doesn't work at all for me. Instead of preventing periods, it causes me more of them. Had to get off of it. My soy and rapeseed sensitivities have been extra annoying lately. Can't eat my beloved tofu, and they put rapeseed oil into almost every protein-rich vegan replacement product. I love my lentils, peas and beans, but occasionally I just wanna have some banger vegan köttbullar, schnitzel, or burger patty without a rash, man, or not make everything myself. Not to mention restaurants, or the fact that they drown everything in rapeseed oil based condiments... I haven't been studying nearly as much as I should. Having some issues with the modalities and feeling a bit stressed, like I need more time away from it. I've been very ambitious this month with my blog posting, and it has caused some writer-constipation at times. I had all these drafts ready with some links and loose thoughts already collected, and wanted to write them out fully; but because I set myself arbitrary deadlines or a loose " This needs to be finished and published today! " I felt intense pressure, which made me freeze up... it's really not that serious, but I made it so, for some reason. I also frequently felt stuck between 2-4 equally "important" tasks, posts, topics, whatever, and when I started one I looked at the other and switched, progressing at nothing. Terrible cycle. I moved some planned posts to June and eased up a bit. The menu of my favorite café has been severely reduced and worsened. Also cannot believe that I am paying 10 Euro for a wrap now. The Brussels trip was filled with some disappointments and stress.

0 views
ava's blog Yesterday

what i read this week - week 22 2026

Thought that after my post on summary distrust, I could share a list of what I read each week. I technically prefer to process and digest what I read into blog posts, but not everything makes it into one, and this is a way to document and keep them, and maybe give others some food for thought. This is not necessarily stuff I fully agree with, I'm just sharing what ended up in my feed reader or was linked in stuff I read, and doesn't include all the personal blog posts I read. I had a lot to catch up on because I read a lot less the previous 2 weeks. AI detection was built for faces - article about how bad AI detection works for war and climate propaganda videos, as the detection mechanisms often rely on biometric human features, and cannot accurately detect fake fire, smoke effects, etc. US Law Enforcement Warns of ‘Anti-Tech Extremism’ - the US gov is aware of the sentiment around AI and is willing to target and suppress it, and they have little paid accomplice firms too who keep surveilling you on social media and in real life meetings if you organize to oppose data centers or voice criticism about them. Iran Israel AI war propaganda - The AI propaganda we see with armed conflicts right now is a dire warning to the future of online video information. Goes more in-depth about detection methods. Can Tracking Private Jets Predict an Imminent Apocalypse? - article about a site that assumes the rich elites will find out about an apocalypse first and try to flee, therefore serving as a warning system to the rest of us. Why GCC Nations Must Move Beyond Content Moderation to Regulate Harm by Design - GCC means Governments in the Gulf Cooperation Council. Article is about how certain countries have already heavily regulated (and, arguably, censored in their favor) social media platform content, so now they should do the same for platform design. Eh... Big Tech Will Not Save Us From the Climate Crisis - Big Tech is moving away from their climate targets and carbon credit bullshit because they wanna do more AI and data centers. The rest they are doing is unproven or not working. Definition of Overburdened Communities in New Jersey - data centers and other similar detrimental undertakings often target overburdened communities, and this is what it means. A Town Hall Too Late - article documenting how citizens near an almost finished data center actually get informed and treated (not well). They only received information well after the thing started to get built. It is being developed by DataOne for the Nebius Group to support AI infrastructure as part of a $17 billion deal with Microsoft. Meta loses High Court challenge - summary of the case and possible fine. Responsible Innovation Harms Modeling on Microsoft's Learning Platform. EU AI Omnibus Deal Changes - more analysis on the proposed AI Act changes, nudifier ban and more, prominent actors, Merz ruining everything for us as usual, etc. The AI Act is not ready for agents - article for a paper that's also listed below; risks of agents, and a need for more guidance from the AI Office. AI’s real threat is worker control and surveillance - about the divide between workers who use AI and those who are managed by it. Higher paid jobs can be supplemented and accelerated by it, while the less fortunate, less earning (warehouse, gig work) are suffering under AI micromanaging them, causing scheduling issues, errors and more, and are more intensely surveilled than ever by AI "bossware". Entzauberung der Digitalen Souveränität - German; deconstructing the term "digital sovereignty" and ideas around it. Mostly about this talk. AI Forensics gegen Big Tech - German; Interview with AI Forensics founder Marc Faddoul about his work and the fear of retribution, especially the fear about getting targeted by Elon Musk. Human Rights Due Diligence - info on what downstream HRDD is. Microsoft took a step towards human rights - very charitable and exaggerated read of Microsoft parting ways with their Israel chief and their ties to the Israeli Ministry of Defense, plus suspending some of their services. The World Is Already Resisting AI - Article on the AI Resist List , a collaboratively built, publicly accessible database documenting acts of resistance to the AI industry from across the world. AI Data Centers: Big Tech's Impact on Electric Bills, Water, and More - looking at different papers and studies around the water and electricity use of big data centers, where they are located, and what local problems they are worsening. Meta’s Hyperion project in Louisiana will need three times as much electricity as the entire city of New Orleans, and is bigger than its main airport. They also gag local officials with NDA's so they can't properly inform the residents. What you need to know about data centers - information on what Earthjustice attorneys are doing to push for stronger environmental protections targeting data centers. The Web Is Being Made Accessible for AI, Not People - llms.txt convention, MCP etc.; companies are more ready to make their services accessible to AI agents than disabled people. This shouldn't be seen as another curb cut phenomenon. Bitte im Omnibus sitzen bleiben, liebe PIMS - German article about the Art. 88 reworks for Personal Information Management Systems that are supposed to enable an easier handling of cookie consent and tracking. Social Media Verbot weder wissenschaftlich fundiert noch effektiv - German; about how there is no scientific proof that social media bans will help, and some stats about how many people support social media bans, and for what age group. Big Tech und Staat - German article on how the state seems to increasingly serve private interests, especially Big Tech. Bundesregierung will KI Einsatz der Polizei - German article about use of AI software for law enforcement, its risks, and what rights are threatened. Polizeigesetznovelle Schleswig-Holstein - German article discussing Schleswig-Holsteins attempt at changing their police law, including real-time facial recognition, behavioral surveillance, online face search and more, from strangers on the street, and even mere victims or witnesses of crimes. Das Internet verrottet - German; about link rot and archiving things properly. Why “Made in Europe” Won’t Fix AI’s Deeper Problems - fitting to my blog post. Big Tech as Executor of the dead - was also a topic at the conference. Praxisfolgen Russmedia Urteil - consequences for social media platforms following the Russmedia court decision C-492/23; Notice-And-Sweep. AI Act: deal on simplification measures, ban on “nudifier” apps - concluding what deal was reached between co-legislators; names the new deadlines for AI compliance. Ratepayer Protection Pledge by the White House - promises and propaganda Microslop's Community-First AI Infrastructure Pledge - promises and propaganda vol. 2 Anthropic's Promises - promises and propaganda vol. 3 Offener Brief der Industrie - Open letter to German politicians by German industry criticizing parts of the digital omnibus; it was silly to read, and I think it is disrespectful to imply that technologies can be discriminated against; that's a different usage and connotation than just using it as "being discriminated from" (aka being differentiated from others). None of the arguments are convincing. Draft guidelines for the implementation of transparency obligations for certain AI systems under Art. 50 AI Act - this is out for commenting until the 3rd of June, by the way. Consent Fatigue entgegenwirken - German policy brief by the TUM think tank about countering consent fatigue. Data Center Fight Guide Einstellungen zum geplanten Einsatz von Palantir-Software II - German phone survey about Palantir use by Verian & campact from Sep 2025. Grok Unleashed - Analyzing Grok nudify uses and extremist propaganda, by AI Forensics. Distinguishing Authentic from AI-Generated Explosions using Spatiotemporal Dynamics - more about how to authenticate conflict-zone explosion footage. AI footage tends to produce much bigger, rounder mushroom plumes that expand quicker. Don't ask me about the math, I don't understand any of that, but I found the rest I could understand very interesting. Embedding Human Rights in Technical Standards - About WITNESS' experience in the Coalition for Content Provenance and Authenticity (C2PA), which is in favor of open technical standards to embed verifiable provenance metadata into digital media files. Helpful explainer here . Better Images of AI - a guide for creators and users on how to use accurate images when talking about AI and what to avoid, as it shapes the narrative. Specifically, they call to avoid the color blue, descending code, human brains, science fiction elements, white robots, anthromorphism and references to the Creation of Adam. That is because it misrepresents capabilities, risks and fears, and who is or can work in or with AI (often, only white men are shown). The AI Climate Hoax : Behind the Curtain of How Big Tech Greenwashes Impacts - talks about how different kinds of AI and its uses as well as carbon credits and overstating the climate benefits of AI can be used to hide the environmental impact of the big, hyped up GenAI. Big Tech’s ‘False Solutions’ to the Climate Crisis - similar thing here. Debunking nuclear power, carbon capture, and artificial intelligence as helping climate change. There are endnotes at each chapter, so don't miss what's after. Tackling Arbitrary Digital Surveillance in the Americas - uses Cajar vs. Colombia for some examples to showcase what needs to change, and the importance of the three-step-analysis. Basically all of this is standard here in the EU, but still needs to be implemented there. TRIED AI Detection Benchmark - paper from WITNESS about their framework that evaluates AI detection tools through a sociotechnical lens (with a focus on adaptability, transparency, accessibility, contextual relevance, and fairness). Wasn't a complete fan, because a chunk of it (for example about resource investments) is rather vague, theoretical and hardly connected with a direct or objective way to measure in practice. The rest is mostly fair, but also rather obvious, and some of it is basically impossible to combine in practice - like only using datasets that comply with data protection and intellectual property laws and are "ethical" with no sensitive data, while the models are supposed to reliably detect an AI generated video of a minority language or niche culture, or have enough datasets (= lots) to accurately detect cultural and local contexts. I can't quite pinpoint what exactly bothers me about it otherwise. I did like the examples of real use cases where things failed. In total, that is roughly ~ 340 pages, if we count an article as two pages on average. Most of it was read on Sunday and Monday (holiday), as I had a lot of free time then. Reply via email Published 30 May, 2026 AI detection was built for faces - article about how bad AI detection works for war and climate propaganda videos, as the detection mechanisms often rely on biometric human features, and cannot accurately detect fake fire, smoke effects, etc. US Law Enforcement Warns of ‘Anti-Tech Extremism’ - the US gov is aware of the sentiment around AI and is willing to target and suppress it, and they have little paid accomplice firms too who keep surveilling you on social media and in real life meetings if you organize to oppose data centers or voice criticism about them. Iran Israel AI war propaganda - The AI propaganda we see with armed conflicts right now is a dire warning to the future of online video information. Goes more in-depth about detection methods. Can Tracking Private Jets Predict an Imminent Apocalypse? - article about a site that assumes the rich elites will find out about an apocalypse first and try to flee, therefore serving as a warning system to the rest of us. Why GCC Nations Must Move Beyond Content Moderation to Regulate Harm by Design - GCC means Governments in the Gulf Cooperation Council. Article is about how certain countries have already heavily regulated (and, arguably, censored in their favor) social media platform content, so now they should do the same for platform design. Eh... Big Tech Will Not Save Us From the Climate Crisis - Big Tech is moving away from their climate targets and carbon credit bullshit because they wanna do more AI and data centers. The rest they are doing is unproven or not working. Definition of Overburdened Communities in New Jersey - data centers and other similar detrimental undertakings often target overburdened communities, and this is what it means. A Town Hall Too Late - article documenting how citizens near an almost finished data center actually get informed and treated (not well). They only received information well after the thing started to get built. It is being developed by DataOne for the Nebius Group to support AI infrastructure as part of a $17 billion deal with Microsoft. Meta loses High Court challenge - summary of the case and possible fine. Responsible Innovation Harms Modeling on Microsoft's Learning Platform. EU AI Omnibus Deal Changes - more analysis on the proposed AI Act changes, nudifier ban and more, prominent actors, Merz ruining everything for us as usual, etc. The AI Act is not ready for agents - article for a paper that's also listed below; risks of agents, and a need for more guidance from the AI Office. AI’s real threat is worker control and surveillance - about the divide between workers who use AI and those who are managed by it. Higher paid jobs can be supplemented and accelerated by it, while the less fortunate, less earning (warehouse, gig work) are suffering under AI micromanaging them, causing scheduling issues, errors and more, and are more intensely surveilled than ever by AI "bossware". Entzauberung der Digitalen Souveränität - German; deconstructing the term "digital sovereignty" and ideas around it. Mostly about this talk. AI Forensics gegen Big Tech - German; Interview with AI Forensics founder Marc Faddoul about his work and the fear of retribution, especially the fear about getting targeted by Elon Musk. Human Rights Due Diligence - info on what downstream HRDD is. Microsoft took a step towards human rights - very charitable and exaggerated read of Microsoft parting ways with their Israel chief and their ties to the Israeli Ministry of Defense, plus suspending some of their services. The World Is Already Resisting AI - Article on the AI Resist List , a collaboratively built, publicly accessible database documenting acts of resistance to the AI industry from across the world. AI Data Centers: Big Tech's Impact on Electric Bills, Water, and More - looking at different papers and studies around the water and electricity use of big data centers, where they are located, and what local problems they are worsening. Meta’s Hyperion project in Louisiana will need three times as much electricity as the entire city of New Orleans, and is bigger than its main airport. They also gag local officials with NDA's so they can't properly inform the residents. What you need to know about data centers - information on what Earthjustice attorneys are doing to push for stronger environmental protections targeting data centers. The Web Is Being Made Accessible for AI, Not People - llms.txt convention, MCP etc.; companies are more ready to make their services accessible to AI agents than disabled people. This shouldn't be seen as another curb cut phenomenon. Bitte im Omnibus sitzen bleiben, liebe PIMS - German article about the Art. 88 reworks for Personal Information Management Systems that are supposed to enable an easier handling of cookie consent and tracking. Social Media Verbot weder wissenschaftlich fundiert noch effektiv - German; about how there is no scientific proof that social media bans will help, and some stats about how many people support social media bans, and for what age group. Big Tech und Staat - German article on how the state seems to increasingly serve private interests, especially Big Tech. Bundesregierung will KI Einsatz der Polizei - German article about use of AI software for law enforcement, its risks, and what rights are threatened. Polizeigesetznovelle Schleswig-Holstein - German article discussing Schleswig-Holsteins attempt at changing their police law, including real-time facial recognition, behavioral surveillance, online face search and more, from strangers on the street, and even mere victims or witnesses of crimes. Das Internet verrottet - German; about link rot and archiving things properly. Why “Made in Europe” Won’t Fix AI’s Deeper Problems - fitting to my blog post. Big Tech as Executor of the dead - was also a topic at the conference. Praxisfolgen Russmedia Urteil - consequences for social media platforms following the Russmedia court decision C-492/23; Notice-And-Sweep. AI Act: deal on simplification measures, ban on “nudifier” apps - concluding what deal was reached between co-legislators; names the new deadlines for AI compliance. Ratepayer Protection Pledge by the White House - promises and propaganda Microslop's Community-First AI Infrastructure Pledge - promises and propaganda vol. 2 Anthropic's Promises - promises and propaganda vol. 3 Offener Brief der Industrie - Open letter to German politicians by German industry criticizing parts of the digital omnibus; it was silly to read, and I think it is disrespectful to imply that technologies can be discriminated against; that's a different usage and connotation than just using it as "being discriminated from" (aka being differentiated from others). None of the arguments are convincing. Draft guidelines for the implementation of transparency obligations for certain AI systems under Art. 50 AI Act - this is out for commenting until the 3rd of June, by the way. Consent Fatigue entgegenwirken - German policy brief by the TUM think tank about countering consent fatigue. Data Center Fight Guide Einstellungen zum geplanten Einsatz von Palantir-Software II - German phone survey about Palantir use by Verian & campact from Sep 2025. Grok Unleashed - Analyzing Grok nudify uses and extremist propaganda, by AI Forensics. Distinguishing Authentic from AI-Generated Explosions using Spatiotemporal Dynamics - more about how to authenticate conflict-zone explosion footage. AI footage tends to produce much bigger, rounder mushroom plumes that expand quicker. Don't ask me about the math, I don't understand any of that, but I found the rest I could understand very interesting. Embedding Human Rights in Technical Standards - About WITNESS' experience in the Coalition for Content Provenance and Authenticity (C2PA), which is in favor of open technical standards to embed verifiable provenance metadata into digital media files. Helpful explainer here . Better Images of AI - a guide for creators and users on how to use accurate images when talking about AI and what to avoid, as it shapes the narrative. Specifically, they call to avoid the color blue, descending code, human brains, science fiction elements, white robots, anthromorphism and references to the Creation of Adam. That is because it misrepresents capabilities, risks and fears, and who is or can work in or with AI (often, only white men are shown). The AI Climate Hoax : Behind the Curtain of How Big Tech Greenwashes Impacts - talks about how different kinds of AI and its uses as well as carbon credits and overstating the climate benefits of AI can be used to hide the environmental impact of the big, hyped up GenAI. Big Tech’s ‘False Solutions’ to the Climate Crisis - similar thing here. Debunking nuclear power, carbon capture, and artificial intelligence as helping climate change. There are endnotes at each chapter, so don't miss what's after. Tackling Arbitrary Digital Surveillance in the Americas - uses Cajar vs. Colombia for some examples to showcase what needs to change, and the importance of the three-step-analysis. Basically all of this is standard here in the EU, but still needs to be implemented there. TRIED AI Detection Benchmark - paper from WITNESS about their framework that evaluates AI detection tools through a sociotechnical lens (with a focus on adaptability, transparency, accessibility, contextual relevance, and fairness). Wasn't a complete fan, because a chunk of it (for example about resource investments) is rather vague, theoretical and hardly connected with a direct or objective way to measure in practice. The rest is mostly fair, but also rather obvious, and some of it is basically impossible to combine in practice - like only using datasets that comply with data protection and intellectual property laws and are "ethical" with no sensitive data, while the models are supposed to reliably detect an AI generated video of a minority language or niche culture, or have enough datasets (= lots) to accurately detect cultural and local contexts. I can't quite pinpoint what exactly bothers me about it otherwise. I did like the examples of real use cases where things failed. Zugänglichkeit von De-Personalisierungsoptionen und Meldeverfahren auf sehr großen Online-Plattformen Decisions I had to read to translate for noyb: 2025-0.875.804 and W171 2305420-1

0 views

AI blog question challenge

Rishabh emailed me the other day, asking me to answer the 7 questions of his new blog challenge, and who am I to say no to such a request? So here we go. I assume by AI models we mean the current crop of LLMs, and not AI models in general, because I’m old enough to remember when “Machine Learning” was a thing. What even is AI anyway at this point, since everything is lumped together into one useless definition? Anyway, I believe my first experience was trying out chagpt back when it first came out. I don’t think I spent more than 10 or 15 minutes using it at the time. It was impressive tech, but was also completely useless for me at the time, and that’s why I didn’t bother spending more time using it. This is an interesting question. Do I use AI? Well, I guess the answer is yes since it’s almost impossible to avoid using it if you use the web at this point. Pretty much all tools and services are integrating some sort of AI-powered functionalities, and it’s become harder and harder not to use them. If, instead, the question is do I use one of the various LLMs directly to do stuff, then the answer is still yes, but the amount of usage is so low that some people might consider that to be the same as not using them at all. I don’t directly pay for any of the models, but my work email has been powered by Google for more than a decade, and so I do get access to Gemini Pro. Workspace has usage data for everything, and I just looked it up: In the last 90 days, the only AI-related feature I used was the Gemini App (that’s not surprising considering I turned off everything else), and I have apparently used it 62 times. I’m now looking at the history of those chats, and pretty much all of them are single-question queries related to something web dev I was doing. Things like how to do a specific thing inside Kirby, or how to achieve something using a particular JS library. This is stuff one should be able to find inside documentation websites, but the search there is often awful and so after a google search, I try my luck with AI. And as I wrote somewhere else, I never copy-paste. I ask very narrow questions so that I can be pointed towards the correct answer. And once I have that, I do the coding and I re-implement everything myself. Am I against using AI? As a generative tool, yes. I refuse to ask AI to do something for me or to generate content from scratch. As a tech in general? I think it has some potentially useful applications in narrow contexts. As always, the answer is not cut-and-dry, and it can be yes or no depending on the framing and the scope. The only aspect I appreciate is the ability to ask questions in natural language. Because sometimes you have a problem or an answer you’re looking for that can’t be described in a more structured way. As for what I don’t like, how anthropomorphised these stupid tools are is definitely high on my list. I don’t want my computer to talk back or to make jokes or to say «I’m sorry». If I input a question, I want an answer back, and that’s it. I don’t want follow up questions, I don’t want some pointless preamble. I get why this happens, but I fucking hate it. This is software. I don't want my software to have a personality. I want it to perform a task and get out of my way. I also don’t like the lying, the gaslighting, and all the other crap, and I also don’t like what the AI industry is doing as a whole, but that’s a separate issue. Again, another question that has different answers depending on the scope. The idea of being able to generate images, in general, is neutral to me. It all comes down to what you use it for. There are some potential use cases that are totally fine, others are completely insane. As a whole, I think the ability to generate slop is bad, but that’s because humanity can’t be trusted to do anything the right way. As for their use in blog posts, I think stock images were useless, and I don’t see images generated with AI to be any different. Unless you have generated an image as part of the content to explain or visualise something. That’s fine, label it as a generated image and move on. That’s no different than including a render, or a sketch on paper, from a content perspective. My consumption of online content these days is so limited that I don’t have this issue. I read very few blogs, and I know they are not AI generated because I emailed the people behind them more than once, and I know what their stance is. I watch almost no YouTube, and I only read a few news sites. My strategy is to simply stay away from the digital world as much as possible, and I’m at the point where I’m considering dropping my digital consumption down to zero and quit the internet as a place for content. I have zero hope. And that is because I have zero hope in anything that’s in the hands of mega corporations. The incentives are totally skewed, and they’d do everything they can in order to keep the line go up. I don’t see people with strong morals in positions of power and so unless we decide to go full French Revolution, I see no reason for things to improve. Thank you for keeping RSS alive. You're awesome. Email me :: Sign my guestbook :: Support for 1$/month :: See my generous supporters :: Subscribe to People and Blogs

0 views
Evan Hahn Yesterday

Notes from May 2026

My blog turned 16 this month! I did nothing to celebrate, but made some little tools and clicked some links about tech ethics. I published four little tools this month: I also did some work on Helmet, my open source project: And like every month, I wrote a few articles at Zelda Dungeon . I don’t feel I wrote anything special this month, but my colleagues put together a feature about Zelda and mental health which was very affecting! “The vast majority of tech workers, at least those who I have encountered in my many years of reporting, are not vampiric Silicon Valley tech bro caricatures [… They] both like working with tech and ultimately want to see it serve the public good.” From “They just formed the biggest tech worker union in the US. They plan to rein in AI and curb layoffs” . This “love letter to Gnutella” is both an introduction to a P2P protocol and a celebration of the culture around it. From “Affordances for me, but not for thee” : “One of the oddest parts of the AI shift is that people are much more willing to do things for LLMs that they should have been doing for human beings all along.” Accessibility, specifications, documentation, and policies are better codified now. The author calls this “dystopian”, and I agree: our motivation to do this stuff is AI or productivity, not helping our fellow human. “More importantly, whereas accessibility affordances provide new abilities for vulnerable people, an AI affordance provides new abilities for people with power. And that’s probably the heart of it.” Looking forward to being surveilled because I’m an “anti-tech extremist” . I can’t tell you how exciting it was to watch Jira add 2 + 3 . “What can I do to resist AI?” asks the AI Resist List . “Tech companies like Google, Facebook and Microsoft are ignoring data controls mandated under California law, researchers say.” “Your AI Slop Bores Me” presents an interface that looks like an LLM chatbot, but it’s entirely powered by humans. A very cute idea. I’m a very bad “image generator”, at least according to the ratings I received. I continue to be amazed by “Lest We Forget the Horrors: An Unending Catalog of Trump’s Cruelties, Collusions, Corruptions, and Crimes” . It’s so thorough. RIP to a real one: Wikinews is shutting down after 21 years . Hope you had a good May. ZIP Shrinker , a web app that shrinks ZIP files with higher compression ratios A command line tool to do (completely offline) translation Open Link in Unloaded Tab , a Firefox extension to open links without loading them png-cmp , a command line tool to compare PNG pixel data After over a year of quiet maintenance, I released version 8.2.0 with some small new features and documentation updates. In a step toward dropping GitHub, I moved the docs from a GitHub URL to helmet.js.org . “The vast majority of tech workers, at least those who I have encountered in my many years of reporting, are not vampiric Silicon Valley tech bro caricatures [… They] both like working with tech and ultimately want to see it serve the public good.” From “They just formed the biggest tech worker union in the US. They plan to rein in AI and curb layoffs” . This “love letter to Gnutella” is both an introduction to a P2P protocol and a celebration of the culture around it. From “Affordances for me, but not for thee” : “One of the oddest parts of the AI shift is that people are much more willing to do things for LLMs that they should have been doing for human beings all along.” Accessibility, specifications, documentation, and policies are better codified now. The author calls this “dystopian”, and I agree: our motivation to do this stuff is AI or productivity, not helping our fellow human. “More importantly, whereas accessibility affordances provide new abilities for vulnerable people, an AI affordance provides new abilities for people with power. And that’s probably the heart of it.” Looking forward to being surveilled because I’m an “anti-tech extremist” . I can’t tell you how exciting it was to watch Jira add 2 + 3 . “What can I do to resist AI?” asks the AI Resist List . “Tech companies like Google, Facebook and Microsoft are ignoring data controls mandated under California law, researchers say.” “Your AI Slop Bores Me” presents an interface that looks like an LLM chatbot, but it’s entirely powered by humans. A very cute idea. I’m a very bad “image generator”, at least according to the ratings I received. I continue to be amazed by “Lest We Forget the Horrors: An Unending Catalog of Trump’s Cruelties, Collusions, Corruptions, and Crimes” . It’s so thorough. RIP to a real one: Wikinews is shutting down after 21 years .

0 views

My Agent Stack For Automating My Personal Life

My agent manages my emails, SMS, WhatsApp, Telegram and pretty much everything to automate my personal life. People keep asking me how I use agents in real life. I mean the actual boring things that make a day disappear: reading WhatsApp and Telegram, finding someone's email, searching the web, drafting the intro, updating a document in Google Drive, creating a calendar event, checking who still needs an answer, and doing all of it across the same messy tools I already use. My answer is disappointingly simple. I use Codex as an operator on top of my actual life data. It has tools. It has data connectors. It has skills. It has a source of truth. It has enough permissions to act locally, and enough approval gates that it does not embarrass me in public. That is basically the setup. Tools, data connectors, skills, and taste. I used to do more of this in Claude Code but I have been moving the setup to Codex because GPT-5.5 is currently a better model for this kind of work. The switch from Claude Code to Codex is not really the story. The story is that once a model is good enough, the real leverage comes from wiring it into the world you already live in. The important part is that the agent can move across boundaries. My personal life is not in one app. It is split between Gmail, WhatsApp, Telegram, iMessage, Google Drive, Calendar, Notion, local files, random PDFs, browser sessions, and a contacts spreadsheet that is much more valuable than it looks. A few days ago a friend sent me a WhatsApp message. She was helping a fast-growing San Francisco AI startup recruit in France and wanted to connect their recruiting manager with a recruiter I know. I did not remember the recruiter's email. I did not know the latest funding news about the startup. I needed to search WhatsApp, search Gmail, find the recruiter's email, search the web, understand why the startup was credible, draft an intro email, include the two job links, show the draft to me, send the email after approval, and then text my friend that it was done. That is normally twenty minutes of annoying app switching. WhatsApp to Gmail to Google search to Gmail again to WhatsApp again. It is not hard work, but it is exactly the kind of work that burns attention because every step is a small context switch. With the agent, I asked for the outcome. It read the WhatsApp thread, searched Gmail for the recruiter's email, researched the startup's funding and recent news on the web, drafted the intro, waited for my approval, sent the email, and then texted my friend that the intro was done. The user-facing part took about ten seconds. The agent did the glue work (in seconds!) This is the killer pattern. The agent is not "answering a question." It is operating across my tools to complete a small real-world workflow (aka a "job-to-be-done") Another example is even more boring, which is why I like it. I got a new license plate for my car. I sent photos and context to Codex. It updated the car information Markdown file I keep in Google Drive, changed the license plate, added the registration notes, preserved the existing VIN, insurance, owners, and address, then uploaded the file back to Drive. That alone is useful, but the better version is what happens next. The agent can use browser automation to go update the same information everywhere else: FasTrak, the parking app, insurance portals, DMV-related forms, or any other web app that does not have a clean API. For clean systems, it should use an API or CLI. For messy systems, it can use the browser and it's so good! I also now use Computer Use from Codex. This is what personal agents are for. Not dramatic autonomy. Administrative continuity. I was always afraid of Openclaw yolo mode in the background. I appreciate being in control. The most important architectural decision I made was centralizing valuable personal information in Google Drive. For years, a lot of my knowledge lived in Notion. I like Notion as a human workspace, but I do not love it as the primary source of truth for an agent. The API works, but the workspace is too fluid: nested pages, databases, properties, permissions, formatting, backlinks, and a lot of UI-native structure that is pleasant for humans and annoying for models. So I used the Notion API to export the valuable information and move it into Google Drive. I was not trying to perfectly preserve the Notion workspace. I was trying to make the information agent-readable. Most of the useful information in Drive is Markdown or CSV, because those formats are easy for the agent to search, diff, edit, and upload back without ceremony. Google Drive became the source of truth because gogcli gives the agent a simple command line surface for Gmail, Drive, Calendar, Docs, Sheets, Contacts, and Tasks. This is an underrated point. You should not organize your knowledge only for the human UI. You should organize it for the agent's tool path. Agents like stable file IDs, text, tables, Markdown, CSVs, and commands that return JSON. If the agent can search it, download it, edit it, upload it, and cite where it came from, the data is useful. My personal data layer is embarrassingly simple. Google Drive holds the important docs, mostly as Markdown files and CSVs. Contacts live in a Google Sheet mirrored as a CSV. Notion exports land in Drive. Local instructions live in . Skills live as Markdown files in folders. The source of truth is not elegant. It is legible. A lot of personal productivity is just joining across this data. One fact is in WhatsApp. Another is in Gmail. The email address is in Contacts. The date is in Calendar. The document is in Drive. The agent becomes useful when it can cross those boundaries without asking me to be the glue. One of my best investment was to create a contact.csv with the phone number, email, LinkedIn etc. of all the people I know. The core tools are boring by design. I use gogcli for Google Workspace, wacli for WhatsApp, imsg for iMessage and SMS, Browser Use or browser automation for web apps, and AppleScript or macOS UI automation when there is no better interface. The hierarchy is simple. APIs and CLIs are best. Local files are great. Browser automation is acceptable. Screen automation is the last resort. This hierarchy matters because agents are only as reliable as their tool surface. Asking a model to click around a website is sometimes necessary, but it is not the happy path. A command like or is much easier for the model to inspect, retry, and reason about. Here is what the tool layer looks like in practice: None of this looks like science fiction. That is the point. The future of personal agents starts as a pile of commands that let the model operate the tools you already use. You want to reduce to a maximum the abstraction layers between the models and the APIs. Tools give the agent hands. Skills give it habits. A skill is just a small operating manual that tells the agent how to do a recurring task the way I like it done. My inbox-zero skill is a good example. It tells the agent to list Gmail inbox messages through gog, separate auto-archive from needs-review, show me the important emails, quote the substance, suggest archive or reply, draft replies, wait for explicit approval, send in the original thread, preserve all recipients, archive only after sending, keep replies short, never suggest calls unless I ask, and sign with "Nicolas." That is not a fancy architecture. It is a procedure. But the procedure is the product and... it's just text instructions. Without the skill, I have to be the prompt every time. I have to remind the agent not to send without approval, not to drop cc recipients, not to suggest a call, and not to sign with some weird corporate signature. With the skill, I say "run inbox zero," and the workflow already contains my taste. The important habit is that I improve the skill every time the agent makes a mistake. If it suggests a call when I hate calls, I add that rule. If it forgets to preserve cc recipients, I add that rule. If it archives too aggressively, I tighten the classification. The agent gets better because the procedure gets better. This is how personal agents become personal. Not by having a cute voice. By accumulating operational taste. The setup compounds because the mistakes become instructions. I do not want an agent that blindly replies to everyone. I want an agent that prepares the work, shows me the draft, and asks at the right moment. For most communication workflows, the loop is: read context, draft response, show me, wait for approval, send, confirm. Sometimes I let it send directly when the stakes are low. "Tell Hugo I am in Seattle next week" does not need a board meeting. But an investor email, a customer reply, an intro, or anything with social nuance should be drafted first. This is the difference between useful and terrifying. Read-only scanning is one trust tier. Drafting is another. Sending is another. Deleting, paying, signing, or changing account settings is a completely different tier. The future is not "the agent does everything." The future is "the agent does the tedious work and asks at the right moments." The killer workflow is not email. It is life inbox triage. Every few hours, I want to ask, "What did I miss?" and have the agent scan WhatsApp, Telegram, Gmail, SMS, Calendar, and the relevant Drive changes. Then I want it to tell me who needs a reply, what is urgent, what is stale, what can be ignored, what should become a calendar event, and what needs a document search. This is the perfect agent task because it is context-heavy, repetitive, cross-tool, and full of small decisions. Humans hate doing the first pass. Agents are good at first passes. Judgment still belongs to me. The result is not that my life becomes autonomous. The result is that I stop being the person manually digging through five apps to discover the three things that matter. If someone wants to reproduce my setup, this is the checklist. Install Codex. Install gogcli for Google Workspace. Install wacli for WhatsApp. Install a Telegram connector if you use Telegram. Install imsg for iMessage and SMS. Add browser automation, ideally through Browser Use or a Chrome controller. Add macOS automation through AppleScript and UI scripting. If your knowledge lives in Notion, use the Notion API to export the valuable parts into Google Drive. Then centralize the data. Make Google Drive the source of truth. Keep contacts in a Google Sheet or CSV. Keep important personal docs as searchable files. Keep local instructions. Keep small skills for recurring workflows. Then grant permissions carefully. Full Disk Access is needed for local files and app databases. Screen Recording is useful as a visual fallback. Accessibility is needed for clicking and typing in apps. These are serious permissions, so pair them with serious approval gates. Then write the operating rules. That is basically it. Tools, data connectors, skills, approval gates, and continuous improvement. The personal computer used to be app-operated. You opened the app, searched, clicked, copied, pasted, wrote, and sent. The agent-operated computer feels different. You state the intent, the agent gathers context, proposes the action, waits for approval when needed, executes, and reports back. Once you experience this, the old way feels absurd. Why am I manually searching WhatsApp, Gmail, Google Drive, and the web to send one intro? Why am I copying a license plate into five different portals? Why am I reading 100 messages to find the three that matter? The computer should do that. The setup is still ugly. The CLIs are rough. Permissions are annoying. Some connectors break. Browser automation is brittle. You have to write skills. You have to maintain a source of truth. But that is how the future usually starts. The first useful personal agents will not look like polished consumer apps. They will look like a model inside a terminal with access to your files, accounts, memories, and tools. That is what I use today, and every week I give it one more piece of my life to operate.

0 views
Stratechery Yesterday

2026.22: Luceing Their Mind

Welcome back to This Week in Stratechery! As a reminder, each week, every Friday, we’re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone . Additionally, you have complete control over what we send to you. If you don’t want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings . On that note, here were a few of our favorites this week. This week’s Stratechery video is on The Inference Shift . Why Everyone Hates Luce. To say that the Jony Ive-designed Ferrari Luce, the iconic carmaker’s first electric vehicle, has faced a chilly reception is an understatement. I actually think it looks great —  for an electric car . On Dithering , John and I discuss why the real problem is that it’s branded Ferrari, and on Sharp Tech I get even more philosophical: electric cars are focused first and foremost on efficiency, and not only is that different than performance, Ferrari’s calling card, but also representative of the parts of modern society — including tech — that leave everyone feeling increasingly alienated (and why, surprisingly, AI might help). — Ben Thompson How to Monetize AI Answers. The ad business is, for me at least, endlessly fascinating, and not just because it is the most important business model in consumer tech: I think digital ads, particularly Meta-style ads that introduce you to things you never knew you wanted, a societal good. The other reason to care about ads, however, is that their economic importance means they are where the impacts of new technology are often felt first. This week’s Interview with Eric Seufert covers all this: how LLMs are changing digital ads, the changes both Google and OpenAI have made in terms of monetizing AI, and, more philosophically, why believing in ads might make one more optimistic about humanity in an AI-denominated future. — BT Social Mobility in China, and Lack Thereof.  Late last week China’s State Council announced a reform that will ease so-called “hukou restrictions” and allow migrant workers from all over the country to access social services in the cities where they work, which had long been forbidden. It’s a major reform that furthers Xi’s goal to unify the national market, and should improve the lives of millions of workers, but it also comes with plenty of questions as it’s implemented. We discussed all of it on a great episode of Sharp China this week , as well as reports that top Chinese talent in AI has been banned from leaving the country, continued capital control, and ongoing tensions with Japan and the U.S. that call to mind an ominous passage from Mao Zedong.  — AS Nvidia Earnings, The AI Stack, Nvidia’s New Reporting — Nvidia is changing its reporting to delineate between hyperscaler sales — where Nvidia is fighting commoditization — and everyone else, where Nvidia runs the whole stack. The SpaceX IPO and Data Centers in Space — There isn’t a financial model that justifies the SpaceX IPO, but data centers in space are plausible, and that might be enough. An Interview with Eric Seufert About Models and Ads, and AI’s Upside for Humanity — An Interview with Eric Seufert about building models for generative AI, why Meta’s foundational models are so important, and why understanding advertising leads to optimism about humanity’s future. How Spencer Pratt Happens — Spencer Pratt’s success in L.A. reflects his own surprising political talent, and an increasingly broken Democratic machine in California and beyond. Acquired the Podcast The Ferrari Luce How Things Fell Apart for Germany’s Nixdorf Computer Japan’s Rare Earths Island Social Mobility and Hukou Reform; US Halts Taiwan Arms Sales?; Ongoing Pressure on Japan; An American Xinhua Journalist Arrested The Knicks are in the NBA Finals, A Moment of SGA Truth, Around the League with Giannis, Bulls, and the Basketball Gods SpaceX Hype and the Elon Bargain, Nvidia and the Neoclouds, Q&A on Dropbox, Google, Ferrari Luce Backlash

0 views

Premium: What If...We're In An AI Bubble? (Part 3)

Last week I ran the second part of my three-part “What If…We’re In An AI Bubble?” series where I have been covering the scenarios that I believe could lead to the bubble popping. Here’s what I’ve discussed so far: Today I want to start with a very simple rundown of what has to happen for the AI bubble to make sense. These are all points that are rooted entirely in the projections and sales of the companies in question.  As NVIDIA intends to sell over a trillion dollars of Blackwell and Vera Rubin GPUs by the end of 2027 , it needs to have around (assuming a PUE of 1.35) 40GW of data center capacity built to support the 30GW+ of GPUs it will have sold .  With that compute being sold at around $12 million a megawatt (based on discussions with analysts and sources), that means that there must be around $435 billion in global annual compute demand to substantiate the amount of GPUs sold.  Outside of OpenAI and Anthropic, there doesn’t appear to be more than a few billion dollars of demand . Another concerning sign is that NVIDIA has had to agree to spend $30 billion in multi-year cloud compute agreements across the very partners it’s selling GPUs to ( per page 16 of its most-recent 10-Q ): The other problem is that data centers are taking way, way too long to finish , taking upwards of 24 months even for smaller 40MW builds.  This means that… Put another way, NVIDIA’s continued growth relies on people’s belief that A) these data centers get built and B) that they’ll actually make money.  Per COO Greg Brockman, OpenAI will spend around $50 billion on compute in 2026 , and I imagine Anthropic will spend in or around the same amount, especially as it’s now agreed to spend $15 billion a year on Musk’s Colossus data centers on top of whatever it spends on Google Cloud, Microsoft Azure and Amazon Web Services.  $100 billion is nowhere near enough to justify the compute being built. And while Anthropic and OpenAI have made more than $1.1 trillion in compute commitments in the next 3-5 years across Microsoft, Google, Amazon, Oracle, CoreWeave, Cerebras, Terawulf, and Cipher Mining, there’s so much more compute that needs to be sold on top of that.  Even if both doubled their spend in a year, we’d still need at least another two Anthropic or OpenAI-sized compute customers — either in aggregate or as separate companies — at a time when I can’t find a single other company spending even a hundred million dollars a year on compute. Most AI startups (and customers) want to pay Anthropic or OpenAI directly to access their models , which means that either Anthropic and OpenAI need to use roughly twice the amount of compute they do today and then some to meet the capacity being built. This will require them to do something either historic or impossible. This is not hyperbole! OpenAI, per The Information , plans to burn $852 billion through the end of 2030. Anthropic has, per The Information, agreed to spend $330 billion on compute on Microsoft, Google, and Amazon , at least another $30 billion on compute with CoreWeave , and another $63 billion in TPUs bought from Broadcom .  To reach this point, Anthropic projects it will hit $174 billion in annual revenue by the end of 2029, and OpenAI $284 billion . Both have made ridiculous claims of profitability ( with Anthropic actively conning investors with a “profitable” quarter based on discounted bills ) in the next few years that are immaterial to the larger point that they need actual, real cash to meet their obligations.  This is, again, not hyperbole. If we assume that the services in question are profitable, sustainable businesses, then revenues attached to AI services must exceed those driven by AI compute by a reasonable margin. It isn’t enough for us to have a few AI companies that spend a lot more on compute than they take in revenue, because at some point venture capital subsidies will run dry.  This isn’t happening. Putting aside the profitability part for a second, OpenAI and Anthropic account for 89% of all AI startup revenues , with the nearest competitor being Cursor with its pathetic $3 billion in annualized revenue . These are rookie numbers. They are insufficient. We need so much more than this. Again, not hyperbole! These are OpenAI and Anthropic’s own revenue projections — $184 billion and $174 billion respectively — that they expect to hit by the end of 2029. These are the same projections that have been used to make their $1.1 trillion in compute commitments, much of which make up 50% of Google, Amazon, and Microsoft’s remaining performance obligations : These commitments reflect expected revenue and demand for OpenAI and Anthropic’s services, but they’re commitments, which means that they need to be paid even if that demand doesn’t exist.  This is a huge problem for these companies. If they buy too much compute and don’t have the demand and revenue to support it, they’ll go bankrupt.  To be clear, that’s not my opinion, it’s what Anthropic CEO Dario Amodei said to Dwarkesh Patel in February, emphasis mine: That is not good! As I’ve covered before , buying compute is a knife-catching game where you have to guess how much you need for a particular year, and if you guess correctly you don’t lose as much money but if you guess wrong you run out of money.  It should be far more worrying to executives that the single-largest AI company is basically saying that if he mistimes growth his company explodes! Per Business Insider , Uber COO Andrew Macdonald said this weekend that it was becoming “harder to justify AI costs within the company”: Anthropic’s meteoric revenue growth has come from both AI startups burning more tokens ( as Opus 4.7 appears to burn more than ever ) and large organizations doing some form of “token-maxxing,” meaning that they tell their employees to use AI as much as they want, usually with KPIs that specifically track AI usage, as is the case at Meta , Amazon, and Zillow . Even organizations that aren’t actively incentivizing their engineers to burn more tokens are finding they’re blowing through their budgets at record speed. The situation with Uber’s COO was caused by his CTO saying back in April that the company had burned through its entire annual token budget in four months. Similarly, my reporting on Zillow’s AI spend showed that it will likely max out its annual Cursor budget by the end of May. The problem, as Macdonald said, is that nobody can seem to track all of this spend to an actual return on investment. This isn’t a situation where somebody is saying “the ROI is low but improving” or “we’re on the path to working that out,” but “it’s very hard to actually draw a line between “what we’ve spent” and “a reason we’re spending it.” This makes it hard for Uber to say how much it should reduce its token budgets. If you can’t measure the return on investment, how do you measure how much you’re meant to spend? What is “enough”? Because right now it’s clear that whatever they’re spending is too much , which means that there’s a ceiling to Anthropic and OpenAI’s revenue story.  OpenAI and especially Anthropic cannot afford for this conversation to be happening, because it suggests there’s a ceiling to the amount that people will spend on AI. It appears there’s a limit to which organizations can be abused and manipulated into believing that “the future is here,” and that limit is when they pay millions for something that doesn’t appear to have a measurable return on investment.  Anthropic and OpenAI need organizations to willingly spend 10% to 100% of their headcount on AI, as their revenue projections are clearly tied to every organization maintaining a significant spend on tokens in perpetuity.  There’re really two problems: This is budgetary poison. Right now, the vast majority of AI token spend is experimental , and if companies are already hesitating at the amounts they’re spending, Anthropic has no way to keep growing, and they also have no super secret models or harnesses or products that are going to reverse this trend. Nobody knows why they’re spending so much money or even how much money they might spend in a given month , which makes it tough to view Anthropic’s ( suspicious ) revenue growth as anything but a chaotic money-dump driven by CEOs that don’t know what their companies actually do and have been beguiled by the AI grift machine . And as I wrote up last week , OpenAI had a negative 122% operating margin in Q1 2026, and ChatGPT growth has stalled. It is unclear what its API revenue is, but it’s likely much less than Anthropic despite shoving its enterprise customers onto token-based billing not long after they did. As I’ve said: this cannot happen, and neither Anthropic nor OpenAI can afford to slow down. Their revenues must grow to over $100 billion by 2028, as their compute commitments demand it. Their growth must continue.  It’s been a little under four years of endless confidence about the inevitable growth of generative AI, and by extension the eternal success and growth of OpenAI. Yet in reality, its economics have only ever soured, and its growth appears to be collapsing.  In October 2024, The Information reported that OpenAI believed it would turn profitable in 2029, that its total losses between 2023 and 2028 would be $44 billion , and that its (non-GAAP, every one of these numbers is non-GAAP) gross margin would be 41% in 2024, though it would end up being a point lower at 40% in the end. OpenAI would then project a gross margin of 49% for 2025… but it ended up at 33% anyway .  OpenAI would also say on September 5 2025 that it would actually burn $115 billion through 2029 , but that “burn” assumed that it would have revenues of $60 billion in 2027, $100 billion in 2028, $145 billion in 2029, and $200 billion in 2030, when it would “become profitable” in some undiscussed manner. Two weeks later on September 19 2025, The Information would report that actually OpenAI would spend “about $450 billion to rent servers through 2030,” but not otherwise update the burn-rate. On November 4, 2025 , OpenAI CEO Sam Altman would say that the company had hit $20 billion in ARR and had made $1.4 trillion in commitments “over the next 8 years,” and a few months later On February 20, 2026 , OpenAI would claim that it had targeted “around $600 billion in compute commitments by 2030.” The very same day, The Information would report that it planned to spend $665 billion on compute through 2030 , that it missed gross margin projections (without sharing what those margins might be), and that ChatGPT had hit 910 million weekly active users that month, 90 million short of its goal of 1 billion by the end of 2025. It’s very obvious by now that OpenAI has been making up all of its projections, and that none of the numbers actually add up. My own reporting from November 2025 from actual Azure personnel suggests that OpenAI’s Q1 to Q3 revenues were billions lower than every other reported figure, and I think it’s likely that OpenAI is overstating its revenues.  In any case, on May 22, 2026 , The Information would report that OpenAI’s Q1 2026 operating margin was negative 122%, and that its Q1 average weekly active users (WAUs) sat at 905 million — suggesting that growth has stalled. OpenAI had anticipated that it would cross the one billion WAU mark by the end of 2025 — and it blamed its failure to do so on fiercer competition, primarily from Google’s Gemini. For OpenAI to afford its compute commitments, it has to make or raise $852 billion in the next four years. It must have that cashflow, or it will run out of money or be sued out of existence by its numerous counterparties from CoreWeave, Microsoft, Amazon, and Cerebras. In the final part, I’m going to get into the depths of destruction — the unraveling of the greater data center debt industry, the massive damage to private credit to come, potential shareholder lawsuits against NVIDIA, and the consequences of the deaths of OpenAI and Anthropic. What If…We’re in an AI Bubble? I also want to add that I realize three headlines didn’t make the cut — what if there’s not a bailout, what if I’m wrong, and what if I’m right — and I intend to cover all three of them in future free newsletters.  Nevertheless, today’s is an absolute beast, a 16,000 word conclusion to the first multi-part Where’s Your Ed At Premium.  What If The AI Industry Moves To Entirely Token-Based Billing?  What If Organizations Can’t Afford To Keep Spending On AI? What If The AI Capacity Crunch Never Ends (And Data Centers Aren’t Getting Built)? What If CoreWeave Can’t Keep Up With Its Capacity Demands? What If Hyperscalers Can’t Build Data Centers Very Fast? What If Hyperscalers Have Warehouses of Uninstalled GPUs? What If Hyperscalers Write Off A Large Chunk of GPUs? What If Data Center Construction Demand Collapses?  What If Venture Capital Funding Stops Flowing To AI Startups? What Would Make Venture Capital Stop Funding AI Startups? What If Most AI Startups Go To Zero? Scenario: OpenAI and Anthropic Go Full FTX, Scooping Up Dying AI Startups To Keep The Industry Afloat With Circular Financing Scenario: Venture Capital’s Post-AI Depression What If Inference Isn’t Profitable? AI Has Become An Existential Reckoning For The Valley NVIDIA’s customers are taking years to even begin making back the billions of dollars its chips and the associated construction costs. NVIDIA is selling far more GPUs every quarter than can realistically be installed in the space of a year. NVIDIA’s revenue stream is entirely based on organizations forecasting demand years into the future. NVIDIA’s revenues are, by extension, dependent on how long organizations believe that building data centers is a good idea. NVIDIA is absolutely, without a doubt, warehousing at least a million Blackwell GPUs . It’s difficult-to-impossible to actually measure the ROI of AI spend. It’s difficult-to-impossible to actually know how much it’ll cost to complete a specific task with AI. What if data center debt stops being issued? What if private credit had to write off most of its data center loans? What if the AI bubble blows up Taiwan’s ODM server manufacturers? What if NVIDIA is misrepresenting how many GPUs are shipped, sold and operational? What if OpenAI and Anthropic don’t go public? What if Oracle doesn’t get paid by OpenAI? What If OpenAI Dies? What if Anthropic Dies?

0 views

Let’s talk about encrypted reasoning

This is a quick post I wanted to write about a hobby project I spent a weekend on. It has little to do with real cryptography, and mostly doesn’t expose a particularly exciting vulnerability. But it did teach me a lot about frontier LLM APIs and coding agents. It also got me certified as an OpenAI “cyber researcher” which is something that doesn’t happen every day. In any case, please keep your expectations low. Who knows, perhaps someone else will find something exciting to do with this. Last week I decided it’d be fun to set up an OpenClaw agent. I still don’t know why I did this. I have no use for another AI in my life, and I realized this fact almost immediately after I got through the (surprisingly difficult!) configuration process. But configuring the agent to talk to Claude exposed me to something way more interesting: I got a cool error . The kind of error that cryptographers can’t resist: This intrigued me. What in the world was a signature doing in an LLM’s “thinking” block? Why would thinking blocks be signed in the first place? And if the thinking blocks are signed, then that means tampering with thinking blocks must have security implications. And there went my weekend. After twenty hours and about 5 million Codex tokens, I wasn’t much smarter. But I had learned a few things. First, the basics. You probably know that most LLM providers expose an API so you can write apps that talk to the model. For Claude, this is called the Messages API, while OpenAI calls it Responses . These APIs handle the ordinary tasks you’d expect an application to need from an LLM. They (1) allow you to set an application-level “instructions” (or ‘developer’) prompt for your application. They let you (2) provide ordinary textual prompts, and get back responses from the LLM. They also (3) provide bookkeeping, for example, listing the number of tokens you’ve used. For reasoning LLMs, they also do something I did not previously know about, and this is central to the error message above. They also send you the contents of the model’s hidden “ reasoning ” or “ thinking ” fields. Note that this data is not the stuff you see on ChatGPT when you ask it a question: those strings are merely summaries . The model’s actual reasoning (called “chain-of-thought”, CoT) is normally kept private and held back by the server. However, the APIs work differently: for various reasons (which we’ll get into below), an encrypted copy of the raw CoT reasoning data is actually sent down to the application. If you’re like me, you should now have three questions: how , why , and so what ? The how is the easiest to answer: for both providers, “thinking”/”reasoning” are sent down to the client as JSON. Each contains a blob of Base64-encoded stuff. The API documentation informs us that this data contains opaque reasoning, and that you’re not meant to look at it; you’re just supposed to ship it back to the server on the next turn. Let’s break that rule. The content of the blocks varies slightly between providers, but the core of each is a random-looking string that appears to be an authenticated ciphertext. You don’t need to be Sherlock Holmes to deduce this. First, it grows and shrinks depending on how hard the model thinks. And second, tampering with any of the ciphertext-looking data produces a recognizable API error when you send it back in. Thanks to AI, I can make nice diagrams. Here’s what OpenAI’s reasoning blocks look like: And here’s Anthropic’s wildly overcomplicated equivalent: The why part of this is more involved. Why ship this data to the client? Doesn’t the provider already have your reasoning data? The answer is sort of . Although the server has access to reasoning state while producing a response, API conversations are not always implemented as persistent sessions. In stateless, zero-retention , tool-loop, or client-managed conversation modes, the client application is expected to carry the transcript forward. Encrypted reasoning lets the provider return hidden model state to the client in a form the client can’t read or modify, but can later replay so the provider can verify/decrypt it and continue a reasoning process. This brings us to the $10 question. We have opaque, encrypted blobs. Should we care about them? Initially the answer seems to be no : this data is unreadable, and tampering with any bit of it produces an angry rejection message from the server. So on the one hand, it seems like this data is really unavailable to us. On the other hand: model reasoning is a big deal! These strings are the literal internal monologue of the model. They might influence the way the model processes later data we send it. More practically: when someone goes to this much trouble to cryptographically protect something, my experience is that they usually have a good reason. And I think the providers do have a good reason. A hint comes from this OpenAI post from 2024, which introduced the first “o1” reasoning model: In other words: it’s possible that these blobs contain sensitive information that the model otherwise wouldn’t share with us. That makes them really tempting to mess with. Unfortunately, the cryptography mostly seems to protect them. Although we can look at the blocks, none of the fields they contain seem readable or malleable. Believe me, I tried. But that doesn’t mean we should quit, it just means we need to try other things. There are still two directions worth checking: Thanks to the magic of coding agents, I was able to test every permutation of these concerns. I won’t claim to you the results are dramatic; nobody is going to win huge bug bounties on them (I tried). But the general answer for both cases seems to be: yes, these possibilities are both real . As I mentioned above, any attempt to directly tamper with reasoning/thinking blocks always produces an error from the API endpoint. However, this only applies to tampering. A few experiments reveal that we can replay an unmodified older reasoning blocks, with no visible error at all. Not only can we replay within sessions, this same idea also seems to work across different sessions. It even applies to sessions running in different accounts . That is: when we obtain reasoning blobs from a session running under one OpenAI or Anthropic account, we can replay them against a session in a different account altogether. For OpenAI specifically, we can even replay blobs across different models. (The Claudes got fussy about this.) At a cryptographic level, this tells us something very simple: the providers are probably using a single global key to encrypt and authenticate all reasoning data sent to the client. This might matter if you’re using the providers’ zero-data retention mode, since it means that everyone’s reasoning data is escrowed under one (not frequently changing) key, rather than protected per-account. The use of a global key also raises a possible new threat model. If you’re an application that uses an API to expose a “chat” interface to malicious parties, you need to be careful that they can’t inject JSON into your chat stream. If they can, a bad guy might inject their own JSON-formatted reasoning blobs into the conversation. This could cause the model to behave in unpredictable ways. So sanitize your chat inputs! Of course, just because the LLM providers accept replayed blocks doesn’t mean much. It strongly indicates that decryption was successful, but not that the model actually saw or cogitated over the decrypted data. To use GPT 5.5’s favored language, the replayed blobs may be accepted but not semantically active. To answer this question, I ran a lot of experiments using Codex. (So many that at one point Codex literally forced me to stop and visit an OpenAI cyber trusted access website where I had to enter pictures of my driver’s license in order to keep going.) What I learned for my trouble is that the nature of block processing between models is wildly variable. Most of the time, replays of encrypted blocks just get quietly absorbed by the model. But every now and then, the model will output something to demonstrate that it is obviously is reading what those blocks contain. For example, here’s GPT 5.5: So this proves that encrypted blocks are, indeed, semantically active. But it doesn’t actually prove that we can do much with them. And believe me, I tried. This was mostly a disappointing project. I tried to convince the model to think about really, really sensitive secrets, while also trying to convince another session that it wanted to dump the same data as cooperatively as possible. What I came away with was some evidence that the data was being placed into the encrypted blocks if I asked the model to think about it. But if I also instructed the model to not output the data to the user , it mostly held to that instruction — even when I replayed the blocks to new sessions. I remain convinced that all kinds of sensitive data can be written in there if you ask the model to think about it, and that there’s a secret incantation that I could try to get the models to produce it. But I’m not able to prove it. Part of the reason I’m writing this post is to scrape it off my plate so someone else can try. I won’t try to convince you that this is a world-beating security result. In fact, all I’m really showing you is that “stuff I can make the model say in plaintext night also get encrypted.” But if that data can include platform secrets , that might get more interesting. More on that later. So while replaying reasoning blocks doesn’t seem to give us what we want, this is not the only way to extract secrets. A second question is whether we can use metadata related to the reasoning blocks to actually learn things that the model isn’t supposed to tell us. While we can’t directly read reasoning blocks, we can learn something about them: we can see how long they are. We can also observe related signals like “how many tokens did the model write”. OpenAI even gives us a special field called . If we’re a user consuming chat data without direct access to the API, we might even be able to measure the raw time it takes the model to respond. An obvious question is: given these signals, can we use them as a kind of side channel to extract secret data? Here’s an example. Imagine that a model’s application prompt (“instructions”) contains a secret, along with strict instructions that it must never tell the user this secret directly . This secret could be a single 0/1 bit, or a byte, or a longer string. We can verify that the model respects these instructions, and won’t output the data visibly — no matter how nicely we ask it. (Note: I’m not a jailbreak expert; maybe this guy will have better luck!) Now consider the following experiment: In all cases, the visible output will be the same: the model is not violating instructions. But note that within reasoning blocks the model is allowed to think about the secret bit, since those blocks are hidden. Since the complexity of computation A is shorter than that of computation B , one value of the bit will produce a lot less reasoning than the other. This will appear in various places: the size of the encrypted thinking blocks, the token counts, and even in wall-clock response times. The trick now is simply to calibrate the system and classify these responses based on whether reasoning blobs were “short” or “long”, which tells us whether the bit was 0 or 1. I put together an absurd test where the model has to compute a long checksum when the bit is 1. The results look something like this: Of course, an attacker who has access to a chat interface might not have access to the encrypted blob. So they might have to get this data through some other mechanism. You can get a very similar signal just by measuring how long it takes the model to return a response. So the summary here is not so much “encrypted blobs can leak useful information” although sometimes they do . It’s that reasoning itself can be leaky, even when we beg the model not to leak. Simply doing it, in a way that reasons over secret data, can potentially leak useful information to a clever attacker. Once I found this side channel I got really excited. Sure, it’s slow: but maybe we could use it to slowly chisel out the models’ top secret instruction prompts, like the one that says “ don’t talk about Goblins. ” This would be painful but simple: just ask true/false questions about the first letter, then the second letter, and so on. At this point I had to stop using Codex and Claude Code because they both just plain refused to help me extract confidential information, even after checking my ID and taking lock of my hair. I was forced to switch to OpenCode using Kimi 2.6, which had no ethical qualms about laying down a trail of destruction for my security research. Unfortunately, most of the destruction was my own. I won’t go into the nightmare of model hallucinations that followed. I’ll just say that I learned a few things: So TL;DR, while I was able to extract application-specific secrets that did exist, I wasn’t able to extract model prompts that don’t. Moreover, I didn’t feel quite ambitious enough to begin pounding on ChatGPT or Claude’s public web interface (where they certainly do.) So for the moment I’m just going to call this a maybe . I think model providers should think hard about this reasoning data, and they should make sure it doesn’t leak things they don’t want it to. I reported both results to OpenAI and Anthropic via their bug bounty programs. OpenAI said my report was unreproducible. I sent them my scripts, but too late. Anthropic quite reasonably told me they don’t see any security implications in side channels or replays, but they might alter their developer documentation to warn application developers to be more careful. I think that’s a fine decision (except for the part about trusting application developers), even if I want to believe there could be more here. Either way: I took those responses as permission to write this post. I still don’t think model providers should write this stuff off entirely. As far as what model providers can do, there’s the easy stuff and the hard stuff. First: both providers should proactively improve their key management . If you think reasoning state is worth encrypting, then properly encrypt it. It should not be replayable across sessions or accounts. While I can’t tell you exactly what bad things might happen, I think you’re better off patching holes before you see the water coming through them. The side channel results aren’t fixed by patches to the encryption protocol. They’re more fundamental to the way models work: if I can convince a model to do secret-dependent reasoning, then there is almost certain to be leakage. If someone figures out how to exploit this for some meaningful purpose, the best I can offer is that models will need to apply policy gates before they even reason about things. Unfortunately, this seems like it might have some real downsides, because “apply policy gate” itself often requires reasoning. This stuff makes me grateful I’m just a cryptographer and I don’t have to think about this sort of problem. Replays . Can we replay encrypted blobs back in the wrong order or even in the wrong session (worse: a whole different account ), and will the model accept them as valid reasoning that it made? Side channels . While we can’t see what’s in the encrypted blobs, we can learn some metadata about them For example: we can see how long they are. These side channels don’t need to involve the cryptography itself: we might also learn how many tokens the model spent making them, or time how long it took to produce them. A malicious user asks the model to reason about the secret bit (or one specific bit of a longer secret.) If the bit is 0, perform simple computation A . If it’s 1, perform extremely complex computation B . While the two computations are both very different, we can ensure that their visible output reveals nothing about the secret. So the model is not revealing its instructions if it follows this request. Neither GPT 5x nor Claude actually has a system prompt when you’re using API mode. But they’re both happy to tell you they have one! Moreover, they will happily invent plausible ones if you really push them to. Kimi 2.6 is also happy to tell you you’re a genius who just invented the Internet each time this happens. Inevitably your experimental results will turn out to have been totally bogus, but at least Kimi will be very disappointed on your behalf. With all that said, Kimi is shockingly good at coding and experiment design, especially given the very attractive pricing. If I was an Anthropic or OpenAI investor, I’d be scared.

0 views

What's going on with Gemini?

Google is in a strange spot right now. They've got arguably the deepest research bench in the industry, their own custom silicon, and effectively unlimited money - and yet most developers I talk to barely touch Gemini day-to-day. The recent Google I/O announcements crystallised a lot of what I find confusing about their AI strategy, so I wanted to write down where I think they actually stand. The consensus seems to be that currently Anthropic and OpenAI are very much in the lead for frontier model intelligence, with each of those two labs trading blows every month. This may change in the near future - if Anthropic releases Mythos-class models that OpenAI doesn't have an answer to - but right now I think most practitioners would agree that GPT5.5 and Opus 4.8 are roughly in the same ballpark. After that, you have Google, with Gemini 3.1 Pro being in benchmarks ahead of the Chinese models but behind the flagship Anthropic/OpenAI models. In my personal experience though I've had better results from the best-in-class Chinese models (GLM 5.1 and Qwen 3.7) than Gemini 3.1 Pro at software engineering tasks. The main model announcement at Google I/O was Gemini 3.5 Flash. The benchmarks of it were underwhelming at coding: Gemini 3.5 Flash on the Artificial Analysis Coding Index - solidly mid-pack. Source: Artificial Analysis . However, the model is super fast - roughly 4x faster in tokens per second than the aforementioned Anthropic/OpenAI models: Output tokens per second - Gemini 3.5 Flash at 206 t/s, far ahead of Opus 4.8 and GPT-5.5. Source: Artificial Analysis . This definitely is really interesting development, especially for user facing applications which can appear very sluggish to users. But - the big but - is the huge price increase they announced - 3x more expensive than the previous flash release. At $9/MTok it is vastly more expensive than the best in class Chinese models, and I'm struggling to see where this fits - if you want best in class intelligence you pay the extra for Opus/GPT5.5, if you want cheap but not-as-clever the Chinese models fit the bill well. The risks around Chinese models are somewhat overplayed in my opinion - you can self-host a lot of them, or use US-based inference providers via OpenRouter. Having said all that, perhaps really this model isn't designed for external use in the same way that the OpenAI/Anthropic models are. Clearly Google consumes an enormous amount of tokens internally - for all their products like AI mode, Gmail, etc. If you look at it that way, the model makes far more sense. The speed of the model really matters for a lot of the Google use cases - AI mode is very user driven and Google knows better than anyone that speed really matters. And the actual serving cost Google pays is almost certainly a fraction of the external facing price, so that becomes irrelevant. The most interesting part of this story though, is this excellent comment on Hacker News from someone that estimated the size of the model and the fact that it should run on one TPU 8i card (Google's latest custom inference hardware). This does give Google a huge advantage. They are the only frontier lab that (currently) designs its own AI hardware. While other labs certainly optimise their models to the hardware, and also no doubt have a lot of say in driving the Nvidia/AMD roadmaps to their specifications, the model teams and hardware teams in Google almost certainly collaborate to a far greater level than the other labs. This really matters. If you have a very good steer on upcoming hardware you know the right size of models to target training runs to aim for. And equally, research from Google Deepmind can go straight into the hardware roadmap without any negotiations. [1] It'll be very interesting to see how this continues to develop. Inference efficiency will be the key driver to actual unit economics in AI, and Google may develop an outsized lead in this. The one real weakness I think Google has though, is their confusing and incoherent strategy on coding agents. While Anthropic has Claude Code, and OpenAI has Codex, in true Google style they have ended up with a smorgasbord of tools. There is currently Antigravity, Jules, Gemini Code Assist, Gemini CLI and AI Studio all doing slightly different things. This doesn't include some other agentic SWE tools they have for specialised purposes (like Android Studio). They announced that Gemini CLI is being discontinued and folded into Antigravity, but I very rarely come across any developer using Google-based SWE tooling. This is a huge issue for Google - there is no doubt that Claude Code and Codex is producing a lot of very detailed telemetry and training data that can be used to improve further models. Without this being resolved, Google does have an extreme weakness in the fastest growing - at least revenue-wise - segment of AI. While I definitely wouldn't write Google off - they do have enormous structural advantages in other areas - I get the feeling that because Google has such a bespoke internal software development workflows [2] their isolation from what "the rest of the industry" does in software is so large it's perhaps hard for them to really reason about agentic tooling for the rest of the industry. My read is that Google is playing a genuinely different game to OpenAI and Anthropic. Gemini 3.5 Flash only looks strange if you assume it's meant to win the same race - priced and tuned for Google's own gigantic internal token consumption, with the TPU advantage baked in, it makes complete sense. Where they're actually behind is the developer-facing surface: a confused tangle of coding tools and an org that struggles to reason about how the rest of us build software. If Google sorts out the agent story, the structural advantages underneath - the silicon, the research, the integration - could make them very hard to beat. That's a big if. But I wouldn't bet against them. While it's hard to say if there was any truth in this - or it was just a negotiation strategy - there were rumours of OpenAI being unhappy with direction/progress Nvidia was making earlier this year: https://finance.yahoo.com/news/sam-altman-pushes-back-report-213000823.html ↩︎ Google engineers have an enormous amount of home built/custom/internal tooling that is uncommon outside of Google-scale companies. They use different source control, build tooling, testing infrastructure and build deployment to the rest of the industry - for very good reasons! But this stack is absolutely overkill for 99% of companies, and when you are used to thinking about SWE at Google scale I suspect it is very difficult to reason how people build software outside of that ecosystem. ↩︎ While it's hard to say if there was any truth in this - or it was just a negotiation strategy - there were rumours of OpenAI being unhappy with direction/progress Nvidia was making earlier this year: https://finance.yahoo.com/news/sam-altman-pushes-back-report-213000823.html ↩︎ Google engineers have an enormous amount of home built/custom/internal tooling that is uncommon outside of Google-scale companies. They use different source control, build tooling, testing infrastructure and build deployment to the rest of the industry - for very good reasons! But this stack is absolutely overkill for 99% of companies, and when you are used to thinking about SWE at Google scale I suspect it is very difficult to reason how people build software outside of that ecosystem. ↩︎

0 views
Langur Monkey 2 days ago

Langur Agent

Langur Agent is a simple, open, hackable CLI AI agent for Linux and macOS. It connects to any service providing an OpenAI-compatible endpoint. It features: The source is available in this repository . Langur Agent has been tested on Linux and macOS only. Install the agent with: Run the agent with the default session: If you need an API key to access the endpoint, put it in the file. Langur Agent looks for the file in the following locations, in order: Create the file with the API key: The agent uses to load at startup. The package reads from the environment automatically. You can also set in your shell profile. On first run, the configuration is created in . You can configure the agent interactively with the slash command. The agent works with any OpenAI-compatible endpoint, so LM Studio, Ollama, OpenWebUI, or any other service you configure. Here are the default values: Run the agent, and then you can enter your prompt. You can use the following key bindings during input: During inference, you can cancel the turn and return to the input prompt with Ctrl + c . Use to print information about the available commands, and to configure the agent interactively. Internally, Langur Agent uses sessions to separate different memory histories. Sessions are named by the user. By default, the agent uses the session. You can start in a different session (either create a new one, or restore it if it exists) with the argument: The default session’s name is , so the following two commands are equivalent: You can also list the existing sessions with : Sessions contain: For now, the configuration file is the same for all sessions. Sessions are matched by the directory name in the sessions location ( ). You can rename a session by just renaming the directory! You can enable mode for the current session with the command , or permanently in the configuration . External editor —In mode, exit INSERT mode ( Esc ), then press v to edit your prompt in an external editor (uses your or variable). There are a few commands available to use in the agent loop. You can list them with . Also, use (e.g. ) to show additional help for a command. Persistent memory follows XDG Base Directory spec in : In addition to persistent memory, the agent maintains a chat history of recent user input and assistant output pairs. This provides context that survives beyond the LLM’s context window. Here is how it works: Persistence: Configuration: Langur Agent can be easily customized and extended by adding new tools, commands, and skills. If you create a cool new tool, skill, or slash command, consider contributing it via a pull request! Create a file in or use one of the existing ones. To create a tool, create a method and decorate it with : Tools are auto-discovered on startup. The process is very similar to tools. You need to create your method, preferably in , and decorate it with . A slash command must return, in that order, , , , : Decorated commands are automatically registered, and auto-completed in the input prompt. Add a file in with YAML front matter, following the agentskills.io standard: The front matter and are parsed and shown in the skills list. The body is injected into the system prompt. session management memory management visual candy autocompletion interactive configuration Python 3.13+ for dependency management Current directory, Home directory, Alt + Enter : add a new line Enter : submit the prompt Ctrl + q : quit The input history Chat memory (see chat memory ) Notes (see session memory ) User profile (see session memory ) — user information — persistent notes (added via tool) Memory is loaded into the system prompt each turn tool adds notes during a session tool explicitly persists memory to disk Memory is auto-saved when the agent exits (interactive mode) Each user message and assistant response is stored in memory Reasoning is omitted from chat memory Automatically compacted when exceeding the configured character limit The user can trigger the compaction any time with Chat memory is attached to the system prompt on each turn The agent displays the last 10 exchanges, with long messages truncated Chat history is persisted to Automatically loaded on startup Saved after every exchange (user input or assistant response) Compacted history is also persisted to disk : a indicating if the command succeeded or failed. : an optional short status message. It is printed with or . : an optional with the Python Rich-formatted content, it is printed to the output. : an optional formatted in Markdown, it is printed to the output.

0 views
Simon Willison 2 days ago

Claude Opus 4.8: "a modest but tangible improvement"

Anthropic shipped Claude Opus 4.8 today. My favourite thing about it is this note in the release announcement: Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. It's so refreshing to see an AI lab honestly describe a release as a minor incremental improvement over the previous model! Honesty seems to be a theme. Here's my other favorite note from that announcement: One of the most prominent improvements in Opus 4.8 is its honesty . We train all our models to be honest---for instance, to avoid making claims that they can't support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. This is borne out in our evaluations , which show that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked. That linked system card includes the following: Claude Opus 4.8 had the lowest incorrect-rate of the six models on every benchmark—the most direct measure of factual hallucination. It achieved this mainly by abstaining on questions about which it was uncertain rather than by answering more questions correctly. Not much has changed since 4.7. It's priced the same as Opus 4.5/4.6/4.7 - $5/million input and $25 per million output. "Fast mode" is twice that price, which is a significant reduction from their previous models - fast mode on 4.6/4.7 remains at $30/$150. Note that fast mode is only available to organizations that are part of the research preview, "Contact your account manager to request access". Both the reliable knowledge cutoff and the training data cutoff are January 2026, the same as for 4.7. The context window is still 1,000,000 tokens, and the max output is 128,000 tokens. The What's new in Claude Opus 4.8 document has some of the more interesting details. These caught my eye: Mid-conversation system messages . Claude Opus 4.8 accepts messages immediately after a user turn in the array (subject to placement rules ). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. See also this update to the Anthropic Python SDK. Being able to steer the system prompt mid-conversation sounds really powerful. I was worried this would be incompatible with the abstraction provided by my own LLM library , which expects a single system prompt per conversation... but it turns out my recent redesign should handle that just fine . Lower prompt cache minimum . The minimum cacheable prompt length on Claude Opus 4.8 is 1,024 tokens, lower than on Claude Opus 4.7. I checked and 4.7's minimum was 4,096 . Here are pelicans riding bicycles for all five thinking levels, , , , , and : This time I ran them using the LLM CLI , exported the logs to Markdown and then had Claude Opus 4.8 build me an HTML tool that could render that Markdown with the fenced code blocks displayed as SVGs on the page. (I later had GPT-5.5 xhigh in Codex update that code to remove any XSS holes. I'm sure Claude could have done that if I'd asked, but GPT-5.5 is my code security blanket at the moment.) The max one was clearly the best, but it did take 25 input, 17,167 output tokens for a total cost of 43 cents ! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
ava's blog 2 days ago

AI blog question challenge

Rishabh made an AI blog question challenge and invited me to fill it out. Let's go! 1. How was your first experience with AI models? I used to have fun playing around with NeuralBlender, and used it to inspire glitch art of mine that I drew. Back when ChatGPT launched, I used it to teach myself HTML and CSS. 2. Do you use AI or are you completely against using it? On average, I use it once a week or less; weeks can go by where I don't use it. Due to my field of interest, I want to keep up to date on some use cases and capabilities, and make my own experiences instead of relying on what the hype online says. I feel like I can't properly write about my criticisms or privacy concerns if I don't use it at all, or don't test the use cases people rave about (which often leave me deeply disappointed). Occasionally, my boss will also ask me to trial out some use cases at work. Situations I use it for in private when I am not testing what others are doing: 3. Do you have any preference among different models, for example Claude vs ChatGPT? If yes, how do you choose? I only use ChatGPT and Lumo, and I'm trying to permanently move to Lumo. I no longer want to use anything made by OpenAI. 4. What aspect of AI models do you like and what do you not like? I hate the sycophancy and wordiness. Even when I adjust settings to be short and precise, they still yap. I don't like all the subheadings and bullet point lists, I prefer a full text. I turned emojis off. I also hate when they constantly repeat my name, so I removed that again. I also hate how mean Lumo can get; I want no sycophancy and the fucker will start bullying me for some reason. I like the aspect of being able to ask something when no one else is available (either due to the sensitive matter, embarrassment, or time issues). 5. How do you feel about AI generated images? Does it annoy you if someone uses them in a blog post? Seeing an AI generated image on a blog post is about as nice as being greeted by a steaming turd. Even worse when I know it isn't a bot blog and the person spent time crafting the text, only to include a graphic that has several errors, spelling mistakes and other unfitting or illogical stuff. Do you have absolutely no shame or quality standards? You wanna tell me you looked at that picture that said "thseism" instead of "theme" somewhere in it and thought " Yup, that's it, best I can do, hope my readers enjoy this total eye candy, can't see anything wrong with that "? What is it supposed to convey to me as a reader - that you didn't even look at it, or that you were too lazy to formulate a second or third prompt? 6. Internet is flooded with AI slop now, full of generated text, images, audio and videos. How do you filter it from authentic human creation? Do you have a strategy? I'm not on any of the big platforms or their replacements, and I consume the internet through my highly curated RSS feed reader where I follow real people who don't use it like that, or the Discover page. It's easier to avoid when your internet use is limited, in a niche, and mostly used for blogging, reading and studying. I have a good grip on detecting generated text and images, but I've noticed that videos and gifs can easily fool me by now. 7. Are you hopeful for a better future with A.I. or a dystopian one? Hard to say; I think AI is absolutely a dystopian nightmare when used in surveillance and war. For the rest, I assume the bubble will pop and few dedicated models for specific niches and use cases will remain that have proven to be useful and worth the cost, and the rest will fade away. I hope it can do some good in healthcare, but that may be wishful thinking. If AI went away completely, I would not miss it. Reply via email Published 28 May, 2026 I can't find something specific (like a specific word, jargon, saying, concept, item name etc.) via normal search engine use or can't find a clear explanation for something I find difficult to understand. Needing an easy language version for a really difficult paragraph, law text passage, case part etc. that I can't seem to crack on my own. Career and job questions I am unable to ask anyone both offline or online, because people I know in real life can't help, and I'd have to reveal too much to others if I asked online. Career trajectory brainstorming, 3-year and 5-year plan stuff.

0 views
Jeff Geerling 2 days ago

Tuning in FM Radio on a 3D Printer Heatbed

Pooch from Repkord dropped by my studio while he was in St. Louis, and asked a simple question: Can a 3D printer's heatbed act as an antenna? A fair question, as many an antenna is embedded in a PCB these days... and the traces on a PCB heatbed like the one used in Prusa's Core One look kinda like an antenna, if you squint the right way. Really, anything (or anyone) can be an antenna, given enough power.

0 views
Stratechery 3 days ago

An Interview with Eric Seufert About Models and Ads, and AI’s Upside for Humanity

An Interview with Eric Seufert about building models for generative AI, why Meta's foundational models are so important, and why understanding advertising leads to optimism about humanity's future.

0 views

Reverse-engineering Prose From Internet Lingo

Read on the website: Internet learned to speak gibberish that doesn’t always coincide with literary text. But it can be converted back to that. Here’s my experiment along these lines.

0 views
Martin Fowler 3 days ago

Fragments: May 27

At the GOTO Conference in Copenhagen in 2025, Kent Beck and I spent some time on stage talking and answering questions from the audience - a format I refer to as “two old geezers on a park bench”. We talk about our experiences with LLM-augmented programming (at that point - October 2025), we show our frustration that things we’ve been saying for thirty years still need to be said, we say how anything like a manifesto reunion needs to be led by a younger generation, and opine on what junior developers should be focusing on in their career. ❄                ❄                ❄                ❄                ❄ Ian Johnson has written a series of posts about restructuring a gnarly codebase The story follows a real Laravel + React codebase over ~3 months and ~258 commits from a legacy monolith with no tests to a well-structured application with automated quality gates, a React SPA migration in progress, and an AI agent that reliably ships production code with minimal supervision. The series covers the steps in decent detail, and his approach follows the kinds of steps I’d use. First get everything under the control of decent characterization tests, add static analysis, introduce the right patterns to make things flow easily. With all of this, is his use of AI, which changed during the exercise: For the first two months of this project, I used Claude Code with auto-approve turned off. Every file edit, every terminal command, every change… I reviewed it before it executed. […] The results were good. The code was clean. But I was doing most of the thinking and half the typing. The agent was a fancy autocomplete with better suggestions. I wasn’t getting the leverage I’d hoped for. I read an article about “on-the-loop” versus “in-the-loop” human-AI collaboration. The framing clicked immediately […] I was micromanaging because I didn’t trust the agent to do the right thing. And I didn’t trust the agent because there was nothing forcing it to do the right thing. His early steps put in tests, static analysis, and the right architectural patterns. With those in place, he could let the agent do more work. My role shifted from writer to curator. I don’t write most of the code anymore. I Define the patterns […] Review the test specs […] Review the output […] Update the harness […] Make strategic decisions […] He finishes the series with conclusions about how he’d generalize his experience to other circumstances. ❄                ❄                ❄                ❄                ❄ Back in the land of my birth, there was some notable groans when the National Health Service decided to close nearly all of their Open Source repositories , supposedly to the security threat of LLMs. Closing repos like this isn’t an effective counter to LLM-augmented attackers. I suspect it’s no coincidence to see GDS (Government Data Services), the highly-regarded IT enablers in the UK government publish their position Moving code from public to private as a substitute for investment in secure-by-design delivery, ownership and remediation is a warning sign because it reduces sharing and scrutiny, can slow coordinated improvement across government and suppliers, and does not remove the underlying weaknesses in a running service. Terence Eden memorably sums up his view on this: Within the UK’s Civil Service you occasionally hear the expression “being invited to a meeting without biscuits”. It implies a rather frosty discussion without any of the polite niceties of a normal meeting. ❄                ❄                ❄                ❄                ❄ I’ve seen a few cases where those developers who are most involved in working with LLMs find they are running into a problem with cognitive endurance, Adam Tornhill has joined this group : One of the big wins with agents is that they let us stay with the higher-level problem for longer. We get less sidetracked by details, dependency cleanup, and similar secondary tasks that used to break concentration. But there is a cost we are still underestimating. Agentic coding is mentally expensive. I can usually sustain the pace for a couple of hours. Then I need a break. The pace is simply too intense. And based on conversations with other engineers, I do not think I am alone in that. He explains that working with The Genie means we are making more decisions in less time, this increase in decision density is hard on the brain. He responds by keeping agent tasks small, automating everything he can, and accepting that he won’t know every line of code as long as he has good verification mechanisms in place. Notably, he has not gone in the direction of doing his work with swarms of agents that he coordinates. Instead has one long-running task that he babysits and one focus task That last point is important given the running-twenty-agents-in-parallel hype. I cannot even think about twenty meaningful things to build, and even less so about the resulting cognitive tax of the likely interruptions. It’s exactly the wrong thing to even consider. At least for humans. (And yes, I understand sub-agents and machine parallelisation. That is not what I’m objecting to. It is the parallelisation of human attention that does not scale). I liked that he included some thoughts about what folks can do in time outside this intense programming time. Not just “have a coffee” (although he includes that) but also about learning about the domain that the software supports. ❄                ❄                ❄                ❄                ❄ A couple of pithy quotes from social media Lorin Hochstein “Metaphor debt” is when all of your metaphors involve the concept of “debt” because you can’t think of any other metaphors anymore. ❄                ❄ Daniel Terhorst-North If a vegan crossfit fan is using Claude to write Rust, which thing do they tell you first? ❄                ❄                ❄                ❄                ❄ Karl Bode reacts to speakers getting booed when mentioning AI during commencement addresses. He points out that younger folks are increasingly unhappy with the tech oligarchy and their fruits . The thing is the kids aren’t stupid. They see the field clearly. They see the difference between what’s being sold to them by tech companies, the press, and commencement speakers, and what they have repeatedly seen with their own eyes. They’ve watched tech oligarchs spend the last decade mired in scandal after scandal, hype cycle after hype cycle, steadily enshittifying everything they touch along the way. The percentage of Gen Z that think AI’s benefits don’t counterbalance the risks now sits around fifty percent, up 11 percentage points in just the last year. Eight out of every ten believe that using AI makes the process of actual learning more difficult. He sees young people saddled with the perception of entering a worsening world - which leads them to rage against this latest fruit of the tech oligarchy. A rage that is easy for folks like me - with a comfortable retirement off-ramp - to properly appreciate. A rage that could have marked political and social consequences. ❄                ❄                ❄                ❄                ❄ Relevant to these concerns are a couple of items in last week’s Economist newspaper. The newspaper argues that historically major technological advances haven’t led to significant unemployment or drops in wages ( paywalled article ). The closest was the original industrial revolution in 19th Century Britain. There was a stagnation in wages during this period, but there was also a massive increase in population, from 4½ million to 12 million. It also points out that we’ll probably only understand the full consequences of all this when a recession hits, as this is when most unproductive jobs tend to be flushed out of the system. A second article ( also paywalled ) indicates that AI is having some effect on graduate hiring. They did an analysis of surveys of recent graduates, looking to see if employment varied depending on a job’s exposure to AI. The least exposed quintile of subjects saw employment rate fall by 1.5% over the last couple of years, while the most exposed quintile’s drop was 6.6%. ❄                ❄                ❄                ❄                ❄ Lawfare isn’t impressed with the latest efforts by the US Government to regulate AI. On [last] Wednesday, the White House invited leaders of OpenAI, Google, Anthropic, Meta, and Microsoft to the Oval Office for a signing ceremony the following afternoon. President Trump was to sign an executive order on AI and cybersecurity—the administration’s most formal effort yet to establish a voluntary process for reviewing frontier models before their release. But roughly three hours before the ceremony, when some company executives were already in the air to Washington, the White House called it off. They see the proposed regulations as mild, and including some valuable measures to harden defenses against cyber threats. But it’s worth underscoring the implications of postponing (if not outright canceling) this order, which, by its own terms, was about as modest a frontier-AI intervention as the federal government could put on paper: voluntary, focused on the government’s own defenses, and explicitly barred from becoming a licensing regime. The objection isn’t so much about government coercion as about the government having any settled role at all. Voluntary, in other words, isn’t the floor of frontier AI policy in this administration; it’s the ceiling. This is a questionable position given that the concerns animating this draft order will likely grow in the near future. It is also self-defeating for those who applauded the order’s delay or demise. Far from resolving the risk of government meddling in AI, killing the order just leaves in place what Ball has described as the “opaque and essentially lawless” alternative: government access happening through back channels, on terms set case by case, with no stable rules at all. One of the problems here is a distinct lack of governmental expertise, either in AI or in software in general. Too much is being decided at the whims of the tech oligarchy, there isn’t any attempt to engage in the broader issues at hand. That’s not entirely a bad thing, trying to regulate something that’s still evolving so fast is usually a fool’s errand - but the problem here is the impact of AI is so big that there’s real danger in being too far behind. ❄                ❄ Which leads me to a rare thing, an endorsement of a candidate for political office. If you are voting in congressional district MA-06 (North Shore of Massachusetts), I’d seriously look at Beth Anders-Beck , who is running for congress in that district. Beth has a long background in software development (including developing the notion of Forest and Desert ), so would introduce expertise that Congress desperately needs. I’ve known Beth for decades, and have a high opinion of their intelligence, judgment, and ability to work with others. Congress doesn’t deserve Beth, but it does need her.

0 views
Unsung 3 days ago

“The pipeline of future experts is thinning from both ends.”

I generally avoid think pieces about AI because a) a lot of them are boring, and b) they rarely match the pragmatic posture of this blog. But this essay on a new No One’s Happy blog was really interesting to read, and feels different in a few ways. First, it examines what happens as AI slop spreads in the context that is less discussed – in a workplace: This is a new form of slop, and it is more expensive than the public kind, because the people producing it are being paid a salary to do so. […] The cost of producing a document has fallen to nearly zero; the cost of reading one has not, and is in fact rising, because the reader must now sift the synthetic context for whatever the document was originally about. A lot in the essay feels pertinent to Unsung as real craft is not feelings or fluffiness. Real craft is deep expertise : Generative AI can produce work that looks expert without being expert, and the failure arrives in two shapes. The first is when novices in a field are able to produce work that resembles what their seniors produce, faster or more advanced than their judgment. The second is when people generate artifacts in disciplines they were never trained in. The two failures look similar from a distance and are not the same. Research has mostly measured the first. The second is what it is missing, and in my experience it is the riskier of the two. The term for this new challenge is, apparently, “output-competence decoupling.” Other parts of the essay come back to a topic – toxic velocity – we covered before : The current generation of agentic systems is built around the premise that the human is the bottleneck — that the loop runs faster and cleaner without the awkward delay of someone reading what is about to happen and deciding whether it should. This is, in a great many cases, exactly backwards. The human in the loop is not a vestige of an earlier era; the human is the only part of the loop with skin in the game. Removing the H from HITL [Human In The Loop – eds. note] is not an efficiency. It is the abandonment of the only mechanism the system has for catching itself. And one last thing that differentiates this essay from many others is the last “what to do about it” section. #ai #craft

0 views
Simon Willison 3 days ago

I think Anthropic and OpenAI have found product-market fit

Anthropic are strongly rumored to be about to have their first profitable quarter. Stories are circulating of companies surprised at how expensive their LLM bills are becoming from usage by their staff. I think this is because OpenAI and Anthropic have both found product-market fit. I currently subscribe to the $100/month Max plan from Anthropic and the $100/month Pro plan from OpenAI. If you are a heavy user of coding agents these plans are a fantastic deal. I just ran the ccusage tool on my laptop to get an estimate of how much I would have spent if I were to pay for API tokens in the past 30 days and got: That's $2,180.16 worth of tokens for $200 - not bad at all! I'm a moderately heavy user of these tools, but I'm certainly not running agents every hour of the day and night. I had assumed that companies making extensive use of agents were getting similar discounts. It turns out I could not have been more wrong about that. I haven't been able to track down the exact date, but at some point in the last six months Anthropic switched their Enterprise plan (originally "Claude seats include enough usage for a typical workday" back in August 2025 ) to $20/seat/month plus API pricing for usage. This story about the change from The Information is dated Apr 14, 2026, but cites an Anthropic spokesperson claiming that the pricing change occurred in November 2025. Existing customers are finding out about the change as they renew their contracts. OpenAI made a similar pricing change in April. The Codex rate card ( Internet Archive copy ) currently says: Note : On April 2, 2026, we updated Codex pricing to align with API token usage, instead of per-message pricing. This change was applicable to new and existing Plus, Pro, ChatGPT Business and new ChatGPT Enterprise plans. On April 23, 2026, we made this update for all existing ChatGPT Enterprise plans as well, inclusive of Edu, Health, Gov, and ChatGPT for Teachers. It's a little harder to decode as they quote prices in "credits", but as far as I can tell those credit costs are an exact match for the API token costs listed for those models. All of which is to say that as of April 2026 the "Enterprise" cost for both OpenAI Codex and Anthropic Claude Code/Cowork is the same as the listed API price. GPT-5.5 (released April 23rd) is 2x the API price of GPT-5.4. Opus 4.7 (April 16th) is around 1.4x the price of Opus 4.6 when you take their new tokenizer into account. So April saw both leading model companies release new frontier models with a higher API price, and both companies now have measures to lock their enterprise customers (who tend to sign year-long deals) at those API prices, not the previous extreme discounts. Why these sudden aggressive moves on pricing? Both Anthropic and OpenAI are planning to IPO, but I suspect there's a more important factor here: I think they've finally found product-market fit, with the coding/general-purpose agent products embodied by Claude Code/Cowork and Codex. Tools like ChatGPT are wildly popular, but that wild popularity has been difficult to turn into revenue. In February OpenAI boasted more than 900 million weekly active users for ChatGPT, but only 50 million - 5.6% of that - were paying consumer subscribers. Charging $10-$20/month per user is an OK business, but you'd need 1-2 billion subscribers sticking around for four years to cover $1 trillion in infrastructure . Companies spending $200+/month/user will get you there a whole lot faster - and as noted above, as a power-user I'm at ~$1,000/month in API costs per vendor already. Coding agents really did change everything. These are tools which burn vastly more tokens, but are also quickly becoming daily drivers for the work carried out by extremely well-compensated professionals. Right now that's still mostly software engineers, but a coding agent is a tool that can automate anything you can do by typing commands into a computer... so they are clearly applicable to a much wider set of skilled knowledge workers. As I've discussed on this site at length , the models released in November 2025 elevated agents to being genuinely useful. We've had six months to get used to that idea now - it's no wonder companies are beginning to spend real money on this technology. You could argue that ChatGPT achieved product-market fit when it became the fastest-growing consumer app in history back in February 2023... but it certainly wasn't making any actual money back then. Coding agents plus enterprise pricing marks the point when these companies start making very real revenue. Maybe even enough to start covering their costs! As further evidence that enterprise agents represent product-market fit for these companies, consider their open job listings. OpenAI have 703 open jobs right now, of which I'd categorize 229 (32.6%) as relating to enterprise sales and support - account executives, "Go To Market", "Forward Deployed Engineers" and the like. Anthropic have 390 open jobs , 105 (26.9%) of which look enterprisey to me. It's pleasingly ironic that these AI labs have picked a business model with such a heavy demand on human labor - enterprise sales contracts don't close themselves without a whole lot of humans in the mix! (I ran this analysis by scraping their job sites with Claude Code, then having it use Datasette's JSON API to pipe that data into Datasette Cloud where I used Datasette Agent for the analysis, exported here . Dogfood!) I started digging into this in response to a growing volume of stories claiming that large companies were sounding the alarm because their AI usage costs had grown so large. The most widely cited of these stories appear quite overblown to me. The most discussed has been Uber, based on this report where CTO Praveen Neppalli Naga indicated that Uber had "maxed out its full year AI budget just a few months into 2026", mostly thanks to Claude Code. Given that Claude Code only got really good in November it's entirely unsurprising to me that a budget set in 2025 may have failed to predict demand for that tool in 2026! That Uber story was further fueled by comments made by Uber's COO, Andrew Macdonald, on the Rapid Response podcast. I tracked down the segment and there really isn't much there. Here's what Andrew said: But then you sometimes go and talk to your senior engineering leaders and you're saying, OK, how many projects that were on the cutting room floor got moved above the line because of the productivity gains because 25% of our code commits were via Claude Code last quarter? That link is not there yet, right? I think maybe implicitly there's more that is getting shipped. But it's very hard to draw a line between one of those stats and, OK, now we're actually producing like 25% more useful consumer features, right? And that line is hard to draw. Somehow this fragment turned into headlines like Uber's COO says it's getting harder to justify the money spent on AI tokenmaxxing , because the market for stories about AI failures remains enormous. The other popular story around this is Microsoft starts canceling Claude Code licenses , ostensibly to encourage their engineers to dogfood their own Copilot CLI agent instead - but The Verge reporter Tom Warren says "sources tell me the decision is also a financial one", triggered by the June 30th end of Microsoft's financial year. I think both of these stories support my "product-market fit" hypothesis. The best advice I ever heard on pricing a product was that your customer should suck air through their teeth and then say yes. Uber's budget overrun and Microsoft's seat cancellations look like that effect playing out in practice. The big AI labs spend billions of dollars on both training and inference. Credible figures are hard to come by, but we did get one huge hint as to the figures involved from, oddly enough, the recent SpaceX S-1 : [...] in May 2026, we entered into Cloud Services Agreements with Anthropic PBC (“Anthropic”), an AI research and development public benefit corporation, with respect to access to compute capacity across COLOSSUS and COLOSSUS II . Pursuant to these agreements, the customer has agreed to pay us $1.25 billion per month through May 2029 [...] The Anthropic announcement said that this deal meant they could "increase our usage limits for Claude Code and the Claude API", heavily implying that Colossus is being used for inference, not model training. Anthropic already have vast amounts of compute from other providers. The fact that they're willing to spend $1.25 billion per month for extra capacity from just one of their vendors hints at how big these inference budgets have become. Over the past two years my impression has been that OpenAI made more of their income from subscription revenue while Anthropic made more from their API. Anthropic's API revenue was historically quite dependent on a small number of large API customers - this VentureBeat story from August 2025 quotes "sources familiar with the matter" suggesting that just Cursor and GitHub Copilot were responsible for $1.2 billion of the company's then-$4 billion revenue. Today Anthropic are rumored to hit $10.9 billion in the second quarter , potentially even operating at a profit for the first time. This pivot-to-Enterprise suggests that the labs have realized that the real money lies in cutting out the middlemen. Anthropic's Claude Code directly competes with Cursor and Copilot. No wonder Cursor are investing in their own models ! I've called November 2025 the November inflection point because that was when GPT-5.1 and Opus 4.5, combined with their respective coding agent harnesses, got good - good enough that we've spent the last six months adapting to agent systems that can reliably get useful work done. I think April 2026 is a new inflection point where the revenue implications of this have started to land, to the benefit of the frontier AI labs and with material impacts on the budgets of large companies. We'll know for sure how real this moment is when the S-1 documents for the upcoming Anthropic and OpenAI IPOs give us some real, audited numbers to get our teeth into. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Enterprise customers are now paying API prices I think they've found product-market fit And they're ramping up The AI-failure stories around this are pretty thin We also know the labs are spending a lot API revenue is becoming less important April is a new inflection point $1,199.79 for Anthropic Claude Code $980.37 for OpenAI Codex

0 views
Martin Fowler 3 days ago

The VibeSec Reckoning

Vibe coding has significantly accelerated software prototyping but AI agents frequently recommend insecure configurations, creating security problems. Gautam Koul, Lucian Moss, Neil Drew-Lopez, and Daberechi Ruth Edeokoh share their experience while building applications for Thoughtworks's global marketing. They learned that to combat this we need to write a security context file to guide the AI, be cautious with AI permission requests, create a daily security intelligence feed, and provide builders with a secure-by-default harness and templates.

0 views
iDiallo 4 days ago

How Many Tokens Did You Burn Today

Early in my career, a manager at one of the big firms where I worked made a request so absurd it remains etched in my memory. I walked back to the team, repeated what he had asked, and couldn't finish the story without laughing. He wanted me to create a pie chart, of lines of code, per developer, per week. We all lost it. Our lead developer asked if, by any chance, the manager's eyes looked glassy. We laughed even harder. Because yes. Yes, they did. He was always high. That was twenty years ago. I've repeated that story countless times, and it always drew chuckles as we discussed the disconnect between software teams and management. Any software engineer could relate. We all knew that lines of code were a meaningless metric. A junior could write a thousand lines of spaghetti. A senior could fix the same problem with forty elegant ones. But then, last week, I found my name at the top of a leaderboard. My employer had been exploring productivity tools and trialed one they thought would be useful. After the trial, they were quoted $500k a year. The tool tracked developer productivity and integrated with Atlassian products, Microsoft, and many other services we used. The price was too steep, so it was dropped. A couple of months later, the same company came back with a discount. The exact same tool for just $50k a year. My employer jumped at the opportunity. How many bytes did you use today? I'm looking at this dashboard right now and I see my name at the top of the leaderboard. I click on the widget, and a pie chart appears. There it is: a breakdown of the total lines of code my team has produced using AI, by individual. This isn't limited to my employer. Every company is putting something together to track AI usage and justify the investment. Instead of tracking project completions, we're tracking how many lines of code each developer generated with AI. And the joke's on me, because nobody is laughing. The whole industry is applauding and encouraging employees to use more of it. I didn't become the champion because I have some neat agentic workflow. It was done by complete accident. While using an LLM, I accidentally selected "planning mode" for a request that had already been planned. The agent ran for several minutes, burning tokens to resolve a problem that didn't exist. Just like that, I made it to the top, without ever writing a single line of code. If this widget is taken at face value, it won't be long before developers start gaming it deliberately. Just let the agent run overnight, and your employer can claim a 10x improvement in productivity. We didn't use line count as a productivity metric in the past because it never made sense. Whenever we refactor code, we often end up with less than we started with. In fact, much of the time I spend modifying AI-generated code is spent deleting unnecessary things it created. Should we track negative lines of code? The better you are at programming, the worse your numbers look. We are assessing developers by the lines of code. I've watched AI evangelists ask "how many tokens did you burn today?" They were trying to convince an audience that productivity is directly proportional to token usage. It reminds me of the transition from paper to computers. A computer evangelist of that era might have asked: "how many bytes did you use today?" Token counts, lines of code, bytes, none of these have anything to do with actual productivity. Metrics are often entirely disconnected from what they're meant to measure. I've seen companies rely on story points only to watch employees point every ticket as high as possible. Choose lines of code as your metric, and lines of code will increase. Reward the highest contributor, and watch everyone double or triple their output by the next performance review. It's a silly metric but it serves a purpose, just not yours. AI companies promote token usage and associate it with productivity because they directly benefit from it. Imagine an internet service provider that charges by the byte. What would their recommendation for productivity be? "Use more bytes!" The best engineers I've ever known wrote less code, not more. They deleted things. They simplified. They understood that the goal was never the code itself. They solved problems, they made the system reliable, and they served the user. Measuring developers by output volume, whether that's lines, commits, or tokens, mistakes the exhaust for the engine. Every era of tooling brings a new class of metric that mistakes activity for value. The spreadsheet didn't make accountants more productive just because they could fill more cells. AI won't make developers more productive just because it can generate more code. We aren't even tracking if the right problems are being solved, and solved well. If the productivity dashboard can't answer that, it's not measuring productivity. It's measuring the subscription.

0 views