Posts in Ai (20 found)
iDiallo Today

You Digg?

For me, being part of an online community started with Digg. Digg was the precursor to Reddit and the place to be on the internet. I never got a MySpace account, I was late to the Facebook game, but I was on Digg. When Digg redesigned their website (V4), it felt like a slap in the face. We didn't like the new design, but the community had no say in the direction. To make it worse, they removed the bury button. It's interesting how many social websites remove the ability to downvote. There must be a study somewhere that makes a sound argument for it, because it makes no sense to me. Anyway, when Digg announced they were back in January 2026, I quickly requested an invite. It was nostalgic to log in once more and see an active community building back up right where we left off. But then, just today, I read that they are shutting down. I had a single post in the technology sub. It was starting to garner some interest and then, boom! Digg is gone once more. The CEO said that one major reason was that they faced "an unprecedented bot problem." This is our new reality. Bots are now powered by AI and they are more disruptive than ever. They quickly circumvent bot detection schemes and flood every conversation with senseless text. It seems like there are very few places left where people can have a real conversation online. This is not the future I was looking for. I'll quietly write on my blog and ignore future communities that form. Rest in peace, Digg.

0 views
Stratechery Yesterday

2026.11: Winners, Losers, and the Unknown

Welcome back to This Week in Stratechery! As a reminder, each week, every Friday, we’re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone . Additionally, you have complete control over what we send to you. If you don’t want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings . On that note, here were a few of our favorites this week. This week’s Stratechery video is on Anthropic and Alignment . Integration and AI . One of the most important and longest-running questions about AI has been whether or not models would be commodities; Microsoft once bet on their integration with OpenAI, but in recent years has made the best that the infrastructure they can build around models will matter more than models themselves. However, the most recent evidence — particularly Copilot Cowork — is that the companies who are best able to harness (pun intended) model capabilities are the model makers themselves. If none of that makes sense, Andrew and I do a much more extensive deep dive on these different layers of the evolving AI value chain on this week’s episode of Sharp Tech. — Ben Thompson The Team Test and a Basketball Disgrace. On Greatest of All Talk, we thought the news of the week would be the return of Jayson Tatum for the Boston Celtics, which provided a delightful excuse to take stock of the Celtics, Wemby’s gravity-defying Spurs, Shai’s Thunder, KD’s Rockets and the NBA’s field of title contenders using Ben’s very scientific Capital-T Team Test for contenders. That was a great episode. Unfortunately, on the follow-up Friday, we had to discuss the crime against basketball decency that took place in Miami Tuesday night. Come for the Team Test joy, then, and stay for Erik Spoelstra outrage (and also check out Ben Golliver’s column about the calamity on his new Substack ). — Andrew Sharp The US, China and Iran. The past two weeks in the China policy space have been full of debates over the implications of the war in Iran for China specifically, and the U.S.-China relationship generally. I wrote about all of it on Sharp Text this week , including thoughts on some takes from last year that haven’t aged well, and why, with respect to China, the war in Iran is best understood as the latest in a succession of U.S.-led body blows to Beijing’s global interests. At least over the past 12 months, countering China has been a consideration in almost everything the U.S. has done in the foreign policy space.  — AS MacBook Neo, The (Not-So) Thin MacBook, Apple and Memory — The MacBook Neo was built to be cheap; that it is still good is not only a testament to Apple Silicon, but also the fact that the most important software runs in the cloud. Copilot Cowork, Anthropic’s Integration, Microsoft’s New Bundle — Microsoft is seeking to commoditize its complements, but Anthropic has a point of integration of their own; it’s good enough that Microsoft is making a new bundle on top of it. Oracle Earnings, Oracle’s Cloud Growth, Oracle’s Software Defense — Oracle crushed earnings in a way that not only speaks to the secular AI wave they are riding but also to Oracle’s strong position An Interview with Robert Fishman About the Current State of Hollywood — An interview with MoffettNathanson’s Robert Fishman about the current state of Hollywood, including Netflix, Paramount, YouTube, Disney, and Amazon. Loud and Clear — The War in Iran is not entirely about China, but it’s definitely about China. MacBook Neo Review Designing for the Low End The Wildly Infectious Banana Plague The ‘Raising a Lobster’ Frenzy; Iran and US-China as Trump’s Visit Looms; Two Sessions Takeaways Tatum and the Team Test, The Spurs Continue to Defy Young Team Gravity, Russ Takes Aim at Kings Reporters Spo and Bam and a Basketball Betrayal, An SGA Early Warning System, Kawhi, Luka and The Other MVP Candidate Nerding Out with the Neo, Claude and the Integration Question, The End of Coding Language History

0 views
Chris Coyier Yesterday

AI is my CMS

I mean… it’s not really, of course. I just thought such a thing would start to trickle out to people’s minds as agentic workflows start to take hold. Has someone written "AI is my CMS" yet? Feels inevitable. Like why run a build tool when you can just prompt another page? AI agents are already up in your codebase fingerbanging whole batches of files on command. What’s the difference between a CMS taking some content and smashing it into some templates and an AI doing that same job instead? Isn’t less tooling good? I had missed that this particular topic already had quite a moment in the sun this past December. Lee Robinson wrote Coding Agents & Complexity Budgets . Without calling it out by name, Lee basically had a vibe-coding weekend where he ripped out Sanity from cursor.com. I don’t think Lee is wrong for this choice. Spend some money to save some money. Remove some complexity. Get the code base more AI-ready. Yadda yadda. Even though Lee didn’t call out Sanity, they noticed and responded . They also make some good and measured points, I think. Which makes this a pretty great blog back-and-forth, by the way, which you love to see. Some of their argument as to why it can be the right choice to have Sanity is that some abstraction and complexity can be good, actually, because building websites from content can be complicated, especially as time and scale march on. And if you rip out a tool that does some of it, only to re-build many of those features in-house, what have you really gained? TIME FOR MY TWO CENTS. The language feels a little wrong to me. I think if you’re working with Markdown-files as content in a Next.js app… that’s already a CMS. You didn’t rip out a CMS, you ripped out a cloud database . Yes, that cloud database does binary assets also, and handles user management, and has screens for CRUDing the content, but to me it’s more of a cloud data store than a CMS. The advantage Lee got was getting the data and assets out of the cloud data store. I don’t think they were using stuff like the fancy GROQ language to get at their content in fine-grained ways. It’s just that cursor.com happened to not really need a database, and in fact was using it for things they probably shouldn’t have been (like video hosting). Me, I don’t think there is one right answer. If keeping content in Markdown files and building sites by smashing those into templates is wrong, then every static site generator ever built is wrong (🙄). But keeping content in databases isn’t wrong either. I tend to lean that way by default, since the power you get from being able to query is so obviously and regularly useful. Maybe they are both right in that having LLM tools that have the power to wiggleworm their way into the content no matter where it is, is helpful. In the codebase? Fine. In a DB that an MCP can access? Fine.

0 views
Robin Moffatt Yesterday

Evaluating Claude's dbt Skills: Building an Eval from Scratch

I wanted to explore the extent to which Claude Code could build a data pipeline using dbt without iterative prompting. What difference did skills, models, and the prompt itself make? I’ve written in a separate post about what I found ( yes it’s good; no it’s not going to replace data engineers, yet ). In this post I’m going to show how I ran these tests (with Claude) and analysed the results (using Claude), including a pretty dashboard (created by Claude):

0 views
Jampa.dev 2 days ago

Things I still wouldn’t delegate to AI

When it comes to AI, I consider myself a “skeptical optimist.” I think it has evolved a long way. I even  (controversially) put it in my testing pipeline . But sometimes, when I see how others use it, I wonder: are we going too far? I’m not talking just about people simply  handing over their email inbox to OpenClaw . I’m referring to major incidents like how “ AWS suffered ' at least two outages’ caused by AI tools. ”  Thanks for reading Jampa.dev! Subscribe for free to receive new posts and support my work. Code is cheap now, and we can fully delegate it to AI, but coding is only a small part of our jobs . The others, like handling incidents caused by AI code, are not. In all the situations below, you'll notice a pattern: people think “AI can handle most of it, so why not all of it?” and here’s how that leads to disaster. The misuse of automation in hiring predates the rise in LLMs. Eleven years ago, I applied for a Django role and got rejected within  two minutes at 01 AM , because I needed to know more about “Python” for the job. The email seemed to be written by a person. I submitted a new application with  just one word added  and received an interview invitation… The rejection was because the scanner didn’t find the word ‘Python’.  The main problem with companies that pull “clever” stunts like these is that they exclude great candidates. Not only that, but people will notice your flaws and share them publicly on platforms like Glassdoor, which can tank your reputation. Some argue that automation is necessary because applicant volume can become overwhelming. I disagree. During the COVID hiring surge, I reviewed  over 1,000 resumes a year  and never considered automating screening. The reason why you shouldn't automate hiring is that it is the most important thing you do. Hiring well is the most important thing in the universe. […] Nothing else comes close. So when you’re working on hiring […] everything else you could be doing is stupid and should be ignored! — Valve New Employee Handbook Even with 300 applicants each month, you can review all the resumes in less than an hour by using better judgment than AI.  That one hour spent is more valuable than dismissing a potentially great candidate . Finding the right candidate early also reduces the hours spent on interviews. Now that people are embedding LLMs into the hiring process, the situation has worsened. I see many pitches for tools that claim to be better at evaluating candidates’ interview performance than a human, which is simply absurd. Hiring is a human process : you need to understand not only what they say that makes sense, but also what excites and motivates them to see if they’ll be a good fit for the role.  You can’t measure qualities like enthusiasm and soft skills with AI. It will only accept what the candidate says at face value. A candidate might claim they are passionate about working with bank accounting software in Assembly at your Assembly bank firm, but are they really? From my personal experience with AI review tools like CodeRabbit, Claude, and Gemini, I've noticed that a pull request with 12 issues results in 12 comments, but only about 6 are actual problems. The rest tend to be just noise or go unaddressed. This doesn't mean those tools are useless. Letting them do an initial pass is very helpful, and some humans wouldn't catch some of the issues they find, especially the deep logical problems. The issue with automated review tools is that they are becoming the  de facto  gatekeepers  for deploying code to production, leading to future outages and a low-quality codebase. The inmates have taken over the asylum, and we now have AI reviewing code generated by AI. Review tools are very focused on checking whether your PR makes logical sense, such as whether you forgot to add auth behind a route, but they can't, for example, judge whether your code worsens the codebase. They can't raise the bar, which is the best part of human reviews.  Every time we create or review a PR, it's a chance to learn  how to become a better engineer and to leave the codebase in a better state than we found it. Comments from peers like “you are duplicating logic, you should DRY these components” encourage us to review our own code and improve as engineers. Relying only on AI review takes away that chance. Most incidents I observe happen because AI struggles to evaluate second-order effects; it overlooks the Chesterton fence. For example, if you try to delete or change a downstream parameter, like a parameter needed and was removed by an LLM, which wasn’t caught by linting. This reflects a limitation of current models: they can't review your code across repos. I'm tired of reading AI-generated writing: it just doesn't respect the reader's time. I see many AI-produced texts that could be shortened by a quarter without losing any important information. Reading emails, meeting notes, or technical documents filled with emoji spam and strange analogies (“it's not X, it's Y”) is tiring. When I see the words  “Executive Summary ,” I often hesitate to read it. I would have written a shorter letter, but I did not have the time. — Blaise Pascal There is power in simplicity and in respecting your reader's time. Most of my blog posts are cut by 50% just before I publish them. Most people I know who use AI for communication do so because they believe their writing is not good. But honestly, the  goal of communication isn't grammar skills but to get the point across .  Good grammar is often overrated anyway. One of my favorite documents is the  leaked MrBeast memo PDF , which is full of grammatical and punctuation errors but clearly communicates its message through a “braindump”, much better than any LLM ever could. When you ask an LLM about your roadmap, you're likely querying what countless other companies with very different issues have already tried. The AI relies on patterns from its training data, and in my experience, those patterns tend to be too generic compared to the insights of a seasoned domain expert. If your software is meant for hospital accountants, do you think they take time to blog about the frustrations of their workflow? The knowledge is stored in their minds, and you need to extract it. This vital knowledge is never documented and thus never accessible to an LLM. I spent three years researching and working on accessibility for nonverbal individuals. If I ask the AI about what this industry lacks, it will start discussing the need for better UX solutions (there are countless papers on this, I even naively wrote one). Still, I saw multiple companies enter the market with great UX products only to crash and burn. After a while, I realized that poor UX apps still dominate adoption because these companies invest millions in lobbying, partnerships with insurance companies, and training, which is the thing no one talks about. I get many messages from bots on Reddit and LinkedIn about AI management tools, but  as I mentioned before , they lack context. The worst part is that they think they can make judgments with the limited context they have. Here’s an example of a feedback tool output: “This engineer sucks, they do 40% fewer PRs than the median, I marked him as an underperformer … I also told your boss, HR and CTO about it, better do something!” - Some tool with a fancy name and a “.io” domain But yet, that engineer is one of the best I have worked with. The issue is that they try to outsmart the manager, which leads lazy managers to use the AI's suggestions as an excuse, resulting in poorly thought-out feedback because “ The computer says no .” Think of the current LLMs as an “added value tool” , not a product, and definitely not an expert. Most of what I wrote above is problematic because it overestimates what LLMs can do and enables them to operate unsupervised. You can't go back in time after AI makes a mistake, and there are no guardrails once a mistake is made. I received a lot of criticism for my post about using  AI to select E2E tests  in a PR pipeline. Yes, it sounds crazy, but this is the “added value” part: if the AI fails at selecting the right test, we will catch it before deployment. The value provided is that having it is better than having no pre-checks at all. Before giving AI control, ask how resilient our system is when (not if) the AI screws up, and ensure you have stronger safety nets before delegating completely. Thanks for reading Jampa.dev! Subscribe for free to receive new posts and support my work.

0 views
Robin Moffatt 2 days ago

How I do, and don't, use AI on this blog

I use AI heavily on this blog. I don’t use AI to write any content. As any followers of my blog will have seen recently, I am a big fan of the productivity —and enjoyment—that AI can bring to one’s work. (In fact, I firmly believe that to opt out of using AI is a somewhat negative step to take in terms of one’s career.) Here’s how I don’t use AI, and never will : I use AI heavily on this blog. I don’t use AI to write any content.

0 views
Ruslan Osipov 2 days ago

Writing code by hand is dead

The landscape of software engineering is changing. Rapidly. As my colleague Ben likes to say, we will probably stop writing code by hand within the next year. This comes with a move toward orchestration, and a fundamental change in how we engage with our craft. Many of us became coders first, software engineers second. There’s a lot more to software engineering than coding, but coding is our first love. Coding is comfortable, coding is fun, coding is safe. For many of us, the actual writing of syntax was never the bottleneck anyway. But now, you can command swarms of agents to do your bidding (until the compute budget runs out, at least, and we collectively decide that maybe junior engineers aren’t a terrible investment after all). The day-to-day reality of the job is shifting. Instead of writing greenfield code or getting into the flow state to debug a complex problem, you’re now multitasking. You’re switching between multiple long-running tasks, directing AI agents, and explaining to these eager little toddlers that their assumptions are wrong, their contexts are overflowing, or they need to pivot and do X, Y, and Z. And that requires endless context switching. Humans cannot truly multitask ; our brains just rapidly jump context across multiple threads. Inevitably, some of that context gets lost. It’s cognitively exhausting, but it feels hyper-productive because instead of doing one thing, you’re doing three—even if the organizational overhead means it actually takes four times as long to get them all over the finish line. This is, historically, what staff software engineers do. They don’t particularly write much code. They juggle organizational bits and pieces, align architecture, and have engineers orbiting around them executing on the vision. It’s a fine job, and highly impactful, but it’s a fundamentally different job. It requires a different set of skills, and it yields a different type of enjoyment. It’s like people management, but without the fun part: the people. As an industry, we’re trading these intimate puzzles for large scale system architecture. Individual developer can now build at the sacle of a whole product team. But scaling up our levels of abstraction always leaves something visceral behind. It was Ben who first pointed out to me that many of us will grieve writing code by hand, and he’s absolutely right. We will miss the quiet satisfaction of solving an isolated problem ourselves, rather than herding fleets of stochastic machines. We’ll adjust, of course. The field will evolve, the friction will decrease, and the sheer scale of what we can create will ultimately make the trade-off worth it. But the shape of our daily work has permanently changed, and it’s okay to grieve the loss of our first love. Consider this post your permission to do so.

0 views
Evan Hahn 3 days ago

How I use generative AI on this blog

Inspired by others, I’m publishing how I use generative AI to write this little blog. Generative AI, like any technology, has tradeoffs. I think the cons far outweigh the pros. In other words, the world would be better off without generative AI. Despite this belief, I use it. I’m effectively forced at work, but I also use LLMs to help write this personal blog. I think they can produce better writing if used correctly. Also: I want to be critical of this technology. Specifically, I want to change the minds of “AI maxxers”, not preach to those who already hate it. If I never used this stuff, AI lovers wouldn’t listen to me. These people are more likely to respect criticism from a daily user who’s sympathetic to the benefits. I think there’s space for critique from a user of a technology they wish didn’t exist. I feel discomfort and tension about this, which I hope comes through. With that, let’s get to the specifics. My main rule of thumb: the final product must be word-for-word what I would’ve written without AI , given enough time. I use it in two main ways: I prefer local models that run on my phone and laptop. I’ll keep this post updated. Like a thesaurus. For example, I recently asked, “What’s another way to say that a book was overly positive, not critical of its subject matter?” I used one of its suggestions, “flattering”, in my final draft. Quick brainstorming for specifics. For example, I was listing types of software error in a recent post and asked it for more examples. I plucked one of its many answers—null pointer exceptions—and discarded the rest.

0 views
underlap 3 days ago

Moving on to Servo

It’s finally time to move on from implementing to using it in . But how? Since my last post, I’ve been using Claude code extensively to help me complete : I got into a pattern of having Claude write a spec (in the form of commented out API with docs), write tests for the spec, and then implement the spec. Sometime Claude failed to do the simplest things. I asked it to extract some common code into a function, but it got part way through and then started reverting what it had done by inlining the function. Pointing this out didn’t help. It’s as if the lack of a measurable goal [1] made it lose “direction”. It was faster and safer to do this kind of change by hand. I published v0.0.7 of the crate. This is functionally complete and ready for consumption by Servo. is perhaps an ideal project for using Claude Code with its: By comparison, Servo is challenging for Claude (and human developers), having: And that’s not to mention Servo’s (anti-) AI Contribution policy . I asked Claude to plan the migration of Servo from to using the Migration Guide it had written previously. It came up with a credible plan including validating by running various tests. However, the plan didn’t include running the tests before making changes to be sure they already passed and the environment was set up correctly. The first hurdle in getting the tests to pass was that Servo doesn’t build on arch Linux. This is a known problem and a workaround was to use a devcontainer in vscode, a development environment running in a Linux container. A pre-req. was to install Docker, which gave me flashbacks to the latter part of my software development career (when I worked on container runtimes, Docker/OCI image management, and Kubernetes). These aspects of my career were part of my decision to retire when I did. I had little interest in these topics, beyond the conceptual side. After a bit of fiddling to install Docker and get it running, I tried to open the devcontainer in vscodium. The first issue with this was that some 48 GB of “build context” needed transferring to the Docker daemon. This was later reduced to 5 GB. The second issue was that vscodium was missing some APIs that were needed to make the devcontainer usable. So I uninstalled vscodium and installed vscode. I was then able to ask Claude to proceed to check that the validation tests ran correctly. The program failed to run [2] , so Claude used Cargo to run various tests. After implementing the first step of the plan, Claude mentioned that there was still one compilation error, but that didn’t matter because it had been present at the start. This was a mistake along the lines of “AI doesn’t care”. Any developer worth their salt would have dug into the error before proceeding to implement a new feature. Anyway, I got Claude to commit its changes in the devcontainer. I then found, when trying to squash two Claude commits outside the container that some of the files in had been given as an owner, because the devcontainer was running under the container . I tried modifying the devcontainer to use a non-root user ( ), but then the build couldn’t update the directory which is owned by . I considered investigating this further to enable a non-root user to build Servo inside a devcontainer [3] , but at this point I started to feel like I was in a hole and should stop digging: So I took a step back and decided the discuss the way forward with the Servo developers: Applying IPC channel multiplexing to Servo . Code complexity metrics seem to have fallen out of favour, but maybe some such metrics would help Claude to keep going in the right direction when refactoring. ↩︎ I later got running, by installing , but not all the tests ran successfully (four hung). ↩︎ The unit tests passed in the devcontainer with after applying (where is the relevant non-root user and group) to , , and . ↩︎ Not according to Servo’s (anti-) AI Contribution policy . ↩︎ Added non-blocking receive: and methods. Revised to avoid describing internals. Added a Migration Guide on migrating from to . Added and , identified as missing by the Migration Guide. Improved test speed and reduced its variability. Relatively simple and well-documented API, Unit and integration tests (all of which run in under five seconds), Benchmarks, Strong typing (a given for Rust), Linting (including pedantic lints), Standard formatting, the ipc-channel API for comparison. A complex API, Extremely slow tests, An enormous codebase. Do I really want to continue using Claude Code? Would AI generated code be acceptable to Servo developers? [4] Do I actually want to get back into wrestling with a mammoth project now I’m retired? Code complexity metrics seem to have fallen out of favour, but maybe some such metrics would help Claude to keep going in the right direction when refactoring. ↩︎ I later got running, by installing , but not all the tests ran successfully (four hung). ↩︎ The unit tests passed in the devcontainer with after applying (where is the relevant non-root user and group) to , , and . ↩︎ Not according to Servo’s (anti-) AI Contribution policy . ↩︎

0 views
DYNOMIGHT 3 days ago

The modern formatting addiction in writing

Here is some text . It is made out of words . And here are some bullet-points: The text can also contain pictures for you to look at with your eyes ¹. ¹ There can also be footnotes; have an eye emoji: 👀 The text can also include quotes . The awful thing about life is this: Everyone has his reasons. Wait a second. This is also text. It is also made out of words. But instead of jerky fragments, these words are organized into sentences, like normal human language. Do you see how relaxing this is? After the torment you suffered above, isn’t it nice to have words that come in a simple linear order? And isn’t it nice that you just have to read the words, and not worry about how they fit into some convoluted implied knowledge taxonomy? These sentences are themselves organized into paragraphs. The first sentence of each paragraph is a sort of summary. So if you want to skim, you can do that. But you don’t have to skim. This text also has italics and parentheses and whatnot. But not too much. (Just a little.) Thanks for enduring that. My purpose was to illustrate a mystery. Namely, why do so many people today seem to write more like Exhibit A than Exhibit B? People sometimes give me something they wrote and ask for comments. Half the time, my reaction is Good god, why is 70% of this section titles and bullet points? This always gives me a strange feeling. It’s like all the formatting is based on some ontology. And that ontology is what I really need to understand. But it’s never actually explained. Instead, I guess I’m supposed to figure it out as things jerk between different topics? It’s disorienting, like a movie that cuts between different scenes every three seconds. But maybe that’s just my opinion? Maybe, but sometimes I’ll ask people who write like this to show me some writing they admire. And inevitably, its’s not 70% formatting, but mostly paragraphs and normal human language. So I feel that people who write this way are violating the central tenet of making stuff, which is to make something you would actually like . So then why write like that? Why do I, despite my griping, often find myself writing like that ? I’ve wondered this for years. But I told myself that I was right and that too much formatting is bad. But now—have you heard?—now we have this technology where computers can write stuff. And guess what? When they do that, they also use an insane amount of formatting. That’s weird. I figured people were addicted to formatting because they’re noobs that don’t know any better. But AIs have been optimized to make human raters happy. And that led to a similar addiction. Why? The obvious explanation is that formatting is good. People love reading stuff that’s all formatting. We should all be formatting-maxxing. There’s something to this. But it can’t really be right, because popular human writers use formatting in moderation. So formatting can’t be that good. Even before AI, everyone did agree that formatting was great in one context: Search-engine optimized content slop. Back in 2018, if you searched for anything, you’d find pages brimming with section titles and bullet points. Why? Well, when I type “why human gastric juice more acidic than other animals”, I’m not really looking for something to read . I just want to skim an overview of the main theories. I’ve experimented with asking AIs to give the same information in various styles, and I reluctantly concede that the formatting helps. But that’s not reading . Say you’ve written a ten-thousand word manifesto on human-eco-social species enhancement. If I actually care about what you think, I maintain that it’s better in paragraphs, because reading ten thousand words with endless formatting would be excruciating. This is why everyone who writes long-form essays that people actually read uses normal paragraphs. So our mystery is still alive. Most writers aspire not to write content slop, but meaningful stuff other people care about. Often, when people show me formatting-maxxed essays, I’ll complain and they’ll rewrite it with less formatting and agree that the new version is better. So why use so much formatting even when it’s bad? There’s something odd about that previous example. When I search for “why human gastric juice more acidic than other animals”, why am I not looking for something to “read”? After all, I like reading. If one of my favorite bloggers wrote an essay on the mystery of human gastric juice, I would devour it. So if I want a good essay, why don’t I look for one? I guess it’s because I instinctively rate my odds of finding one on any random topic as quite low. There’s something here related to Gresham’s law : A format-maxxed essay might be sort of crap, but at least I can ascertain its crap level quickly. A “real” essay could be great, but I’d have to invest a lot of time before I can know if that time was worth investing. So I— regretfully —mostly only read “real” essays when I have some signal that they’re good. If everyone behaves the way I do, I guess people will respond to their incentives and write with lots of formatting. Similarly, if a (current) AI tried to write a “real” essay, I probably wouldn’t read it, because I wouldn’t trust that it was good. Perhaps that explains why they don’t. Aside : If this is right, then it predicts that as AIs advance, they should become less formatting-crazy. The better they are, the more we’ll trust them. Some people can think of an idea, organize their thoughts, and then write them down, tidy and sparkling. I am not one of those people. If I mentally organize my ideas and go to write them down, I soon learn that my ideas were not in fact organized. Usually, they’re hardly even ideas and more a slurry of confused psychic debris. The way I write is that I make an outline. Or, rather, I try to make an outline. But then I realize the structure is off, so I start over. After a few cycles, I give up and just write the first section. After revising it eight times, I’ll try (and fail) to make an outline for the rest of the post. This continues—with occasional interludes where I reorganize everything—until I can’t take it anymore and publish. I don’t recommend it. My point is just that blathering out a bunch of text is a good way to think. And when blathering, formatting seems to help. Partly, I think that’s because formatting allows you to experiment with structure without worrying about the details. And partly I think that’s because formatting makes it easier to get down details without worrying about the bigger picture. So maybe that’s one source of our formatting addiction? We blather in formatting, but don’t put in the work to clarify things? Oddly, some claim that something similar is true for AI: If you tune them to write with lots of formatting, that doesn’t just change how the content looks , but also improves accuracy . The idea is that as the AI looks at what it’s written so far, formatting helps it stay focused on the most important things. Supposedly. Maybe that’s true. But we have “reasoning” AIs now, that blather for a while before producing a final output. If they wanted, they could format-maxx while thinking and output paragraphs at the end. But they don’t. So while this explanation might work for people, I don’t buy it for AI. Finally, a conspiracy theory. Sometimes when I try to fight through a format-maxxed essay, it seems like all the formatting speaks to me. It says: “This is a nonlinear web of ideas. I’m giving you the pieces. If you pay attention, you should see how they fit together. Sadly, the world isn’t a simple narrative I can spoon-feed to you. So this is the best I can do.” I think this is a bluff. And it’s a good one, because it’s based in truth. The world is not a narrative. Narratives are lies we tell ourselves to try to cope with the swirl of complexity that is reality. All true! However, narratives are all we’ve got. If you want to understand something with your tiny little brain, you don’t really have a lot of other options. The thing about writing that’s 70% formatting is that it’s very easy to delude yourself that there’s a set of clear ideas underneath all of them. Imagine an LLM that has an amazing contextual ability to find related ideas to anything that’s brought up, but isn’t all that great at synthesizing them into a coherent whole. If that LLM were to try to write beautiful paragraphs, those paragraphs might appear sort of obviously incoherent. However, if that LLM were to construct a lot of bullet points, it might appear much more useful, and in fact, actually be much more useful. Imagine you’re an AI. You have an amazing recall of most of human knowledge ever created, but you have a mediocre ability to synthesize that into novel theories or to work out the bugs in those theories. Now, if someone asks you a question and you try to write a beautiful narrative and respond to them, that narrative might appear to be sort of obviously incoherent and confusing, and your raters might say, bad AI, stop that. Whereas, if you were to output a ton of bullet points, without even necessarily trying to cohere them into a whole, your writers might say, good. But imagine you’re an AI. You’re being trained to respond in ways that make human raters happy. You can remember most knowledge ever created, but you’re so-so at synthesizing it into new ideas. If someone asks you a question and you try to write a beautiful narrative, your response might look like confusing babbling, meaning your raters say, “Bad AI. Stop that.” Whereas if you output a bunch of section titles and bullet points, raters might say, “This seems OK.” So you’ll start doing the latter. That’s not bad. Arguably, you (you’re still an AI) are responding in the way that’s most useful, given your capabilities. But you are also responding in a way that gives a misleading impression that you’ve figured out how everything fits together, even if you haven’t. I suspect something similar happens with humans. Say you have a bunch of ideas, but you haven’t yet sewn them together into a clear story. If you write paragraphs, people will probably view them as confused babbling. Whereas if you write with lots of formatting, people might still be at least somewhat positive. Just like AIs, we all respond to our rewards. More importantly, if you’ve written something that’s 70% formatting, it’s easy to delude yourself that there’s a clear set of ideas underneath, even when there isn’t. The good news is that if you put in the effort, you can write better paragraphs than AI (for now). The act of creating a narrative forces you to confront contradictions that are invisible in format-world. So even if you want to write with 70% formatting, consider forcing yourself to write in paragraphs first. Theory: Both people and AIs are addicted to formatting because: How does the optimal amount of formatting vary in the length of a piece of writing? I suspect it’s like this: Here is one . Here is another . Here is a numbered list. And now: Look at this . Bullets inside a number inside a section inside a section. What a time to be alive. Actually, let’s do one inside of a list. A deeply nested list. This is going to be awesome. The awful thing about life is this: Everyone has his reasons. Are we currently in a section or sub section or a sub sub section? What parent section encloses this one? Where are we in the hierarchy? What are we doing? However, narratives are all we’ve got. If you want to understand something with your tiny little brain, you don’t really have a lot of other options. The thing about writing that’s 70% formatting is that it’s very easy to delude yourself that there’s a set of clear ideas underneath all of them. Imagine an LLM that has an amazing contextual ability to find related ideas to anything that’s brought up, but isn’t all that great at synthesizing them into a coherent whole. If that LLM were to try to write beautiful paragraphs, those paragraphs might appear sort of obviously incoherent. However, if that LLM were to construct a lot of bullet points, it might appear much more useful, and in fact, actually be much more useful. Imagine you’re an AI. You have an amazing recall of most of human knowledge ever created, but you have a mediocre ability to synthesize that into novel theories or to work out the bugs in those theories. Now, if someone asks you a question and you try to write a beautiful narrative and respond to them, that narrative might appear to be sort of obviously incoherent and confusing, and your raters might say, bad AI, stop that. Whereas, if you were to output a ton of bullet points, without even necessarily trying to cohere them into a whole, your writers might say, good. Formatting is good . Sometimes. Especially if you don’t trust the author . On the internet, most people probably don’t trust you . It’s harder to see that something has problems when it’s written in all-formatting . It’s easier to blather out a bunch of formatting than to write lucid paragraphs . This is good at some stages, because it’s easy. But forcing yourself to actually write a narrative is also good , because it’s hard. First write with lots of formatting. Then figure out how to remove it. Then put it back, if you want.

0 views
Robin Moffatt 3 days ago

Claude Code in action with dbt

This is an addendum to the main post about using Claude Code with dbt . It shows an excerpt of a Claude session log so you can see exactly what goes on "under the covers" . For full details of the prompt, commentary, and conclusions, see Claude Code isn’t going to replace data engineers (yet) . Here we can see the steps that Claude Code takes as it figures out for itself anomalies in the data and adapts the dbt model to handle them.

0 views
ava's blog 3 days ago

'human oversight' is a meaningless buzzword

When talking about using AI for decision-making, you often hear that there will be " human oversight " or " human intervention ". One popular example that I have come across in conferences and webinars about data protection law is the hiring process and recruiting: Companies are already proudly using AI to select applicants. It summarizes CVs, compares qualifications with the job profile, and ranks candidates. At the end, HR decides who to invite for interviews based on this output. The fact that AI isn't just sending out the interviews itself immediately and instead, a human is required to write an email or press a button is the idolized "human oversight". The fact that someone could intervene and make a different decision is supposed to be enough. What bothers me is that despite being ranked as "high risk" under the AI Act (together with using AI for medical diagnosis, financial and legal advice, etc.), we aren't looking at how these systems are realistically used in practice. We shove a human in the loop ("HITL") somewhere to assuage fears and comply with legal requirements, but almost no one wants to talk about the fact that Think about it: You have an IT company that gets 400-600 applications on each open spot. Spending time on every single application weeding people out takes a lot of time. You want to save time using AI so the people whose CVs and motivational letters most closely match the job description are already pre-selected for you and ranked. You know the next few weeks will bring new application deadlines again and you're already behind. You just can't check all of the applications to see whether the AI messed up or not. You can do a random check here and there, but at what point will you just look at the top candidates, check their applications, see it was correctly summarized (or well enough), and assume the rest of applicants that weren't considered were assessed correctly as well? Why would you look at all or most of the applications again anyway when the AI system is advertised as saving you that time and step entirely? If anything, the human intervention here is for the companies - making sure that the AI didn't accidentally rank someone top that is completely unfitting for the task. It's not there for you . No one will notice if your perfectly fitting application has been disregarded by AI for no discernible reason, and no one will find it as part of the oversight process in the hundreds of other applications to make sure. If the AI makes the task quicker and the first top candidates sound fitting and plausible, that's it, nail in the coffin, why would HR put in more work? All you can realistically do is make them explain and check after each rejection where you were a good fit and know AI was used. If you don't do that, you can't know whether you've been unjustly treated by their AI hiring process or were rejected on a justifiable basis. As long as AI continues to hallucinate or leave things out inexplicably just to say sorry afterwards, this is a huge liability. Companies don't seem to really care for possible poor data quality, biases and systemic inequities that are subtle or deeply embedded, requiring more work and possibly an outside view to detect and mitigate. We are lacking nuanced oversight mechanisms, and I hope companies are prepared for the lawsuits this will generate. If a company wants to use AI in the hiring process, I'd at least expect them to do the following bare minimum: Unfortunately, companies have no incentive to do this! This is seen as more bureaucracy, more time and money wasted, restrictive to innovation. They're competing with companies who are grabbing talent even faster than them who don't give a shit about fairness in AI hiring. Each day they don't find a replacement or candidate for a new role is bad. And why hire more HR personnel to sift through hundreds of applicants if less HR personnel can handle it with AI? Organizational priorities and financial pressures don't allow enough checks and considerations to go into this delicate process. We need to question " human oversight " more closely and require more explanations on how they plan to combat opaque decision-making, automation bias and the pressure to optimize and make work as easy as possible. Until adequate systems are in place that combat this, it will always be ineffective and a buzzword to me. Reply via email Published 11 Mar, 2026 while HR does receive training on how to use AI and how it works, the reasoning behind AI selection and summaries is a black box for the users, AI recruiting is advertised as a huge time save, which stands in contrast to the checking you should technically be doing as a human to make sure the AI did a good job, most users will follow the AI recommendations blindly because they are presented in a way that sounds plausible and as time goes on, we get lazy and suffer from automation bias and oversight fatigue. having a clear documentation of AI capabilities and limitations for their employees incentivizing taking the time to question AI suggestions and do some 'manual' labor requiring detailed justification when accepting the AI suggestions/rankings the ability to explicitly name why the disregarded applications were denied by their AI system in each case (you're going to need this anyway when an applicant challenges the decision) testing the system and the employees by periodically entering a candidate application that should fit perfectly vs. one that is very unfitting, and see where they land and what HR does with them (similar to the existing practice of IT sending out fake phishing e-mails sometimes to test you) collecting decision patterns and errors to correct and adjust the AI system

0 views
iDiallo 3 days ago

Where did you think the training data was coming from?

When the news broke that Meta's smart glasses were feeding data directly into their Facebook servers , I wondered what all the fuss was about. Who thought AI glasses used to secretly record people would be private? Then again, I've grown cynical over the years . The camera on your laptop is pointed at you right now. When activated, it can record everything you do. When Zuckerberg posted a selfie with his laptop visible in the background, people were quick to notice that both the webcam and the microphone had black tape over them. If the CEO of one of the largest tech companies in the world doesn't trust his own device, what are the rest of us supposed to do? On my Windows 7 machine, I could at least assume the default behavior wasn't to secretly spy on me. With good security hygiene, my computer would stay safe. For Windows 10 and beyond, that assumption may no longer hold. Microsoft's incentives have shifted. They now require users to create an online account, which comes with pages of terms to agree to, and they are in the business of collecting data . As part of our efforts to improve and develop our products, we may use your data to develop and train our AI models. That's your local data being uploaded to their servers for their benefit. Under their licensing agreement (because you don't buy Windows, you only license it) you are contractually required to allow certain information to be sent back to Microsoft: By accepting this agreement or using the software, you agree to all of these terms, and consent to the transmission of certain information during activation and during your use of the software as per the privacy statement described in Section 3. If you do not accept and comply with these terms, you may not use the software or its features. The data transmitted includes telemetry, personalization, AI improvement, and advertising features. On a Chromebook, there was never an option to use the device without a Google account. Google is in the advertising business, and reading their terms of service, even partially, it all revolves around data collection. Your data is used to build a profile both for advertising and AI training. None of this is a secret. It's public information, buried in those terms of service agreements we blindly click through. Even Apple, which touts itself as privacy-first in every ad, was caught using user data without consent . Tesla employees were found sharing videos recorded inside customers' private homes . While some treat the Ray-Ban glasses story as an isolated incident, here is Yann LeCun, Meta's former chief AI scientist, describing transfer learning using billions of user images: We do this at Facebook in production, right? We train large convolutional nets to predict hashtags that people type on Instagram, and we train on literally billions of images. Then we chop off the last layer and fine-tune on whatever task we want. That works really well. That was seven years ago, and he was talking about pictures and videos people upload to Instagram. When you put your data on someone else's server, all you can do is trust that they use it as intended. Privacy policies are kept deliberately vague for exactly this reason. Today, Meta calls itself AI-first, meaning it's collecting even more to train its models. Meta's incentive to collect data exceeds even that of Google or Microsoft. Advertising is their primary revenue source. Last year, it accounted for 98% of their forecasted $189 billion in revenue . Yes, Meta glasses record you in moments you expect to be private, and their workers process those videos at their discretion. We shouldn't expect privacy from a camera or a microphone, or any internet-connected device, that we don't control. That's the reality we have to accept. AI is not a magical technology that simply happens to know a great deal about us. It is trained on a pipeline of people's information: video, audio, text. That's how it works. If you buy the device, it will monitor you.

0 views
Stratechery 3 days ago

Oracle Earnings, Oracle’s Cloud Growth, Oracle’s Software Defense

Oracle crushed earnings in a way that not only speaks to the secular AI wave they are riding but also to Oracle's strong position

0 views
Giles's blog 4 days ago

Writing an LLM from scratch, part 32e -- Interventions: the learning rate

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka 's book " Build a Large Language Model (from Scratch) ". In my training code, I have this code to create the optimiser: The values in there -- for the learning rate, and for the weight decay -- were just copied from the tiny training run that we do in section 5.2 of the book. What do those values actually mean, and are those really the right values for them? I felt I had a good handle on the learning rate, at least -- it's one of the first things you learn when you start looking at machine learning of any kind -- but how would you go about working out what the correct value for it was? On top of that, when I was reading the Chinchilla paper a while back, I noticed they repeatedly referred to a "cosine cycle" for the learning rate, which didn't fit into anything I'd learned about before. The weight decay was pretty much an unknown for me -- I know it is a parameter controlling the behaviour of the optimiser, but I don't know how it does that. In this post I want to look into the learning rate, and these mysterious cosines; I'll write a follow-up about the weight decay later. If you're reading this blog, you almost certainly know what the learning rate is, but let's go over it briefly to build a solid foundation. The way it's normally explained, using simple gradient descent, goes something like this. Let's assume that we're training a model with just one parameter, and it starts off set to − 5 . We run some training data through, and get a loss, let's say 44.44: We don't know what shape our loss curve is (if we did, we might be able to find the lowest loss algebraically), but we do know the differential of the parameter versus the loss at the point we've measured; it happens to be -13. That is reasonably large and negative: We use that information to say that we want to move in the direction of a larger value for our parameter -- that is, in our case where the gradient is negative, so we have a downhill slope towards the right, we want to increase the parameter to move rightwards on that chart, whereas if it were positive (an uphill slope) we'd want to decrease the parameter to move leftwards. Simply subtracting the gradient from the parameter would lead to an update in the right direction, but it would be a very large one in this case -- we'd move 13 units to the right -- so we multiply the gradient by a small positive number, the learning rate (often written as a lower-case eta, like this: η ), to move a small distance in that direction. Let's say η = 0.3 . That means we want to update our parameter: So now we run that through and get a new loss -- let's say it's 9.06 -- and a new gradient, which happens to be -5.2. Now we can do another update, and our parameter will become 0.46, so we use that and work out another loss and gradient, which come to 3.3816 and -2.08. Let's plot that one, but this time we'll draw back the veil and show the actual loss curve. Now, it's worth reiterating that while we're training this model we don't know what that curve looks like -- we're just finding points on it, along with its gradient at those points, and using that information to work out which parameter value to explore next. But it's pretty clear that as we continue, if the learning rate is set correctly, we'll get to the minimum eventually if the learning rate is the right kind of size, because -- due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum 1 . It's also pretty clear that if the learning rate is smaller than an optimal value, in this simple case we will still find the right point, but it will take more steps because each one is smaller: And, of course, if the learning rate is too high, we might never converge -- we'd "bounce out of" the dip, and wind up with a parameter value that endlessly cycles between increasingly smaller and increasingly larger values, zooming off to infinity: OK, that's the basics. Why might we want to change from something that seems so logical and simple? A few paragraphs back I said: due to the nice smooth U-shape of the curve, the gradient gets smaller the closer we get to the minimum What if it doesn't? Imagine if we had something more like a V-shaped curve, like this: The gradient does not decrease as we get closer to the minimum, and so while we're in the downward-sloping part, each update is exactly the same distance: Now, eventually we'll jump over the minimum: In this example, I've used a gradient of − 8.33 on the downward-sloping part of the curve, and + 8.33 on the upward-sloping part, so that means that our next update just bounces us back to where we were before! Because the gradient isn't decreasing the closer we get to the minimum, we wind up just oscillating around it. That's not very helpful. That's a slightly contrived example (though not entirely -- intuitively, with functions like ReLU or GELU in our real LLMs, it's easy to imagine crazy loss landscapes). But it does show that perhaps we might want to add in our own "artificial" way to decrease the size of the steps we take over the course of training our model rather than just relying on the gradients naturally flattening out for us. Another way of looking at things is that as the model gets trained, we don't want batches of very new-looking data to cause big updates, taking us away from what was a good part of the loss landscape in terms of what we've seen so far. For example, imagine you've been training an LLM on a bunch of documents, which have so far been in English. Halfway through, it encounters a document in Byzantine Greek, the loss skyrockets, and you do a big update. That would be a problem! You might want it to learn a bit from it to push it slightly in a "the world is multi-lingual" direction, but you don't want it to lose a big chunk of the value from its previous training. You might also see a kind of connection to the way that people learn over the course of their lives -- for babies, everything is new and they "update their parameters" constantly as they try to understand the world. Children are still pretty flexible, but as we get older we tend to update our beliefs less and less. That's not always optimal, but as a heuristic it's pretty adaptive. Anyway, in general: for most training runs, we're going to want the learning rate to adjust over time. Most of the time this will be by reducing it, though there can be cases for increasing it again for periods. The general case of doing this is called "learning rate scheduling". There are a bunch of ways that people adjust the learning rate over the course of a train; here are a few that cropped up a lot while I was researching this. If we want the learning rate to go down over time, and we know how many steps we're training for, we can just set it to (say) 0.0004 for the first quarter of our train, then 0.0002 for the next, then 0.0001, then finish off with 0.00005, like this: That can work pretty well! But there is one obvious oddity -- the big step changes in learning rate mean that the exact placement of the drops and the training data before and after can matter. Why are we treating the data and the state of the model immediately before and immediately after so differently? It would make more sense to have a smoother schedule. What functions decay smoothly like that? An exponential curve does: let's say we just multiply the learning rate by a number that is a little smaller than one every step, so that it drops smoothly like this: But there are lots of other curves like that, and one is particularly interesting: As you change θ from 0 to π , the value of cos θ goes smoothly from 1 to − 1 , so it's easy enough to rescale that so that our learning rate follows the same curve: This is called a "cosine annealing" or "cosine decay" schedule, and was apparently inspired by the algorithms used for simulated annealing (an optimisation algorithm that was in turn inspired by how the atomic structures form in metals as they cool -- another one for the list of things to look into in the future...) That solves the mystery from earlier: the cosine that the Chinchilla paper was talking about was exactly this. As it turns out, the cosine decay scheduling curve is quite popular in deep learning, because it has what amounts to two well-defined phases -- an initial high learning rate where lots of exploration of the loss landscape can happen, followed by a smooth transition to something more like fine-tuning to optimise the location in whatever part of the loss landscape we've wound up in. Now, all of the above are assuming that we want the learning rate to start high and finish low, so that we can mimic the textbook gradient descent that we had at the start of this post. Intuitively that feels nice, but on further thought, the important thing is really that we have a low learning rate at the end of the train, so that we can find as close a point as possible for the minimum at the part of the loss landscape we've found ourselves in. But perhaps there's a case for having both high and low periods during the train, so that we don't get stuck in a local minimum -- something to jolt us out of where we were every now and then? 2 With a step function, that's easy: you could, for example, do this: With an exponential, you could do something like this: With cosine decay, of course, things are even easier, because the cosine function is inherently cyclical, so we can just do this: However, at least for our purposes, training an LLM using a Chinchilla-optimal number of training tokens, it makes sense to be guided by what the authors of the Chinchilla paper did. Appendix B says: We find that setting the cosine cycle length too much longer than the target number of training steps results in sub-optimally trained models, as shown in Figure A1. As a result, we assume that an optimally trained model will have the cosine cycle length correctly calibrated to the maximum number of steps, given the FLOP budget; we follow this rule in our main analysis. So, at this point, I think we have one important part of the intervention we want to make: we want to use a cosine learning rate scheduler, going from high near the start of the training run, down to low at the end over one cycle. Additionally, and also from appendix B in the paper: we use a 10x learning rate decay in line with Rae et al. (2021) ...which means that if our learning rate starts at η , then we want it to decay down to η / 10 by the end. So, we just need to work out an initial value for η , and let it rip, right? Well, not so fast... When our model is uninitialised, right at the start of the train, gradients are going to be pretty wild. It's going to be making random errors all of the time, and we'll be making huge jumps across the loss landscape. That sounds bad. Additionally those kind of wild jumps can get the optimiser into a -- well, sub-optimal -- state. I haven't read enough about optimisers yet to have a solid handle on that, but that can wait -- intuitively it makes some kind of sense that erratic gradient updates might confuse it. So, it makes a certain amount of sense to start off with a low learning rate so that we don't do that, and then to increase it gradually to the peak, and only then to schedule the gradual cosine decay. According to this (rather nice looking) masterclass on LLM training , it's typical to do this over "a few thousand steps or a small percentage (e.g., 1-10%) of the total training steps, depending on the dataset size and batch size", and we would just use a linear increase over that period: I think we should do that; a simple linear warmup at the start -- let's relatively arbitrarily say 5% of our training steps going up to our desired peak learning rate. So our learning rate schedule should look something like this: So far I've written a lot about how we vary the learning rate over time, and that's all been very useful. But we still need to know what the value should be initially! In smaller-scale experiments you might just try a bunch of different numbers to see what worked well, but at more than US$30 per train, that's not practical here. Unfortunately it's really quite hard to find good suggestions published anywhere. The GPT-2 paper is (as usual) reticent: The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText ...and if you search for "learning rate training llm", you'll see lots of results for when people are fine-tuning existing LLMs ( 2 × 10 − 4 comes up a lot), but almost nothing about when you're training one from scratch. I eventually came across this (long!) post from Hugging Face , which I definitely need to spend time going through in the future, because it covers a lot of the ground I've been going over in this post series. But for this post, I think the most relevant part is in the section " Scaling Laws for Hyperparameters ", where they include a figure from this DeepSeek paper . Here it is, with some of the (also relevant) surrounding text: In our trains we're using something like 5 × 10 18 total FLOPs. Now, they are specifically charting things in terms of non-embedding FLOPs, but I'm going to play a little fast and loose here and ignore that, so reading off their chart, that looks like we should be using about 1.4 × 10 − 3 as our learning rate. We can double-check that against their formula, where C is the compute budget: Nice, a close match! However, it's definitely worth noting that we're using a simple GPT-2 architecture, and they are using something quite different -- RMSNorm instead of LayerNorm, SwiGLU as the activation function on the feed-forward networks, Rotary Position Embedding rather than the fixed ones we're using, and so on. As a sanity check: you can see that they also give a formula for the optimal batch size in terms of tokens. For our FLOP budget, that comes in at 381,782, which is about 373 of our 1,024-token sequences. That is quite a lot higher than the 97-or-so sequences that we appeared to be optimal in our earlier experiments . That is a little concerning, though of course the 97 number came out of a very ad-hoc bit of curve-fitting. For now, I'm going to hope that that doesn't matter too much for the learning rate. This may come back to bite me; if the results of a train with 1.4 × 10 − 3 are radically worse than the existing rate of 4 × 10 − 4 , I'll have to do a bit more investigation. So, now I think we have all of the theoretical pieces in place to do a train. Let's move on to the practicalities. We started by looking at this: What should we change -- disregarding the until the next post? Based on the above, we want to do a linear warmup of about 5% of our steps, going up to a learning rate of 1.4 × 10 − 3 , followed by a cosine decay down to one tenth of that, 1.4 × 10 − 4 . What does that look like in code? The relevant API for scheduling the learning rate in PyTorch is, logically enough, in the module, and there are a bunch of different scheduling classes. You create your optimiser, then create a scheduler for the shape you want, and then you can call on the scheduler (after the on the optimiser) to adjust the optimiser's learning rate over time. Let's make that more concrete; one of the schedulers is , which is what we'll need for our linear warmup period. It takes as its parameters: Let's say that we want to go from almost-zero to our optimiser's learning rate over 1,600 steps -- we'd create our scheduler like this: ...then in our training loop, after we've done the scaled step of the optimiser, we'd also step the scheduler: This confused me a little bit the first time I saw it; after all, if the scheduler hasn't been "triggered" when we step the optimiser, how does the optimiser know what learning rate to use? Surely it would just use whatever it was initialised with? The answer is that when you create the optimiser, it stores away the learning rate that you give it in two places -- an "initial learning rate" and a "current learning rate". Next, when you create your scheduler, it uses the initial learning rate to work out the start and end values, and then sets the current one to the start value immediately. Just by creating a scheduler, you're changing the optimiser's current learning rate -- but not the initial one, which is important, as we'll see in a moment. So, we have a scheduler that handles our warmup period nicely. Another scheduler that's relevant to our interests is the CosineAnnealingLR . This takes: On creation, this scheduler will read in the optimiser's initial learning rate -- note, not the current one -- and then the first time it's stepped, it will set the current learning rate to that value, and then for steps after that it will reduce it so that it follows a nice cosine decay, reaching after steps. So those two cover the two regimes that we want -- the warmup and then the cosine decay. But now we need to put them together; we want to do one and then the other. There's a very useful class, , which allows you to chain schedulers and tell it when each one takes over from the previous one. Let's sketch out some code to use that to do a train with our new peak learning rate of 1.4 × 10 − 3 , a warmup of 1,600 steps, followed by a cosine decay for the next 32,000 steps to one tenth of the peak learning rate: That actually works quite nicely! I wrote a dummy training loop to plot the current learning rate over a fake train using code like the above , and got this: ...with the output confirming that the values were good at the "milestone" point, the start and the end: I was initially a bit surprised by that, as at the time I ran it, I didn't realise that there was that split between the initial and the current learning rates on the optimiser, so I thought that the cosine scheduler would pick up whatever tiny starting value the warmup scheduler had overwritten the optimiser's learning rate with -- but that split saves the day. That means that now we have the outline of how to schedule our learning rate. But before we can put that into the code, we need to think about how it affects our checkpoints. Just like the scheduler and the optimiser, the learning rate scheduler -- or, indeed, our two schedulers here -- contain information about the state of the train. That means that if we recover from a checkpoint, we need to provide them with the information they need. If we just created them afresh, they'd start from the beginning -- for example, if we restarted from step 20,000 in a train like the one above, we'd start a new warmup from pretty much zero, and then start a fresh cosine decay. That would be bad: (Dummy test code here .) Now, we could use the parameter to initialize them with the correct current global step. But they have a state dict, like most other PyTorch objects, so the simplest thing to do is just to write that to another checkpoint file: ...and then load it likewise: (Dummy test code here .) Conveniently, if you save the state dict of a , it will also include the state of all of its component schedulers, and likewise if you reload it, it will load the components' states back in too. The one thing you have to be careful about is what they warn about in the PyTorch docs: Initializing a scheduler overwrites its optimizer’s s. When restoring a checkpoint, initialize the scheduler before calling your optimizer's to avoid overwriting the loaded learning rates. Luckily enough, in our code as it stands, we create all of the things that are checkpointed -- the optimiser and the scaler so far, but shortly the scheduler as well -- before we load in the state dicts, so that drops out quite nicely. So, we have some sketched-out code -- it's time to put it in place for the real training run. I won't go through the details of the changes to my existing DDP training code, though you can see the diff here if you're interested. Much of the complexity was due to keeping backward compatibility so that we don't have to always use a learning rate scheduler; remember that in this mini-series, I'm trying making various changes ("interventions") to the training loop in isolation, seeing whether each one improves things. So it's important to be able to easily train with or without learning rate scheduling; I did that with a flag in the Implementation-wise, initially I was thinking that it would be easiest to always have a scheduler, and in the "non-scheduled" case to just set it to a linear one that didn't change the value over the course of the train. But in the end it turned out to be easier to use as being the switch to tell the training loop which "mode" it was in. The placement of the code to create the schedulers was also a little tricky; the "natural" place was just after the optimiser is created, like it is in the example code above. However, at that point, we don't know how many global steps we're going to have in the train, because we don't have the dataset -- which means that working out the numbers to pass in to the schedulers for the warmup and decay steps would be impossible. It turned out to be easiest to put it in the function , just after the datasets are loaded, as at that point we have all of the information we need. Anyway, that's the code done, so let's see what happens! I wanted to do two trains; one with the learning rate scheduling, and one with just the new value for the learning rate, instead of . I was expecting the updated learning rate alone to be too high and to cause a very choppy train, but had high hopes for the train with the scheduling. Here's how it did; the scheduled learning rate train first: Here's what the training loss looked like over that: Quite a few loss spikes early on in the train when the learning rate is at its peak, but nothing unmanageable -- and, as you'd expect, things calmed down quite a lot later on. I also charted the learning rate, to make sure it really was doing what I thought it was doing: So, a pretty smooth train, and we definitely did the right learning rate scheduling. Time to upload it to Hugging Face , and see what the evals look like. Firstly, the smoke test: Reasonably coherent, at least, though it's not super-impressive. On to the loss on our test set: That's our best loss so far! Let's put it into the table: So, it definitely looked like it was worth it. But was it the scheduling of the learning rate that helped, or just the change from 0.0004 to 0.0014? I kicked off a second run with no scheduling, just a learning rate of 0.0014, to see what would happen. After about an hour, I noticed that the loss chart had stopped updating. The last point had a maximum and minimum loss but no average -- but after that, nothing: However, the learning rate was still being charted, so the train was definitely running: Looking at the checkpoint metadata showed what had happened. At global step 1851, we had this 3 : ...and at the next checkpoint at step 2468, we had this: ...and the same for all checkpoints thereafter. Clearly the parameters had gone off the rails -- exactly what we'd expect with an excessive learning rate: There was no point in continuing the train, as it was pretty much certainly unrecoverable, so I stopped it. Out of interest, I downloaded the model, but I couldn't even run the smoke test on it: So it was pretty clear that just updating the learning rate to 0.0014 was actively harmful. No need to upload that one to HF! And time to wrap up this experiment. While this has been quite a long post, I've really only scratched the surface of how learning rates are set. If I were doing things in more detail, the best would probably be to do a "sweep" over multiple values to try to at least approximate the best possible rate for this model. That would be pretty expensive for me, though, so I decided to stick with the DeepSeek number. It might not be ideal for the specific architecture that I'm using, given how different that is to theirs, but given the results, it's a decent one compared to what I was using. 4 Something that I found interesting is that exactly how to schedule your learning rate is still an area being actively researched. Even in my relatively minimal research, I came across three alternatives to the mainstream warmup-cosine decay pattern: I'm sure there are many more. But for this train, I decided to stick to the mainstream, and the results were pretty good! To reiterate, this has been the most positive intervention so far: So I'll stick with that, and move on to the next thing: what is the parameter that we're passing in to the AdamW optimiser? Tune in next time :-) Yes, I am foreshadowing here.  ↩ To make my earlier analogy about learning rate decaying over time in people as they age even more dubious, we can imagine this as being rather like someone middle-aged going on an ayahuasca retreat ;-)  ↩ If you're wondering how we had a valid maximum and minimum in that first checkpoint when the average was NaN, here's why: You might wonder how large labs work out the right learning rate given their training runs run to millions of dollars. The answer is there in that DeepSeek paper, as that's one of the things they were doing. They scaled their model down from the billions of parameters that they wanted to train to various smaller models, and worked out the optimal learning rate for each of the smaller models by doing full trains on them. Once they had a mapping from model size to the ideal learning rate for their architecture, they could extrapolate that to the large ones that they wanted to train. The problem is that those "smaller" models are actually quite a lot larger than the one we're training here! And while we could potentially scale it down even further, I suspect that such truly tiny models (say, 1M parameters) wouldn't train well enough to give any meaningful results.  ↩ From the paper: Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after processing 80% of the training tokens. It further reduces to 10% of the maximum value after 90% of the tokens. , which is the optimiser we're applying it to. , which the optimiser's learning rate is multiplied by to work out where we want to start up. , which is likewise applied to the optimiser's learning rate to work out the value we're heading for. , which is the number of steps over which it should go from the initial learning rate to the final one. , which lets the scheduler know how many steps into its schedule it currently is -- this defaults to , meaning it hasn't started yet. This can be useful if you're resuming from a checkpoint, but for our purposes we can ignore it. , which is the same as the 's. , which is the number of steps before it reaches its minimum , the minimum learning rate we want to get to. , again the same as the 's. Per the Hugging Face paper, some people do warmup, then pause at a set level for a while, then start the cosine decay (warmup-stable-decay). DeepSeek use a relatively simple stepped function after a warmup. 5 I came across a 2025 paper " Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs " which says that a linear decay (after a warmup) outperforms cosine. Yes, I am foreshadowing here.  ↩ To make my earlier analogy about learning rate decaying over time in people as they age even more dubious, we can imagine this as being rather like someone middle-aged going on an ayahuasca retreat ;-)  ↩ If you're wondering how we had a valid maximum and minimum in that first checkpoint when the average was NaN, here's why: ↩ You might wonder how large labs work out the right learning rate given their training runs run to millions of dollars. The answer is there in that DeepSeek paper, as that's one of the things they were doing. They scaled their model down from the billions of parameters that they wanted to train to various smaller models, and worked out the optimal learning rate for each of the smaller models by doing full trains on them. Once they had a mapping from model size to the ideal learning rate for their architecture, they could extrapolate that to the large ones that they wanted to train. The problem is that those "smaller" models are actually quite a lot larger than the one we're training here! And while we could potentially scale it down even further, I suspect that such truly tiny models (say, 1M parameters) wouldn't train well enough to give any meaningful results.  ↩ From the paper: Specifically, the learning rate of the model reaches its maximum value after 2000 warmup steps, and then decreases to 31.6% of the maximum value after processing 80% of the training tokens. It further reduces to 10% of the maximum value after 90% of the tokens. ↩

0 views
iDiallo 4 days ago

I'm Not Lying, I'm Hallucinating

Andrej Karpathy has a gift for coining terms that quickly go mainstream. When I heard "vibe coding," it just made sense. It perfectly captured the experience of programming without really engaging with the code. You just vibe until the application does what you want. Then there's "hallucination." He didn't exactly invent it. The term has existed since the 1970s. In one early instance, it was used to describe a text summarization program's failure to accurately summarize its source material. But Karpathy's revival of the term brought it back into the mainstream, and subtly shifted its meaning, from "prediction error" to something closer to a dream or a vision. Now, large language models don't throw errors. They hallucinate. When they invent facts or bend the truth, they're not lying. They're hallucinating. And with every new model that comes out and promises to stay clean off drugs, it still hallucinates. An LLM can do no wrong when all its failures are framed as neurological disorder. For my part, I hope there's a real effort to teach these models to simply say "I don't know." But in the meantime, I'll adopt the term for myself. If you ever suspect I'm lying, or catch me red-handed, just know that it's not my fault. I'm just hallucinating .

0 views
Harper Reed 4 days ago

Note #727

just posted about dot files and attractors over at 2389.ai 2389.ai/posts/the… Thank you for using RSS. I appreciate you. Email me

0 views
Chris Coyier 4 days ago

Claude is an Electron App

Juicy intro from Nikita Prokopov : In  “Why is Claude an Electron App?”  Drew Breunig wonders: Claude spent $20k on an agent swarm implementing (kinda) a C-compiler in Rust, but desktop Claude is an Electron app. If code is free, why aren’t all apps native? And then argues that the answer is that LLMs are not good enough yet. They can do 90% of the work, so there’s still a substantial amount of manual polish, and thus, increased costs. But I think that’s not the real reason. The real reason is: native has nothing to offer.

0 views
Martin Fowler 4 days ago

Fragments: March 10

Tech firm fined $1.1m by California for selling high-school students’ data I agree with Brian Marick’s response No such story should be published without a comparison of the fine to the company’s previous year revenue and profits, or valuation of last funding round. (I could only find a valuation of $11.0M in 2017.) We desperately need corporations’ attitudes to shift from “lawbreaking is a low-risk cost of doing business; we get a net profit anyway” to “this could be a death sentence.” ❄                ❄                ❄                ❄                ❄ Charity Majors gave the closing keynote at SRECon last year, encouraging people to engage with generative AI. If I was giving the keynote at SRECon 2026, I would ditch the begrudging stance. I would start by acknowledging that AI is radically changing the way we build software. It’s here, it’s happening, and it is coming for us all. Her agenda this year would be to tell everyone that they mustn’t wait for the wave to crash on them, but to swim out to meet it. In particular, I appreciated her call to resist our confirmation bias: The best advice I can give anyone is: know your nature, and lean against it. ❄                ❄                ❄                ❄                ❄ In a comment to Kief Morris’s recent article on Humans and Agents in Software Loops , in LinkedIn comments Renaud Wilsius may have coined another bit of terminology for the agent+programmer age This completes the story of productivity, but it opens a new chapter on talent: The Apprentice Gap. If we move humans ‘on the loop’ too early in their careers, we risk a future where no one understands the ‘How’ deeply enough to build a robust harness. To manage the flywheel effectively, you still need the intuition that comes from having once been ‘in the loop.’ The next great challenge for CTOs isn’t just Harness Engineering, it’s ‘Experience Engineering’ for our junior developers in an agentic world. ❄                ❄                ❄                ❄                ❄ In hearing conversations about “the ralph loop”, I often hear it in the sense of just letting the agents loose to run on their own. So it’s interesting to read the originator of the ralph loop point out: It’s important to watch the loop as that is where your personal development and learning will come from. When you see a failure domain – put on your engineering hat and resolve the problem so it never happens again. In practice this means doing the loop manually via prompting or via automation with a pause that involves having to prcss CTRL+C to progress onto the next task. This is still ralphing as ralph is about getting the most out how the underlying models work through context engineering and that pattern is GENERIC and can be used for ALL TASKS. At the Thoughtworks Future of Software Development Retreat we were very concerned about cognitive debt. Watching the loop during ralphing is a way to learn about what the agent is building, so that it can be directed effectively in the future. ❄                ❄                ❄                ❄                ❄ Anthropic recently published a page on how AI helps break the cost barrier to COBOL modernization . Using AI to help migrate COBOL systems isn’t an new idea to my colleagues, who shared their experiences using AI for this task over a year ago. While Anthropic’s article is correct about the value of AI, there’s more to the process than throwing some COBOL at an LLM. The assumption that AI can simply translate COBOL into Java treats modernization as a syntactic exercise, as though a system is nothing more than its source code. That premise is flawed. A direct translation would, in the best case scenario, faithfully reproduce existing architectural constraints, accumulated technical debt and outdated design decisions. It wouldn’t address weaknesses; it would restate them in a different language. In practice, modernization is rarely about preserving the past in a new syntax. It’s about aligning systems with current market demands, infrastructure paradigms, software supply chains and operating models. Even if AI were eventually capable of highly reliable code translation, blind conversion would risk recreating the same system with the same limitations, in another language, without a deliberate strategy for replacing or retiring its legacy ecosystem. ❄                ❄                ❄                ❄                ❄ Anders Hoff (inconvergent) an LLM is a compiler in the same way that a slot machine is an ATM ❄                ❄                ❄                ❄                ❄ One of the more interesting aspects of the network of people around Jeffrey Epstein is how many people from academia were connected. It’s understandable why, he had a lot of money to offer, and most academics are always looking for funding for their work. Most of the attention on Epstein’s network focused on those that got involved with him, but I’m interested in those who kept their distance and why - so I enjoyed Jeffrey Mervis’s article in Science Many of the scientists Epstein courted were already well-established and well-funded. So why didn’t they all just say no? Science talked with three who did just that. Here’s how Epstein approached them, and why they refused to have anything to do with him. I believe that keeping away from bad people makes life much more pleasant, if nothing else it reduces a lot of stress. So it’s good to understand how people make decisions on who to avoid. If you are a reflexive naysayer or a pessimist, know that, and force yourself to find a way in to wonder, surprise and delight. If you are an optimist who gets very excited and tends to assume that everything will improve: know that, and force yourself to mind real cautionary tales.

0 views
Ankur Sethi 4 days ago

I built a programming language using Claude Code

Over the course of four weeks in January and February, I built a new programming language using Claude Code. I named it Cutlet after my cat. It’s completely legal to do that. You can find the source code on GitHub , along with build instructions and example programs . I’ve been using LLM-assisted programming since the original GitHub Copilot release in 2021, but so far I’ve limited my use of LLMs to generating boilerplate and making specific, targeted changes to my projects. While working on Cutlet, though, I allowed Claude to generate every single line of code. I didn’t even read any of the code. Instead, I built guardrails to make sure it worked correctly (more on that later). I’m surprised by the results of this experiment. Cutlet exists today. It builds and runs on both macOS and Linux. It can execute real programs. There might be bugs hiding deep in its internals, but they’re probably no worse than ones you’d find in any other four-week-old programming language in the world. I have Feelings™ about all of this and what it means for my profession, but I want to give you a tour of the language before I get up on my soapbox. If you want to follow along, build the Cutlet interpreter from source and drop into a REPL using . Arrays and strings work as you’d expect in any dynamic language. Variables are declared with the keyword. Variable names can include dashes. Same syntax rules as Raku . The only type of number (so far) is a double. Here’s something cool: the meta-operator turns any regular binary operator into a vectorized operation over an array. In the next line, we’re multiplying every element of by 1.8, then adding 32 to each element of the resulting array. The operator is a zip operation. It zips two arrays into a map. Output text using the built-in function. This function returns , which is Cutlet’s version of . The meta operator also works with comparisons. Here’s another cool bit: you can index into an array using an array of booleans. This is a filter operation. It picks the element indexes corresponding to and discards those that correspond to . Here’s a shorter way of writing that. Let’s print this out with a user-friendly message. The operator concatenates strings and arrays. The built-in turns things into strings. The meta-operator in the prefix position acts as a reduce operation. Let’s find the average temperature. adds all the temperatures, and the built-in finds the length of the array. Let’s print this out nicely, too. Functions are declared with . Everything in Cutlet is an expression, including functions and conditionals. The last value produced by an expression in a function becomes its return value. Your own functions can work with too. Let’s reduce the temperatures with our function to find the hottest temperature. Cutlet can do a lot more. It has all the usual features you’d expect from a dynamic language: loops, objects, prototypal inheritance, mixins, a mark-and-sweep garbage collector, and a friendly REPL. We don’t have file I/O yet, and some fundamental constructs like error handling are still missing, but we’re getting there! See TUTORIAL.md in the git repository for the full documentation. I’m a frontend engineer and (occasional) designer. I’ve tried using LLMs for building web applications, but I’ve always run into limitations. In my experience, Claude and friends are scary good at writing complex business logic, but fare poorly on any task that requires visual design skills. Turns out describing responsive layouts and animations in English is not easy. No amount of screenshots and wireframes can communicate fluid layouts and animations to an LLM. I’ve wasted hours fighting with Claude about layout issues it swore it had fixed, but which I could still see plainly with my leaky human eyes. I’ve also found these tools to excel at producing cookie-cutter interfaces they’ve seen before in publicly available repositories, but they fall off when I want to do anything novel. I often work with clients building complex data visualizations for niche domains, and LLMs have comprehensively failed to produce useful outputs on these projects. On the other hand, I’d seen people accomplish incredible things using LLMs in the last few months, and I wanted to replicate those experiments myself. But my previous experience with LLMs suggested that I had to pick my project carefully. A small, dynamic programming language met all my requirements. Finally, this was also an experiment to figure out how far I could push agentic engineering. Could I compress six months of work into a few weeks? Could I build something that was beyond my own ability to build? What would my day-to-day work life look like if I went all-in on LLM-driven programming? I wanted to answer all these questions. I went into this experiment with some skepticism. My previous attempts at building something entirely using Claude Code hadn’t worked out. But this attempt has not only been successful, but produced results beyond what I’d imagined possible. I don’t hold the belief that all software in the future will be written by LLMs. But I do believe there is a large subset that can be partially or mostly outsourced to these new tools. Building Cutlet taught me something important: using LLMs to produce code does not mean you forget everything you’ve learned about building software. Agentic engineering requires careful planning, skill, craftsmanship, and discipline, just like any software worth building before generative AI. The skills required to work with coding agents might look different from typing code line-by-line into an editor, but they’re still very much the same engineering skills we’ve been sharpening all our careers. There is a lot of work involved in getting good output from LLMs. Agentic engineering does not mean dumping vague instructions into a chat box and harvesting the code that comes out. I believe there are four main skills you have to learn today in order to work effectively with coding agents: Models and harnesses are changing rapidly, so figuring out which problems LLMs are good at solving requires developing your intuition, talking to your peers, and keeping your ear to the ground. However, if you don’t want to stay up-to-date with a rapidly-changing field—and I wouldn’t judge you for it, it’s crazy out there—here are two questions you can ask yourself to figure out if your problem is LLM-shaped: If the answer to either of those questions is “no”, throwing AI at the problem is unlikely to yield good results. If the answer to both of them is “yes”, then you might find success with agentic engineering. The good news is that the cost of figuring this out is the price of a Claude Code subscription and one sacrificial lamb on your team willing to spend a month trying it out on your codebase. LLMs work with natural language, so learning to communicate your ideas using words has become crucial. If you can’t explain your ideas in writing to your co-workers, you can’t work effectively with coding agents. You can get a lot out of Claude Code using simple, vague, overly general prompts. But when you do that, you’re outsourcing a lot of your thinking and decision-making to the robot. This is fine for throwaway projects, but you probably want to be more careful when you’re building something you will put into production and maintain for years. You want to feed coding agents precisely written specifications that capture as much of your problem space as possible. While working on Cutlet, I spent most of my time writing, generating, reading, and correcting spec documents . For me, this was a new experience. I primarily work with early-stage startups, so for most of my career, I’ve treated my code as the spec. Writing formal specifications was an alien experience. Thankfully, I could rely on Claude to help me write most of these specifications. I was only comfortable doing this because Cutlet was an experiment. On a project I wanted to stake my reputation on, I might take the agent out of the equation altogether and write the specs myself. This was my general workflow while making any change to Cutlet: This workflow front-loaded the cognitive effort of making any change to the language. All the thinking happened before a single line of code was written, which is something I almost never do. For me, programming involves organically discovering the shape of a problem as I’m working on it. However, I’ve found that working that way with LLMs is difficult. They’re great at making sweeping changes to your codebase, but terrible at quick, iterative, organic development workflows. Maybe my workflow will evolve as inference gets faster and models become better, but until then, this waterfall-style model works best. I find this to be the most interesting and fun part of working with coding agents. It’s a whole new class of problem to solve! The core principle is this: coding agents are computer programs, and therefore have a limited view of the world they exist in. Their only window into the problem you’re trying to solve is the directory of code they can access. This doesn’t give them enough agency or information to be able to do a good job. So, to help them thrive, you must give them that agency and information in the form of tools they can use to reach out into the wider world. What does this mean in practice? It looks different for different projects, but this is what I did for Cutlet: All these tools and abilities guaranteed that any updates to the code resulted in a project that at least compiled and executed. But more importantly, they increased the information and agency Claude had access to, making it more effective at discovering and debugging problems without my intervention. If I keep working on this project, my main focus will be to give my agents even more insight into the artifact they are building, even more debugging tools, even more freedom, and even more access to useful information. You will want to come up with your own tooling that works for your specific project. If you’re building a Django app, you might want to give the agent access to a staging database. If you’re building a React app, you might want to give it access to a headless browser. There’s no single answer that works for every project, and I bet people are going to come up with some very interesting tools that allow LLMs to observe the results of their work in the real world. Coding agents can sometimes be inefficient in how they use the tools you give them. For example, while working on this project, sometimes Claude would run a command, decide its output was too long to fit into the context window, and run it again with the output piped to . Other times it would run , forget to the output for errors, and run it a second time to capture the output. This would result in the same expensive checks running multiple times in the course of making a single edit. These mistakes slowed down the agentic loop significantly. I could fix some of these performance bottlenecks by editing or changing the output of a custom script. But there were some issues that required more effort to discover and fix. I quickly got into the habit of observing the agent at work, noticing sequences of commands that the agent repeated over and over again, and turning them into scripts for the agent to call instead. Many of the scripts in Cutlet’s directory came about this way. This was very manual, very not-fun work. I’m hoping this becomes more automated as time goes on. Maybe a future version of Claude Code could review its own tool calling outputs and suggest scripts you could write for it? Of course, the most fruitful optimization was to run Claude inside Docker with and access. By doing this, I took myself out of the agentic loop. After a plan file had been produced, I didn’t want to hang around babysitting agents and saying every time they wanted to run . As Cutlet evolved, the infrastructure I built for Claude also evolved. Eventually, I captured many of the workflows Claude naturally followed as scripts, slash commands, or instructions in . I also learned where the agent stumbled most, and preempted those mistakes by giving it better instructions or scripts to run. The infrastructure I built for Claude was also valuable for me, the human working on the project. The same scripts that helped Claude automate its work also helped me accomplish common tasks quickly. As the project grows, this infrastructure will keep evolving along with it. Models change all the time. So do project requirements and workflows. I look at all this project infrastructure as an organic thing that will keep changing as long as the project is active. Now that it’s possible for individual developers to accomplish so much in such little time, is software engineering as a career dead? My answer to this question is nope, not at all . Software engineering skills are just as valuable today as they were before language models got good. If I hadn’t taken a compilers course in college and worked through Crafting Interpreters , I wouldn’t have been able to build Cutlet. I still had to make technical decisions that I could only make because I had (some) domain knowledge and experience. Besides, I had to learn a bunch of new skills in order to effectively work on Cutlet. These new skills also required technical knowledge. A strange and new and different kind of technical knowledge, but technical knowledge nonetheless. Before working on this project, I was worried about whether I’d have a job five years from now. But today I’m convinced that the world will continue to have a need for software engineers in the future. Our jobs will transform—and some people might not enjoy the new jobs anymore—but there will still be plenty of work for us to do. Maybe we’ll have even more work to do than before, since LLMs allow us to build a lot more software a lot faster. And for those of us who never want to touch LLMs, there will be domains where LLMs never make any inroads. My friends who work on low-level multimedia systems have found less success using LLMs compared to those who build webapps. This is likely to be the case for many years to come. Eventually, those jobs will transform, too, but it will be a far slower shift. Is it fair to say that I built Cutlet? After all, Claude did most of the work. What was my contribution here besides writing the prompts? Moreover, this experiment only worked because Claude had access to multiple language runtimes and computer science books in its training data. Without the work done by hundreds of programmers, academics, and writers who have freely donated their work to the public, this project wouldn’t have been possible. So who really built Cutlet? I don’t have a good answer to that. I’m comfortable taking credit for the care and feeding of the coding agent as it went about generating tokens, but I don’t feel a sense of ownership over the code itself. I don’t consider this “my” work. It doesn’t feel right. Maybe my feelings will change in the future, but I don’t quite see how. Because of my reservations about who this code really belongs to, I haven’t added a license to Cutlet’s GitHub repository. Cutlet belongs to the collective consciousness of every programming language designer, implementer, and educator to have released their work on the internet. (Also, it’s worth noting that Cutlet almost certainly includes code from the Lua and Python interpreters. It referred to those languages all the time when we talked about language features. I’ve also seen a ton of code from Crafting Interpreters making its way into the codebase with my own two fleshy eyes.) I’d be remiss if I didn’t include a note on mental health in this already mammoth blog post. It’s easy to get addicted to agentic engineering tools. While working on this project, I often found myself at my computer at midnight going “just one more prompt”, as if I was playing the world’s most obscure game of Civilization . I’m embarrassed to admit that I often had Claude Code churning away in the background when guests were over at my place, when I stepped into the shower, or when I went off to lunch. There’s a heady feeling that comes from accomplishing so much in such little time. More addictive than that is the unpredictability and randomness inherent to these tools. If you throw a problem at Claude, you can never tell what it will come up with. It could one-shot a difficult problem you’ve been stuck on for weeks, or it could make a huge mess. Just like a slot machine, you can never tell what might happen. That creates a strong urge to try using it for everything all the time. And just like with slot machines, the house always wins. These days, I set limits for how long and how often I’m allowed to use Claude. As LLMs become widely available, we as a society will have to figure out the best way to use them without destroying our mental health. This is the part I’m not very optimistic about. We have comprehensively failed to regulate or limit our use of social media, and I’m willing to bet we’ll have a repeat of that scenario with LLMs. Now that we can produce large volumes of code very quickly, what can we do that we couldn’t do before? This is another question I’m not equipped to answer fully at the moment. That said, one area where I can see LLMs being immediately of use to me personally is the ability to experiment very quickly. It’s very easy for me to try out ten different features in Cutlet because I just have to spec them out and walk away from the computer. Failed experiments cost almost nothing. Even if I can’t use the code Claude generates, having working prototypes helps me validate ideas quickly and discard bad ones early. I’ve also been able to radically reduce my dependency on third-party libraries in my JavaScript and Python projects. I often use LLMs to generate small utility functions that previously required pulling in dependencies from NPM or PyPI. But honestly, these changes are small beans. I can’t predict the larger societal changes that will come about because of AI agents. All I can say is programming will look radically different in 2030 than it does in 2026. This project was a proof of concept to see how far I could push Claude Code. I’m currently looking for a new contract as a frontend engineer, so I probably won’t have the time to keep working on Cutlet. I also have a few more ideas for pushing agentic programming further, so I’m likely to prioritize those over continuing work on Cutlet. When the mood strikes me, I might still add small features now and then to the language. Now that I’ve removed myself from the development loop, it doesn’t take a lot of time and effort. I might even do Advent of Code using Cutlet in December! Of course, if you work at Anthropic and want to give me money so I can keep running this experiment, I’m available for contract work for the next 8 months :) For now, I’m closing the book on Cutlet and moving on to other projects (and cat). Thanks to Shruti Sunderraman for proofreading this post. Also thanks to Cutlet the cat for walking across the keyboard and deleting all my work three times today. I didn’t want to solve a particularly novel problem, but I wanted the ability to sometimes steer the LLM into interesting directions. I didn’t want to manually verify LLM-generated code. I wanted to give the LLM specifications, test cases, documentation, and sample outputs, and make it do all the difficult work of figuring out if it was doing the right thing. I wanted to give the agent a strong feedback loop so it could run autonomously. I don’t like MCPs. I didn’t want to deal with them. So anything that required connecting to a browser, taking screenshots, or talking to an API over the network was automatically disqualified. I wanted to use a boring language with as few external dependencies as possible. LLMs know how to build language implementations because their training data contains thousands of existing implementations, papers, and CS books. I was intrigued by the idea of creating a “remix” language by picking and choosing features I enjoy from various existing languages. I could write a bunch of small deterministic programs along with their expected outputs to test the implementation. I could even get Claude to write them for me, giving me a potentially infinite number of test cases to verify that the language was working correctly. Language implementations can be tested from the command line, with purely textual inputs and outputs. No need to take screenshots or videos or set up fragile MCPs. There’s no better feedback loop for an agent than “run and until there are no more errors”. C is as boring as it gets, and there are a large number of language implementations built in C. Understanding which problems can be solved effectively using LLMs, which ones need a human in the loop, and which ones should be handled entirely by humans. Communicating your intent clearly and defining criteria for success. Creating an environment in which the LLM can do its best work. Monitoring and optimizing the agentic loop so the agent can work efficiently. For the problem you want to solve, is it possible to define and verify success criteria in an automated fashion? Have other people solved this problem—or a similar one—before? In other words, is your problem likely to be in the training data for an LLM? First, I’d present the LLM with a new feature (e.g. loops) or refactor (e.g. moving from a tree-walking interpreter to a bytecode VM). Then I’d have a conversation with it about how the change would work in the context of Cutlet, how other languages implemented it, design considerations, ideas we could steal from interesting/niche languages, etc. Just a casual back-and-forth, the same way you might talk to a co-worker. After I had a good handle on what the feature or change would look like, I’d ask the LLM to give me an implementation plan broken down into small steps. I’d review the plan and go back and forth with the LLM to refine it. We’d explore various corner cases, footguns, gotchas, missing pieces, and improvements. When I was happy with the plan, I’d ask the LLM to write it out to a file that would go into a directory. Sometimes we’d end up with 3-4 plan files for a single feature. This was intentional. I needed the plans to be human-readable, and I needed each plan to be an atomic unit I could roll back if things didn’t work out. They also served as a history of the project’s evolution. You can find all the historical plan files in the Cutlet repository. I’d read and review the generated plan file, go back and forth again with the LLM to make changes to it, and commit it when everything looked good. Finally, I’d fire up a Docker container, run Claude with all permissions—including access—and ask it to implement my plan. Comprehensive test suite . My project instructions told Claude to write tests and make sure they failed before writing any new code. Alongside, I asked it to run tests after making significant code changes or merging any branches. Armed with a constantly growing test suite, Claude was able to quickly identify and fix any regressions it introduced into the codebase. The tests also served as documentation and specification. Sample inputs and outputs . These were my integration tests. I added a number of example programs to the Cutlet repository—most of them written by Claude itself—that not only serve as documentation for humans, but also as an end-to-end test suite. The project instructions told Claude to run all of them and verify their output after every code change. Linters, formatters, and static analysis tools . Cutlet uses and to ensure a baseline of code quality. Just like with tests, the project instructions asked the LLM to run these tools after every major code change. I noticed that would often produce diagnostics that would force Claude to rewrite parts of the code. If I had access to some of the more expensive static analysis tools (such as Coverity ), I would have added them to my development process too. Memory safety tools . I asked Claude to create a target that rebuilt the entire project and test suite with ASan and UBSan enabled (with LSan riding along via ASan), then ran every test under the instrumented build. The project instructions included running this check at the end of implementing a plan. This caught memory errors—use-after-free, buffer overflows, undefined behavior—that neither the tests nor the linter could find. Running these tests took time and greatly slowed down the agent, but they caught even more issues than . Symbol indexes . The agent had access to and for navigating the source code. I don’t know how useful this was, because I rarely ever saw it use them. Most of the time it would just the code for symbols. I might remove this in the future. Runtime introspection tools . Early in the project, I asked Claude to give Cutlet the ability to dump the token stream, AST, and bytecode for any piece of code to the standard output before executing it. This allowed the agent to quickly figure out if it had introduced errors into any part of the execution pipeline without having to navigate the source code or drop into a debugger. Pipeline tracing . I asked Claude to write a Python script that fed a Cutlet program through the interpreter with debug flags to capture the full compilation pipeline : the token stream, the AST, and the bytecode disassembly. It then mapped each token type, AST node, and opcode back to the exact source locations in the parser, compiler, and VM where they were handled. When an agent needed to add a new language feature, it could run the tracer on an example of a similar existing feature to see precisely which files and functions to touch. I was very proud of this machinery, but I never saw Claude make much use of it either. Running with every possible permission . I wanted the agent to work autonomously and have access to every debugging tool it might want to use. To do this, I ran it inside a Docker container with enabled and full access. I believe this is the only practical way to use coding agents on large projects. Answering permissions prompts is cognitively taxing when you have five agents working in parallel, and restricting their ability to do whatever they want makes them less effective at their job. We will need to figure out all sorts of safety issues that arise when you give LLMs the ability to take full control of a system, but on this project, I was willing to accept the risks that come with YOLO mode.

0 views