GreatReads - Blog Aggregator · Phoenix Framework

Data locality (sometimes) beats algorithmic complexity

I've been ECS -curious ever since I learned about it in the Bevy game engine documentation . The ECS architecture predictably improves performance in languages that give you low-level control over memory (C, C++, Rust, Zig, and friends). But how does it fare when used in high-level, dynamic, garbage-collected languages such as JavaScript? This is the question Dan Murphy set out to answer in The Physics of Memory : Is it possible to use an ECS-style architecture in Javascript? And for applicable operations, does that actually do better than objects + V8’s garbage collection? To answer the question, Murphy built a 2D physics simulation of 15,000 balls bouncing around in a box using several different techniques. He found that a JavaScript implementation of the simulation that used ECS outperformed the usual "giant graph of objects" OOP implementation by 24x. He writes: It's also worth noting how the usual OOP implementation creates GC pressure: In OOP, entities are scattered across the heap. As they move and interact, the JavaScript engine’s garbage collector is constantly triggered, and the CPU frequently stalls waiting for pointer lookups. This causes sporadic frame drops (micro-stutter). Because ECS uses pre-allocated, flat TypedArrays, memory access is 100% predictable and GC overhead is zero, guaranteeing perfectly smooth frame delivery. My favorite thing about Murphy's post is that you can run all his benchmarks in your own browser. I love it when technical explanations or benchmarks are accompanied by embedded "apps" you can play around with. I'm surprised at how much data locality matters for performance. An algorithm with worse big-O complexity can outperform one with better complexity if it makes good use of the CPU's L1/L2 caches. Very cool. Cache Locality > Algorithmic Complexity : At 15,000 entities, pointer-chasing and unpredictable tree branching cannot compete with the contiguous L1/L2 cache locality of a flat 1D array sort—even though trees have a better theoretical Big-O complexity. You Don’t Need WASM for ECS Wins : Simply switching your JavaScript codebase to a flat Structure of Arrays (SoA) layout yields up to a 24x speedup over OOP. WASM is the cherry on top (another 2.5x), not the entry ticket. Pragmatism Wins : While a hand-tuned SoA is the absolute fastest, using a production ECS library like still gives you a massive 14x speedup over OOP while providing a clean, scalable API. IMO, for 99% of applications using a library is the correct engineering choice.

Programming

Rust

JavaScript

C++

0 views

Ankur Sethi 2 weeks ago

Your analytics are lying to you

Alistair Davidson writes about migrating a form-heavy web application from a React SPA to a traditional server-rendered HTML-first website . The entire article is worth reading, but I want to draw attention to this bit about analytics (emphasis mine): The results? When we launched, the number of people completing the form doubled. The analytics people didn’t even know where these users were coming from. Of course, your javascript-based analytics package doesn’t see the users you are bouncing because of javascript failures. It was a flood! We also saw my “keep a backend session, never lose user data” approach pay off. In one case, someone completed a form a month after starting it. Web analytics are fragile. They fail in so many ways that making product decisions based wholly on your Google Analytics or Plausible data is folly of the highest degree. Here's a subset of all the reasons your analytics package undercounts or miscounts visitors: Web analytics can only give you an approximation of what your web traffic looks like. Even when they work correctly, they paint an incomplete picture. As I said in my post about share buttons , the number one referrer for pages on this website is "Direct/none". It's impossible for Plausible to figure out where those users are coming from. Further, my server logs report three times as much traffic as my Plausible dashboard over a seven day window. Some of this might be bot traffic and thus irrelevant, but I know for a fact that a large chunk of this traffic comes from RSS readers. Plausible will never have insight into these users. My point is, if you rely on your analytics dashboard to make product decisions, you're excluding a large chunk of potential users who simply don't show up in your graphs. You might be missing out on serving thousands of potential users because you can't see them in your data. These are users who want to sign up for your newsletter, buy your app, subscribe to your service. These are human beings you could help, whose lives you could improve. I'm not saying that analytics are completely useless. They can and should have a place in your decision-making process. Just don't treat analytics data as gospel, because there will always be massive blind spots in what it tells you. To get a real understanding of how users experience your products, test them on real devices under real conditions as much as possible. And as always, get out there and talk to your users. Network errors prevent your analytics script from loading. Ad-blockers and tracking prevention block your script from loading (enabled by default on many browsers today). A JavaScript error in an unrelated part of the page prevents the analytics script from working correctly. The user loses network connectivity before the analytics script can send data to the server. The user gets impatient and bounces off your website before the page can load fully and start collecting data. Too much JavaScript on the page causes the browser tab to crash (a common issue on low-end devices). The analytics script is blocked by a DNS rule, corporate proxy, firewall, or VPN. The user has disabled JavaScript. The user's browser has limited or no support for JavaScript (Opera Mini still has more than half a million downloads on Android, and it's still widely-used in Africa ). The user is accessing your content using a service that strips JavaScript (e.g. an RSS reader, a web archiving tool, Telegram Instant View, AMP, a read-later service, or a bookmarking service). You only test your app in Chrome, so you don't realize that your website is entirely broken in Firefox and Safari.

Data Analysis

HTML

Web Development Analytics

JavaScript

0 views

Ankur Sethi 3 weeks ago

Deno Desktop

From the Desktop apps section of the Deno documentation : turns a Deno project (anything from a single TypeScript file to a Next.js app) into a self-contained desktop application. The output is a redistributable binary that bundles your code, the Deno runtime, and a web rendering engine into one bundle per platform. I'm happy to see another attempt at solving the biggest issues with Electron apps (other notable attempts being Tauri , Electrobun , and Neutralinojs ). According to the docs, Deno Desktop is only available in Deno's channel at the moment. So I obviously installed it (version ) and tried running a Hello World example app . On first run, Deno spent a few minutes downloading , then packaged the example into an app bundle weighing 308.8MB. I was curious about that download. A quick Kagi search led me to the homepage for a Rust/C library called laufey , which appears to be the tech underpinning Deno Desktop. Running the app bundle popped open a window that looked like this: This is clearly a work in progress. If somebody who works on Deno is reading this, here's a list of bugs I noticed: Deno uses Chromium as the default webview (via Chromium Embedded Framework ). But you can also use the system webview instead: When I ran that command, it downloaded and produced a much slimmer app bundle at 68.5MB. This is what the window looked like: This version of the app exhibited none of the bugs I noticed in the CEF version, except it doesn't have a title. Deno Desktop also has a backend that skips bundling the webview altogether. I didn't try it, but here's what the docs say: No web engine. Provides window management, input events, clipboard, and the native API surface, but no webview, no auto-binding, and no proxy. Useful for apps that draw their own UI (WebGPU, Skia, custom rendering) or as a foundation for non-web desktop programs. The backend is selected through the field in ; the flag accepts only and . A major difference between Deno Desktop and its competition is how it communicates between the code running in the webview and the code running in the Deno runtime: Bindings are not IPC. The Deno runtime and the rendering backend run as threads / processes inside the same address space (CEF) or coordinated process group (WebView). Calls go through in-process channels, and the backend dispatches them from its run loop. This avoids the cross-process round-trip that socket-based IPC frameworks (Electron's ipcMain / ipcRenderer, Tauri's invoke) impose. Arguments and results are still encoded as they cross the realm boundary, but the transport is in-process: no socket, no cross-process scheduling. In practical terms: bindings are fast enough that you do not need to worry about call frequency for typical app workloads. The docs are light on how they pull this off. I'd love to read more about this. There's a built-in auto-update mechanism, including rollbacks if updates fail: Deno.autoUpdate() polls a release server for new versions, downloads binary-diff patches, applies them to the runtime dylib, and stages the result for the next launch. If the next launch fails, the runtime rolls back to the previous version automatically. Updates ship as small bsdiff patches instead of full binary downloads, with rollback baked into the launcher. The comparison page has this bullet-point under the section titled "What doesn't have yet": Shared CEF runtime across apps. Every app currently bundles its own CEF copy. A managed shared runtime would drop binary sizes to a few MB per app. On the roadmap. Does this mean all Deno apps on my computer could potentially share a single CEF runtime? If yes, that would mean massive disk space savings. But it's unclear if the developers intend to ship this feature in a future release or if it's just a wishlist item that may or may not see the light of day. Deno Desktop is, of course, heavily under development. Some important features are still missing (platform native file dialogs), and it's not clear if others are on the roadmap or not (mobile support). I'm sure many of the missing features will make their way into the final release, and we'll get a clearer idea of future plans in a release announcement. I have a personal interest in anything that aims to replace Electron, so I'll be keeping an eye out for Deno 2.9. The app window had a dark background by default, even though the demo app didn't contain any styles. Browsers don't default to a dark background unless you explicitly opt in using . Even so, opting into dark mode inverts all the default colors, not just the page background. Something is off here. Running the bundle triggered a macOS permissions dialog for and , both of them asking for notification permissions. The demo app didn't use the notifications API (it didn't even contain any JavaScript), so seeing two permission dialogs felt aggressive. Hitting didn't quit the app. The app always opened on the top left of the screen.

Deno

Web Development

Rust

JavaScript

0 views

Ankur Sethi 3 weeks ago

What was nice about the UI of Windows 2000

Among tech and design folks of a certain vintage, there is a fondness for the design language used by Windows from Windows 95 to Windows 2000. While you could argue that this is mere nostalgia, I've always thought there is more to it than that. The Windows 95 UI was famously the result of comprehensive UX studies and careful design . While Microsoft has always been Microsoft—a company defined by a lack of care in everything it puts out into the world—the developers and designers who worked on early versions of Windows really did care about what they were building. movq writes about what made the Windows 2000 UI great . Things that stood out to me: Incidentally, some of the things that made the Windows 2000 UI great were also the things that made Apple's Platinum and Aqua design languages great. We lost our way somewhere in the 2010s with Apple's "flat design" and Microsoft's Metro, with the rest of the industry following suit soon after. Today, many of our desktop UIs have no distinction between interactive and non-interactive elements, use color palettes that are just different shades of gray, give all icons the same silhouette by putting them in squircle jail , and make scrollbars nearly inaccessible by autohiding them at every opportunity. movq ends with: This trend of slowly removing visual clues continued and, today, you have no idea anymore which elements on the screen might be interactive. I had this conversation a while ago: The entire idea of having clear, consistent visual clues is lost. Nobody but old tech people even expects that anymore. I wish we wouldn't shy away so much from bevels, reliefs, lines, and frames, just because those "scream Windows 95". The Start menu (in my opinion, this is the greatest idea in the design of desktop operating systems). Icons are colored and have unique shapes, making them easy to identify at a glance. All interactive elements share visual similarities. The color palette consists of sharply contrasting colors, making it easy to distinguish between different parts of the UI. Scrollbars are always visible. Options dialogs across the OS have a similar layout, with options grouped into tabs and the same Apply/Ok/Cancel buttons at the bottom of the dialogs. UI controls that are part of the same logical group are contained inside clearly demarcated frames. Me: "I don't like smartphone UIs. Everything is flat, nothing indicates where you can touch or not. I have to randomly try everything on the screen." Response by non-tech person: "Well, yeah, of course you have to try everything? How else would this work?"

Design

0 views

Ankur Sethi 1 months ago

Nobody clicks your share buttons

Link: https://derekhanson.blog/nobody-clicks-your-share-buttons/ (Via rendezvous with cassidoo .) I've always wondered if anyone actually used the social sharing buttons embedded on news sites and (some) WordPress blogs. Derek Hanson digs into the numbers : The UK government ran one of the most thorough studies on this. When GOV.UK added social sharing buttons, they tracked usage for 10 weeks across 6.8 million pageviews. The share buttons got clicked 14,078 times. That’s a 0.21% usage rate, which works out to about 1 in 476 visitors. The most telling part: the feature sat in their backlog for ages because zero end users had ever requested it. In their user testing, people just copied and pasted links. Moovweb found the same thing when they analyzed 61 million mobile sessions . Only 0.2% of mobile users interacted with social sharing at all. Visitors were twelve times more likely to click an advertisement. Luke Wroblewski, the interaction designer and author, crowdsourced data from his readers and landed on an average of 0.25% across 18 million pageviews. Different organizations, different audiences, same number. What do people do instead? They copy and paste URLs or use the share button in their browser. In 2012, Alexis Madrigal at The Atlantic noticed a huge chunk of the magazine’s web traffic showing up as “direct” in Google Analytics. Those visitors weren’t typing URLs or using bookmarks. They were clicking links that someone had pasted into a text thread, an email chain, a Slack channel. This reflects my own experience. "Direct/none" is the number one referrer on this very website.

Analytics

Marketing

0 views

Ankur Sethi 1 months ago

So you want to write a GUI framework

Link: https://www.cmyr.net/blog/gui-framework-ingredients.html There are a handful of technical blog posts in my bookmarks that made me go oh, I never thought of it that way when I first read them. I'm talking about posts like Parse, don't validate , Text Editing Hates You Too , Choose Boring Technology , or Making illegal states unrepresentable . I consider these required reading for any working programmer. To me, Colin Rofls' So you want to write a GUI framework falls into the same category. Before reading this post, I'd never considered how much work goes into building a GUI framework. There's a reason even trillion-dollar megacorporations use web technologies to build their apps , ship buggy frameworks year after year , or drop support for platforms with no concern for their users. Building a brand-new GUI framework in 2026 is a long slog, and you don't get to reap the fruits of your labor until you've solved every single problem on Colin's list. Colin writes : Regardless of the specifics, there is one major dividing line to recognize, and this is whether or not a framework is expected to integrate closely into an existing platform or environment . On one side of this line, then, are tools for building games, embedded applications, and (to a lesser degree) web apps. In this world, you are responsible for providing almost everything your applications will need, and you will be interacting closely with the underlying hardware: accepting raw input events, and outputting your UI to some sort of buffer or surface. (The web is different; here the browser vendors have done that integration work for you.) On the other side of this line are tools for building traditional desktop applications. In this world, you must integrate tightly into a large number of existing platform APIs, design patterns, and conventions, and it is this integration that is the source of most of your design complexity. In general, a game or an embedded application is a self-contained world; there is a single ‘window’, and the application is responsible for drawing everything in it. The application doesn’t need to worry about menus or sub-windows; it doesn’t need to worry about the compositor , or integrating with the platform’s IME system. Although they maybe should , they often don’t support complex scripts . They can ignore rich text editing. They likely don’t need to support font enumeration or fallback . They often ignore accessibility. He goes on to enumerate all the integration points a GUI framework has with its host platform, including windowing, menus, 2D graphics, text rendering, accessibility, user input, and a bunch more. Each of these problems is hard on its own, but to build a GUI framework that people will want to use, you must solve all of these problems simultaneously . A few surprising things that stood out to me from the post: We don't have too many viable cross-platform GUI frameworks today, especially if you want to target desktop computers. It takes too much time, money, and specialized expertise to build one. If I was starting a desktop app business today, there are only two frameworks I'd feel comfortable relying on: Electron and Qt. Nothing else is mature enough. Dropdowns and select menus are actually tiny windows. If they weren't, they would be constrained to live inside your app's main window. You can see this in action when a web application cobbles together a custom select box using a bunch of s. Those custom selects can never overflow the boundaries of your browser. Building an abstraction that supports all the different 2D drawing APIs across platforms (CoreGraphics on Mac, Direct2D on Windows, Cairo on Linux, etc.) is difficult. To get around this, many cross-platform apps bundle Skia, which adds ~17MB to the application's binary. The article is from 2021, so that footprint is probably larger now. GPUs are built to render 3D scenes, which makes them worse at rendering 2D scenes. Rendering 2D scenes on GPUs is an area of active research. If you only ever write English, you've probably never thought about IME s. I write Hindustani and Punjabi, and broken support for the macOS IMEs for those languages immediately tells me that an app is built using a non-native GUI framework. Replicating the native behavior and conventions of a platform is difficult but possible. Replicating the native appearance of a platform—down to the animation curves, gradients, border radii—is a fool's errand. In my opinion, if you're building a cross-platform app, it's better to have it look completely alien than trying to mimic the platform's native widgets. But not respecting the platform's conventions for things like drag and drop, scroll acceleration, etc. is nonnegotiable.

Programming

Web Development

Tutorial

HTML

0 views

Ankur Sethi 1 months ago

Using SwiftUI to Build a Mac-assed App in 2026

Link: https://pfandrade.me/blog/mac-assed-swiftui-app/ Paulo Andrade , creator of Secrets and Shopie : There was a time when Mac apps felt unapologetically Mac. Panic, Omni, Cultured Code, Bare Bones, Sofa. The years just before the iPhone SDK were probably peak Mac-assedness. Then Apple's center of gravity shifted toward the iPhone. Now we have Electron, Catalyst, and iPadOS apps on the Mac. And even Apple's SwiftUI apps often sand off the very behaviors that made Mac software feel great in the first place. SwiftUI was announced at WWDC in 2019, almost exactly 7 years ago now. It was meant to be a unified toolkit that would allow you to build apps for Mac, iPhone, iPad, Apple Watch, and any future platforms Apple might release. Most Apple developers would agree that SwiftUI has failed to deliver on that promise. In fact, Paulo's post is not the first I've read about SwiftUI's various inadequacies. Michael Tsai recently made a list of grievances professional SwiftUI developers have with the framework . I've been personally interested in getting back into building native Mac apps since at least the COVID lockdowns. But every time I've asked for advice on whether I should learn SwiftUI or AppKit, I've been met with the same answer: learn both. For somebody who has a full-time job and somewhat of a social life, this is untenable. It's just not possible for me to learn two new UI frameworks just as a cost of entry into the Apple developer ecosystem, no matter how motivated or skilled I might be. Meanwhile, long-time Mac users complain that nobody builds native apps anymore. To be fair, diehard Mac users have always complained about this, but I believe this time their complaint has legs. I don't see too many native Mac apps being built in 2026. The old stalwarts are still going strong—BBEdit, Things, Transmit, iA Writer, and all the rest—but pretty much every recent app I've used is built on top of Electron. It's easy to point the finger at Electron and React, or at CXOs that want to hire cheap frontend developers over expensive native developers, or at developers themselves, but I feel Apple is at least partially to blame for the state of the ecosystem today. I don't want to invest my time in an incomplete and buggy UI framework, and I certainly don't want to learn two UI frameworks just to try my hand at building a native app. I suspect most developers feel the same. Paulo ends his post with: You can see the result everywhere. SwiftUI is productive, modern, and often delightful, right up until you try to make a really good Mac app. Then suddenly you're fighting the framework for things the Mac solved 20 years ago. WWDC starts in two hours from the time I'm writing this post. Perhaps today we'll see some announcements that address some of these issues? Perhaps the Apple of 2026 will finally catch up with the Apple of 2006 in terms of software quality? Whether Apple cleans up their mess or not, Electron exists today and works fine . It lets you get your work out the door and into the hands of your users. It lets you build your business without worrying about what Apple will or will not do. As does React, which hasn't changed significantly since SwiftUI was announced.

Programming

Tutorial

0 views

Ankur Sethi 2 months ago

Land and expand

Link: https://lobste.rs/s/oznirn/redis_cost_ambition#c_dzrja0 Mitchell Hashimoto (founder of HashiCorp, creator of Vagrant and Ghostty) commenting on why software products often lose their core identity and grow irrelevant features: The cost (cognitive, time, risk, money, etc.) of adopting a new thing is significantly higher than expanding an old thing. You see this even without any commercial interests. For example, one I've spoken publicly on is how many programming languages became a least-common-denominator of everything features rather than hold strong to a core identity. And many/most of these have no commercial motive, its just laziness. Commercial interests of course definitely push this though. At a certain points its all about horizontal expansion. Or, in more businessy terms: "land and expand." You have the P&P (pricing/packaging) for land deals that explicitly aim to get someone to use your software, usually lead by a flagship functionality that your product is truly probably best in class or nearly at. Then once the deal is landed, you have a cadre of add-on functionality that you're probably just average at at best, but its easier for procurement (the department that handles software purchasing in a business) to upgrade an existing closed deal than to engage in a new one. So you can sell mediocre stuff. I recently heard a different term for the "land and expand" idea in The Positioning Manual for Indie Consultants : "creating a beachhead". I find it interesting (and off-putting) that much of business vocabulary borrows from military operations. But that's a post for another day. The "land and expand" strategy doesn't always result in bad products. But when it's done badly, you end up with Zoom Mail, Microsoft Teams, and JIRA.

Business

0 views

Ankur Sethi 2 months ago

Selling to practitioners vs. selling to technical decision makers

Link: https://lobste.rs/s/oznirn/redis_cost_ambition#c_dzrja0 Mitchell Hashimoto (founder of HashiCorp, creator of Vagrant and Ghostty) commenting on Lobste.rs about how software products are sold: For software solutions, there are two main groups: practitioners and technical decision makers (TDMs). Practitioners are the main users of a piece of software (and in the case of OSS, adopters, though not the case always). TDMs are the higher level management with budgetary discretion that are making broad stroke technical decisions. The Redis landing page to me looks like a TDM-oriented site. And the "real-time context engine for AI" and AI focus feels correct for that target user. You know the phrase "no one ever got fired for choosing IBM?" The thing about 90% of TDMs is that they're motivated primarily by NOT GETTING FIRED. These aren't people who browser Lobsters or push to GH on the weekend. These are people that work 9 to 5, get paid, go home, and NEVER THINK ABOUT WORK AGAIN. So to achieve all that, they follow secular trends supported by analysts and broad public sentiment. Oh, Gartner said that "AI strategy" is most important? McKinsey said "context" needs to be managed? Well, "Context Engine for AI Apps" is going to be defensible. Buy it. On the surface, this might sound like a dismissal of TDMs as people who don't care about the job, but I don't think Mitchell meant it that way. TDMs are doing their best with the information they have. They're paying attention to signals that are high quality in their estimation, but not necessarily high quality in the estimation of their technical co-workers. I personally would never use a Gartner report to make technical decisions, but in the same way the CFO at your company would never use a Hacker News comment to make financial decisions. And you know what? It's okay if your CFO doesn't care about what Hacker News thinks about Redis. That's not their job. That's your job. Their job is to make sure the business doesn't go bankrupt. If I want my company to pick Valkey over Redis, the onus for communicating that to management is entirely on me. It's my job to explain why it's valuable not just from a technical point of view, but also from a business point of view. Will it help the company ship faster? Save money on AWS bills? Build new features we couldn't build before? Will it help reduce liability, create better audit trails, onboard new engineers faster? TDMs can't make good decisions based on information they can't parse, so it's my job to make sure they can parse the differences between two relatively similar products. If I refuse to do this job properly, the marketing department at Redis Ltd. will do it in a way that serves their business needs rather than mine. There are economic, social, legal, and political dimensions to picking technology. It's never just about the quality of the product in isolation.

Business

Marketing

0 views

Ankur Sethi 2 months ago

Mythos finds a curl vulnerability

Link: https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/ Daniel Stenberg , creator and lead developer of cURL: My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing. I signed the contract for getting access, but then nothing happened. Weeks went past and I was told there was a hiccup somewhere and access was delayed. Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report. To me, the distinction isn’t that important. It’s not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway. Getting the tool to generate a first proper scan and analysis would be great, whoever did it. I happily accepted this offer. So Daniel didn't have access to Mythos. Someone else ran the analysis on his behalf. It's unclear what methodology this "someone else" used, how familiar they were with the cURL codebase, or how well they were acquainted with the sort of security issues the project has seen before. What if Daniel had run the scan himself? I'm willing to bet the results would've been radically different. I'm not saying all the hype around Mythos is necessarily justified—Anthropic is an AI lab after all, and AI labs lie. However, it's becoming clear that LLMs are remarkably effective at finding bugs and security issues as long as they have the right guidance . For an example of what Claude can do with expert guidance and access to custom tools, see Using LLMs to find Python C-extension bugs . Broadly speaking, I believe Daniel would agree with this sentiment. He writes: But allow me to highlight and reiterate what I have said before: AI powered code analyzers are significantly better at finding security flaws and mistakes in source code than any traditional code analyzers did in the past. All modern AI models are good at this now. Anyone with time and some experimental spirits can find security problems now. The high quality chaos is real. Any project that has not scanned their source code with AI powered tooling will likely find huge number of flaws, bugs and possible vulnerabilities with this new generation of tools. Mythos will, and so will many of the others. Not using AI code analyzers in your project means that you leave adversaries and attackers time and opportunity to find and exploit the flaws you don’t find. Lately I find myself drawn to how LLMs can help improve existing human-authored (or mostly human-authored) code. I'm no longer thrilled with the idea of using them to write most of my code for me— been there , dealt with the cognitive debt—but I'm intrigued by how I could use them as superhuman code reviewers to catch my mistakes. What would a coding harness designed primarily around improving code quality look like?

Open Source

Python

Security

0 views

Ankur Sethi 2 months ago

Using LLMs to find Python C-extension bugs

Link: https://lwn.net/Articles/1067234/ Jake Edge , LWN.net: […] Hobbyist Daniel Diniz used Claude Code to find more than 500 bugs of various sorts across nearly a million lines of code in 44 extensions; he has been working with maintainers to get fixes upstream and his methodology serves as a great example of how to keep the human in the loop—and the maintainers out of burnout—when employing LLMs. It's worth reading Daniel Diniz's post on the Python forums in full. This is a great example of an engineer with specific domain expertise using LLMs to augment and amplify his abilities. Not just that, he's working closely with maintainers to ensure he's not inundating them with slop PRs or unreproducible bug reports. The part I find most interesting is how Daniel's Claude Code plugin works. He writes in his forum post : I built a Claude Code plugin called cext-review-toolkit . The key difference from traditional static analysis is that this system tracks Python-specific invariants (refcounts, GIL discipline, exception state) across control flow, and validates findings with targeted reproducers. That is done by 13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class. The agents use Tree-sitter for C/C++ parsing, which enables analysis that pattern matching can’t do, like tracking borrowed reference lifetimes across function calls, or cross-referencing type slot definitions with struct members. Each agent can run a scanner script to find candidates, then performs qualitative review of each candidate to confirm or dismiss it. The scripts have a ~20-40% false positive rate and the agents are there to bring that down. After the agents finish, I try to reproduce every finding from pure Python and write a reproducer appendix. Later from the same post: Traditional tools like clang-tidy, Coverity, and sanitizers struggle with Python C API semantics (reference ownership, exception state, GIL constraints). The analyses cext-review-toolkit performs target those invariants specifically. Besides that, the tool uses guided semantic analysis (LLM-assisted) to analyze aspects like “was that bugfix complete, and do similar bugs still lurk in the codebase?” that other tools cannot cover. The rich set of agents cover: So is not just a set of prompts that tell Claude to go find bugs. It combines detailed descriptions of specific classes of bugs with scripts powered by Tree-sitter that allow Claude to extract rich semantic data from the codebase it's analyzing. The LLM is not doing all of the heavy lifting here. It works in tandem with human expertise encoded in prompts and deterministic scripts custom built for acting on those prompts. To me, this feels like the most effective use of LLMs for domain-specific tasks that don't exist in training data: encode as much of your logic into deterministic tools as you can, encode the more squishy parts of your domain into prompts, and let an agent drive those tools. I can see a possible future where every project has its own version of that encodes common classes of bugs the project deals with repeatedly. How much would something like this improve code quality? How much better would it be versus the generic PR review agents we use today? Reference counting: leaked refs, borrowed-ref-across-callback, stolen-ref misuse. Error handling: missing NULL checks, return without exception, exception clobbering. NULL safety: unchecked allocations, dereference-before-check. GIL discipline: API calls without GIL, blocking with GIL held. Type slots: dealloc bugs, missing traverse/clear, -without- safety. PyErr_Clear: unguarded exception swallowing (MemoryError, KeyboardInterrupt). Module state: single-phase init, global PyObject* state. Version compatibility: deprecated APIs, dead version guards. Git history: fix completeness (same bug fixed in one place but not another). Plus: stable ABI compliance, resource lifecycle, complexity analysis.

C++

Python

0 views

Ankur Sethi 2 months ago

A broken 404 template in Django can swallow your backtraces

I recently migrated this website from Astro to Wagtail . The reason why I did it is a story for another day. In this post, I want to talk about a bug that took me far too long to figure out. In his (verifiably incorrect) post about making chai , Abhigyan linked to my own (verifiably correct) post on the topic . While linking to my post, he accidentally omitted the trailing slash from the URL. This shouldn't have been a problem. By default, Django automatically redirects a URL without a trailing slash to the same URL with the trailing slash appended, provided the original URL returns a . For example, if you try to access the following URL on my website: Django automatically performs a redirect to: This is the default behavior, controlled by the setting . However, when Abhigyan linked to my (verifiably correct) post about making chai, my server returned a error instead. I'd never have discovered this error myself, but Shubh pointed it out to me on the IndieWebClub chat last week. Thanks Shubh! I started investigating the issue by checking the Gunicorn logs on my VPS. I was hoping they would contain a backtrace that would help me pinpoint the exact problem, but the logs only printed the string whenever the broken URL was accessed. I ran my app with production settings inside a Docker container to see if I could trigger the same behavior. And sure enough, the Dockerized app produced the same error with the same mysterious in the Gunicorn logs. My first instinct was that I had somehow messed up my logging configuration. I'd surely introduced a bug in some Python code somewhere, and my logging configuration was failing to log the backtrace because of a misconfiguration. But tweaking Django's setting didn't change anything. I could see backtraces from the exceptions I inserted at random points in my code, but accessing a URL without a trailing slash would still only produce the string in the logs. After a lot of head scratching, reading the docs, and yelling at Claude, I wondered if something in my template could be responsible for the error. My template was fairly complex, loading and calling several template tags, inheriting from a chain of templates, rendering a few s, and concatenating assets using django-compressor . I started by deleting everything from and reducing it to a single tag. Sure enough, this fixed the issue! Then I slowly added some of the code back until I found the one custom template tag that was throwing an exception, but only when called in the context of a 404 page. Fixing the tag and redeploying fixed the issue for good. But what about the logs? An error in my 404 template not only caused my server to return a 500, but also suppressed any backtraces that might have helped me diagnose the issue. That's weird, right? I might be wrong, but I believe the sequence of events that can lead to this issue is as follows: The lessons I learned from this frustrating scenario were: Somebody accesses a URL without a trailing slash. Django tries to find that URL in its . Since this is a Wagtail installation, it also tries to find a page in the URLs known to Wagtail. All the URLs in my have trailing slashes. Wagtail also appends trailing slashes to all its URLs when is true. So trying to access a page without a trailing slash returns a 404. You would expect Django's redirect logic to kick in at this point, trying to append a trailing slash to the original URL and performing a redirect. But that's not what happens! The redirect logic lives in , which can only perform the redirect after the entire handling chain has finished running. This means regardless of what happens, Django will always render your template when an unknown URL is accessed. Yes, even if redirecting to the same URL with a trailing slash produces a known, correct URL! This means if your template errors out, doesn't even get a chance to run. Django encounters an unknown URL, tries to render the template, fails, and turns the into a . When this happens, Django only logs the , not the template failing to render. This happens even if you're logging template rendering errors in your logging configuration . From what I can tell, there is no way to get Django to log an error in without creating a custom view, manually catching errors, logging the caught errors, and re-raising them so that Django can turn them into s. Always render your and pages in unit tests to make sure they can never error out. Keep your error pages as simple as possible. Ideally, they should only contain HTML and inlined CSS, nothing more.

HTML

Backend

Django

CSS

Web Development

0 views

Ankur Sethi 3 months ago

I'm no longer using coding assistants on personal projects

I’ve spent the last few months figuring out how best to use LLMs to build software. In January and February, I used Claude Code to build a little programming language in C. In December I used local a local LLM to analyze all the journal entries I wrote in 2025 , and then used Gemini to write scripts that could visualize that data. Besides what I’ve written about publicly, I’ve also used Claude Code to: I won’t lie, I started off skeptical about the ability of LLMs to write code, but I can’t deny the fact that, in 2026, they can produce code that’s as good or better than a junior-to-intermediate developer for most programming domains. If you’re abstaining from learning about or using LLMs in your own work, you’re doing a disservice to yourself and your career. It’s a very real possibility that in five years, most of the code we write will be produced using an LLM. It’s not a certainty, but it’s a strong possibility. However, I’m not going to stop writing code by hand. Not anytime soon. As long as there are computers to program, I will be programming them using my own two fleshy human hands. I started programming computers because I enjoy the act of programming. I enjoy thinking through problems, coming up with solutions, evolving those solutions so that they are as correct and clear as possible, and then putting them out into the world where they can be of use to people. It’s a fun and fulfilling profession. Some people see the need for writing code as an impediment to getting good use out of a computer. In fact, some of the most avid fans of generative AI believe that the act of actually doing the work is a punishment. They see work as unnecesary friction that must be optimized away. Truth is, the friction inherent in doing any kind of work—writing, programming, making music, painting, or any other creative activity generative AI purpots to replace—is the whole point. The artifacts you produce as the result of your hard work are not important. They are incidental. The work itself is the point. When you do the work, you change and grow and become more yourself. Work—especially creative work—is an act of self-love if you choose to see it that way. Besides, when you rely on generative AI to do the work, you miss out on the pleasurable sensations of being in flow state. Your skills atrophy (no, writing good prompts is not a skill, any idiot can do it). Your brain gets saturated with dopamine in the same way when you gamble, doomscroll, or play a gatcha game. Using Claude Code as your main method of producing code is like scrolling TikTok eight hours a day, every day, for work. And the worst part? The code you produce using LLMs is pure cognitive debt. You have no idea what it’s doing, only that it seems to be doing what you want it to do. You don’t have a mental model for how it works, and you can’t fix it if it breaks in production. Such a codebase is not an asset but a liability. I predict that in 1-3 years we’re going see organizations rewrite their LLM-generated software using actual human programmers. Personally, I’ve stopped using generative AI to write code for my personal projects. I still use Claude Code as a souped up search engine to look up information, or to help me debug nasty errors. But I’m manually typing every single line of code in my current Django project, with my own fingers, using a real physical keyboard. I’m even thinking up all the code using my own brain. Miraculous! For the commercial projects I work on for my clients, I’m going to follow whatever the norms around LLM use happen to at my workplace. If a client requires me to use Claude Code to write every single line of code, I’ll be happy to oblige. If they ban LLMs outright, I’m fine with that too. After spending hundreds of hours yelling at Claude, I’m dangerously proficient at getting it to do the right thing. But I haven’t lost my programming skills yet, and I don’t plan to. I’m flexible. Given the freedom to choose, I’d probably pick a middle path: use LLMs to generate boilerplate code, write tricky test cases, debug nasty issues I can’t think of, and quickly prototype ideas to test. I’m not an AI vegan. But when it comes to code I write for myself—which includes the code that runs this website—I’m going to continue writing it myself, line by line, like I always did. Somebody has to clean up after the robots when they make a mess, right? Write and debug Emacs Lisp for my personal Emacs configuration. Write several Alfred workflows (in Bash, AppleScript, and Swift) to automate tasks on my computer. Debug CSS issues on this very website. Generate React components for a couple of throwaway side projects. Generate Django apps for a couple of throwaway side projects. Port color themes between text editors. A lot more that I’m forgetting now.

Bash

CSS

Python

Programming

0 views

Ankur Sethi 3 months ago

Waiting is fun

I enjoy waiting. I enjoy waiting at the doctor’s office, at the dentist’s, at the hairdresser’s. I enjoy waiting in queue for my airplane to board, and I enjoy sitting in airplanes on long flights where I have nothing to do and nowhere to go. I enjoy long drives across the city. I even enjoy being stuck in traffic. I enjoy all these moments of waiting for something to happen. Yes, they rob me of my agency to do the things I want or need to do, but they are enjoyable for that very reason. When I’m waiting for something to happen, that time is already spoken for. It’s earmarked for sitting in the doctor’s office, or in an airplane, or for the long drive to a friend’s place. I’ve scheduled nothing “productive” in that time, because it’s not possible to get anything useful done during that time. It’s dead time. It’s time where I’m not eating or sleeping or watching TV or working. Where I’m not pressured to be productive, because there’s no way to get anything useful done while I’m in a waiting room and the doctor’s assistant is interrupting me over and over again. And so I slip into a state of simply being . Of observing the people and events around me without feeling a pressing need to do anything about them. It’s when I notice all the little things people do. It’s when I can laugh at and fall in love with our collective humanity. Sometimes I judge people—for wearing Crocs, watching reels on their phones too loudly. Sometimes I notice heartfelt moments—a kid reaching for their parent’s hand, somebody getting a glass of water for their partner, somebody else leaning their head on their parent’s shoulder. I overhear conversations and shelve them away to recount to my friends later. I notice weird labels on machinery, funny signs, wildlife, people falling asleep in chairs, spelling mistakes on forms. But this isn’t just the time for me to observe the world passively. This is also a time to think. I often get lost in reverie while I’m at the dentist’s, thinking about somebody I love, writing projects I’m working on, programming problems I’m trying to solve. Sometimes I get so lost it takes me a few minutes to come back to reality when I’m finally called inside the doctor’s office. This is also a time to read. It’s my second favorite things to do on flights (my favorite is sleeping). Reading is an activity that, for me, is uniquely resilient to constant interruption. I can read a few paragraphs, attend to something else, then come back and continue where I left off. I can’t do that when I’m writing code or working on a blog post. I can’t even do that when I’m playing a video game. Reading is woven so deep into my life that dipping in and out of it doesn’t take much cognitive effort, nor does it bother me that much. In these states of waiting—I’m really trying not to use the word “liminal” in this blog post, I hate how it feels on my tongue—I often come up with new ideas, make new connections, plan for the future, solve problems that had plagued me for weeks. If I allow my brain to roll along with whatever thought flits into it, just maintaining a soft focus on it without trying to guide it into any specific direction, some strange alchemy happens. I think thoughts I’d never thought I could have thunked. When the waiting finally ends, it feels like the end of playtime. Like my grandpa is standing in the verandah, yelling at me to come back indoors from the park because it’s 7pm. It’s time to say goodbye to all my friends, wipe the mud and grass and bugs off my clothes, wash my feet, and go back indoors. It’s time for homework, preparing for the upcoming school day, brushing my teeth, and going to bed. All that is to say that I like being bored. I like waiting. As an adult, it’s one of the few times I allow myself to simply exist without feeling the pressure to do something “useful”.

Culture

Writing

0 views

Ankur Sethi 4 months ago

I built a programming language using Claude Code

Over the course of four weeks in January and February, I built a new programming language using Claude Code. I named it Cutlet after my cat. It’s completely legal to do that. You can find the source code on GitHub , along with build instructions and example programs . I’ve been using LLM-assisted programming since the original GitHub Copilot release in 2021, but so far I’ve limited my use of LLMs to generating boilerplate and making specific, targeted changes to my projects. While working on Cutlet, though, I allowed Claude to generate every single line of code. I didn’t even read any of the code. Instead, I built guardrails to make sure it worked correctly (more on that later). I’m surprised by the results of this experiment. Cutlet exists today. It builds and runs on both macOS and Linux. It can execute real programs. There might be bugs hiding deep in its internals, but they’re probably no worse than ones you’d find in any other four-week-old programming language in the world. I have Feelings™ about all of this and what it means for my profession, but I want to give you a tour of the language before I get up on my soapbox. If you want to follow along, build the Cutlet interpreter from source and drop into a REPL using . Arrays and strings work as you’d expect in any dynamic language. Variables are declared with the keyword. Variable names can include dashes. Same syntax rules as Raku . The only type of number (so far) is a double. Here’s something cool: the meta-operator turns any regular binary operator into a vectorized operation over an array. In the next line, we’re multiplying every element of by 1.8, then adding 32 to each element of the resulting array. The operator is a zip operation. It zips two arrays into a map. Output text using the built-in function. This function returns , which is Cutlet’s version of . The meta operator also works with comparisons. Here’s another cool bit: you can index into an array using an array of booleans. This is a filter operation. It picks the element indexes corresponding to and discards those that correspond to . Here’s a shorter way of writing that. Let’s print this out with a user-friendly message. The operator concatenates strings and arrays. The built-in turns things into strings. The meta-operator in the prefix position acts as a reduce operation. Let’s find the average temperature. adds all the temperatures, and the built-in finds the length of the array. Let’s print this out nicely, too. Functions are declared with . Everything in Cutlet is an expression, including functions and conditionals. The last value produced by an expression in a function becomes its return value. Your own functions can work with too. Let’s reduce the temperatures with our function to find the hottest temperature. Cutlet can do a lot more. It has all the usual features you’d expect from a dynamic language: loops, objects, prototypal inheritance, mixins, a mark-and-sweep garbage collector, and a friendly REPL. We don’t have file I/O yet, and some fundamental constructs like error handling are still missing, but we’re getting there! See TUTORIAL.md in the git repository for the full documentation. I’m a frontend engineer and (occasional) designer. I’ve tried using LLMs for building web applications, but I’ve always run into limitations. In my experience, Claude and friends are scary good at writing complex business logic, but fare poorly on any task that requires visual design skills. Turns out describing responsive layouts and animations in English is not easy. No amount of screenshots and wireframes can communicate fluid layouts and animations to an LLM. I’ve wasted hours fighting with Claude about layout issues it swore it had fixed, but which I could still see plainly with my leaky human eyes. I’ve also found these tools to excel at producing cookie-cutter interfaces they’ve seen before in publicly available repositories, but they fall off when I want to do anything novel. I often work with clients building complex data visualizations for niche domains, and LLMs have comprehensively failed to produce useful outputs on these projects. On the other hand, I’d seen people accomplish incredible things using LLMs in the last few months, and I wanted to replicate those experiments myself. But my previous experience with LLMs suggested that I had to pick my project carefully. A small, dynamic programming language met all my requirements. Finally, this was also an experiment to figure out how far I could push agentic engineering. Could I compress six months of work into a few weeks? Could I build something that was beyond my own ability to build? What would my day-to-day work life look like if I went all-in on LLM-driven programming? I wanted to answer all these questions. I went into this experiment with some skepticism. My previous attempts at building something entirely using Claude Code hadn’t worked out. But this attempt has not only been successful, but produced results beyond what I’d imagined possible. I don’t hold the belief that all software in the future will be written by LLMs. But I do believe there is a large subset that can be partially or mostly outsourced to these new tools. Building Cutlet taught me something important: using LLMs to produce code does not mean you forget everything you’ve learned about building software. Agentic engineering requires careful planning, skill, craftsmanship, and discipline, just like any software worth building before generative AI. The skills required to work with coding agents might look different from typing code line-by-line into an editor, but they’re still very much the same engineering skills we’ve been sharpening all our careers. There is a lot of work involved in getting good output from LLMs. Agentic engineering does not mean dumping vague instructions into a chat box and harvesting the code that comes out. I believe there are four main skills you have to learn today in order to work effectively with coding agents: Models and harnesses are changing rapidly, so figuring out which problems LLMs are good at solving requires developing your intuition, talking to your peers, and keeping your ear to the ground. However, if you don’t want to stay up-to-date with a rapidly-changing field—and I wouldn’t judge you for it, it’s crazy out there—here are two questions you can ask yourself to figure out if your problem is LLM-shaped: If the answer to either of those questions is “no”, throwing AI at the problem is unlikely to yield good results. If the answer to both of them is “yes”, then you might find success with agentic engineering. The good news is that the cost of figuring this out is the price of a Claude Code subscription and one sacrificial lamb on your team willing to spend a month trying it out on your codebase. LLMs work with natural language, so learning to communicate your ideas using words has become crucial. If you can’t explain your ideas in writing to your co-workers, you can’t work effectively with coding agents. You can get a lot out of Claude Code using simple, vague, overly general prompts. But when you do that, you’re outsourcing a lot of your thinking and decision-making to the robot. This is fine for throwaway projects, but you probably want to be more careful when you’re building something you will put into production and maintain for years. You want to feed coding agents precisely written specifications that capture as much of your problem space as possible. While working on Cutlet, I spent most of my time writing, generating, reading, and correcting spec documents . For me, this was a new experience. I primarily work with early-stage startups, so for most of my career, I’ve treated my code as the spec. Writing formal specifications was an alien experience. Thankfully, I could rely on Claude to help me write most of these specifications. I was only comfortable doing this because Cutlet was an experiment. On a project I wanted to stake my reputation on, I might take the agent out of the equation altogether and write the specs myself. This was my general workflow while making any change to Cutlet: This workflow front-loaded the cognitive effort of making any change to the language. All the thinking happened before a single line of code was written, which is something I almost never do. For me, programming involves organically discovering the shape of a problem as I’m working on it. However, I’ve found that working that way with LLMs is difficult. They’re great at making sweeping changes to your codebase, but terrible at quick, iterative, organic development workflows. Maybe my workflow will evolve as inference gets faster and models become better, but until then, this waterfall-style model works best. I find this to be the most interesting and fun part of working with coding agents. It’s a whole new class of problem to solve! The core principle is this: coding agents are computer programs, and therefore have a limited view of the world they exist in. Their only window into the problem you’re trying to solve is the directory of code they can access. This doesn’t give them enough agency or information to be able to do a good job. So, to help them thrive, you must give them that agency and information in the form of tools they can use to reach out into the wider world. What does this mean in practice? It looks different for different projects, but this is what I did for Cutlet: All these tools and abilities guaranteed that any updates to the code resulted in a project that at least compiled and executed. But more importantly, they increased the information and agency Claude had access to, making it more effective at discovering and debugging problems without my intervention. If I keep working on this project, my main focus will be to give my agents even more insight into the artifact they are building, even more debugging tools, even more freedom, and even more access to useful information. You will want to come up with your own tooling that works for your specific project. If you’re building a Django app, you might want to give the agent access to a staging database. If you’re building a React app, you might want to give it access to a headless browser. There’s no single answer that works for every project, and I bet people are going to come up with some very interesting tools that allow LLMs to observe the results of their work in the real world. Coding agents can sometimes be inefficient in how they use the tools you give them. For example, while working on this project, sometimes Claude would run a command, decide its output was too long to fit into the context window, and run it again with the output piped to . Other times it would run , forget to the output for errors, and run it a second time to capture the output. This would result in the same expensive checks running multiple times in the course of making a single edit. These mistakes slowed down the agentic loop significantly. I could fix some of these performance bottlenecks by editing or changing the output of a custom script. But there were some issues that required more effort to discover and fix. I quickly got into the habit of observing the agent at work, noticing sequences of commands that the agent repeated over and over again, and turning them into scripts for the agent to call instead. Many of the scripts in Cutlet’s directory came about this way. This was very manual, very not-fun work. I’m hoping this becomes more automated as time goes on. Maybe a future version of Claude Code could review its own tool calling outputs and suggest scripts you could write for it? Of course, the most fruitful optimization was to run Claude inside Docker with and access. By doing this, I took myself out of the agentic loop. After a plan file had been produced, I didn’t want to hang around babysitting agents and saying every time they wanted to run . As Cutlet evolved, the infrastructure I built for Claude also evolved. Eventually, I captured many of the workflows Claude naturally followed as scripts, slash commands, or instructions in . I also learned where the agent stumbled most, and preempted those mistakes by giving it better instructions or scripts to run. The infrastructure I built for Claude was also valuable for me, the human working on the project. The same scripts that helped Claude automate its work also helped me accomplish common tasks quickly. As the project grows, this infrastructure will keep evolving along with it. Models change all the time. So do project requirements and workflows. I look at all this project infrastructure as an organic thing that will keep changing as long as the project is active. Now that it’s possible for individual developers to accomplish so much in such little time, is software engineering as a career dead? My answer to this question is nope, not at all . Software engineering skills are just as valuable today as they were before language models got good. If I hadn’t taken a compilers course in college and worked through Crafting Interpreters , I wouldn’t have been able to build Cutlet. I still had to make technical decisions that I could only make because I had (some) domain knowledge and experience. Besides, I had to learn a bunch of new skills in order to effectively work on Cutlet. These new skills also required technical knowledge. A strange and new and different kind of technical knowledge, but technical knowledge nonetheless. Before working on this project, I was worried about whether I’d have a job five years from now. But today I’m convinced that the world will continue to have a need for software engineers in the future. Our jobs will transform—and some people might not enjoy the new jobs anymore—but there will still be plenty of work for us to do. Maybe we’ll have even more work to do than before, since LLMs allow us to build a lot more software a lot faster. And for those of us who never want to touch LLMs, there will be domains where LLMs never make any inroads. My friends who work on low-level multimedia systems have found less success using LLMs compared to those who build webapps. This is likely to be the case for many years to come. Eventually, those jobs will transform, too, but it will be a far slower shift. Is it fair to say that I built Cutlet? After all, Claude did most of the work. What was my contribution here besides writing the prompts? Moreover, this experiment only worked because Claude had access to multiple language runtimes and computer science books in its training data. Without the work done by hundreds of programmers, academics, and writers who have freely donated their work to the public, this project wouldn’t have been possible. So who really built Cutlet? I don’t have a good answer to that. I’m comfortable taking credit for the care and feeding of the coding agent as it went about generating tokens, but I don’t feel a sense of ownership over the code itself. I don’t consider this “my” work. It doesn’t feel right. Maybe my feelings will change in the future, but I don’t quite see how. Because of my reservations about who this code really belongs to, I haven’t added a license to Cutlet’s GitHub repository. Cutlet belongs to the collective consciousness of every programming language designer, implementer, and educator to have released their work on the internet. (Also, it’s worth noting that Cutlet almost certainly includes code from the Lua and Python interpreters. It referred to those languages all the time when we talked about language features. I’ve also seen a ton of code from Crafting Interpreters making its way into the codebase with my own two fleshy eyes.) I’d be remiss if I didn’t include a note on mental health in this already mammoth blog post. It’s easy to get addicted to agentic engineering tools. While working on this project, I often found myself at my computer at midnight going “just one more prompt”, as if I was playing the world’s most obscure game of Civilization . I’m embarrassed to admit that I often had Claude Code churning away in the background when guests were over at my place, when I stepped into the shower, or when I went off to lunch. There’s a heady feeling that comes from accomplishing so much in such little time. More addictive than that is the unpredictability and randomness inherent to these tools. If you throw a problem at Claude, you can never tell what it will come up with. It could one-shot a difficult problem you’ve been stuck on for weeks, or it could make a huge mess. Just like a slot machine, you can never tell what might happen. That creates a strong urge to try using it for everything all the time. And just like with slot machines, the house always wins. These days, I set limits for how long and how often I’m allowed to use Claude. As LLMs become widely available, we as a society will have to figure out the best way to use them without destroying our mental health. This is the part I’m not very optimistic about. We have comprehensively failed to regulate or limit our use of social media, and I’m willing to bet we’ll have a repeat of that scenario with LLMs. Now that we can produce large volumes of code very quickly, what can we do that we couldn’t do before? This is another question I’m not equipped to answer fully at the moment. That said, one area where I can see LLMs being immediately of use to me personally is the ability to experiment very quickly. It’s very easy for me to try out ten different features in Cutlet because I just have to spec them out and walk away from the computer. Failed experiments cost almost nothing. Even if I can’t use the code Claude generates, having working prototypes helps me validate ideas quickly and discard bad ones early. I’ve also been able to radically reduce my dependency on third-party libraries in my JavaScript and Python projects. I often use LLMs to generate small utility functions that previously required pulling in dependencies from NPM or PyPI. But honestly, these changes are small beans. I can’t predict the larger societal changes that will come about because of AI agents. All I can say is programming will look radically different in 2030 than it does in 2026. This project was a proof of concept to see how far I could push Claude Code. I’m currently looking for a new contract as a frontend engineer, so I probably won’t have the time to keep working on Cutlet. I also have a few more ideas for pushing agentic programming further, so I’m likely to prioritize those over continuing work on Cutlet. When the mood strikes me, I might still add small features now and then to the language. Now that I’ve removed myself from the development loop, it doesn’t take a lot of time and effort. I might even do Advent of Code using Cutlet in December! Of course, if you work at Anthropic and want to give me money so I can keep running this experiment, I’m available for contract work for the next 8 months :) For now, I’m closing the book on Cutlet and moving on to other projects (and cat). Thanks to Shruti Sunderraman for proofreading this post. Also thanks to Cutlet the cat for walking across the keyboard and deleting all my work three times today. I didn’t want to solve a particularly novel problem, but I wanted the ability to sometimes steer the LLM into interesting directions. I didn’t want to manually verify LLM-generated code. I wanted to give the LLM specifications, test cases, documentation, and sample outputs, and make it do all the difficult work of figuring out if it was doing the right thing. I wanted to give the agent a strong feedback loop so it could run autonomously. I don’t like MCPs. I didn’t want to deal with them. So anything that required connecting to a browser, taking screenshots, or talking to an API over the network was automatically disqualified. I wanted to use a boring language with as few external dependencies as possible. LLMs know how to build language implementations because their training data contains thousands of existing implementations, papers, and CS books. I was intrigued by the idea of creating a “remix” language by picking and choosing features I enjoy from various existing languages. I could write a bunch of small deterministic programs along with their expected outputs to test the implementation. I could even get Claude to write them for me, giving me a potentially infinite number of test cases to verify that the language was working correctly. Language implementations can be tested from the command line, with purely textual inputs and outputs. No need to take screenshots or videos or set up fragile MCPs. There’s no better feedback loop for an agent than “run and until there are no more errors”. C is as boring as it gets, and there are a large number of language implementations built in C. Understanding which problems can be solved effectively using LLMs, which ones need a human in the loop, and which ones should be handled entirely by humans. Communicating your intent clearly and defining criteria for success. Creating an environment in which the LLM can do its best work. Monitoring and optimizing the agentic loop so the agent can work efficiently. For the problem you want to solve, is it possible to define and verify success criteria in an automated fashion? Have other people solved this problem—or a similar one—before? In other words, is your problem likely to be in the training data for an LLM? First, I’d present the LLM with a new feature (e.g. loops) or refactor (e.g. moving from a tree-walking interpreter to a bytecode VM). Then I’d have a conversation with it about how the change would work in the context of Cutlet, how other languages implemented it, design considerations, ideas we could steal from interesting/niche languages, etc. Just a casual back-and-forth, the same way you might talk to a co-worker. After I had a good handle on what the feature or change would look like, I’d ask the LLM to give me an implementation plan broken down into small steps. I’d review the plan and go back and forth with the LLM to refine it. We’d explore various corner cases, footguns, gotchas, missing pieces, and improvements. When I was happy with the plan, I’d ask the LLM to write it out to a file that would go into a directory. Sometimes we’d end up with 3-4 plan files for a single feature. This was intentional. I needed the plans to be human-readable, and I needed each plan to be an atomic unit I could roll back if things didn’t work out. They also served as a history of the project’s evolution. You can find all the historical plan files in the Cutlet repository. I’d read and review the generated plan file, go back and forth again with the LLM to make changes to it, and commit it when everything looked good. Finally, I’d fire up a Docker container, run Claude with all permissions—including access—and ask it to implement my plan. Comprehensive test suite . My project instructions told Claude to write tests and make sure they failed before writing any new code. Alongside, I asked it to run tests after making significant code changes or merging any branches. Armed with a constantly growing test suite, Claude was able to quickly identify and fix any regressions it introduced into the codebase. The tests also served as documentation and specification. Sample inputs and outputs . These were my integration tests. I added a number of example programs to the Cutlet repository—most of them written by Claude itself—that not only serve as documentation for humans, but also as an end-to-end test suite. The project instructions told Claude to run all of them and verify their output after every code change. Linters, formatters, and static analysis tools . Cutlet uses and to ensure a baseline of code quality. Just like with tests, the project instructions asked the LLM to run these tools after every major code change. I noticed that would often produce diagnostics that would force Claude to rewrite parts of the code. If I had access to some of the more expensive static analysis tools (such as Coverity ), I would have added them to my development process too. Memory safety tools . I asked Claude to create a target that rebuilt the entire project and test suite with ASan and UBSan enabled (with LSan riding along via ASan), then ran every test under the instrumented build. The project instructions included running this check at the end of implementing a plan. This caught memory errors—use-after-free, buffer overflows, undefined behavior—that neither the tests nor the linter could find. Running these tests took time and greatly slowed down the agent, but they caught even more issues than . Symbol indexes . The agent had access to and for navigating the source code. I don’t know how useful this was, because I rarely ever saw it use them. Most of the time it would just the code for symbols. I might remove this in the future. Runtime introspection tools . Early in the project, I asked Claude to give Cutlet the ability to dump the token stream, AST, and bytecode for any piece of code to the standard output before executing it. This allowed the agent to quickly figure out if it had introduced errors into any part of the execution pipeline without having to navigate the source code or drop into a debugger. Pipeline tracing . I asked Claude to write a Python script that fed a Cutlet program through the interpreter with debug flags to capture the full compilation pipeline : the token stream, the AST, and the bytecode disassembly. It then mapped each token type, AST node, and opcode back to the exact source locations in the parser, compiler, and VM where they were handled. When an agent needed to add a new language feature, it could run the tracer on an example of a similar existing feature to see precisely which files and functions to touch. I was very proud of this machinery, but I never saw Claude make much use of it either. Running with every possible permission . I wanted the agent to work autonomously and have access to every debugging tool it might want to use. To do this, I ran it inside a Docker container with enabled and full access. I believe this is the only practical way to use coding agents on large projects. Answering permissions prompts is cognitively taxing when you have five agents working in parallel, and restricting their ability to do whatever they want makes them less effective at their job. We will need to figure out all sorts of safety issues that arise when you give LLMs the ability to take full control of a system, but on this project, I was willing to accept the risks that come with YOLO mode.

AI Lua

Open Source

Programming

JavaScript

0 views

Ankur Sethi 5 months ago

I used a local LLM to analyze my journal entries

In 2025, I wrote 162 journal entries totaling 193,761 words. In December, as the year came to a close and I found myself in a reflective mood, I wondered if I could use an LLM to comb through these entries and extract useful insights. I’d had good luck extracting structured data from web pages using Claude, so I knew this was a task LLMs were good at. But there was a problem: I write about sensitive topics in my journal entries, and I don’t want to share them with the big LLM providers. Most of them have at least a thirty-day data retention policy, even if you call their models using their APIs, and that makes me uncomfortable. Worse, all of them have safety and abuse detection systems that get triggered if you talk about certain mental health issues. This can lead to account bans or human review of your conversations. I didn’t want my account to get banned, and the very idea of a stranger across the world reading my journal mortifies me. So I decided to use a local LLM running on my MacBook for this experiment. Writing the code was surprisingly easy. It took me a few evenings of work—and a lot of yelling at Claude Code—to build a pipeline of Python scripts that would extract structured JSON from my journal entries. I then turned that data into boring-but-serviceable visualizations. This was a fun side-project, but the data I extracted didn’t quite lead me to any new insights. That’s why I consider this a failed experiment. The output of my pipeline only confirmed what I already knew about my year. Besides, I didn’t have the hardware to run the larger models, so some of the more interesting analyses I wanted to run were plagued with hallucinations. Despite how it turned out, I’m writing about this experiment because I want to try it again in December 2026. I’m hoping I won’t repeat my mistakes again. Selfishly, I’m also hoping that somebody who knows how to use LLMs for data extraction tasks will find this article and suggest improvements to my workflow. I’ve pushed my data extraction and visualization scripts to GitHub. It’s mostly LLM-generated slop, but it works. The most interesting and useful parts are probably the prompts . Now let’s look at some graphs. I ran 12 different analyses on my journal, but I’m only including the output from 6 of them here. Most of the others produced nonsensical results or were difficult to visualize. For privacy, I’m not using any real names in these graphs. Here’s how I divided time between my hobbies through the year: Here are my most mentioned hobbies: This one is media I engaged with. There isn’t a lot of data for this one: How many mental health issues I complained about each day across the year: How many physical health issues I complained about each day across the year: The big events of 2025: The communities I spent most of my time with: Top mentioned people throughout the year: I ran all these analyses on my MacBook Pro with an M4 Pro and 48GB RAM. This hardware can just barely manage to run some of the more useful open-weights models, as long as I don’t run anything else. For running the models, I used Apple’s package . Picking a model took me longer than putting together the data extraction scripts. People on /r/LocalLlama had a lot of strong opinions, but there was no clear “best” model when I ran this experiment. I just had to try out a bunch of them and evaluate their outputs myself. If I had more time and faster hardware, I might have looked into building a small-scale LLM eval for this task. But for this scenario, I picked a few popular models, ran them on a subset of my journal entries, and picked one based on vibes. This project finally gave me an excuse to learn all the technical terms around LLMs. What’s quantization ? What does the number of parameters do? What does it mean when a model has , , , or in its name? What is a reasoning model ? What’s MoE ? What are active parameters? This was fun, even if my knowledge will be obsolete in six months. In the beginning, I ran all my scripts with Qwen 2.5 Instruct 32b at 8-bit quantization as the model. This fit in my RAM with just enough room left over for a browser, text editor, and terminal. But Qwen 2.5 didn’t produce the best output and hallucinated quite a bit, so I ran my final analyses using Llama-3.3 70B Instruct at 3bit quantization. This could just about fit in my RAM if I quit every other app and increased the amount of GPU RAM a process was allowed to use . While quickly iterating on my Python code, I used a tiny model: Qwen 3 4b Instruct quantized to 4bits. A major reason this experiment didn’t yield useful insights was that I didn’t know what questions to ask the LLM. I couldn’t do a qualitative analysis of my writing—the kind of analysis a therapist might be able to do—because I’m not a trained psychologist. Even if I could figure out the right prompts, I wouldn’t want to do this kind of work with an LLM. The potential for harm is too great, and the cost of mistakes is too high. With a few exceptions, I limited myself to extracting quantitative data only. From each journal entry, I extracted the following information: None of the models was as accurate as I had hoped at extracting this data. In many cases, I noticed hallucinations and examples from my system prompt leaking into the output, which I had to clean up afterwards. Qwen 2.5 was particularly susceptible to this. Some of the analyses (e.g. list of new people I met) produced nonsensical results, but that wasn’t really the fault of the models. They were all operating on a single journal entry at a time, so they had no sense of the larger context of my life. I couldn’t run all my journal entries through the LLM at once. I didn’t have that kind of RAM and the models didn’t have that kind of context window. I had to run the analysis one journal entry at a time. Even then, my computer choked on some of the larger entries, and I had to write my scripts in a way that I could run partial analyses or continue failed analyses. Trying to extract all the information listed above in one pass produced low-quality output. I had to split my analysis into multiple prompts and run them one at a time. Surprisingly, none of the models I tried had an issue with the instruction . Even the really tiny models had no problems following the instruction. Some of them occasionally threw in a Markdown fenced code block, but it was easy enough to strip using a regex. My prompts were divided into two parts: The task-specific prompts included detailed instructions and examples that made the structure of the JSON output clear. Every model followed the JSON schema mentioned in the prompt, and I rarely ever ran into JSON parsing issues. But the one issue I never managed to fix was the examples from the prompts leaking into the extracted output. Every model insisted that I had “dinner with Sarah” several times last year, even though I don’t know anybody by that name. This name came from an example that formed part of one of my prompts. I just had to make sure the examples I used stood out—e.g., using names of people I didn’t know at all or movies I hadn’t watched—so I could filter them out using plain old Python code afterwards. Here’s what my prompt looked like: To this prompt, I appended task-specific prompts. Here’s the prompt for extracting health issues mentioned in an entry: You can find all the prompts in the GitHub repository . The collected output from all the entries looked something like this: Since my model could only look at one journal entry at a time, it would sometimes refer to the same health issue, gratitude item, location, or travel destination using different synonyms. For example, “exhaustion” and “fatigue” should refer to the same health issue, but they would appear in the output as two different issues. My first attempt at de-duplicating these synonyms was to keep a running tally of unique terms discovered during each analysis and append them to the end of the prompt for each subsequent entry. Something like this: But this quickly led to some really strange hallucinations. I still don’t understand why. This list of terms wasn’t even that long, maybe 15-20 unique terms for each analysis. My second attempt at solving this was a separate normalization pass for each analysis. After an analysis finished running, I extracted a unique list of terms from its output file and collected them into a prompt. Then asked the LLM to produce a mapping to de-duplicate the terms. This is what the prompt looked like: There were better ways to do this than using an LLM. But you know what happens when all you have is a hammer? Yep, exactly. The normalization step was inefficient, but it did its job. This was the last piece of the puzzle. With all the extraction scripts and their normalization passes working correctly, I left my MacBook running the pipeline of scripts all day. I’ve never seen an M-series MacBook get this hot. I was worried that I’d damage my hardware somehow, but it all worked out fine. There was nothing special about this step. I just decided on a list of visualizations for the data I’d extracted, then asked Claude to write some code to generate them for me. Tweak, rinse, repeat until done. I’m underwhelmed by the results of this experiment. I didn’t quite learn anything new or interesting from the output, at least nothing I didn’t already know. This was only partly because of LLM limitations. I believe I didn’t quite know what questions to ask in the first place. What was I hoping to discover? What kinds of patterns was I looking for? What was the goal of the experiment besides producing pretty graphs? I went into the project with a cool new piece of tech to try out, but skipped the important up-front human-powered thinking work required to extract good insights from data. I neglected to sit down and design a set of initial questions I wanted to answer and assumptions I wanted to test before writing the code. Just goes to show that no amount of generative AI magic will produce good results unless you can define what success looks like. Maybe this year I’ll learn more about data analysis and visualization and run this experiment again in December to see if I can go any further. I did learn one thing from all of this: if you have access to state-of-the-art language models and know the right set of questions to ask, you can process your unstructured data to find needles in some truly massive haystacks. This allows you analyze datasets that would take human reviewers months to comb through. A great example is how the NYT monitors hundreds of podcasts every day using LLMs. For now, I’m putting a pin in this experiment. Let’s try again in December. List of things I was grateful for, if any List of hobbies or side-projects mentioned List of locations mentioned List of media mentioned (including books, movies, games, or music) A boolean answer to whether it was a good or bad day for my mental health List of mental health issues mentioned, if any A boolean answer to whether it was a good or bad day for my physical health List of physical health issues mentioned, if any List of things I was proud of, if any List of social activities mentioned Travel destinations mentioned, if any List of friends, family members, or acquaintances mentioned List of new people I met that day, if any A “core” prompt that was common across analyses Task-specific prompts for each analysis

JSON

Python

Data Analysis

0 views

Ankur Sethi 5 months ago

The only correct recipe for making chai

All my friends have their own personal recipes for making chai. I love my friends, so it hurts me to say that they’re wrong. My friends are, unfortunately, wrong about chai. I’m still coming to terms with this upsetting fact, but I’ll live. What follows is the only correct recipe for making chai. The only correct choice of tea leaves is Tata Tea Gold. Keep it in an airtight jar. Shake it up a bit so there’s an even mix of smaller grains and whole tea leaves. The smaller grains make for a stronger chai and they tend to settle at the bottom, so take that into account when measuring. You need full-cream milk for this recipe. Amul Gold is a good choice. I buy the tetrapacks because they survive in the fridge for longer, but the plastic bags work as well. According to the pack, Amul Gold has 6% fat. If you can’t find Amul Gold, try to find an equivalent milk. For a basic chai, you only need tea leaves, water, sugar, and milk. But we don’t want to make a basic chai, do we? No. So we’re going to add some elaichi (green cardamom) and saunf (fennel). Try to find fresh spices, if you can. I don’t have recommendations for specific brands here because most of them are fine. I learned the hard way that you get two kinds of saunf in the supermarket: green and brown. Green saunf tastes sweet and fresh, almost like a dessert. The brown saunf has a stronger flavor but is also bitter. We want the green saunf. Sometimes you find old elaichi at the store that’s gone a bit brown. Don’t buy that. Your elaichi should be green in color, just like the saunf. This recipe makes three cups of chai. Why three? Because that’s how much chai I drink every day. You can adjust this recipe to make more or fewer cups, as long as you keep all the ratios the same. Dig out your mortar and pestle from the drawer it has been languishing in. Add six pods of elaichi—two for each cup. Add half a tablespoon of saunf. You can use a bit more of both these spices if you want a more flavorful chai. Grind the spices into a semi-powdery mix. You don’t have to turn it into a fine powder, just grind them enough so that the flavors come through. Put two cups of water in a saucepan and add the spice mix. Put it on a high flame until boiling. When the water is boiling, reduce the flame to medium. Add three dessert spoons full of tea leaves to the boiling water. A dessert spoon is slightly smaller than a tablespoon. If all you have is a tablespoon, try about 3/4 tablespoons of tea leaves for each cup. Then add the same amount of sugar. You can adjust the amount of sugar based on how sweet you want your chai, but if you don’t add enough sugar the flavors won’t come through. Allow the mixture to boil on the stove for about 3-4 minutes. Then add a cup of milk. At this stage you should add a tiny bit of extra milk to account for the water evaporating, otherwise you won’t have three full cups of chai. About 1/5 of a cup should be enough, but I’ve been known to add a bit more to make the chai richer. Stir the mixture a bit to ensure everything is properly mixed together, then allow it to sit on the stove until the milk boils over. This next step is crucial. It will make or break your chai. I swear it’s not superstition. When the milk boils over, turn the stove to simmer. Allow it to settle back down into the pan. Then turn it up to medium heat again until it boils over once more. Repeat one more time. The milk should boil over and settle down three times total. Your chai is ready! Use a strainer to strain it into cups and enjoy. Should you eat a Parle-G with your chai? Maybe a Rusk? I have strong opinions on this matter but I’m running out of time, so I’ll leave that decision up to you.

Tutorial

Culture

Entertainment

0 views

Ankur Sethi 5 months ago

Write quickly, edit lightly, prefer rewrites, publish with flaws

Over two years of consistent writing and publishing, I’ve internalized a few lessons for producing satisfying—if not necessarily “good”—work: I covered similar ground previously in Writing without a plan . This post builds on the same idea. If I want to see the shape of the idea I’m trying to communicate in my writing, I must get it down on paper as quickly as possible. This is similar to how painters lay down underdrawings on canvas before applying paint. I can’t judge the quality of my idea unless I finish this underdrawing. Without this basic sketch to guide me, I might end up writing the wrong thing altogether. More than once, I’ve slaved away at a long blog post for days, only to realize that my core thesis was bunk. Writing quickly allows me to see the idea in its entirety before I waste time and energy refining it. How do I define quickly ? For blog posts like this one, I try to produce a first draft in about 45 minutes. For longer pieces, I take about the same time but work in broad strokes and make heavy use of placeholders. It’s easy to edit the life and vitality out of a piece by over-editing it. I’ve done it many times. I’m prone to spending hours upon hours polishing the same few paragraphs in a work, complicating my sentences by attaching a hundred sub-clauses, burying important ideas under mountains of caveats, turning direct writing into purple prose, and inflating my word counts to planetary proportions. Light edits to a first draft improve my writing. If I keep going, I reach a point of diminishing returns where every new edit feels like busywork. And then, if I keep going some more, I start making the writing worse rather than better. Spending too much time editing puts me in a mental state that’s similar to semantic satiation , but at the scale of a full essay or story. The words in front of my eyes begin to lose their meaning, ideas become muddled, and I can no longer tell if anything I’ve written makes sense at all. At that point, I have no choice but to walk away from the work and come back to it another day. It’s no fun. I try to spend a little more time editing than I do writing, but only a little. I’ve learned to recognize that if editing a draft takes me significantly longer than it took me to write it, there’s probably something wrong with the piece. If editing takes too long, it’s better to throw it away and redo from start . If it’s taking too long to edit, rewrite. By writing quickly, I’ve convinced my brain that rewriting something wholesale is cheap and easy. It’s profitable and practical for me to write out a single idea multiple times, exploring it from different angles, finding new insight and depth every time I take a fresh stab at it. If writing a first draft takes 45 minutes, making multiple attempts at the same idea is no big deal. If it takes four hours, I’m more likely to go with my first attempt. Spending too much time on first drafts is a good way for me to get married to bad ideas. I wrote this very blog post three times because I couldn’t quite capture what I wanted to say in the first two drafts. The content of the post changed entirely with every new attempt, but the core ideas remained the same. No piece of writing is ever perfect. If I keep looking, I can find flaws in every single piece of writing I’ve ever published. I find it a waste of time to keep refining my work once it reaches the good enough stage. If I’ve communicated my ideas clearly and haven’t misrepresented any facts, I can allow a few clumsy sentences or a bad opening paragraph to slide. Even as I publish imperfect work, I try to look back at my past writing, notice the mistakes I keep repeating, and try to do better next time. I find that publishing a lot of bad work and learning from each mistake is a better way to learn and grow compared to writing a small number of “perfect” pieces. By working quickly, I’ve been able to produce a lot of bad-to-mediocre writing, but I feel satisfied. As I keep saying, finding joy in the work I do is more important to me than producing something extraordinary. I’d rather write a hundred bad essays with gleeful abandon than slave over a single perfect manuscript. There’s joy in finishing something, closing the book on it, calling it a day, and moving on. There’s joy in trying out different styles, voices, subjects, ideas, personalities. There’s joy in knowing that there will always be a next thing to write, and the next, and the next. When I’m stuck writing something that’s not fun to work on, I find a certain consolation in knowing that I’ll be done soon. That my sloppy writing process means I’m allowed to finish my piece quickly, put it out into the world, and move on to something more enjoyable. Now you’ve reached the end of this post, and I don’t quite know how to leave you with a solid kicker. Instead of doing a good job, I’ll end with this Ray Bradbury quote that I copied off somebody’s blog: Don’t think. Thinking is the enemy of creativity. It’s self-conscious and anything self-conscious is lousy. You can’t “try” to do things. You simply “must” do things. Perfect. I’ve never liked thinking anyway. Write quickly Edit lightly Prefer rewriting to editing Publish with flaws

Writing

Career

0 views

Ankur Sethi 5 months ago

Generative AI and the era of increased gatekeeping

Generative AI models can create text, images, code, and music faster and in larger quantities than our ability to absorb them. Before ChatGPT was introduced to the world in November 2022, producing a piece of media took longer than consuming it. In 2026, the equation has been turned on its head. If your job—thus far—involved curating or evaluating the work of other humans, this is a problem. Generative AI is bad news for teachers and professors, editors at magazines and publishing companies, maintainers of open-source projects, academics doing peer-review or replicating studies, and anyone else who must review the work of their peers in order to give them feedback or as quality control. If you have a job that fits this description, then you’ve probably been inundated with a deluge of low-quality AI-generated content in the past few months. It only takes a few minutes for somebody to “write” a short story using an LLM; it takes a human hours to read and evaluate it. In response to the increasing burden on curators, organizations are tightening the rules around how they handle submissions. Some are taking the moderate stance of asking AI-generated submissions to be identified and cleaned up prior to submission, but many are banning outside contributions altogether. For example: Other organizations are placing strict restriction on number of submissions and making submission rules more stringent: This is a net negative for society. Organizations lose out on potentially good contributions, people early in their careers lose out on a chance to get feedback from experienced professionals, and the rest of us lose because fewer good works make their way into publications and the commons. I see three possible futures ahead of us. First: the novelty of using ChatGPT to produce work and throw it over the wall without reading it wears off. It becomes a social faux pas to submit AI-generated work for publication without extensively vetting and editing it. Enough people are named and shamed that new social norms around the use of generative AI emerge. Our societies adapt so that putting your name on a work without verifying its quality is an act that destroys your reputation. Second: we come up with methods to prove that you have in fact done the work you claim to have done. Like proof of work in cryptography 1 , but for humans. Submitting anything without proof of work becomes an automatic rejection. I can’t imagine what this would look like, though. More importantly, I can’t imagine that we will collectively agree to put ourselves through the indignity of being judged by an algorithm. But hey points to everything look at the world we’ve made. Society has a high tolerance for algorithmically inflicted indignities. Third: we enter a new era of gatekeeping, in which most of us can no longer fix a bug in our favorite open-source projects, submit stories to literary magazines, apply for public job postings, or get peer-review on our papers. Unless you’re a well-known name, or you know somebody who knows somebody, or you can get somebody to vouch for the veracity of your work, you’re considered a nonentity. An era of eroding trust, where anything created by a stranger you don’t personally know is considered suspect. An era of increased gatekeeping that only allows some of us to publish, and the rest of us perish. Personally, I think we’ll land on a combination of the three possible outcomes. Some organizations will name and shame, some will ask for proof of work, and yet others will step up their gatekeeping. And who knows, there’s probably a secret fourth option that I haven’t thought of. I’ve never been great at predicting the future. That said, I remain optimistic 2 about our ability to handle this situation. I believe people are generally nice and just want to help, even the ones sending 5,000 line vibecoded pull requests to open-source projects. Our societies are still adjusting to a strange new technology, and the social norms around its use have not been written yet. Until we collectively figure out how to behave reasonably, we might see slightly increased gatekeeping, but my hunch is that it’ll be temporary. I believe we’ll eventually get to a point where we all learn to be editors and reviewers and slush-pile readers of our own AI-generated work. That’s an interesting future to consider: one in which generative AI has turned us all into more discerning readers. Cryptography, not cryptocurrency. The crypto-bros have given perfectly reasonable mathematical techniques a bad name, so I feel it’s important to mention that here. ↩ Sloptimistic? Ha. ↩ arXiv Changes Rules After Getting Spammed With AI-Generated ‘Research’ Papers Curl ending bug bounty program after flood of AI slop reports Sci-fi magazine ‘Clarkesworld’ stops submissions after a rush of AI-made stories : NPR (although this was a temporary measure) Flood of AI-Written Fiction Shuts Down Clarkesworld Submissions – Black Gate (this, too, was a temporary measure) CVPR 2025 Changes ICLR 2026: Submissions, LLM Disclosures, and the Peer Review Shuffle Fearful of AI-generated grant proposals, NIH limits scientists to six applications per year Cryptography, not cryptocurrency. The crypto-bros have given perfectly reasonable mathematical techniques a bad name, so I feel it’s important to mention that here. ↩ Sloptimistic? Ha. ↩

Career

0 views

Ankur Sethi 6 months ago

Pushing the smallest possible change to production

I wrote this post as an exercise during a meeting of IndieWebClub Bangalore . During my first week of work with a new client, I like to push a very small, almost-insignificant change into production. Something that makes zero difference to the org’s product, but allows me to learn how things are done in the new environment I’m going to be working in. If my client already has a working webapp, this change could be as simple as fixing a typo. If they don’t, I might build a tiny “Hello world!” app using a framework of their choice and make it available at a URL that’s accessible to everyone within the company (or at least everyone who is involved in the project I’m working on). This exercise helps me figure out everything I need to navigate the workplace and be productive within its constraints. It’s better than any amount of documentation, meetings, or one-on-ones. Doing this work after I’ve already spent weeks building out features is frustrating. When I’m in the middle of solving a problem, I want to iterate fast and get my work in front of users and managers as quickly as possible. I like to go into meetings and stand-up calls with working prototypes that people can play with on their own computers, not with vague promises of code that kinda-sorta works on my own machine. This work also brings me in contact with a variety of people from across the organization, which is always helpful. I like being able to reach out to my co-workers when I’m stuck. As an independent contractor, I can only do that if I put in the effort to build relationships with my team. I also like to have a sense of camaraderie with my co-workers. I want to see my co-workers as more than just names on a Slack channel, which is only possible if I actually talk to them. Pushing the smallest possible change into production helps me do all this and sets the tone for a fruitful working relationship. Plus, it’s always satisfying to end your first week of work at a new workplace with something tangible to show for it. Where is the source code hosted? How do I get access to it? Who will give me access? How do I build and run the software on my dev machine? Is there documentation? Is there somebody who can guide me through the process? What does the version control strategy look like? What workflows am I expected to follow? Are there special conventions for naming branches? Does the codebase have automated tests? Is there a CI server? What’s the process for getting a change merged? Should I open a PR and wait for a code review? How long do code reviews typically take? Who reviews my code? Is there a staging server? When does staging get merged into production? How can I provision new servers if I need them? Who will help me do that? Are there any third-party services in play? What providers does the org use for auth, CDN, media transformation, LLMs? How do I get access to these services? Alternatively, how do I mock them in development?

DevOps

Web Development

Career

2 views