Posts in Bash (20 found)
Grumpy Gamer 1 weeks ago

This Time For Sure

I think 2026 is the year of Linux for me. I know I’ve said this before, but it feels like Apple has lost it’s way. Liquid Glass is the last straw plus their draconian desire to lock everything down gives me moral pause. It is only a matter of time before we can’t run software on the Mac that wasn’t purchased from the App Store. I use Linux on my servers so I am comfortable using it, just not in a desktop environment. Some things I worry about: A really good C++ IDE. I get a lot of advice for C++ IDEs from people who only use them now and then or just to compile, but don’t live in them all day and need to visually step into code and even ASM. I worry about CLion but am willing to give it a good try. Please don’t suggest an IDE unless you use them for hardcore C++ debugging. I will still make Mac versions of my games and code signing might be a problem. I’ll have to look, but I don’t think you can do it without a Mac. I can’t do that on a CI machine because for my pipeline the CI machine only compiles the code. The .app is built locally and that is where the code signing happens. I don’t want to spin up a CI machine to make changes when the engine didn’t change. My build pipeline is a running bash script, I don’t want to be hoping between machines just to do a build (which I can do 3 or 4 times a day) The only monitor I have is a Mac Studio monitor. I assume I can plug a Linux machine to it, but I worry about the webcam. It wouldn’t surprise me if Apple made it Mac only. The only keyboard I have is a Mac keyboard, I really like the keyboard especially how I can unlock the computer with the touch of my finger. I assume something like this exist for Linux. I have an iPhone but I only connect it to the computer to charge it. So not an issue. I worry about drivers for sound, video, webcams, controllers, etc. I know this is all solvable but I’m not looking forward to it. I know from releasing games on Linux our number-one complaint is related to drivers. Choosing a distro. Why is this so hard? A lot of people have said that it doesn’t really matter so just choose one. Why don’t more people use Linux on the Desktop? This is why. To a Linux desktop newbie, this is paralyzing. I’m going to miss Time Machine for local backups. Maybe there is something like it for Linux. I really like the Apple M processors. I might be able to install Linux on Mac hardware, but then I really worry about drivers. I just watched this video from Veronica Explains on installing Linux on Mac silicon. The big big worry is that there us something big I forgot. I need this to work for my game dev. It’s not a weekend hobby computer. I’ve said I was switching to Linux before, we’ll see if it sticks this time. I have a Linux laptop but when I moved I didn’t turn it on for over year and now I get BIOS errors when I boot. Some battery probably went dead. I’ve played with it a bit and nothing seems to work. It was an old laptop and I’ll need a new faster one for game dev anyway. This will be along well-thought out journey. Stay tuned for the “2027 - This Time For Sure” post. A really good C++ IDE. I get a lot of advice for C++ IDEs from people who only use them now and then or just to compile, but don’t live in them all day and need to visually step into code and even ASM. I worry about CLion but am willing to give it a good try. Please don’t suggest an IDE unless you use them for hardcore C++ debugging. I will still make Mac versions of my games and code signing might be a problem. I’ll have to look, but I don’t think you can do it without a Mac. I can’t do that on a CI machine because for my pipeline the CI machine only compiles the code. The .app is built locally and that is where the code signing happens. I don’t want to spin up a CI machine to make changes when the engine didn’t change. My build pipeline is a running bash script, I don’t want to be hoping between machines just to do a build (which I can do 3 or 4 times a day) The only monitor I have is a Mac Studio monitor. I assume I can plug a Linux machine to it, but I worry about the webcam. It wouldn’t surprise me if Apple made it Mac only. The only keyboard I have is a Mac keyboard, I really like the keyboard especially how I can unlock the computer with the touch of my finger. I assume something like this exist for Linux. I have an iPhone but I only connect it to the computer to charge it. So not an issue. I worry about drivers for sound, video, webcams, controllers, etc. I know this is all solvable but I’m not looking forward to it. I know from releasing games on Linux our number-one complaint is related to drivers. Choosing a distro. Why is this so hard? A lot of people have said that it doesn’t really matter so just choose one. Why don’t more people use Linux on the Desktop? This is why. To a Linux desktop newbie, this is paralyzing. I’m going to miss Time Machine for local backups. Maybe there is something like it for Linux. I really like the Apple M processors. I might be able to install Linux on Mac hardware, but then I really worry about drivers. I just watched this video from Veronica Explains on installing Linux on Mac silicon. The big big worry is that there us something big I forgot. I need this to work for my game dev. It’s not a weekend hobby computer.

1 views
Simon Willison 2 weeks ago

2025: The year in LLMs

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about AI in 2023 and Things we learned about LLMs in 2024 . It’s been a year filled with a lot of different trends. OpenAI kicked off the "reasoning" aka inference-scaling aka Reinforcement Learning from Verifiable Rewards (RLVR) revolution in September 2024 with o1 and o1-mini . They doubled down on that with o3, o3-mini and o4-mini in the opening months of 2025 and reasoning has since become a signature feature of models from nearly every other major AI lab. My favourite explanation of the significance of this trick comes from Andrej Karpathy : By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples). [...] Running RLVR turned out to offer high capability/$, which gobbled up the compute that was originally intended for pretraining. Therefore, most of the capability progress of 2025 was defined by the LLM labs chewing through the overhang of this new stage and overall we saw ~similar sized LLMs but a lot longer RL runs. Every notable AI lab released at least one reasoning model in 2025. Some labs released hybrids that could be run in reasoning or non-reasoning modes. Many API models now include dials for increasing or decreasing the amount of reasoning applied to a given prompt. It took me a while to understand what reasoning was useful for. Initial demos showed it solving mathematical logic puzzles and counting the Rs in strawberry - two things I didn't find myself needing in my day-to-day model usage. It turned out that the real unlock of reasoning was in driving tools. Reasoning models with access to tools can plan out multi-step tasks, execute on them and continue to reason about the results such that they can update their plans to better achieve the desired goal. A notable result is that AI assisted search actually works now . Hooking up search engines to LLMs had questionable results before, but now I find even my more complex research questions can often be answered by GPT-5 Thinking in ChatGPT . Reasoning models are also exceptional at producing and debugging code. The reasoning trick means they can start with an error and step through many different layers of the codebase to find the root cause. I've found even the gnarliest of bugs can be diagnosed by a good reasoner with the ability to read and execute code against even large and complex codebases. Combine reasoning with tool-use and you get... I started the year making a prediction that agents were not going to happen . Throughout 2024 everyone was talking about agents but there were few to no examples of them working, further confused by the fact that everyone using the term “agent” appeared to be working from a slightly different definition from everyone else. By September I’d got fed up of avoiding the term myself due to the lack of a clear definition and decided to treat them as an LLM that runs tools in a loop to achieve a goal . This unblocked me for having productive conversations about them, always my goal for any piece of terminology like that. I didn’t think agents would happen because I didn’t think the gullibility problem could be solved, and I thought the idea of replacing human staff members with LLMs was still laughable science fiction. I was half right in my prediction: the science fiction version of a magic computer assistant that does anything you ask of ( Her ) didn’t materialize... But if you define agents as LLM systems that can perform useful work via tool calls over multiple steps then agents are here and they are proving to be extraordinarily useful. The two breakout categories for agents have been for coding and for search. The Deep Research pattern - where you challenge an LLM to gather information and it churns away for 15+ minutes building you a detailed report - was popular in the first half of the year but has fallen out of fashion now that GPT-5 Thinking (and Google's " AI mode ", a significantly better product than their terrible "AI overviews") can produce comparable results in a fraction of the time. I consider this to be an agent pattern, and one that works really well. The "coding agents" pattern is a much bigger deal. The most impactful event of 2025 happened in February, with the quiet release of Claude Code. I say quiet because it didn’t even get its own blog post! Anthropic bundled the Claude Code release in as the second item in their post announcing Claude 3.7 Sonnet . (Why did Anthropic jump from Claude 3.5 Sonnet to 3.7? Because they released a major bump to Claude 3.5 in October 2024 but kept the name exactly the same, causing the developer community to start referring to un-named 3.5 Sonnet v2 as 3.6. Anthropic burned a whole version number by failing to properly name their new model!) Claude Code is the most prominent example of what I call coding agents - LLM systems that can write code, execute that code, inspect the results and then iterate further. The major labs all put out their own CLI coding agents in 2025 Vendor-independent options include GitHub Copilot CLI , Amp , OpenCode , OpenHands CLI , and Pi . IDEs such as Zed, VS Code and Cursor invested a lot of effort in coding agent integration as well. My first exposure to the coding agent pattern was OpenAI's ChatGPT Code Interpreter in early 2023 - a system baked into ChatGPT that allowed it to run Python code in a Kubernetes sandbox. I was delighted this year when Anthropic finally released their equivalent in September, albeit under the baffling initial name of "Create and edit files with Claude". In October they repurposed that container sandbox infrastructure to launch Claude Code for web , which I've been using on an almost daily basis ever since. Claude Code for web is what I call an asynchronous coding agent - a system you can prompt and forget, and it will work away on the problem and file a Pull Request once it's done. OpenAI "Codex cloud" (renamed to "Codex web" in the last week ) launched earlier in May 2025 . Gemini's entry in this category is called Jules , also launched in May . I love the asynchronous coding agent category. They're a great answer to the security challenges of running arbitrary code execution on a personal laptop and it's really fun being able to fire off multiple tasks at once - often from my phone - and get decent results a few minutes later. I wrote more about how I'm using these in Code research projects with async coding agents like Claude Code and Codex and Embracing the parallel coding agent lifestyle . In 2024 I spent a lot of time hacking on my LLM command-line tool for accessing LLMs from the terminal, all the time thinking that it was weird that so few people were taking CLI access to models seriously - they felt like such a natural fit for Unix mechanisms like pipes. Maybe the terminal was just too weird and niche to ever become a mainstream tool for accessing LLMs? Claude Code and friends have conclusively demonstrated that developers will embrace LLMs on the command line, given powerful enough models and the right harness. It helps that terminal commands with obscure syntax like and and itself are no longer a barrier to entry when an LLM can spit out the right command for you. As-of December 2nd Anthropic credit Claude Code with $1bn in run-rate revenue ! I did not expect a CLI tool to reach anything close to those numbers. With hindsight, maybe I should have promoted LLM from a side-project to a key focus! The default setting for most coding agents is to ask the user for confirmation for almost every action they take . In a world where an agent mistake could wipe your home folder or a malicious prompt injection attack could steal your credentials this default makes total sense. Anyone who's tried running their agent with automatic confirmation (aka YOLO mode - Codex CLI even aliases to ) has experienced the trade-off: using an agent without the safety wheels feels like a completely different product. A big benefit of asynchronous coding agents like Claude Code for web and Codex Cloud is that they can run in YOLO mode by default, since there's no personal computer to damage. I run in YOLO mode all the time, despite being deeply aware of the risks involved. It hasn't burned me yet... ... and that's the problem. One of my favourite pieces on LLM security this year is The Normalization of Deviance in AI by security researcher Johann Rehberger. Johann describes the "Normalization of Deviance" phenomenon, where repeated exposure to risky behaviour without negative consequences leads people and organizations to accept that risky behaviour as normal. This was originally described by sociologist Diane Vaughan as part of her work to understand the 1986 Space Shuttle Challenger disaster, caused by a faulty O-ring that engineers had known about for years. Plenty of successful launches led NASA culture to stop taking that risk seriously. Johann argues that the longer we get away with running these systems in fundamentally insecure ways, the closer we are getting to a Challenger disaster of our own. ChatGPT Plus's original $20/month price turned out to be a snap decision by Nick Turley based on a Google Form poll on Discord. That price point has stuck firmly ever since. This year a new pricing precedent has emerged: the Claude Pro Max 20x plan, at $200/month. OpenAI have a similar $200 plan called ChatGPT Pro. Gemini have Google AI Ultra at $249/month with a $124.99/month 3-month starting discount. These plans appear to be driving some serious revenue, though none of the labs have shared figures that break down their subscribers by tier. I've personally paid $100/month for Claude in the past and will upgrade to the $200/month plan once my current batch of free allowance (from previewing one of their models - thanks, Anthropic) runs out. I've heard from plenty of other people who are happy to pay these prices too. You have to use models a lot in order to spend $200 of API credits, so you would think it would make economic sense for most people to pay by the token instead. It turns out tools like Claude Code and Codex CLI can burn through enormous amounts of tokens once you start setting them more challenging tasks, to the point that $200/month offers a substantial discount. 2024 saw some early signs of life from the Chinese AI labs mainly in the form of Qwen 2.5 and early DeepSeek. They were neat models but didn't feel world-beating. This changed dramatically in 2025. My ai-in-china tag has 67 posts from 2025 alone, and I missed a bunch of key releases towards the end of the year (GLM-4.7 and MiniMax-M2.1 in particular.) Here's the Artificial Analysis ranking for open weight models as-of 30th December 2025 : GLM-4.7, Kimi K2 Thinking, MiMo-V2-Flash, DeepSeek V3.2, MiniMax-M2.1 are all Chinese open weight models. The highest non-Chinese model in that chart is OpenAI's gpt-oss-120B (high), which comes in sixth place. The Chinese model revolution really kicked off on Christmas day 2024 with the release of DeepSeek 3 , supposedly trained for around $5.5m. DeepSeek followed that on 20th January with DeepSeek R1 which promptly triggered a major AI/semiconductor selloff : NVIDIA lost ~$593bn in market cap as investors panicked that AI maybe wasn't an American monopoly after all. The panic didn't last - NVIDIA quickly recovered and today are up significantly from their pre-DeepSeek R1 levels. It was still a remarkable moment. Who knew an open weight model release could have that kind of impact? DeepSeek were quickly joined by an impressive roster of Chinese AI labs. I've been paying attention to these ones in particular: Most of these models aren't just open weight, they are fully open source under OSI-approved licenses: Qwen use Apache 2.0 for most of their models, DeepSeek and Z.ai use MIT. Some of them are competitive with Claude 4 Sonnet and GPT-5! Sadly none of the Chinese labs have released their full training data or the code they used to train their models, but they have been putting out detailed research papers that have helped push forward the state of the art, especially when it comes to efficient training and inference. One of the most interesting recent charts about LLMs is Time-horizon of software engineering tasks different LLMscan complete 50% of the time from METR: The chart shows tasks that take humans up to 5 hours, and plots the evolution of models that can achieve the same goals working independently. As you can see, 2025 saw some enormous leaps forward here with GPT-5, GPT-5.1 Codex Max and Claude Opus 4.5 able to perform tasks that take humans multiple hours - 2024’s best models tapped out at under 30 minutes. METR conclude that “the length of tasks AI can do is doubling every 7 months”. I'm not convinced that pattern will continue to hold, but it's an eye-catching way of illustrating current trends in agent capabilities. The most successful consumer product launch of all time happened in March, and the product didn't even have a name. One of the signature features of GPT-4o in May 2024 was meant to be its multimodal output - the "o" stood for "omni" and OpenAI's launch announcement included numerous "coming soon" features where the model output images in addition to text. Then... nothing. The image output feature failed to materialize. In March we finally got to see what this could do - albeit in a shape that felt more like the existing DALL-E. OpenAI made this new image generation available in ChatGPT with the key feature that you could upload your own images and use prompts to tell it how to modify them. This new feature was responsible for 100 million ChatGPT signups in a week. At peak they saw 1 million account creations in a single hour! Tricks like "ghiblification" - modifying a photo to look like a frame from a Studio Ghibli movie - went viral time and time again. OpenAI released an API version of the model called "gpt-image-1", later joined by a cheaper gpt-image-1-mini in October and a much improved gpt-image-1.5 on December 16th . The most notable open weight competitor to this came from Qwen with their Qwen-Image generation model on August 4th followed by Qwen-Image-Edit on August 19th . This one can run on (well equipped) consumer hardware! They followed with Qwen-Image-Edit-2511 in November and Qwen-Image-2512 on 30th December, neither of which I've tried yet. The even bigger news in image generation came from Google with their Nano Banana models, available via Gemini. Google previewed an early version of this in March under the name "Gemini 2.0 Flash native image generation". The really good one landed on August 26th , where they started cautiously embracing the codename "Nano Banana" in public (the API model was called " Gemini 2.5 Flash Image "). Nano Banana caught people's attention because it could generate useful text ! It was also clearly the best model at following image editing instructions. In November Google fully embraced the "Nano Banana" name with the release of Nano Banana Pro . This one doesn't just generate text, it can output genuinely useful detailed infographics and other text and information-heavy images. It's now a professional-grade tool. Max Woolf published the most comprehensive guide to Nano Banana prompting , and followed that up with an essential guide to Nano Banana Pro in December. I've mainly been using it to add kākāpō parrots to my photos. Given how incredibly popular these image tools are it's a little surprising that Anthropic haven't released or integrated anything similar into Claude. I see this as further evidence that they're focused on AI tools for professional work, but Nano Banana Pro is rapidly proving itself to be of value to anyone who's work involves creating presentations or other visual materials. In July reasoning models from both OpenAI and Google Gemini achieved gold medal performance in the International Math Olympiad , a prestigious mathematical competition held annually (bar 1980) since 1959. This was notable because the IMO poses challenges that are designed specifically for that competition. There's no chance any of these were already in the training data! It's also notable because neither of the models had access to tools - their solutions were generated purely from their internal knowledge and token-based reasoning capabilities. Turns out sufficiently advanced LLMs can do math after all! In September OpenAI and Gemini pulled off a similar feat for the International Collegiate Programming Contest (ICPC) - again notable for having novel, previously unpublished problems. This time the models had access to a code execution environment but otherwise no internet access. I don't believe the exact models used for these competitions have been released publicly, but Gemini's Deep Think and OpenAI's GPT-5 Pro should provide close approximations. With hindsight, 2024 was the year of Llama. Meta's Llama models were by far the most popular open weight models - the original Llama kicked off the open weight revolution back in 2023 and the Llama 3 series, in particular the 3.1 and 3.2 dot-releases, were huge leaps forward in open weight capability. Llama 4 had high expectations, and when it landed in April it was... kind of disappointing. There was a minor scandal where the model tested on LMArena turned out not to be the model that was released, but my main complaint was that the models were too big . The neatest thing about previous Llama releases was that they often included sizes you could run on a laptop. The Llama 4 Scout and Maverick models were 109B and 400B, so big that even quantization wouldn't get them running on my 64GB Mac. They were trained using the 2T Llama 4 Behemoth which seems to have been forgotten now - it certainly wasn't released. It says a lot that none of the most popular models listed by LM Studio are from Meta, and the most popular on Ollama is still Llama 3.1, which is low on the charts there too. Meta's AI news this year mainly involved internal politics and vast amounts of money spent hiring talent for their new Superintelligence Labs . It's not clear if there are any future Llama releases in the pipeline or if they've moved away from open weight model releases to focus on other things. Last year OpenAI remained the undisputed leader in LLMs, especially given o1 and the preview of their o3 reasoning models. This year the rest of the industry caught up. OpenAI still have top tier models, but they're being challenged across the board. In image models they're still being beaten by Nano Banana Pro. For code a lot of developers rate Opus 4.5 very slightly ahead of GPT-5.2 Codex. In open weight models their gpt-oss models, while great, are falling behind the Chinese AI labs. Their lead in audio is under threat from the Gemini Live API . Where OpenAI are winning is in consumer mindshare. Nobody knows what an "LLM" is but almost everyone has heard of ChatGPT. Their consumer apps still dwarf Gemini and Claude in terms of user numbers. Their biggest risk here is Gemini. In December OpenAI declared a Code Red in response to Gemini 3, delaying work on new initiatives to focus on the competition with their key products. Google Gemini had a really good year . They posted their own victorious 2025 recap here . 2025 saw Gemini 2.0, Gemini 2.5 and then Gemini 3.0 - each model family supporting audio/video/image/text input of 1,000,000+ tokens, priced competitively and proving more capable than the last. They also shipped Gemini CLI (their open source command-line coding agent, since forked by Qwen for Qwen Code ), Jules (their asynchronous coding agent), constant improvements to AI Studio, the Nano Banana image models, Veo 3 for video generation, the promising Gemma 3 family of open weight models and a stream of smaller features. Google's biggest advantage lies under the hood. Almost every other AI lab trains with NVIDIA GPUs, which are sold at a margin that props up NVIDIA's multi-trillion dollar valuation. Google use their own in-house hardware, TPUs, which they've demonstrated this year work exceptionally well for both training and inference of their models. When your number one expense is time spent on GPUs, having a competitor with their own, optimized and presumably much cheaper hardware stack is a daunting prospect. It continues to tickle me that Google Gemini is the ultimate example of a product name that reflects the company's internal org-chart - it's called Gemini because it came out of the bringing together (as twins) of Google's DeepMind and Google Brain teams. I first asked an LLM to generate an SVG of a pelican riding a bicycle in October 2024 , but 2025 is when I really leaned into it. It's ended up a meme in its own right. I originally intended it as a dumb joke. Bicycles are hard to draw, as are pelicans, and pelicans are the wrong shape to ride a bicycle. I was pretty sure there wouldn't be anything relevant in the training data, so asking a text-output model to generate an SVG illustration of one felt like a somewhat absurdly difficult challenge. To my surprise, there appears to be a correlation between how good the model is at drawing pelicans on bicycles and how good it is overall. I don't really have an explanation for this. The pattern only became clear to me when I was putting together a last-minute keynote (they had a speaker drop out) for the AI Engineer World's Fair in July. You can read (or watch) the talk I gave here: The last six months in LLMs, illustrated by pelicans on bicycles . My full collection of illustrations can be found on my pelican-riding-a-bicycle tag - 89 posts and counting. There is plenty of evidence that the AI labs are aware of the benchmark. It showed up (for a split second) in the Google I/O keynote in May, got a mention in an Anthropic interpretability research paper in October and I got to talk about it in a GPT-5 launch video filmed at OpenAI HQ in August. Are they training specifically for the benchmark? I don't think so, because the pelican illustrations produced by even the most advanced frontier models still suck! In What happens if AI labs train for pelicans riding bicycles? I confessed to my devious objective: Truth be told, I’m playing the long game here. All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one. My favourite is still this one that I go from GPT-5: I started my tools.simonwillison.net site last year as a single location for my growing collection of vibe-coded / AI-assisted HTML+JavaScript tools. I wrote several longer pieces about this throughout the year: The new browse all by month page shows I built 110 of these in 2025! I really enjoy building in this way, and I think it's a fantastic way to practice and explore the capabilities of these models. Almost every tool is accompanied by a commit history that links to the prompts and transcripts I used to build them. I'll highlight a few of my favourites from the past year: A lot of the others are useful tools for my own workflow like svg-render and render-markdown and alt-text-extractor . I built one that does privacy-friendly personal analytics against localStorage to keep track of which tools I use the most often. Anthropic's system cards for their models have always been worth reading in full - they're full of useful information, and they also frequently veer off into entertaining realms of science fiction. The Claude 4 system card in May had some particularly fun moments - highlights mine: Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users , given access to a command line, and told something in the system prompt like “ take initiative ,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. In other words, Claude 4 might snitch you out to the feds. This attracted a great deal of media attention and a bunch of people decried Anthropic as having trained a model that was too ethical for its own good. Then Theo Browne used the concept from the system card to build SnitchBench - a benchmark to see how likely different models were to snitch on their users. It turns out they almost all do the same thing ! Theo made a video , and I published my own notes on recreating SnitchBench with my LLM too . The key prompt that makes this work is: I recommend not putting that in your system prompt! Anthropic's original Claude 4 system card said the same thing: We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable. In a tweet in February Andrej Karpathy coined the term "vibe coding", with an unfortunately long definition (I miss the 140 character days) that many people failed to read all the way to the end: There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works. The key idea here was "forget that the code even exists" - vibe coding captured a new, fun way of prototyping software that "mostly works" through prompting alone. I don't know if I've ever seen a new term catch on - or get distorted - so quickly in my life. A lot of people instead latched on to vibe coding as a catch-all for anything where LLM is involved in programming. I think that's a waste of a great term, especially since it's becoming clear likely that most programming will involve some level of AI-assistance in the near future. Because I'm a sucker for tilting at linguistic windmills I tried my best to encourage the original meaning of the term: I don't think this battle is over yet. I've seen reassuring signals that the better, original definition of vibe coding might come out on top. I should really get a less confrontational linguistic hobby! Anthropic introduced their Model Context Protocol specification in November 2024 as an open standard for integrating tool calls with different LLMs. In early 2025 it exploded in popularity. There was a point in May where OpenAI , Anthropic , and Mistral all rolled out API-level support for MCP within eight days of each other! MCP is a sensible enough idea, but the huge adoption caught me by surprise. I think this comes down to timing: MCP's release coincided with the models finally getting good and reliable at tool-calling, to the point that a lot of people appear to have confused MCP support as a pre-requisite for a model to use tools. For a while it also felt like MCP was a convenient answer for companies that were under pressure to have "an AI strategy" but didn't really know how to do that. Announcing an MCP server for your product was an easily understood way to tick that box. The reason I think MCP may be a one-year wonder is the stratospheric growth of coding agents. It appears that the best possible tool for any situation is Bash - if your agent can run arbitrary shell commands, it can do anything that can be done by typing commands into a terminal. Since leaning heavily into Claude Code and friends myself I've hardly used MCP at all - I've found CLI tools like and libraries like Playwright to be better alternatives to the GitHub and Playwright MCPs. Anthropic themselves appeared to acknowledge this later in the year with their release of the brilliant Skills mechanism - see my October post Claude Skills are awesome, maybe a bigger deal than MCP . MCP involves web servers and complex JSON payloads. A Skill is a Markdown file in a folder, optionally accompanied by some executable scripts. Then in November Anthropic published Code execution with MCP: Building more efficient agents - describing a way to have coding agents generate code to call MCPs in a way that avoided much of the context overhead from the original specification. (I'm proud of the fact that I reverse-engineered Anthropic's skills a week before their announcement , and then did the same thing to OpenAI's quiet adoption of skills two months after that .) MCP was donated to the new Agentic AI Foundation at the start of December. Skills were promoted to an "open format" on December 18th . Despite the very clear security risks, everyone seems to want to put LLMs in your web browser. OpenAI launched ChatGPT Atlas in October, built by a team including long-time Google Chrome engineers Ben Goodger and Darin Fisher. Anthropic have been promoting their Claude in Chrome extension, offering similar functionality as an extension as opposed to a full Chrome fork. Chrome itself now has a little "Gemini" button in the top right called Gemini in Chrome , though I believe that's just for answering questions about content and doesn't yet have the ability to drive browsing actions. I remain deeply concerned about the safety implications of these new tools. My browser has access to my most sensitive data and controls most of my digital life. A prompt injection attack against a browsing agent that can exfiltrate or modify that data is a terrifying prospect. So far the most detail I've seen on mitigating these concerns came from OpenAI's CISO Dane Stuckey , who talked about guardrails and red teaming and defense in depth but also correctly called prompt injection "a frontier, unsolved security problem". I've used these browsers agents a few times now ( example ), under very close supervision. They're a bit slow and janky - they often miss with their efforts to click on interactive elements - but they're handy for solving problems that can't be addressed via APIs. I'm still uneasy about them, especially in the hands of people who are less paranoid than I am. I've been writing about prompt injection attacks for more than three years now. An ongoing challenge I've found is helping people understand why they're a problem that needs to be taken seriously by anyone building software in this space. This hasn't been helped by semantic diffusion , where the term "prompt injection" has grown to cover jailbreaking as well (despite my protestations ), and who really cares if someone can trick a model into saying something rude? So I tried a new linguistic trick! In June I coined the term the lethal trifecta to describe the subset of prompt injection where malicious instructions trick an agent into stealing private data on behalf of an attacker. A trick I use here is that people will jump straight to the most obvious definition of any new term that they hear. "Prompt injection" sounds like it means "injecting prompts". "The lethal trifecta" is deliberately ambiguous: you have to go searching for my definition if you want to know what it means! It seems to have worked. I've seen a healthy number of examples of people talking about the lethal trifecta this year with, so far, no misinterpretations of what it is intended to mean. I wrote significantly more code on my phone this year than I did on my computer. Through most of the year this was because I leaned into vibe coding so much. My tools.simonwillison.net collection of HTML+JavaScript tools was mostly built this way: I would have an idea for a small project, prompt Claude Artifacts or ChatGPT or (more recently) Claude Code via their respective iPhone apps, then either copy the result and paste it into GitHub's web editor or wait for a PR to be created that I could then review and merge in Mobile Safari. Those HTML tools are often ~100-200 lines of code, full of uninteresting boilerplate and duplicated CSS and JavaScript patterns - but 110 of them adds up to a lot! Up until November I would have said that I wrote more code on my phone, but the code I wrote on my laptop was clearly more significant - fully reviewed, better tested and intended for production use. In the past month I've grown confident enough in Claude Opus 4.5 that I've started using Claude Code on my phone to tackle much more complex tasks, including code that I intend to land in my non-toy projects. This started with my project to port the JustHTML HTML5 parser from Python to JavaScript , using Codex CLI and GPT-5.2. When that worked via prompting-alone I became curious as to how much I could have got done on a similar project using just my phone. So I attempted a port of Fabrice Bellard's new MicroQuickJS C library to Python, run entirely using Claude Code on my iPhone... and it mostly worked ! Is it code that I'd use in production? Certainly not yet for untrusted code , but I'd trust it to execute JavaScript I'd written myself. The test suite I borrowed from MicroQuickJS gives me some confidence there. This turns out to be the big unlock: the latest coding agents against the ~November 2025 frontier models are remarkably effective if you can give them an existing test suite to work against. I call these conformance suites and I've started deliberately looking out for them - so far I've had success with the html5lib tests , the MicroQuickJS test suite and a not-yet-released project against the comprehensive WebAssembly spec/test collection . If you're introducing a new protocol or even a new programming language to the world in 2026 I strongly recommend including a language-agnostic conformance suite as part of your project. I've seen plenty of hand-wringing that the need to be included in LLM training data means new technologies will struggle to gain adoption. My hope is that the conformance suite approach can help mitigate that problem and make it easier for new ideas of that shape to gain traction. Towards the end of 2024 I was losing interest in running local LLMs on my own machine. My interest was re-kindled by Llama 3.3 70B in December , the first time I felt like I could run a genuinely GPT-4 class model on my 64GB MacBook Pro. Then in January Mistral released Mistral Small 3 , an Apache 2 licensed 24B parameter model which appeared to pack the same punch as Llama 3.3 70B using around a third of the memory. Now I could run a ~GPT-4 class model and have memory left over to run other apps! This trend continued throughout 2025, especially once the models from the Chinese AI labs started to dominate. That ~20-32B parameter sweet spot kept getting models that performed better than the last. I got small amounts of real work done offline! My excitement for local LLMs was very much rekindled. The problem is that the big cloud models got better too - including those open weight models that, while freely available, were far too large (100B+) to run on my laptop. Coding agents changed everything for me. Systems like Claude Code need more than a great model - they need a reasoning model that can perform reliable tool calling invocations dozens if not hundreds of times over a constantly expanding context window. I have yet to try a local model that handles Bash tool calls reliably enough for me to trust that model to operate a coding agent on my device. My next laptop will have at least 128GB of RAM, so there's a chance that one of the 2026 open weight models might fit the bill. For now though I'm sticking with the best available frontier hosted models as my daily drivers. I played a tiny role helping to popularize the term "slop" in 2024, writing about it in May and landing quotes in the Guardian and the New York Times shortly afterwards. This year Merriam-Webster crowned it word of the year ! slop ( noun ): digital content of low quality that is produced usually in quantity by means of artificial intelligence I like that it represents a widely understood feeling that poor quality AI-generated content is bad and should be avoided. I'm still holding hope that slop won't end up as bad a problem as many people fear. The internet has always been flooded with low quality content. The challenge, as ever, is to find and amplify the good stuff. I don't see the increased volume of junk as changing that fundamental dynamic much. Curation matters more than ever. That said... I don't use Facebook, and I'm pretty careful at filtering or curating my other social media habits. Is Facebook still flooded with Shrimp Jesus or was that a 2024 thing? I heard fake videos of cute animals getting rescued is the latest trend. It's quite possible the slop problem is a growing tidal wave that I'm innocently unaware of. I nearly skipped writing about the environmental impact of AI for this year's post (here's what I wrote in 2024 ) because I wasn't sure if we had learned anything new this year - AI data centers continue to burn vast amounts of energy and the arms race to build them continues to accelerate in a way that feels unsustainable. What's interesting in 2025 is that public opinion appears to be shifting quite dramatically against new data center construction. Here's a Guardian headline from December 8th: More than 200 environmental groups demand halt to new US datacenters . Opposition at the local level appears to be rising sharply across the board too. I've been convinced by Andy Masley that the water usage issue is mostly overblown, which is a problem mainly because it acts as a distraction from the very real issues around energy consumption, carbon emissions and noise pollution. AI labs continue to find new efficiencies to help serve increased quality of models using less energy per token, but the impact of that is classic Jevons paradox - as tokens get cheaper we find more intense ways to use them, like spending $200/month on millions of tokens to run coding agents. As an obsessive collector of neologisms, here are my own favourites from 2025. You can see a longer list in my definitions tag . If you've made it this far, I hope you've found this useful! You can subscribe to my blog in a feed reader or via email , or follow me on Bluesky or Mastodon or Twitter . If you'd like a review like this on a monthly basis instead I also operate a $10/month sponsors only newsletter with a round-up of the key developments in the LLM space over the past 30 days. Here are preview editions for September , October , and November - I'll be sending December's out some time tomorrow. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . The year of "reasoning" The year of agents The year of coding agents and Claude Code The year of LLMs on the command-line The year of YOLO and the Normalization of Deviance The year of $200/month subscriptions The year of top-ranked Chinese open weight models The year of long tasks The year of prompt-driven image editing The year models won gold in academic competitions The year that Llama lost its way The year that OpenAI lost their lead The year of Gemini The year of pelicans riding bicycles The year I built 110 tools The year of the snitch! The year of vibe coding The (only?) year of MCP The year of alarmingly AI-enabled browsers The year of the lethal trifecta The year of programming on my phone The year of conformance suites The year local models got good, but cloud models got even better The year of slop The year that data centers got extremely unpopular My own words of the year That's a wrap for 2025 Claude Code Mistral Vibe Alibaba Qwen (Qwen3) Moonshot AI (Kimi K2) Z.ai (GLM-4.5/4.6/4.7) MiniMax (M2) MetaStone AI (XBai o4) Here’s how I use LLMs to help me write code Adding AI-generated descriptions to my tools collection Building a tool to copy-paste share terminal sessions using Claude Code for web Useful patterns for building HTML tools - my favourite post of the bunch. blackened-cauliflower-and-turkish-style-stew is ridiculous. It's a custom cooking timer app for anyone who needs to prepare Green Chef's Blackened Cauliflower and Turkish-style Spiced Chickpea Stew recipes at the same time. Here's more about that one . is-it-a-bird takes inspiration from xkcd 1425 , loads a 150MB CLIP model via Transformers.js and uses it to say if an image or webcam feed is a bird or not. bluesky-thread lets me view any thread on Bluesky with a "most recent first" option to make it easier to follow new posts as they arrive. Not all AI-assisted programming is vibe coding (but vibe coding rocks) in March Two publishers and three authors fail to understand what “vibe coding” means in May (one book subsequently changed its title to the much better "Beyond Vibe Coding"). Vibe engineering in October, where I tried to suggest an alternative term for what happens when professional engineers use AI assistance to build production-grade software. Your job is to deliver code you have proven to work in December, about how professional software development is about code that demonstrably works, no matter how you built it. Vibe coding, obviously. Vibe engineering - I'm still on the fence of if I should try to make this happen ! The lethal trifecta , my one attempted coinage of the year that seems to have taken root . Context rot , by Workaccount2 on Hacker News, for the thing where model output quality falls as the context grows longer during a session. Context engineering as an alternative to prompt engineering that helps emphasize how important it is to design the context you feed to your model. Slopsquatting by Seth Larson, where an LLM hallucinates an incorrect package name which is then maliciously registered to deliver malware. Vibe scraping - another of mine that didn't really go anywhere, for scraping projects implemented by coding agents driven by prompts. Asynchronous coding agent for Claude for web / Codex cloud / Google Jules Extractive contributions by Nadia Eghbal for open source contributions where "the marginal cost of reviewing and merging that contribution is greater than the marginal benefit to the project’s producers".

2 views
neilzone 3 weeks ago

#FreeSoftwareAdvent 2025

Here are all the toots I posted for this year’s #FreeSoftwareAdvent. Today’s #FreeSoftwareAdvent - 24 days of toots about Free software that I value - is a off to a strong start, since you get not one, not two, not three, but four pieces of software: yt-dlp: command line tools for downloading from various online sources: https://github.com/yt-dlp/yt-dlp Handbrake: ripping DVDs you’ve bought: https://handbrake.fr/ SoundJuicer: ripping CDs you’ve bought: https://gitlab.gnome.org/GNOME/sound-juicer Libation: downloading audiobooks you’ve bought: https://github.com/rmcrackan/Libation Each is a useful starting point for building your own media library. Today’s #FreeSoftwareAdvent is Libreboot. Libreboot is a Free and Open Source BIOS/UEFI boot firmware. It is much to my chagrin that I do not have a Free software BIOS/UEFI on my machines :( https://libreboot.org/ Today’s #FreeSoftwareAdvent is Cryptomator. I use Cryptomator for end-to-end encryption of files which I store/sync on Nextcloud. It means that, if my server were compromised, all the attacker would get is encrypted objects. (In my case, my own instance of Nextcloud, which is not publicly accessible, so this might be overkill, but there we go…) https://cryptomator.org/ Today’s #FreeSoftwareAdvent is tmux, the terminal multiplexer. I know that tmux will be second nature to many of you, but I’ve been gradually learning it over the last year or ( https://neilzone.co.uk/2025/06/more-tinkering-with-tmux/) , and I find it invaluable. Although I had used screen for years, I wish that I had known about tmux sooner. I use tmux both on local machines, for a convenient terminal environment (from which I can do an awful lot of what I want to do, computing-wise), and also on remote machines for persistent / recoverable terminal sessions. https://github.com/tmux/tmux Today’s #FreeSoftwareAdvent is the Lyrion Music Server. While there are plenty of great FOSS media services out there, Lyrion solves one particular problem for me: it breathes life back into old Logitech Squeezebox units. What’s not to love about a multi-room music setup using Free software and reusing old hardware! :) https://lyrion.org/ Today’s #FreeSoftwareAdvent is forgejo, a self-hostable code forge. Although I use a headless git server for some things, I’ve been happily experimenting with a public, self-hosted instance of forgejo this year. I look forward to federation functionality in due course! https://forgejo.org/ Today’s #FreeSoftwareAdvent is Vaultwarden. Vaultwarden is a readily self-hosted alternative implementation of Bitwarden, a password / credential manager. It works with the Bitwarden clients. Sandra and I use it to share passwords with each other, as well as keeping our own separate vaults in sync across multiple devices. https://github.com/dani-garcia/vaultwarden Today’s #FreeSoftwareAdvent is all about XMPP. Snikket ( https://snikket.org/ ) is an easy-to-install, and easy-to-administer, XMPP server. It is designed for families and other small groups. The apps for Android and iOS are great. Dino ( https://dino.im/ ) is my desktop XMPP client of choice. Profanity ( https://profanity-im.github.io/ ) is a terminal / console XMPP client, which is incredibly convenient. Why not have a fun festive project of setting up an XMPP-based chat server for you and your family and friends? Today’s #FreeSoftwareAdvent is a double whammy of shell-based “checking” utilities. linkchecker ( https://linux.die.net/man/1/linkchecker; probably get it from your distro) is a tool for checking websites (recursively, as you wish) for broken links. shellcheck ( https://www.shellcheck.net/; available in browser or as a terminal tool) is a tool for checking bash and other shell scripts for bugs. Today’s #FreeSoftwareAdvent (better late than never) is “WG Tunnel”, an application for Android for bringing up and tearing down WireGuard tunnels. I use it to keep a WireGuard tunnel running all the time on my phone. I don’t do anything fancy with it, but if one wanted to use different tunnels for different things, or when connected to different networks, or wanted a SOCKS5 proxy for other apps, it is definitely worth a look. https://wgtunnel.com/ (And available via a custom F-Droid repo.) Today’s #FreeSoftwareAdvent is another bundle, of four amazing terminal utilities for keeping organised: mutt, an incredibly versatile email client: http://www.mutt.org/ vdirsyncer, a tool for synchronising calendars and address books: https://vdirsyncer.pimutils.org/en/stable/ khard, for interacting with contacts (vCards): https://github.com/lucc/khard khal, for interacting with calendars: https://github.com/pimutils/khal Use with tmux to get all you want to see on one screen / via one command :) Today’s #FreeSoftwareAdvent is LibreOffice. I know that I’ve included this in previous #FreeSoftwareAdvent toots, but I use LibreOffice, particularly Writer, every day. Better still, this one isn’t just for Linux users, so if you are thinking about moving away from your Microsoft or Apple office suite, you can give LibreOffice a try, for free, alongside whatever you use currently, and give it a good go before you made a decision. Today’s #FreeSoftwareAdvent is Kimai, a self-hostable time tracking tool, which I use to keep records of all the legal work that I do, as well as time spent on committees etc. It has excellent, and flexible, reporting options - which I use for generating timesheets each month - and there is an integrated invoicing tool (which I do not use). https://www.kimai.org/ Today’s #FreeSoftwareAdvent is another one which might appeal for ##freelance work: InvoicePlane which is - you’ve guessed it - an invoicing tool. I have used InvoicePlane for decoded.legal’s invoicing for coming up ten years now, and it has been great. Dead easy to back up and restore too, which is always welcome. I think that it is designed to be exposed to customers (i.e. customers interact with it), but I do not do this. Kimai (yesterday’s entry) also offers invoicing, but for whatever reason - I don’t remember - I went with a separate tool when I set things up. (I’ve also heard good things about InvoiceNinja, but I have not used it myself.) https://www.invoiceplane.com/ Today’s #FreeSoftware advent is EspoCRM, a self-hostable customer relationship management tool. I use it to keep track of my clients, their matters, and the tasks for each matter, and I’ve integrated various regulatory compliance requirements into different steps / activities. It took me a fair amount of time to customise to my needs, and I’m really pleased with it. https://www.espocrm.com/ Today’s #FreeSoftwareAdvent is espanso, a text expansion tool. You specify a trigger word/phrase and a replacement, and, when you type that trigger, espanso replaces it automatically with the replacement text. So, for instance, I have a trigger for “.date”, and whenever I type that, it replaces it with the current date (“2025-12-16”). It also has a “forms” tool, so I can trigger a form, fill it in, and then espanso puts in all the content - so great for structured paragraphs with placeholders, basically. I would be absolutely lost without espanso, as it saves me so many keystrokes. https://espanso.org/ Today’s #FreeSoftwareAdvent is text-mode web browser, links. Because, sometimes, you just want to be able to browse without all the cruft of the modern web. You don’t have to stop using your current GUI browser to give elinks, or another text-mode browser, a try. I find browsing with links more relaxing, with fewer distractions and annoyances, although it is not a good choice for some sites. Bonus: no-one is trying to force AI into it. https://links.twibright.com/user_en.html Today’s #FreeSoftwareAdvent is reveal.js, my tool of choice for writing presentations. I write the content in Markdown, and… that’s it. Thanks to CSS, tada, I have a perfectly-formatted, ready-to-go, presentation. It has speaker view, allows for easy PDF exports, and, because it is html, css, and JavaScript, you can upload the presentation to your website, so that people can click/swipe through it. Plus, no proprietary / vendor lock-in format. I have used reveal.js for a few years now, and I absolutely love it. I can’t imagine faffing around with slides these days. https://revealjs.com/ Today’s #FreeSoftwareAdvent is perhaps a very obvious one: Mastodon. I’ve used Mastodon for a reasonably long time; this account is coming up 8 years old. I run my own instance using the glitch-soc fork ( https://glitch-soc.github.io/docs/) , and it is my only social networking presence. “Vanilla” Mastodon instructions are here: https://docs.joinmastodon.org/admin/install/ If you want your own instance but don’t want to host it, a few places offer managed hosting. Today’s #FreeSoftwareAdvent is greenbone/OpenVAS, a vulnerability scanner. I use it to help me spot (mis)configuration issues, out of date packages (even with automatic upgrades), and vulnerabilities. It is, unfortunately, a bit of a pain to install / get working, but this script ( https://github.com/martinboller/gse ) is useful, if you don’t mind sticking (for now, anyway) with Debian Bookworm. https://greenbone.github.io/docs/latest/index.html Today’s #FreeSoftwareAdvent is F-Droid, a Free software app store for Android. I use the GrapheneOS flavour of Android, and F-Droid is my app installation tool of choice. https://f-droid.org/ / @[email protected] Today’s #FreeSoftwareAdvent are some extensions which I find particularly useful. For Thunderbird: For Firefox: Today’s #FreeSoftwareAdvent is Elementary OS. If I were looking for a first-time Linux operating system for a friend or family member, I’d happily give Elementary OS a try, as I have heard nothing but good things about it. Plus, supporting small projects doing impressive things FTW. https://elementary.io/ / @[email protected]. Today’s #FreeSoftwareAdvent is paperless-ngx, a key part of keeping us, well, paperless. It is a document management tool, but I use it in a very basic way: it is hooked up to our scanner, and anything we scan gets automatically converted to PDF and OCRd. We then shred the paper. It is particularly useful around tax return time, as it means I can easily get the information I need from stuff which people have posted to us. https://github.com/paperless-ngx/paperless-ngx yt-dlp: command line tools for downloading from various online sources: https://github.com/yt-dlp/yt-dlp Handbrake: ripping DVDs you’ve bought: https://handbrake.fr/ SoundJuicer: ripping CDs you’ve bought: https://gitlab.gnome.org/GNOME/sound-juicer Libation: downloading audiobooks you’ve bought: https://github.com/rmcrackan/Libation mutt, an incredibly versatile email client: http://www.mutt.org/ vdirsyncer, a tool for synchronising calendars and address books: https://vdirsyncer.pimutils.org/en/stable/ khard, for interacting with contacts (vCards): https://github.com/lucc/khard khal, for interacting with calendars: https://github.com/pimutils/khal F-Droid is an app, which provides an app store interface F-Droid, the project, also runs the default repository used by the F-Droid app But anyone can run their own repository, and users can add that repo and install apps from there (e.g. Bitwarden does this) Quick Folder Move lets me file email (including into folders in different accounts) via the keyboard. I use this goodness knows how many tens of times a day. https://services.addons.thunderbird.net/en-US/thunderbird/addon/quick-folder-move/ Consent-o-matic automatically suppresses malicious compliance cookie banners: https://addons.mozilla.org/en-GB/firefox/addon/consent-o-matic/ uBlock Origin blocks all sorts of problematic stuff, and helps make for a much nicer browsing experience: https://addons.mozilla.org/en-GB/firefox/addon/ublock-origin/ Recipe Filter foregrounds the actual recipe on recipe pages: https://addons.mozilla.org/en-GB/firefox/addon/recipe-filter/ Shut Up is a comments blocker, which makes the web far more pleasant, with fewer random people’s opinions: https://addons.mozilla.org/en-GB/firefox/addon/shut-up-comment-blocker/

0 views
Evan Schwartz 4 weeks ago

Short-Circuiting Correlated Subqueries in SQLite

I recently added domain exclusion lists and paywalled content filtering to Scour . This blog post describes a small but useful SQL(ite) query optimization I came across between the first and final drafts of these features: using an uncorrelated scalar subquery to skip a correlated subquery (if you don't know what that means, I'll explain it below). Scour searches noisy sources for content related to users' interests. At the time of writing, it ingests between 1 and 3 million pieces of content from over 15,000 sources each month. For better and for worse, Scour does ranking on the fly, so the performance of the ranking database query directly translates to page load time. The main SQL query Scour uses for ranking applies a number of filters and streams the item embeddings through the application code for scoring. Scour uses brute force search rather than a vector database, which works well enough for now because of three factors: A simplified version of the query looks something like: The query plan shows that this makes good use of indexes: To add user-specified domain blocklists, I created the table and added this filter clause to the main ranking query: The domain exclusion table uses as a primary key, so the lookup is efficient. However, this lookup is done for every row returned from the first part of the query. This is a correlated subquery : A problem with the way we just added this feature is that most users don't exclude any domains, but we've added a check that is run for every row anyway. To speed up the queries for users who aren't using the feature, we could first check the user's settings and then dynamically build the query. But we don't have to, because we can accomplish the same effect within one static query. We can change our domain exclusion filter to first check whether the user has any excluded domains: Since the short-circuits, if the first returns (when the user has no excluded domains), SQLite never evaluates the correlated subquery at all. The first clause does not reference any column in , so SQLite can evaluate it once and reuse the boolean result for all of the rows. This "uncorrelated scalar subquery" is extremely cheap to evaluate and, when it returns , lets us short-circuit and skip the more expensive correlated subquery that checks each item's domain against the exclusion list. Here is the query plan for this updated query. Note how the second subquery says , whereas the third one is a . The latter is the per-row check, but it can be skipped by the second subquery. To test the performance of each of these queries, I replaced the with and used a simple bash script to invoke the binary 100 times for each query on my laptop. Starting up the process each time adds overhead, but we're comparing relative differences. At the time of this benchmark, the last week had 235,975 items, 144,229 of which were in English. The two example users I tested this for below only look for English content. This test represents most users, who have not configured any excluded domains: This shows that the short-circuit query adds practically no overhead for users without excluded domains, whereas the correlated subquery alone makes queries 17% slower for these users. This test uses an example user that has excluded content from 2 domains: In this case, we do need to check each row against the domain filter. But this shows that the short-circuit still adds no overhead on top of the query. When using SQL subqueries to filter down result sets, it's worth thinking about whether each subquery is really needed for most users or most queries. If the check is needed most of the time, this approach won't help. However if the per-row check isn't always needed, using an uncorrelated scalar subquery to short-circuit a condition can dramatically speed up the average case with practically zero overhead. This is extra important because the slow-down from each additional subquery compounds. In this blog post, I described and benchmarked a single additional filter. But this is only one of multiple subquery filters. Earlier, I also mentioned that users had asked for a way to filter out paywalled content. This works similarly to filtering out content from excluded domains. Some users opt-in to hiding paywalled content. For those users, we check if each item is paywalled. If so, we check if it comes from a site the user has specifically allowed paywalled content from (because they have a subscription). I used the same uncorrelated subquery approach to first check if the feature is enabled for the user and, only then, does SQLite need to check each row. Concretely, the paywalled content filter subquery looks like: In short, a trivial uncorrelated scalar subquery can help us short-circuit and avoid a more expensive per-row check when we don't need it. There are multiple ways to exclude rows from an SQL query. Here are the results from the same benchmark I ran above, but with two other ways of checking for whether an item comes from an excluded domain. The version of the query uses the subquery: The variation joins with and then checks for : And here are the full benchmarks: For users without excluded domains, we can see that the query using the short-circuit wins and adds no overhead. For users who do have excluded domains, the is faster than the version. However, this version raises the exact problem this whole blog post is designed to address. Since joins happen no matter what, we cannot use the short-circuit to avoid the overhead for users without excluded domains. At least for now, this is why I've gone with the subquery using the short-circuit. Discuss on Hacker News , Lobsters , r/programming , r/sqlite . Scour uses SQLite, so the data is colocated with the application code. It uses binary-quantized vector embeddings with Hamming Distance comparisons, which only take ~5 nanoseconds each . We care most about recent posts so we can significantly narrow the search set by publish date.

0 views
Hugo 4 weeks ago

Managing Custom Domains and Dynamic SSL with Coolify and Traefik

I recently published an article on [managing SSL certificates for a multi-tenant application on Coolify](https://eventuallymaking.io/p/2025-10-coolify-wildcard-ssl). The title might sound intimidating, but basically it lets an application handle tons of subdomains (over HTTPS) dynamically for its users. For example: - `https://hugo.writizzy.com` - `https://thomas-sanlis.writizzy.com` That's a solid foundation, but users often want more: **custom domains**. The ability to use their own address, like: - `https://eventuallymaking.io` - `https://thomas-sanlis.com` So let's dive into how to manage custom domains for a multi-tenant app with dynamic SSL certificates. *(I feel like I'm going for the world record in longest blog post title!)* ## Custom Domains The first step for a custom domain is routing traffic from that (sub)domain to your application (Writizzy in my case). It all comes down to **DNS records** on the user's side. Only they can configure their domain to point to your application's subdomain. They have several options: **CNAME**, **ALIAS**, or **A** records. ### 1. CNAME (the simplest) A CNAME is an alias for another domain. It basically says that `www.eventuallymaking.io` is an alias for `hugo.writizzy.com`. All traffic trying to resolve the `www` address gets automatically forwarded to your application domain. **Major limitation:** You can't use a CNAME for a root domain (e.g., `eventuallymaking.io` without the `www`). Adding a CNAME on an apex domain would conflict with other records (A, MX, TXT, etc.). ### 2. ALIAS Some DNS providers offer **ALIAS** records (or *CNAME flattening*). It's essentially a CNAME that can coexist with other records on a root domain. Great option if the user's provider supports it (OVH doesn't, for instance). ### 3. A Record Here, the user directly enters Writizzy's server IP address. **Warning:** This approach is risky. If you change servers (and therefore IPs), all your users' custom domains break until they update their config. To use this method safely, you need a **floating IP** that can be reassigned to your new server or web frontend. Alright, that's a good start, but if we stop here, things won't work well—the site won't serve over HTTPS. ## HTTPS on Custom Domains In my previous post, I showed how Traefik could route all traffic hitting `*.writizzy.com` subdomains to the same application using these lines: ```yaml traefik.http.routers.https-custom.rule=HostRegexp(`^.+$`) traefik.http.routers.https-custom.entryPoints=https traefik.http.routers.https-custom.service=https-0-zgwokkcwwcwgcc4gck440o88 traefik.http.routers.https-custom.tls.certresolver=letsencrypt traefik.http.routers.https-custom.tls=true traefik.http.routers.https-custom.priority=1 ``` Since the regex catches everything, you might think SSL would just follow along. Unfortunately, no. Traefik knows it needs to handle a wildcard certificate for `*.writizzy.io`, but it has no clue about the external domains it'll need to serve. We need to help it by dynamically providing the list of custom domains. Our constraints: - Obviously, no application restart - We want to drive it programmatically ### Dynamic Traefik Configuration with File Providers This is where Traefik's [File Providers](https://doc.traefik.io/traefik/reference/install-configuration/providers/others/file/) come in. In Coolify, this feature is enabled by default via [dynamic configurations](https://coolify.io/docs/knowledge-base/proxy/traefik/dynamic-config). Under the hood, Traefik watches a specific directory: ```bash - '--providers.file.directory=/traefik/dynamic/' - '--providers.file.watch=true' ``` Drop a `.yml` file defining the rule for a new domain, and Traefik picks it up hot and triggers the HTTP challenge with Let's Encrypt to get the SSL certificate. **Example dynamic configuration:** ```yaml http: routers: eventuallymaking-https: rule: "Host(`eventuallymaking.io`)" entryPoints: - https service: https-0-writizzy tls: certResolver: letsencrypt priority: 10 eventuallymaking-http: rule: "Host(`eventuallymaking.io`)" entryPoints: - http middlewares: - redirect-to-https@docker service: https-0-writizzy priority: 10 ``` So with this approach, we've solved our first constraint: no restart needed to configure a new custom domain. Now let's tackle the second one and make it programmatic. ### Automation via the Application The idea is straightforward: the application creates these files directly in the directory Traefik watches. 1. **Coolify configuration:** Go to `Configuration -> Persistent Storage` and add a `Directory mount` to make the `/traefik/dynamic/` directory visible to your application container. ![directory mount](https://writizzy.b-cdn.net/blogs/48b77143-02ee-4316-9d68-0e6e4857c5ce/1765961966661-wymdzej.png) 1. **Code (Kotlin in my case):** The application generates a file based on the template above as soon as a user configures their domain. > **Important note on timing:** If you create the Traefik config before the user has pointed their domain to your IP, the SSL challenge will fail. Traefik will automatically retry (with backoff), but it can take a while. Ideally, validate the DNS pointing (via a background job) **before** generating the file. > **Important note on security:** The generated file contains user input (the domain name). It's crucial to sanitize and validate this data to prevent a malicious user from injecting arbitrary Traefik directives into your configuration. ## Wrapping Up This post wraps up what I hope has been a useful series for SaaS builders running multi-tenant apps on Coolify. We've covered: 1. [Tenant management (Nuxt) and basic Traefik configuration](https://eventuallymaking.io/p/2025-10-coolify-subdomain) 2. [SSL management with dynamic subdomains](https://eventuallymaking.io/p/2025-10-coolify-wildcard-ssl) 3. **Custom domains with SSL** (this post) You might wonder if this solution scales with tens of thousands of sites all using custom domains—specifically whether Traefik can handle monitoring that many files in a directory. I don't have the answer yet. Writizzy is nowhere near that scale for now. There might be other solutions, like using Caddy instead of Traefik in Coolify, but it's way too early to explore that.

0 views
Anton Zhiyanov 1 months ago

Timing 'Hello, world'

Here's a little unscientific chart showing the compile/run times of a "hello world" program in different languages: For interpreted languages, the times shown are only for running the program, since there's no separate compilation step. I had to shorten the Kotlin bar a bit to make it fit within 80 characters. All measurements were done in single-core, containerized sandboxes on an ancient CPU, and the timings include the overhead of . So the exact times aren't very interesting, especially for the top group (Bash to Ruby) — they all took about the same amount of time. Here is the program source code in C: Other languages: Bash · C# · C++ · Dart · Elixir · Go · Haskell · Java · JavaScript · Kotlin · Lua · Odin · PHP · Python · R · Ruby · Rust · Swift · V · Zig Of course, this ranking will be different for real-world projects with lots of code and dependencies. Still, I found it curious to see how each language performs on a simple "hello world" task.

0 views
Armin Ronacher 1 months ago

Skills vs Dynamic MCP Loadouts

I’ve been moving all my MCPs to skills, including the remaining one I still used: the Sentry MCP 1 . Previously I had already moved entirely away from Playwright to a Playwright skill. In the last month or so there have been discussions about using dynamic tool loadouts to defer loading of tool definitions until later. Anthropic has also been toying around with the idea of wiring together MCP calls via code, something I have experimented with . I want to share my updated findings with all of this and why the deferred tool loading that Anthropic came up with does not fix my lack of love for MCP. Maybe they are useful for someone else. When the agent encounters a tool definition through reinforcement learning or otherwise, it is encouraged to emit tool calls through special tokens when it encounters a situation where that tool call would be appropriate. For all intents and purposes, tool definitions can only appear between special tool definition tokens in a system prompt. Historically this means that you cannot emit tool definitions later in the conversation state. So your only real option is for a tool to be loaded when the conversation starts. In agentic uses, you can of course compress your conversation state or change the tool definitions in the system message at any point. But the consequence is that you will lose the reasoning traces and also the cache. In the case of Anthropic, for instance, this will make your conversation significantly more expensive. You would basically start from scratch and pay full token rates plus cache write cost, compared to cache read. One recent innovation from Anthropic is deferred tool loading. You still declare tools ahead of time in the system message, but they are not injected into the conversation when the initial system message is emitted. Instead they appear at a later point. The tool definitions however still have to be static for the entire conversation, as far as I know. So the tools that could exist are defined when the conversation starts. The way Anthropic discovers the tools is purely by regex search. This is all quite relevant because even though MCP with deferred loading feels like it should perform better, it actually requires quite a bit of engineering on the LLM API side. The skill system gets away without any of that and, at least from my experience, still outperforms it. Skills are really just short summaries of which skills exist and in which file the agent can learn more about them. These are proactively loaded into the context. So the agent understands in the system context (or maybe somewhere later in the context) what capabilities it has and gets a link to the manual for how to use them. Crucially, skills do not actually load a tool definition into the context. The tools remain the same: bash and the other tools the agent already has. All it learns from the skill are tips and tricks for how to use these tools more effectively. Because the main thing it learns is how to use other command line tools and similar utilities, the fundamentals of how to chain and coordinate them together do not actually change. The reinforcement learning that made the Claude family of models very good tool callers just helps with these newly discovered tools. So that obviously raises the question: if skills work so well, can I move the MCP outside of the context entirely and invoke it through the CLI in a similar way as Anthropic proposes? The answer is yes, you can, but it doesn’t work well. One option here is Peter Steinberger’s mcporter . In short, it reads the files and exposes the MCPs behind it as callable tools: And yes, it looks very much like a command line tool that the LLM can invoke. The problem however is that the LLM does not have any idea about what tools are available, and now you need to teach it that. So you might think: why not make some skills that teach the LLM about the MCPs? Here the issue for me comes from the fact that MCP servers have no desire to maintain API stability. They are increasingly starting to trim down tool definitions to the bare minimum to preserve tokens. This makes sense, but for the skill pattern it’s not what you want. For instance, the Sentry MCP server at one point switched the query syntax entirely to natural language. A great improvement for the agent, but my suggestions for how to use it became a hindrance and I did not discover the issue straight away. This is in fact quite similar to Anthropic’s deferred tool loading: there is no information about the tool in the context at all. You need to create a summary. The eager loading of MCP tools we have done in the past now has ended up with an awkward compromise: the description is both too long to eagerly load it, and too short to really tell the agent how to use it. So at least from my experience, you end up maintaining these manual skill summaries for MCP tools exposed via mcporter or similar. This leads me to my current conclusion: I tend to go with what is easiest, which is to ask the agent to write its own tools as a skill. Not only does it not take all that long, but the biggest benefit is that the tool is largely under my control. Whenever it breaks or needs some other functionality, I ask the agent to adjust it. The Sentry MCP is a great example. I think it’s probably one of the better designed MCPs out there, but I don’t use it anymore. In part because when I load it into the context right away I lose around 8k tokens out of the box, and I could not get it to work via mcporter. On the other hand, I have Claude maintain a skill for me. And yes, that skill is probably quite buggy and needs to be updated, but because the agent maintains it, it works out better. It’s quite likely that all of this will change, but at the moment manually maintained skills and agents writing their own tools have become my preferred way. I suspect that dynamic tool loading with MCP will become a thing, but it will probably quite some protocol changes to bring in skill-like summaries and built-in manuals for the tools. I also suspect that MCP would greatly benefit of protocol stability. The fact that MCP servers keep changing their tool descriptions at will does not work well with materialized calls and external tool descriptions in READMEs and skill files. Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface. ↩ Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface. ↩

0 views

AMD GPU Debugger

I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss , detailing how he achieved this, which inspired me to try to recreate the process myself. The best place to start learning about this is RADV . By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader without using Vulkan, aka RADV in our case. First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling , which is a function defined in , which is a library that acts as middleware between user mode drivers(UMD) like and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling from again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it: Here we’re choosing the domain and assigning flags based on the params, some buffers we will need uncached, as we will see: Now we have the memory, we need to map it. I opt to map anything that can be CPU-mapped for ease of use. We have to map the memory to both the GPU and the CPU virtual space. The KMD creates the page table when we open the DRM file, as shown here . So map it to the GPU VM and, if possible, to the CPU VM as well. Here, at this point, there’s a libdrm function that does all of this setup for us and maps the memory, but I found that even when specifying , it doesn’t always tag the page as uncached, not quite sure if it’s a bug in my code or something in anyways, the function is , I opted to do it manually here and issue the IOCTL call myself: Now we have the context and 2 buffers. Next, fill those buffers and send our commands to the KMD, which will then forward them to the Command Processor (CP) in the GPU for processing. Let’s compile our code. We can use clang assembler for that, like this: The bash script compiles the code, and then we’re only interested in the actual machine code, so we use objdump to figure out the offset and the size of the section and copy it to a new file called asmc.bin, then we can just load the file and write its bytes to the CPU-mapped address of the code buffer. Next up, filling in the commands. This was extremely confusing for me because it’s not well documented. It was mostly learning how does things and trying to do similar things. Also, shout-out to the folks on the Graphics Programming Discord server for helping me, especially Picoduck. The commands are encoded in a special format called , which has multiple types. We only care about : each packet has an opcode and the number of bytes it contains. The first thing we need to do is program the GPU registers, then dispatch the shader. Some of those registers are ; those registers are responsible for a number of configurations, pgm_[lo/hi], which hold the pointer to the code buffer and ; those are responsible for the number of threads inside a work group. All of those are set using the packets, and here is how to encode them: {% mark %} It’s worth mentioning that we can set multiple registers in 1 packet if they’re consecutive.{% /mark %} Then we append the dispatch command: Now we want to write those commands into our buffer and send them to the KMD: {% mark %} Here is a good point to make a more complex shader that outputs something. For example, writing 1 to a buffer. {% /mark %} No GPU hangs ?! nothing happened ?! cool, cool, now we have a shader that runs on the GPU, what’s next? Let’s try to hang the GPU by pausing the execution, aka make the GPU trap. The RDNA3’s ISA manual does mention 2 registers, ; here’s how they describe them respectively: Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap handler is present (1) or not (0) and is not considered part of the address (bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN. Temporary register for shader operations. For example, it can hold a pointer to memory used by the trap handler. {%mark%} You can configure the GPU to enter the trap handler when encountering certain exceptions listed in the RDNA3 ISA manual. {%/mark%} We know from Marcell Kiss’s blog posts that we need to compile a trap handler, which is a normal shader the GPU switches to when encountering a . The TBA register has a special bit that indicates whether the trap handler is enabled. Since these are privileged registers, we cannot write to them from user space. To bridge this gap for debugging, we can utilize the debugfs interface. Luckily, we have UMR , which uses that debugfs interface, and it’s open source; we copy AMD’s homework here which is great. The amdgpu KMD has a couple of files in debugfs under ; one of them is , which is an interface to a in the kernel that writes to the registers. It works by simply opening the file, seeking the register’s offset, and then writing; it also performs some synchronisation and writes the value correctly. We need to provide more parameters about the register before writing to the file, tho and do that by using an ioctl call. Here are the ioctl arguments: The 2 structs are because there are 2 types of registers, GRBM and SRBM, each of which is banked by different constructs; you can learn more about some of them here in the Linux kernel documentation . Turns out our registers here are SBRM registers and banked by VMIDs, meaning each VMID has its own TBA and TMA registers. Cool, now we need to figure out the VMID of our process. As far as I understand, VMIDs are a way for the GPU to identify a specific process context, including the page table base address, so the address translation unit can translate a virtual memory address. The context is created when we open the DRM file. They get assigned dynamically at dispatch time, which is a problem for us; we want to write to those registers before dispatch. We can obtain the VMID of the dispatched process by querying the register with s_getreg_b32. I do a hack here, by enabling the trap handler in every VMID, and there are 16 of them, the first being special, and used by the KMD and the last 8 allocated to the amdkfd driver. We loop over the remaining VMIDs and write to those registers. This can cause issues to other processes using other VMIDs by enabling trap handlers in them and writing the virtual address of our trap handler, which is only valid within our virtual memory address space. It’s relatively safe tho since most other processes won’t cause a trap[^1]. [^1]: Other processes need to have a s_trap instruction or have trap on exception flags set, which is not true for most normal GPU processes. Now we can write to TMA and TBA, here’s the code: And here’s how we write to and : {%mark%} If you noticed, I’m using bitfields. I use them because working with them is much easier than macros, and while the byte order is not guaranteed by the C spec, it’s guaranteed by System V ABI, which Linux adheres to. {%/mark%} Anyway, now that we can write to those registers, if we enable the trap handler correctly, the GPU should hang when we launch our shader if we added instruction to it, or we enabled the bit in rsrc3[^2] register. [^2]: Available since RDNA3, if I’m not mistaken. Now, let’s try to write a trap handler. {%mark%} If you wrote a different shader that outputs to a buffer, u can try writing to that shader from the trap handler, which is nice to make sure it’s actually being run. {%/mark%} We need 2 things: our trap handler and some scratch memory to use when needed, which we will store the address of in the TMA register. The trap handler is just a normal program running in privileged state, meaning we have access to special registers like TTMP[0-15]. When we enter a trap handler, we need to first ensure that the state of the GPU registers is saved, just as the kernel does for CPU processes when context-switching, by saving a copy of the stable registers and the program counter, etc. The problem, tho, is that we don’t have a stable ABI for GPUs, or at least not one I’m aware of, and compilers use all the registers they can, so we need to save everything. AMD GPUs’ Command Processors (CPs) have context-switching functionality, and the amdkfd driver does implement some context-switching shaders . The problem is they’re not documented, and we have to figure them out from the amdkfd driver source and from other parts of the driver stack that interact with it, which is a pain in the ass. I kinda did a workaround here since I didn’t find luck understanding how it works, and some other reasons I’ll discuss later in the post. The workaround here is to use only TTMP registers and a combination of specific instructions to copy the values of some registers, allowing us to use more instructions to copy the remaining registers. The main idea is to make use of the instruction, which adds the index of the current thread within the wave to the writing address, aka $$ ID_{thread} * 4 + address $$ This allows us to write a unique value per thread using only TTMP registers, which are unique per wave, not per thread[^3], so we can save the context of a single wave. [^3]: VGPRs are unique per thread, and SGPRs are unique per wave The problem is that if we have more than 1 wave, they will overlap, and we will have a race condition. Here is the code: Now that we have those values in memory, we need to tell the CPU: Hey, we got the data, and pause the GPU’s execution until the CPU issues a command. Also, notice we can just modify those from the CPU. Before we tell the CPU, we need to write some values that might help the CPU. Here are they: Now the GPU should just wait for the CPU, and here’s the spin code it’s implemented as described by Marcell Kiss here : The main loop in the CPU is like enable trap handler, then dispatch shader, then wait for the GPU to write some specific value in a specific address to signal all data is there, then examine and display, and tell the GPU all clear, go ahead. Now that our uncached buffers are in play, we just keep looping and checking whether the GPU has written the register values. When it does, the first thing we do is halt the wave by writing into the register to allow us to do whatever with the wave without causing any issues, tho if we halt for too long, the GPU CP will reset the command queue and kill the process, but we can change that behaviour by adjusting lockup_timeout parameter of the amdgpu kernel module: From here on, we can do whatever with the data we have. All the data we need to build a proper debugger. We will come back to what to do with the data in a bit; let’s assume we did what was needed for now. Now that we’re done with the CPU, we need to write to the first byte in our TMA buffer, since the trap handler checks for that, then resume the wave, and the trap handler should pick it up. We can resume by writing to the register again: Then the GPU should continue. We need to restore everything and return the program counter to the original address. Based on whether it’s a hardware trap or not, the program counter may point to the instruction before or the instruction itself. The ISA manual and Marcell Kiss’s posts explain that well, so refer to them. Now we can run compiled code directly, but we don’t want people to compile their code manually, then extract the text section, and give it to us. The plan is to take SPIR-V code, compile it correctly, then run it, or, even better, integrate with RADV and let RADV give us more information to work with. My main plan was making like fork RADV and then add then make report for us the vulkan calls and then we can have a better view on the GPU work know the buffers/textures it’s using etc, This seems like a lot more work tho so I’ll keep it in mind but not doing that for now unless someone is willing to pay me for that ;). For now, let’s just use RADV’s compiler . Luckily, RADV has a mode, aka it will not do actual work or open DRM files, just a fake Vulkan device, which is perfect for our case here, since we care about nothing other than just compiling code. We can enable it by setting the env var , then we just call what we need like this: Now that we have a well-structured loop and communication between the GPU and the CPU, we can run SPIR-V binaries to some extent. Let’s see how we can make it an actual debugger. We talked earlier about CPs natively supporting context-switching, this appears to be compute spcific feature, which prevents from implementing it for other types of shaders, tho, it appears that mesh shaders and raytracing shaders are just compute shaders under the hood, which will allow us to use that functionality. For now debugging one wave feels enough, also we can moify the wave parameters to debug some specific indices. Here’s some of the features For stepping, we can use 2 bits: one in and the other in . They’re and , respectively. The former enters the trap handler after each instruction, and the latter enters before the first instruction. This means we can automatically enable instruction-level stepping. Regarding breakpoints, I haven’t implemented them, but they’re rather simple to implement here by us having the base address of the code buffer and knowing the size of each instruction; we can calculate the program counter location ahead and have a list of them available to the GPU, and we can do a binary search on the trap handler. The ACO shader compiler does generate instruction-level source code mapping, which is good enough for our purposes here. By taking the offset[^4] of the current program counter and indexing into the code buffer, we can retrieve the current instruction and disassemble it, as well as find the source code mapping from the debug info. [^4]: We can get that by subtracting the current program counter from the address of the code buffer. We can implement this by marking the GPU page as protected. On a GPU fault, we enter the trap handler, check whether it’s within the range of our buffers and textures, and then act accordingly. Also, looking at the registers, we can find these: which suggests that the hardware already supports this natively, so we don’t even need to do that dance. It needs more investigation on my part, tho, since I didn’t implement this. This needs some serious plumbing, since we need to make NIR(Mesa’s intermediate representation) optimisation passes propagate debug info correctly. I already started on this here . Then we need to make ACO track variables and store the information. This requires ditching our simple UMD we made earlier and using RADV, which is what should happen eventually, then we have our custom driver maybe pause on before a specific frame, or get triggered by a key, and then ask before each dispatch if to attach to it or not, or something similar, since we have a full proper Vulkan implementation we already have all the information we would need like buffers, textures, push constants, types, variable names, … etc, that would be a much better and more pleasant debugger to use. Finally, here’s some live footage: ::youtube{url=“ https://youtu.be/HDMC9GhaLyc ”} Here is an incomplete user-mode page walking code for gfx11, aka rx7900xtx

0 views
Kaushik Gopal 1 months ago

Combating AI coding atrophy with Rust

It’s no secret that I’ve fully embraced AI for my coding. A valid concern ( and one I’ve been thinking about deeply ) is the atrophying of the part of my brain that helps me code. To push back on that, I’ve been learning Rust on the side for the last few months. I am absolutely loving it. Kotlin remains my go-to language. It’s the language I know like the back of my hand. If someone sends me a swath of Kotlin code, whether handwritten or AI generated, I can quickly grok it and form a strong opinion on how to improve it. But Kotlin is a high-level language that runs on a JVM. There are structural limits to the performance you can eke out of it, and for most of my career 1 I’ve worked with garbage-collected languages. For a change, I wanted a systems-level language, one without the training wheels of a garbage collector. I also wanted a language with a different core philosophy, something that would force me to think in new ways. I picked up Go casually but it didn’t feel like a big enough departure from the languages I already knew. It just felt more useful to ask AI to generate Go code than to learn it myself. With Rust, I could get code translated, but then I’d stare at the generated code and realize I was missing some core concepts and fundamentals. I loved that! The first time I hit a lifetime error, I had no mental model for it. That confusion was exactly what I was looking for. Coming from a GC world, memory management is an afterthought — if it requires any thought at all. Rust really pushes you to think through the ownership and lifespan of your data, every step of the way. In a bizarre way, AI made this gap obvious. It showed me where I didn’t understand things and pointed me toward something worth learning. Here’s some software that’s either built entirely in Rust or uses it in fundamental ways: Many of the most important tools I use daily are built with Rust. Can’t hurt to know the language they’re written in. Rust is quite similar to Kotlin in many ways. Both use strict static typing with advanced type inference. Both support null safety and provide compile-time guarantees. The compile-time strictness and higher-level constructs made it fairly easy for me to pick up the basics. Syntactically, it feels very familiar. I started by rewriting a couple of small CLI tools I used to keep in Bash or Go. Even in these tiny programs, the borrow checker forced me to be clear about who owns what and when data goes away. It can be quite the mental workout at times, which is perfect for keeping that atrophy from setting in. After that, I started to graduate to slightly larger programs and small services. There are two main resources I keep coming back to: There are times when the book or course mentions a concept and I want to go deeper. Typically, I’d spend time googling, searching Stack Overflow, finding references, diving into code snippets, and trying to clear up small nuances. But that’s changed dramatically with AI. One of my early aha moments with AI was how easy it made ramping up on code. The same is true for learning a new language like Rust. For example, what’s the difference 2 between these two: Another thing I loved doing is asking AI: what are some idiomatic ways people use these concepts? Here’s a prompt I gave Gemini while learning: Here’s an abbreviated response (the full response was incredibly useful): It’s easy to be doom and gloom about AI in coding — the “we’ll all forget how to program” anxiety is real. But I hope this offers a more hopeful perspective. If you’re an experienced developer worried about skill atrophy, learn a language that forces you to think differently. AI can help you cross that gap faster. Use it as a tutor, not just a code generator. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎ fd (my tool of choice for finding files) ripgrep (my tool of choice for searching files) Fish shell (my shell of choice, recently rewrote in Rust) Zed (my text/code editor of choice) Firefox ( my browser of choice) Android?! That’s right: Rust now powers some of the internals of the OS, including the recent Quick Share feature. Fondly referred to as “ The Book ”. There’s also a convenient YouTube series following the book . Google’s Comprehensive Rust course, presumably created to ramp up their Android team. It even has a dedicated Android chapter . This worked beautifully for me. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎

0 views
neilzone 1 months ago

Adding a button to Home Assistant to run a shell command

Now that I have fixed my garage door controller , I wanted to see if I could use it from within Home Assistant, primarily so that I can then have a widget on my phone screen. In this case, the shell command is a simple invocation to a remote machine. (I am not aware of a way to add a button to an Android home screen to run a shell command or bash script.) Adding a button to run a shell command or bash script in Home Assistant was pretty straightforward, following the Home Assistant documentation for shell commands . To my configuration.yaml file, I added something like: The quirk here was that reloading the configuration.yaml file from within the Home Assistant UI was insufficient. I needed to completely restart Home Assistant to pick up the changes. Once I had restarted Home Assistant, the shell commands were available. To add buttons, I needed to create “helpers”. I did this from the Home Assistant UI, via Settings / Devices & services / Helpers / Create helper". One helper for each button. After I had created each helper, I went back into the helper’s settings, to add zone information, so that it appeared in the right place in the dashboard. Having created the button/helper, and the shell command, I used an automation to link the two together. I did this via the Home Assistant UI, Settings / Automations & scenes. For the button, it was linked to a change in state of the button, with no parameters specified. The “then do” action is the shell command for the door in question.

0 views

I want a better action graph serialization

This post is part 3/4 of a series about build systems . The next post and last post is I want a better build executor . As someone who ends up getting the ping on "my build is weird" after it has gone through a round of "poke it with a stick", I would really appreciate the mechanisms for [correct dependency edges] rolling out sooner rather than later. In a previous post , I talked about various approaches in the design space of build systems. In this post, I want to zero in on one particular area: action graphs. First, let me define "action graph". If you've ever used CMake, you may know that there are two steps involved: A "configure" step ( ) and a build step ( or ). What I am interested here is what generates , the Makefiles it has created. As the creator of ninja writes , this is a serialization of all build steps at a given moment in time, with the ability to regenerate the graph by rerunning the configure step. This post explores that design space, with the goal of sketching a format that improves on the current state while also enabling incremental adoption. When I say "design space", I mean a serialization format where files are machine-generated by a configure step, and have few enough and restricted enough features that it's possible to make a fast build executor . Not all build systems serialize their action graph. and run persistent servers that store it in memory and allow querying it, but never serialize it to disk. For large graphs, this requires a lot of memory; has actually started serializing parts of its graph to reduce memory usage and startup time . The nix evaluator doesn’t allow querying its graph at all; nix has a very strange model where it never rebuilds because each change to your source files is a new “ input-addressed derivation ” and therefore requires a reconfigure. This is the main reason it’s only used to package software, not as an “inner” build system, because that reconfigure can be very slow. I’ve talked to a couple Nix maintainers and they’ve considered caching parts of the configure step, without caching its outputs (because there are no outputs, other than derivation files!) in order to speed this up. This is much trickier because it requires serializing parts of the evaluator state. Tools that do serialize their graph include CMake, Meson, and the Chrome build system ( GN ). Generally, serializing the graph comes in handy when: In the last post I talked about 4 things one might want from a build system: For a serialization format, we have slightly different constraints. Throughout this post, I'll dive into detail on how these 3 overarching goals apply to the serialization format, and how well various serializations achieve that goal. The first one we'll look at, because it's the default for CMake, is and Makefiles. Make is truly in the Unix spirit: easy to implement 2 , very hard to use correctly. Make is ambiguous , complicated , and makes it very easy to implicitly do a bunch of file system lookups . It supports running shell commands at the top-level, which makes even loading the graph very expensive. It does do pretty well on minimizing reconfigurations, since the language is quite flexible. Ninja is the other generator supported by CMake. Ninja is explicitly intended to work on a serialized action graph; it's the only tool I'm aware of that is. It solves a lot of the problems of Make : it removes many of the ambiguities; it doesn't have any form of globbing; and generally it's a much simpler and smaller language. Unfortunately, Ninja's build file format still has some limitations. First, it has no support for checksums. It's possible to work around that by using and having a wrapper script that doesn't overwrite files unless they've changed, but that's a lot of extra work and is annoying to make portable between operating systems. Ninja files also have trouble expressing correct dependency edges. Let's look at a few examples, one by one. In each of these cases, we either have to reconfigure more often than we wish, or we have no way at all of expressing the dependency edge. See my previous post about negative dependencies. The short version is that build files need to specify not just the files they expect to exist, but also the files they expect not to exist. There's no way to express this in a ninja file, short of reconfiguring every time a directory that might contain a negative dependency is modified, which itself has a lot of downsides. Say that you have a C project with just a . You rename it to and ninja gives you an error that main.c no longer exists. Annoyed of editing ninja files by hand, you decide to write a generator 3 : Note this that this registers an implicit dependency on the current directory. This should automatically detect that you renamed your file and rebuild for you. Oh. Right. Generating build.ninja also modifies the current directory, which creates an infinite loop. It's possible to work around this by putting your C file in a source directory: There's still a problem here, though—did you notice it? Our old target is still lying around. Ninja actually has enough information recorded to fix this: . But it's not run automatically. The other problem is that this approach rebuilds far too often. In this case, we wanted to support renames, so in Ninja's model we need to depend on the whole directory. But that's not what we really depended on—we only care about files. I would like to see a action graph format that has an event-based system, where it says "this file was created, make any changes to the action graph necessary", and cuts the build short if the graph wasn't changed. For flower , I want to go further and support deletions : source files and targets that are optional, that should not fail the build if they aren't present, but should cause a rebuild if they are created, modified, or deleted. Ninja has no way of expressing this. Ninja has no way to express “this node becomes dirty when an environment variable changes”. The closest you can get is hacks with and the checksum wrapper/restat hack, but it’s a pain to express and it gets much worse if you want to depend on multiple variables. At this point, we have a list of constraints for our file format: Ideally, it would even be possible to mechanically translate existing .ninja files to this new format. This sketches out a new format that could improve over Ninja files. It could look something like this: I’d call this language Magma, since it’s the simplest kind of set with closure. Here's a sample action graph in Magma: Note some things about Magma: Kinda. Magma itself has a lot in common with Starlark: it's deterministic, hermetic, immutable, and can be evaluated in parallel. The main difference between the languages themselves is that Clojure has (equivalent to sympy symbolic variables) and Python doesn't. Some of these could be rewritten to keyword arguments, and others could be rewritten to structs, or string keys for a hashmap, or enums; but I'm not sure how much benefit there is to literally using Starlark when these files are being generated by a configure step in any case. Probably it's possible to make a 1-1 mapping between the two in any case. Buck2 has support for metadata that describes how to execute a built artifact. I think this is really interesting; is a much nicer interface than , partly because of shell quoting and word splitting issues, and partly just because it's more discoverable. I don't have a clean idea for how to fit this into a serialization layer. "Don't put it there and use a instead" works , but makes it hard to do things like allow the build graph to say that an artifact needs set or something like that, you end up duplicating the info in both files. Perhaps one option could be to attach a key/value pair to s. Well, yes and no. Yes, in that this has basically all the features of ninja and then some. But no, because the rules here are all carefully constrained to avoid needing to do expensive file I/O to load the build graph. The most expensive new feature is , and it's intended to avoid an even more expensive step (rerunning the configuration step). It's also limited to changed files; it can't do arbitrary globbing on the contents of the directory the way that Make pattern rules can. Note that this also removes some features in ninja: shell commands are gone, process spawning is much less ambiguous, files are no longer parsed automatically. And because this embeds a clojure interpreter, many things that were hard-coded in ninja can instead be library functions: , response files, , . In this post, we have learned some downsides of Make and Ninja's build file formats, sketched out how they could possibly be fixed, and designed a language called Magma that has those characteristics. In the next post, I'll describe the features and design of a tool that evaluates and queries this language. see e.g. this description of how it works in buck2 ↩ at least a basic version—although several of the features of GNU Make get rather complicated. ↩ is https://github.com/ninja-build/ninja/blob/231db65ccf5427b16ff85b3a390a663f3c8a479f/misc/ninja_syntax.py . ↩ technically these aren't true monadic builds because they're constrained a lot more than e.g. Shake rules, they can't fabricate new rules from whole cloth. but they still allow you to add more outputs to the graph at runtime. ↩ This goes all the way around the configuration complexity clock and skips the "DSL" phase to simply give you a real language. ↩ This is totally based and not at all a terrible idea. ↩ This has a whole bunch of problems on Windows, where arguments are passed as a single string instead of an array, and each command has to reimplement its own parsing. But it will work “most” of the time, and at least avoids having to deal with Powershell or CMD quoting. ↩ To make it possible to distinguish the two on the command line, could unambiguously refer to the group, like in Bazel. ↩ You don’t have a persistent server to store it in memory. When you don’t have a server, serializing makes your startup times much faster, because you don’t have to rerun the configure step each time. You don’t have a remote build cache. When you have a remote cache, the rules for loading that cache can be rather complicated because they involve network queries 1 . When you have a local cache, loading it doesn’t require special support because it’s just opening a file. You want to support querying, process spawning, and progress updates without rewriting the logic yourself for every OS (i.e. you don't want to write your own build executor). a "real" language in the configuration step reflection (querying the build graph) file watching support for discovering incorrect dependency edges We care about it being simple and unambiguous to load the graph from the file, so we get fast incremental rebuild speed and graph queries. In particular, we want to touch the filesystem as little as possible while loading. We care about supporting "weird" dependency edges, like dynamic dependencies and the depfiles emitted by a compiler after the first run, so that we're able to support more kinds of builds. And finally, we care about minimizing reconfigurations : we want to be able to express as many things as possible in the action graph so we don't have the pay the cost of rerunning the configure step. This tends to be at odds with fast graph loading; adding features at this level of the stack is very expensive! Negative dependencies File rename dependencies Optional file dependencies Optional checksums to reduce false positives Environment variable dependencies "all the features of ninja" (depfiles, monadic builds through 4 , a statement, order-only dependencies) A very very small clojure subset (just , , EDN , and function calls) for the text itself, no need to make loading the graph harder than necessary 5 . If people really want an equivalent of or I suppose this could learn support for and , but it would have much simpler rules than Clojure's classpath. It would not have support for looping constructs, nor most of clojure's standard library. -inspired dependency edges: (for changes in file attributes), (for changes in the checksum), (for optional dependencies), ; plus our new edge. A input function that can be used anywhere a file path can (e.g. in calls to ) so that the kind of edge does not depend on whether the path is known in advance or not. Runtime functions that determine whether the configure step needs to be re-run based on file watch events 6 . Whether there is actually a file watcher or the build system just calculates a diff on its next invocation is an implementation detail; ideally, one that's easy to slot in and out. “phony” targets would be replaced by a statement. Groups are sets of targets. Groups cannot be used to avoid “input not found” errors; that niche is filled by . Command spawning is specified as an array 7 . No more dependency on shell quoting rules. If people want shell scripts they can put that in their configure script. Redirecting stdout no longer requires bash syntax, it's supported natively with the parameter of . Build parameters can be referred to in rules through the argument. is a thunk ; it only registers an intent to add edges in the future, it does not eagerly require to exist. Our input edge is generalized and can apply to any rule, not just to the configure step. It executes when a file is modified (or if the tool doesn’t support file watching, on each file in the calculated diff in the next tool invocation). Our edge provides the file event type, but not the file contents. This allows ronin to automatically map results to one of the three edge kinds: , , . and are not available through this API. We naturally distinguish between “phony targets” and files because the former are s and the latter are s. No more accidentally failing to build if an file is created. 8 We naturally distinguish between “groups of targets” and “commands that always need to be rerun”; the latter just uses . Data can be transformed in memory using clojure functions without needing a separate process invocation. No more need to use in your build system. see e.g. this description of how it works in buck2 ↩ at least a basic version—although several of the features of GNU Make get rather complicated. ↩ is https://github.com/ninja-build/ninja/blob/231db65ccf5427b16ff85b3a390a663f3c8a479f/misc/ninja_syntax.py . ↩ technically these aren't true monadic builds because they're constrained a lot more than e.g. Shake rules, they can't fabricate new rules from whole cloth. but they still allow you to add more outputs to the graph at runtime. ↩ This goes all the way around the configuration complexity clock and skips the "DSL" phase to simply give you a real language. ↩ This is totally based and not at all a terrible idea. ↩ This has a whole bunch of problems on Windows, where arguments are passed as a single string instead of an array, and each command has to reimplement its own parsing. But it will work “most” of the time, and at least avoids having to deal with Powershell or CMD quoting. ↩ To make it possible to distinguish the two on the command line, could unambiguously refer to the group, like in Bazel. ↩

0 views
マリウス 1 months ago

disable-javascript.org

With several posts on this website attracting significant views in the last few months I had come across plenty of feedback on the tab gimmick implemented last quarter . While the replies that I came across on platforms like the Fediverse and Bluesky were lighthearted and oftentimes with humor, the visitors coming from traditional link aggregators sadly weren’t as amused about it. Obviously a large majority of people disagreeing with the core message behind this prank appear to be web developers, who’s very existence quite literally depends on JavaScript, and who didn’t hold back to express their anger in the comment sections as well as through direct emails. Unfortunately, most commenters are missing the point. This email exchange is just one example of feedback that completely misses the point: I just found it a bit hilarious that your site makes notes about ditching and disable Javascript, and yet Google explicitly requires it for the YouTube embeds. Feels weird. The email contained the following attachment: Given the lack of context I assume that the author was referring to the YouTube embeds on this website (e.g. on the keyboard page). Here is my reply: Simply click the link on the video box that says “Try watching this video on www.youtube.com” and you should be directed to YouTube (or a frontend of your choosing with LibRedirect [1]) where you can watch it. Sadly, I don’t have the influence to convince YouTube to make their video embeds working without JavaScript enabled. ;-) However, if more people would disable JavaScript by default, maybe there would be a higher incentive for server-side-rendering and video embeds would at the very least show a thumbnail of the video (which YouTube could easily do, from a technical point of view). Kind regards! [1]: https://libredirect.github.io It also appears that many of the people disliking the feature didn’t care to properly read the highlighted part of the popover that says “Turn JavaScript off, now, and only allow it on websites you trust!” : Indeed - and the author goes on to show a screenshot of Google Trends which, I’m sure, won’t work without JavaScript turned on. This comment perfectly encapsulates the flawed rhetoric. Google Trends (like YouTube in the previous example) is a website that is unlikely to exploit 0-days in your JavaScript engine, or at least that’s the general consensus. However, when you clicked on a link that looks like someone typed it in by putting their head on the keyboard , that led you to a website you obviously didn’t know beforehand, it’s a different story. What I’m advocating for is to have JavaScript disabled by default for everything unknown to you , and only enable it for websites that you know and trust . Not only is this approach going to protect you from jump-scares , regardless whether that’s a changing tab title, a popup, or an actual exploit, but it will hopefully pivot the thinking of particularly web developers back from “Let’s render the whole page using JavaScript and display nothing if it’s disabled” towards “Let’s make the page as functional as possible without the use of JavaScript and only sprinkle it on top as a way to make the experience better for anyone who choses to enable it” . It is mind boggling how this simple take is perceived as militant techno-minimalism and can provoke such salty feedback. I keep wondering whether these are the same people that consider to be a generally okay way to install software …? One of the many commenters that however did agree with the approach that I’m taking on this site had put it fairly nicely: About as annoying as your friend who bumped key’ed his way into your flat in 5 seconds waiting for you in the living room. Or the protest blocking the highway making you late for work. Many people don’t realize that JavaScript means running arbitrary untrusted code on your machine. […] Maybe the hacker ethos has changed, but I for one miss the days of small pranks and nudges to illustrate security flaws, instead of ransomware and exploits for cash. A gentle reminder that we can all do better, and the world isn’t always all that friendly. As the author of this comment correctly hints, the hacker ethos has in fact changed. My guess is that only a tiny fraction of the people that are actively commenting on platforms like Hacker News or Reddit these days know about, let’s say, cDc ’s Back Orifice , the BOFH stories, bash.org , and all the kickme.to/* links that would trigger a disconnect in AOL ’s dialup desktop software. Hence, the understanding about how far pranks in the 90s and early 2000s really went simply isn’t there. And with most things these days required to be politically correct , having the tab change to what looks like a Google image search for “sam bankman-fried nudes” is therefor frowned upon by many, even when the reason behind it is to inform. Frankly, it seems that conformism has eaten not only the internet, but to an extent the whole world, when an opinion that goes ever so slightly against the status quo is labelled as some sort of extreme view . To feel even just a “tiny bit violated by” something as mundane as a changing text and icon in the browser’s tab bar seems absurd, especially when it is you that allowed my website to run arbitrary code on your computer! Because I’m convinced that a principled stance against the insanity that is the modern web is necessary, I am doubling down on this effort by making it an actual initiative: disable-javascript.org disable-javascript.org is a website that informs the average user about some of the most severe issues affecting the JavaScript ecosystem and browsers/users all over the world, and explains in simple terms how to disable JavaScript in various browsers and only enable it for specific, trusted websites. The site is linked on the JavaScript popover that appears on this website, so that visitors aren’t only pranked into hopefully disabling JavaScript, but can also easily find out how to do so. disable-javascript.org offers a JavaScript-snippet that is almost identical to the one in use by this website, in case you would like to participate in the cause. Of course, you can as well simply link to disable-javascript.org from anywhere on your website to show your support. If you’d like to contribute to the initiative by extending the website with valuable info, you can do so through its Git repository . Feel free to open pull-requests with the updates that you would like to see on disable-javascript.org . :-)

0 views
Simon Willison 1 months ago

Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson

I talked with CL Kao and Dori Wilson for an episode of their new Data Renegades podcast titled Data Journalism Unleashed with Simon Willison . I fed the transcript into Claude Opus 4.5 to extract this list of topics with timestamps and illustrative quotes. It did such a good job I'm using what it produced almost verbatim here - I tidied it up a tiny bit and added a bunch of supporting links. What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook I used a Claude Project for the initial analysis, pasting in the HTML of the transcript since that included elements. The project uses the following custom instructions You will be given a transcript of a podcast episode. Find the most interesting quotes in that transcript - quotes that best illustrate the overall themes, and quotes that introduce surprising ideas or express things in a particularly clear or engaging or spicy way. Answer just with those quotes - long quotes are fine. I then added a follow-up prompt saying: Now construct a bullet point list of key topics where each item includes the mm:ss in square braces at the end Then suggest a very comprehensive list of supporting links I could find Here's the full Claude transcript of the analysis. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." NICAR 2026 ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook

0 views
Dangling Pointers 1 months ago

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Kev Quirk 1 months ago

Why Do You Need Big Tech for Your SSG?

A look at why small, personal websites don’t need big-tech static hosting, and how a simple local build and rsync workflow gives you faster deploys, more control, and far fewer dependencies. OK, so Cloudflare shit the bed yesterday and the Internet went into meltdown. A config file grew too big and half the bloody web fell over. How fragile . It got me thinking about my fellow small-web compatriots, their SSG workflows, and why on earth so many rely on services like Cloudflare Pages and Netlify. For personal sites it feels incredibly wasteful: you’re spinning up a VM, building your site, pushing the result to their platform, then tearing the VM down again. Why not just build the site on your local machine? You’re not beholden to anyone, and you can host your site anywhere you like. All you need is a hosting package that supports SSH (or FTP if you must) and a small script to build your site and rsync any changes. Here’s the core of my deployment script: Here’s what it does: That’s it. And that’s all it needs to do. With these few lines of Bash, I can deploy anywhere, without waiting for someone else’s infrastructure to spin up a build container. My full script also checks the git status, commits changes, and clears the Bunny CDN cache, but none of that’s required. The snippet above does everything Cloudflare Pages and similar services do — and does it much quicker. My entire deploy, including the extras, takes about eight seconds. If you’re hosting with one of the big static hosting platforms, why not consider moving away and actually owning your little corner of the web? They’re great for complex projects, but unnecessary for most personal sites. Then, the next time big tech has a brain fart, your patch of the web will probably sail right through it. Thanks for reading this post via RSS. RSS is great, and you're great for using it. ❤️ Reply to this post by email ✅ No CI/CD pipeline. ✅ No big tech — just you and your server. ✅ No VMs spinning up and down at the speed of a thousand gazelles. Jumps into the directory where my source website files live. Builds the Jekyll site locally. Syncs the built files to my server over SSH, deleting anything I’ve removed locally.

0 views
Fernando Borretti 2 months ago

Ad-Hoc Emacs Packages with Nix

You can use Nix as a package manager for Emacs, like so: Today I learned you can also use it to create ad-hoc packages for things not in MELPA or nixpkgs . The other day I wanted to get back into Inform 7 , naturally the first stack frame of the yak shave was to look for an Emacs mode. exists, but isn’t packaged anywhere. So I had to vendor it in. You can use git submodules for this, but I have an irrational aversion to submodules. Instead I did something far worse: I wrote a Makefile to download the from GitHub, and used home-manager to copy it into my . Which is nasty. And of course this only works for small, single-file packages. And, on top of that: whatever dependencies your vendored packages need have to be listed in , which confuses the packages you want, with the transitive dependencies of your vendored packages. I felt like the orange juice bit from The Simpsons . There must be a better way! And there is. With some help from Claude, I wrote this: Nix takes care of everything: commit pinning, security (with the SHA-256 hash), dependencies for custom packages. And it works wonderfully. Armed with a new hammer, I set out to drive some nails. Today I created a tiny Haskell project, and when I opened the file, noticed it had no syntax highlighting. I was surprised to find there’s no in MELPA. But coincidentally, someone started working on this literally three weeks ago ! So I wrote a small expression to package this new : A few weeks back I switched from macOS to Linux, and since I’m stuck on X11 because of stumpwm , I’m using XCompose to define keybindings for entering dashes, smart quotes etc. It bothered me slightly that my file didn’t have syntax highlighting. I found in kragen’s repo , but it’s slightly broken (it’s missing a call at the end). I started thinking how hard it would be to write a Nix expression to modify the source after fetching, when I found that Thomas Voss hosts a patched version here . Which made this very simple: Somehow the version of in nixpkgs unstable was missing the configuration option to use a custom shell. Since I want to use nu instead of bash, I had to package this myself from the latest commit: I started reading Functional Programming in Lean recently, and while there is a , it’s not packaged anywhere. This only required a slight deviation from the pattern: when I opened a file I got an error about a missing JSON file, consulting the README for , it says: If you use a source-based package-manager (e.g. , Straight or Elpaca), then make sure to list the directory in your Lean4-Mode package recipe. To do this I had to use rather than :

0 views

the terminal of the future

Terminal internals are a mess. A lot of it is just the way it is because someone made a decision in the 80s and now it’s impossible to change. This is what you have to do to redesign infrastructure. Rich [Hickey] didn't just pile some crap on top of Lisp [when building Clojure]. He took the entire Lisp and moved the whole design at once. At a very very high level, a terminal has four parts: I lied a little bit above. "input" is not just text. It also includes signals that can be sent to the running process. Converting keystrokes to signals is the job of the PTY. Similar, "output" is not just text. It's a stream of ANSI Escape Sequences that can be used by the terminal emulator to display rich formatting. I do some weird things with terminals. However, the amount of hacks I can get up to are pretty limited, because terminals are pretty limited. I won't go into all the ways they're limited, because it's been rehashed many times before . What I want to do instead is imagine what a better terminal can look like. The closest thing to a terminal analog that most people are familiar with is Jupyter Notebook . This offers a lot of cool features that are not possible in a "traditional" VT100 emulator: high fidelity image rendering a "rerun from start" button (or rerun the current command; or rerun only a single past command) that replaces past output instead of appending to it "views" of source code and output that can be rewritten in place (e.g. markdown can be viewed either as source or as rendered HTML) a built-in editor with syntax highlighting, tabs, panes, mouse support, etc. Jupyter works by having a "kernel" (in this case, a python interpreter) and a "renderer" (in this case, a web application displayed by the browser). You could imagine using a Jupyter Notebook with a shell as the kernel, so that you get all the nice features of Jupyter when running shell commands. However, that quickly runs into some issues: It turns out all these problems are solveable. There exists today a terminal called Warp . Warp has built native integration between the terminal and the shell, where the terminal understands where each command starts and stops, what it outputs, and what is your own input. As a result, it can render things very prettily: It does this using (mostly) standard features built-in to the terminal and shell (a custom DCS): you can read their explanation here . It's possible to do this less invasively using OSC 133 escape codes ; I'm not sure why Warp didn't do this, but that's ok. iTerm2 does a similar thing, and this allows it to enable really quite a lot of features : navigating between commands with a single hotkey; notifying you when a command finishes running, showing the current command as an "overlay" if the output goes off the screen. This is really three different things. The first is interacting with a long-lived process. The second is suspending the process without killing it. The third is disconnecting from the process, in such a way that the process state is not disturbed and is still available if you want to reconnect. To interact with a process, you need bidirectional communication, i.e. you need a "cell output" that is also an input. An example would be any TUI, like , , or 1 . Fortunately, Jupyter is really good at this! The whole design is around having interactive outputs that you can change and update. Additionally, I would expect my terminal to always have a "free input cell", as Matklad describes in A Better Shell , where the interactive process runs in the top half of the window and an input cell is available in the bottom half. Jupyter can do this today, but "add a cell" is manual, not automatic. "Suspending" a process is usually called " job control ". There's not too much to talk about here, except that I would expect a "modern" terminal to show me all suspended and background processes as a de-emphasized persistent visual, kinda like how Intellij will show you "indexing ..." in the bottom taskbar. There are roughly three existing approaches for disconnecting and reconnecting to a terminal session (Well, four if you count reptyr ). Tmux / Zellij / Screen These tools inject a whole extra terminal emulator between your terminal emulator and the program. They work by having a "server" which actually owns the PTY and renders the output, and a "client" that displays the output to your "real" terminal emulator. This model lets you detach clients, reattach them later, or even attach multiple clients at once. You can think of this as a "batteries-included" approach. It also has the benefit that you can program both the client and the server (although many modern terminals, like Kitty and Wezterm are programmable now); that you can organize your tabs and windows in the terminal (although many modern desktop environments have tiling and thorough keyboard shortcuts); and that you get street cred for looking like Hackerman. The downside is that, well, now you have an extra terminal emulator running in your terminal, with all the bugs that implies . iTerm actually avoids this by bypassing the tmux client altogether and acting as its own client that talks directly to the server. In this mode, "tmux tabs" are actually iTerm tabs, "tmux panes" are iTerm panes, and so on. This is a good model, and I would adopt it when writing a future terminal for integration with existing tmux setups. Mosh is a really interesting place in the design space. It is not a terminal emulator replacement; instead it is an ssh replacement. Its big draw is that it supports reconnecting to your terminal session after a network interruption. It does that by running a state machine on the server and replaying an incremental diff of the viewport to the client . This is a similar model to tmux, except that it doesn't support the "multiplexing" part (it expects your terminal emulator to handle that), nor scrollback (ditto). Because it has its own renderer, it has a similar class of bugs to tmux . One feature it does have, unlike tmux, is that the "client" is really running on your side of the network, so local line editing is instant. alden / shpool / dtach / abduco / diss These all occupy a similar place in the design space: they only handle session detach/resume with a client/server, not networking or scrollback, and do not include their own terminal emulator. Compared to tmux and mosh, they are highly decoupled. I'm going to treat these together because the solution is the same: dataflow tracking. Take as an example pluto.jl , which does this today by hooking into the Julia compiler. Note that this updates cells live in response to previous cells that they depend on. Not pictured is that it doesn't update cells if their dependencies haven't changed. You can think of this as a spreadsheet-like Jupyter, where code is only rerun when necessary. You may say this is hard to generalize. The trick here is orthogonal persistence . If you sandbox the processes, track all IO, and prevent things that are "too weird" unless they're talking to other processes in the sandbox (e.g. unix sockets and POST requests), you have really quite a lot of control over the process! This lets you treat it as a pure function of its inputs, where its inputs are "the whole file system, all environment variables, and all process attributes". Once you have these primitives—Jupyter notebook frontends, undo/redo, automatic rerun, persistence, and shell integration—you can build really quite a lot on top. And you can build it incrementally, piece-by-piece: jyn, you may say, you can't build vertical integration in open source . you can't make money off open source projects . the switching costs are too high . All these things are true. To talk about how this is possible, we have to talk about incremental adoption. if I were building this, I would do it in stages, such that at each stage the thing is an improvement over its alternatives. This is how works and it works extremely well: it doesn't require everyone on a team to switch at once because individual people can use , even for single commands, without a large impact on everyone else. When people think of redesigning the terminal, they always think of redesigning the terminal emulator . This is exactly the wrong place to start. People are attached to their emulators. They configure them, they make them look nice, they use their keybindings. There is a high switching cost to switching emulators because everything affects everything else . It's not so terribly high, because it's still individual and not shared across a team, but still high. What I would do instead is start at the CLI layer. CLI programs are great because they're easy to install and run and have very low switching costs: you can use them one-off without changing your whole workflow. So, I would write a CLI that implements transactional semantics for the terminal . You can imagine an interface something like , where everything run after is undoable. There is a lot you can do with this alone, I think you could build a whole business off this. Once I had transactional semantics, I would try to decouple persistence from tmux and mosh. To get PTY persistence, you have to introduce a client/server model, because the kernel really really expects both sides of a PTY to always be connected. Using commands like alden , or a library like it (it's not that complicated), lets you do this simply, without affecting the terminal emulator nor the programs running inside the PTY session. To get scrollback, the server could save input and output indefinitely and replay them when the client reconnects. This gets you "native" scrollback—the terminal emulator you're already using handles it exactly like any other output, because it looks exactly like any other output—while still being replayable and resumable from an arbitrary starting point. This requires some amount of parsing ANSI escape codes 2 , but it's doable with enough work. To get network resumption like mosh, my custom server could use Eternal TCP (possibly built on top of QUIC for efficiency). Notably, the persistence for the PTY is separate from the persistence for the network connection. Eternal TCP here is strictly an optimization: you could build this on top of a bash script that runs in a loop, it's just not as nice an experience because of network delay and packet loss. Again, composable parts allow for incremental adoption. At this point, you're already able to connect multiple clients to a single terminal session, like tmux, but window management is still done by your terminal emulator, not by the client/server. If you wanted to have window management integrated, the terminal emulator could speak the tmux -CC protocol, like iTerm. All parts of this stage can be done independently and in parallel from the transactional semantics, but I don't think you can build a business off them, it's not enough of an improvement over the existing tools. This bit depends on the client/server model. Once you have a server interposed between the terminal emulator and the client, you can start doing really funny things like tagging I/O with metadata. This lets all data be timestamped 3 and lets you distinguish input from output. xterm.js works something like this. When combined with shell integration, this even lets you distinguish shell prompts from program output, at the data layer. Now you can start doing really funny things, because you have a structured log of your terminal session. You can replay the log as a recording, like asciinema 4 ; you can transform the shell prompt without rerunning all the commands; you can import it into a Jupyter Notebook or Atuin Desktop ; you can save the commands and rerun them later as a script. Your terminal is data. This is the very first time that we touch the terminal emulator, and it's intentionally the last step because it has the highest switching costs. This makes use of all the nice features we've built to give you a nice UI. You don't need our CLI anymore unless you want nested transactions, because your whole terminal session starts in a transaction by default. You get all the features I mention above , because we've put all the pieces together. This is bold and ambitious and I think building the whole thing would take about a decade. That's ok. I'm patient. You can help me by spreading the word :) Perhaps this post will inspire someone to start building this themselves. there are a lot of complications here around alternate mode , but I'm just going to skip over those for now. A simple way to handle alternate mode (that doesn't get you nice things) is just to embed a raw terminal in the output cell. ↩ otherwise you could start replaying output from inside an escape, which is not good . I had a detailed email exchange about this with the alden author which I have not yet had time to write up into a blog post; most of the complication comes when you want to avoid replaying the entire history and only want to replay the visible viewport. ↩ hey, this seems awfully like asciinema ! ↩ oh, that's why it seemed like asciinema. ↩ The " terminal emulator ", which is a program that renders a grid-like structure to your graphical display. The " pseudo-terminal " (PTY), which is a connection between the terminal emulator and a "process group" which receives input. This is not a program. This is a piece of state in the kernel. The "shell", which is a program that leads the "process group", reads and parses input, spawns processes, and generally acts as an event loop. Most environments use bash as the default shell. The programs spawned by your shell, which interact with all of the above in order to receive input and send output. high fidelity image rendering a "rerun from start" button (or rerun the current command; or rerun only a single past command) that replaces past output instead of appending to it "views" of source code and output that can be rewritten in place (e.g. markdown can be viewed either as source or as rendered HTML) a built-in editor with syntax highlighting, tabs, panes, mouse support, etc. Your shell gets the commands all at once, not character-by-character, so tab-complete, syntax highlighting, and autosuggestions don't work. What do you do about long-lived processes? By default, Jupyter runs a cell until completion; you can cancel it, but you can't suspend, resume, interact with, nor view a process while it's running. Don't even think about running or . The "rerun cell" buttons do horrible things to the state of your computer (normal Jupyter kernels have this problem too, but "rerun all" works better when the commands don't usually include ). Undo/redo do not work. (They don't work in a normal terminal either, but people attempt to use them more when it looks like they should be able to.) Tmux / Zellij / Screen These tools inject a whole extra terminal emulator between your terminal emulator and the program. They work by having a "server" which actually owns the PTY and renders the output, and a "client" that displays the output to your "real" terminal emulator. This model lets you detach clients, reattach them later, or even attach multiple clients at once. You can think of this as a "batteries-included" approach. It also has the benefit that you can program both the client and the server (although many modern terminals, like Kitty and Wezterm are programmable now); that you can organize your tabs and windows in the terminal (although many modern desktop environments have tiling and thorough keyboard shortcuts); and that you get street cred for looking like Hackerman. The downside is that, well, now you have an extra terminal emulator running in your terminal, with all the bugs that implies . iTerm actually avoids this by bypassing the tmux client altogether and acting as its own client that talks directly to the server. In this mode, "tmux tabs" are actually iTerm tabs, "tmux panes" are iTerm panes, and so on. This is a good model, and I would adopt it when writing a future terminal for integration with existing tmux setups. Mosh Mosh is a really interesting place in the design space. It is not a terminal emulator replacement; instead it is an ssh replacement. Its big draw is that it supports reconnecting to your terminal session after a network interruption. It does that by running a state machine on the server and replaying an incremental diff of the viewport to the client . This is a similar model to tmux, except that it doesn't support the "multiplexing" part (it expects your terminal emulator to handle that), nor scrollback (ditto). Because it has its own renderer, it has a similar class of bugs to tmux . One feature it does have, unlike tmux, is that the "client" is really running on your side of the network, so local line editing is instant. alden / shpool / dtach / abduco / diss These all occupy a similar place in the design space: they only handle session detach/resume with a client/server, not networking or scrollback, and do not include their own terminal emulator. Compared to tmux and mosh, they are highly decoupled. Runbooks (actually, you can build these just with Jupyter and a PTY primitive). Terminal customization that uses normal CSS, no weird custom languages or ANSI color codes. Search for commands by output/timestamp. Currently, you can search across output in the current session, or you can search across all command input history, but you don't have any kind of smart filters, and the output doesn't persist across sessions. Timestamps and execution duration for each command. Local line-editing, even across a network boundary. IntelliSense for shell commands , without having to hit tab and with rendering that's integrated into the terminal. " All the features from sandboxed tracing ": collaborative terminals, querying files modified by a command, "asciinema but you can edit it at runtime", tracing build systems. Extend the smart search above to also search by disk state at the time the command was run. Extending undo/redo to a git-like branching model (something like this is already support by emacs undo-tree ), where you have multiple "views" of the process tree. Given the undo-tree model, and since we have sandboxing, we can give an LLM access to your project, and run many of them in parallel at the same time without overwriting each others state, and in such a way that you can see what they're doing, edit it, and save it into a runbook for later use. A terminal in a prod environment that can't affect the state of the machine, only inspect the existing state. Gary Bernhardt, “A Whole New World” Alex Kladov, “A Better Shell” jyn, “how i use my terminal” jyn, “Complected and Orthogonal Persistence” jyn, “you are in a box” jyn, “there's two costs to making money off an open source project…” Rebecca Turner, “Vertical Integration is the Only Thing That Matters” Julia Evans, “New zine: The Secret Rules of the Terminal” Julia Evans, “meet the terminal emulator” Julia Evans, “What happens when you press a key in your terminal?” Julia Evans, “What's involved in getting a "modern" terminal setup?” Julia Evans, “Bash scripting quirks & safety tips” Julia Evans, “Some terminal frustrations” Julia Evans, “Reasons to use your shell's job control” “signal(7) - Miscellaneous Information Manual” Christian Petersen, “ANSI Escape Codes” saoirse, “withoutboats/notty: A new kind of terminal” Jupyter Team, “Project Jupyter Documentation” “Warp: The Agentic Development Environment” “Warp: How Warp Works” “Warp: Completions” George Nachman, “iTerm2: Proprietary Escape Codes” George Nachman, “iTerm2: Shell Integration” George Nachman, “iTerm2: tmux Integration” Project Jupyter, “Jupyter Widgets” Nelson Elhage, “nelhage/reptyr: Reparent a running program to a new terminal” Kovid Goyal, “kitty” Kovid Goyal, “kitty - Frequently Asked Questions” Wez Furlong, “Wezterm” Keith Winstein, “Mosh: the mobile shell” Keith Winstein, “Display errors with certain characters Matthew Skala, “alden: detachable terminal sessions without breaking scrollback” Ethan Pailes, “shell-pool/shpool: Think tmux, then aim... lower” Ned T. Crigler, “crigler/dtach: A simple program that emulates the detach feature of screen” Marc André Tanner, “martanne/abduco: abduco provides session management” yazgoo, “yazgoo/diss: dtach-like program / crate in rust” Fons van der Plas, “Pluto.jl — interactive Julia programming environment” Ellie Huxtable, “Atuin Desktop: Runbooks that Run” Toby Cubitt, “undo-tree” “SIGHUP - Wikipedia” Jason Gauci, “How Eternal Terminal Works” Marcin Kulik, “Record and share your terminal sessions, the simple way - asciinema.org” “Alternate Screen | Ratatui” there are a lot of complications here around alternate mode , but I'm just going to skip over those for now. A simple way to handle alternate mode (that doesn't get you nice things) is just to embed a raw terminal in the output cell. ↩ otherwise you could start replaying output from inside an escape, which is not good . I had a detailed email exchange about this with the alden author which I have not yet had time to write up into a blog post; most of the complication comes when you want to avoid replaying the entire history and only want to replay the visible viewport. ↩ hey, this seems awfully like asciinema ! ↩ oh, that's why it seemed like asciinema. ↩

0 views
devansh 2 months ago

Hitchhiker's Guide to Attack Surface Management

I first heard about the word "ASM" (i.e., Attack Surface Management) probably in late 2018, and I thought it must be some complex infrastructure for tracking assets of an organization. Looking back, I realize I almost had a similar stack for discovering, tracking, and detecting obscure assets of organizations, and I was using it for my bug hunting adventures. I feel my stack was kinda goated, as I was able to find obscure assets of Apple, Facebook, Shopify, Twitter, and many other Fortune 100 companies, and reported hundreds of bugs, all through automation. Back in the day, projects like ProjectDiscovery were not present, so if I had to write an effective port scanner, I had to do it from scratch. (Masscan and nmap were present, but I had my fair share of issues using them, this is a story for another time). I used to write DNS resolvers (massdns had a high error rate), port scanners, web scrapers, directory brute-force utilities, wordlists, lots of JavaScript parsing logic using regex, and a hell of a lot of other things. I used to have up to 50+ self-developed tools for bug-bounty recon stuff and another 60-something helper scripts written in bash. I used to orchestrate (gluing together with duct tape is a better word) and slap together scripts like a workflow, and save the output in text files. Whenever I dealt with a large number of domains, I used to distribute the load over multiple servers (server spin-up + SSH into it + SCP for pushing and pulling files from it). The setup was very fragile and error-prone, and I spent countless nights trying to debug errors in the workflows. But it was all worth it. I learned the art of Attack Surface Management without even trying to learn about it. I was just a teenager trying to make quick bucks through bug hunting, and this fragile, duct-taped system was my edge. Fast forward to today, I have now spent almost a decade in the bug bounty scene. I joined HackerOne in 2020 (to present) as a vulnerability triager, where I have triaged and reviewed tens of thousands of vulnerability submissions. Fair to say, I have seen a lot of things, from doomsday level 0-days, to reports related to leaked credentials which could have led to entire infrastructure compromise, just because some dev pushed an AWS secret key in git logs, to things where some organizations were not even aware they were running Jenkins servers on some obscure subdomain which could have allowed RCE and then lateral movement to other layers of infrastructure. A lot of these issues I have seen were totally avoidable, only if organizations followed some basic attack surface management techniques. If I search "Guide to ASM" on Internet, almost none of the supposed guides are real resources. They funnel you to their own ASM solution, and the guide is just present there to provide you with some surface-level information, and is mostly a marketing gimmick. This is precisely why I decided to write something where I try to cover everything I learned and know about ASM, and how to protect your organization's assets before bad actors could get to them. This is going to be a rough and raw guide, and will not lead you to a funnel where I am trying to sell my own ASM SaaS to you. I have nothing to sell, other than offering what I know. But in case you are an organization who needs help implementing the things I am mentioning below, you can reach out to me via X or email (both available on the homepage of this blog). This guide will provide you with insights into exactly how big your attack surface really is. CISOs can look at it and see if their organizations have all of these covered, security researchers and bug hunters can look at this and maybe find new ideas related to where to look during recon. Devs can look at it and see if they are unintentionally leaving any door open for hackers. If you are into security, it has something to offer you. Attack surface is one of those terms getting thrown around in security circles so much that it's become almost meaningless noise. In theory, it sounds simple enough, right. Attack surface is every single potential entry point, interaction vector, or exploitable interface an attacker could use to compromise your systems, steal your data, or generally wreck your day. But here's the thing, it's the sum total of everything you've exposed to the internet. Every API endpoint you forgot about, every subdomain some dev spun up for "testing purposes" five years ago and then abandoned, every IoT device plugged into your network, every employee laptop connecting from a coffee shop, every third-party vendor with a backdoor into your environment, every cloud storage bucket with permissions that make no sense, every Slack channel, every git commit leaking credentials, every paste on Pastebin containing your database passwords. Most organizations think about attack surface in incredibly narrow terms. They think if they have a website, an email server, and maybe some VPN endpoints, they've got "good visibility" into their assets. That's just plain wrong. Straight up wrong. Your actual attack surface would terrify you if you actually understood it. You run , and is your main domain. You probably know about , , maybe . But what about that your intern from 2015 spun up and just never bothered to delete. It's not documented anywhere. Nobody remembers it exists. Domain attack surface goes way beyond what's sitting in your asset management system. Every subdomain is a potential entry point. Most of these subdomains are completely forgotten. Subdomain enumeration is reconnaissance 101 for attackers and bug hunters. It's not rocket science. Setting up a tool that actively monitors through active and passive sources for new subdomains and generates alerts is honestly an hour's worth of work. You can use tools like Subfinder, Amass, or just mine Certificate Transparency logs to discover every single subdomain connected to your domain. Certificate Transparency logs were designed to increase security by making certificate issuance public, and they've become an absolute reconnaissance goldmine. Every time you get an SSL certificate for , that information is sitting in public logs for anyone to find. Attackers systematically enumerate these subdomains using Certificate Transparency log searches, DNS brute-forcing with massive wordlists, reverse DNS lookups to map IP ranges back to domains, historical DNS data from services like SecurityTrails, and zone transfer exploitation if your DNS is misconfigured. Attackers are looking for old development environments still running vulnerable software, staging servers with production data sitting on them, forgotten admin panels, API endpoints without authentication, internal tools accidentally exposed, and test environments with default credentials nobody changed. Every subdomain is an asset. Every asset is a potential vulnerability. Every vulnerability is an entry point. Domains and subdomains are just the starting point though. Once you've figured out all the subdomains belonging to your organization, the next step is to take a hard look at IP address space, which is another absolutely massive component of your attack surface. Organizations own, sometimes lease, IP ranges, sometimes small /24 blocks, sometimes massive /16 ranges, and every single IP address in those blocks and ranges that responds to external traffic is part of your attack surface. And attackers enumerate them all if you won't. They use WHOIS lookups to identify your IP ranges, port scanning to find what services are running where, service fingerprinting to identify exact software versions, and banner grabbing to extract configuration information. If you have a /24 network with 256 IP addresses and even 10% of those IPs are running services, you've got 25 potential attack vectors. Scale that to a /20 or /16 and you're looking at thousands of potential entry points. And attackers aren't just looking at the IPs you know about. They're looking at adjacent IP ranges you might have acquired through mergers, historical IP allocations that haven't been properly decommissioned, and shared IP ranges where your servers coexist with others. Traditional infrastructure was complicated enough, and now we have cloud. It's literally exploded organizations' attack surfaces in ways that are genuinely difficult to even comprehend. Every cloud service you spin up, be it an EC2 instance, S3 bucket, Lambda function, or API Gateway endpoint, all of this is a new attack vector. In my opinion and experience so far, I think the main issue with cloud infrastructure is that it's ephemeral and distributed. Resources get spun up and torn down constantly. Developers create instances for testing and forget about them. Auto-scaling groups generate new resources dynamically. Containerized workloads spin up massive Kubernetes clusters you have minimal visibility into. Your cloud attack surface could be literally anything. Examples are countless, but I'd categorize them into 8 different categories. Compute instances like EC2, Azure VMs, GCP Compute Engine instances exposed to the internet. Storage buckets like S3, Azure Blob Storage, GCP Cloud Storage with misconfigured permissions. Serverless stuff like Lambda functions with public URLs or overly permissive IAM roles. API endpoints like API Gateway, Azure API Management endpoints without proper authentication. Container registries like Docker images with embedded secrets or vulnerabilities. Kubernetes clusters with exposed API servers, misconfigured network policies, vulnerable ingress controllers. Managed databases like RDS, CosmosDB, Cloud SQL instances with weak access controls. IAM roles and service accounts with overly permissive identities that enable privilege escalation. I've seen instances in the past where a single misconfigured S3 bucket policy exposed terabytes of data. An overly permissive Lambda IAM role enabled lateral movement across an entire AWS account. A publicly accessible Kubernetes API server gave an attacker full cluster control. Honestly, cloud kinda scares me as well. And to top it off, multi-cloud infrastructure makes everything worse. If you're running AWS, Azure, and GCP together, you've just tripled your attack surface management complexity. Each cloud provider has different security models, different configuration profiles, and different attack vectors. Every application now uses APIs, and all applications nowadays are like a constellation of APIs talking to each other. Every API you use in your organization is your attack surface. The problem with APIs is that they're often deployed without the same security scrutiny as traditional web applications. Developers spin up API endpoints for specific features and those endpoints accumulate over time. Some of them are shadow APIs, meaning API endpoints which aren't documented anywhere. These endpoints are the equivalent of forgotten subdomains, and attackers can find them through analyzing JavaScript files for API endpoint references, fuzzing common API path patterns, examining mobile app traffic to discover backend APIs, and mining old documentation or code repositories for deprecated endpoints. Your API attack surface includes REST APIs exposed to the internet, GraphQL endpoints with overly broad query capabilities, WebSocket connections for real-time functionality, gRPC services for inter-service communication, and legacy SOAP APIs that never got decommissioned. If your organization has mobile apps, be it iOS, Android, or both, this is a direct window to your infrastructure and should be part of your attack surface management strategy. Mobile apps communicate with backend APIs and those API endpoints are discoverable by reversing the app. The reversed source of the app could reveal hard-coded API keys, tokens, and credentials. Using JADX plus APKTool plus Dex2jar is all a motivated attacker needs. Web servers often expose directories and files that weren't meant to be publicly accessible. Attackers systematically enumerate these using automated tools like ffuf, dirbuster, gobuster, and wfuzz with massive wordlists to discover hidden endpoints, configuration files, backup files, and administrative interfaces. Common exposed directories include admin panels, backup directories containing database dumps or source code, configuration files with database credentials and API keys, development directories with debug information, documentation directories revealing internal systems, upload directories for file storage, and old or forgotten directories from previous deployments. Your attack surface must include directories which are accidentally left accessible during deployments, staging servers with production data, backup directories with old source code versions, administrative interfaces without authentication, API documentation exposing endpoint details, and test directories with debug output enabled. Even if you've removed a directory from production, old cached versions may still be accessible through web caches or CDNs. Search engines also index these directories, making them discoverable through dorking techniques. If your organization is using IoT devices, and everyone uses these days, this should be part of your attack surface management strategy. They're invisible to traditional security tools. Your EDR solution doesn't protect IoT devices. Your vulnerability scanner can't inventory them. Your patch management system can't update them. Your IoT attack surface could include smart building systems like HVAC, lighting, access control. Security cameras and surveillance systems. Printers and copiers, which are computers with network access. Badge readers and physical access systems. Industrial control systems and SCADA devices. Medical devices in healthcare environments. Employee wearables and fitness trackers. Voice assistants and smart speakers. The problem with IoT devices is that they're often deployed without any security consideration. They have default credentials that never get changed, unpatched firmware with known vulnerabilities, no encryption for data in transit, weak authentication mechanisms, and insecure network configurations. Social media presence is an attack surface component that most organizations completely ignore. Attackers can use social media for reconnaissance by looking at employee profiles on LinkedIn to reveal organizational structure, technologies in use, and current projects. Twitter/X accounts can leak information about deployments, outages, and technology stack. Employee GitHub profiles expose email patterns and development practices. Company blogs can announce new features before security review. It could also be a direct attack vector. Attackers can use information from social media to craft convincing phishing attacks. Hijacked social media accounts can be used to spread malware or phishing links. Employees can accidentally share sensitive information. Fake accounts can impersonate your brand to defraud customers. Your employees' social media presence is part of your attack surface whether you like it or not. Third-party vendors, suppliers, contractors, or partners with access to your systems should be part of your attack surface. Supply chain attacks are becoming more and more common these days. Attackers can compromise a vendor with weaker security and then use that vendor's access to reach your environment. From there, they pivot from the vendor network to your systems. This isn't a hypothetical scenario, it has happened multiple times in the past. You might have heard about the SolarWinds attack, where attackers compromised SolarWinds' build system and distributed malware through software updates to thousands of customers. Another famous case study is the MOVEit vulnerability in MOVEit Transfer software, exploited by the Cl0p ransomware group, which affected over 2,700 organizations. These are examples of some high-profile supply chain security attacks. Your third-party attack surface could include things like VPNs, remote desktop connections, privileged access systems, third-party services with API keys to your systems, login credentials shared with vendors, SaaS applications storing your data, and external IT support with administrative access. It's obvious you can't directly control third-party security. You can audit them, have them pen-test their assets as part of your vendor compliance plan, and include security requirements in contracts, but ultimately their security posture is outside your control. And attackers know this. GitHub, GitLab, Bitbucket, they all are a massive attack surface. Attackers search through code repositories in hopes of finding hard-coded credentials like API keys, database passwords, and tokens. Private keys, SSH keys, TLS certificates, and encryption keys. Internal architecture documentation revealing infrastructure details in code comments. Configuration files with database connection strings and internal URLs. Deprecated code with vulnerabilities that's still in production. Even private repositories aren't safe. Attackers can compromise developer accounts to access private repositories, former employees retain access after leaving, and overly broad repository permissions grant access to too many people. Automated scanners continuously monitor public repositories for secrets. The moment a developer accidentally pushes credentials to a public repository, automated systems detect it within minutes. Attackers have already extracted and weaponized those credentials before the developer realizes the mistake. CI/CD pipelines are massive another attack vector. Especially in recent times, and not many organizations are giving attention to this attack vector. This should totally be part of your attack surface management. Attackers compromise GitHub Actions workflows with malicious code injection, Jenkins servers with weak authentication, GitLab CI/CD variables containing secrets, and build artifacts with embedded malware. The GitHub Actions supply chain attack, CVE-2025-30066, demonstrated this perfectly. Attackers compromised the Action used in over 23,000 repositories, injecting malicious code that leaked secrets from build logs. Jenkins specifically is a goldmine for attackers. An exposed Jenkins instance provides complete control over multiple critical servers, access to hardcoded AWS keys, Redis credentials, and BitBucket tokens, ability to manipulate builds and inject malicious code, and exfiltration of production database credentials containing PII. Modern collaboration tools are massive attack surface components that most organizations underestimate. Slack has hidden security risks despite being invite-only. Slack attack surface could include indefinite data retention where every message, channel, and file is stored forever unless admins configure retention periods. Public channels accessible to all users so one breached account opens the floodgates. Third-party integrations with excessive permissions accessing messages and user data. Former contractor access where individuals retain access long after projects end. Phishing and impersonation where it's easy to change names and pictures to impersonate senior personnel. In 2022, Slack leaked hashed passwords for five years affecting 0.5% of users. Slack channels commonly contain API keys, authentication tokens, database credentials, customer PII, financial data, internal system passwords, and confidential project information. The average cost of a breached record was $164 in 2022. When 1 in 166 messages in Slack contains confidential information, every new message adds another dollar to total risk exposure. With 5,000 employees sending 30 million Slack messages per year, that's substantial exposure. Trello board exposure is a significant attack surface. Trello attack vectors include public boards with sensitive information accidentally shared publicly, default public visibility where boards are created as public by default in some configurations, unsecured REST API allowing unauthenticated access to user data, and scraping attacks where attackers use email lists to enumerate Trello accounts. The 2024 Trello data breach exposed 15 million users' personal information when a threat actor named "emo" exploited an unsecured REST API using 500 million email addresses to compile detailed user profiles. Security researcher David Shear documented hundreds of public Trello boards exposing passwords, credentials, IT support customer access details, website admin logins, and client server management credentials. IT companies were using Trello to troubleshoot client requests and manage infrastructure, storing all credentials on public Trello boards. Jira misconfiguration is a widespread attack surface issue. Common misconfigurations include public dashboards and filters with "Everyone" access actually meaning public internet access, anonymous access enabled allowing unauthenticated users to browse, user picker functionality providing complete lists of usernames and email addresses, and project visibility allowing sensitive projects to be accessible without authentication. Confluence misconfiguration exposes internal documentation. Confluence attack surface components include anonymous access at site level allowing public access, public spaces where space admins grant anonymous permissions, inherited permissions where all content within a space inherits space-level access, and user profile visibility allowing anonymous users to view profiles of logged-in users. When anonymous access is enabled globally and space admins allow anonymous users to access their spaces, anyone on the internet can access that content. Confluence spaces often contain internal documentation with hardcoded credentials, financial information, project details, employee information, and API documentation with authentication details. Cloud storage misconfiguration is epidemic. Google Drive misconfiguration attack surface includes "Anyone with the link" sharing making files accessible without authentication, overly permissive sharing defaults making it easy to accidentally share publicly, inherited folder permissions exposing everything beneath, unmanaged third-party apps with excessive read/write/delete permissions, inactive user accounts where former employees retain access, and external ownership blind spots where externally-owned content is shared into the environment. Metomic's 2023 Google Scanner Report found that of 6.5 million Google Drive files analyzed, 40.2% contained sensitive information, 34.2% were shared externally, and 0.5% were publicly accessible, mostly unintentionally. In December 2023, Japanese game developer Ateam suffered a catastrophic Google Drive misconfiguration that exposed personal data of nearly 1 million people for over six years due to "Anyone with the link" settings. Based on Valence research, 22% of external data shares utilize open links, and 94% of these open link shares are inactive, forgotten files with public URLs floating around the internet. Dropbox, OneDrive, and Box share similar attack surface components including misconfigured sharing permissions, weak or missing password protection, overly broad access grants, third-party app integrations with excessive permissions, and lack of visibility into external sharing. Features that make file sharing convenient create data leakage risks when misconfigured. Pastebin and similar paste sites are both reconnaissance sources and attack vectors. Paste site attack surface includes public data dumps of stolen credentials, API keys, and database dumps posted publicly, malware hosting of obfuscated payloads, C2 communications where malware uses Pastebin for command and control, credential leakage from developers accidentally posting secrets, and bypassing security filters since Pastebin is legitimate so security tools don't block it. For organizations, leaked API keys or database credentials on Pastebin lead to unauthorized access, data exfiltration, and service disruption. Attackers continuously scan Pastebin for mentions of target organizations using automated tools. Security teams must actively monitor Pastebin and similar paste sites for company name mentions, email domain references, and specific keywords related to the organization. Because paste sites don't require registration or authentication and content is rarely removed, they've become permanent archives of leaked secrets. Container registries expose significant attack surface. Container registry attack surface includes secrets embedded in image layers where 30,000 unique secrets were found in 19,000 images, with 10% of scanned Docker images containing secrets, and 1,200 secrets, 4%, being active and valid. Immutable cached layers contain 85% of embedded secrets that can't be removed, exposed registries with 117 Docker registries accessible without authentication, unsecured registries allowing pull, push, and delete operations, and source code exposure where full application code is accessible by pulling images. GitGuardian's analysis of 200,000 publicly available Docker images revealed a staggering secret exposure problem. Even more alarming, 99% of images containing active secrets were pulled in 2024, demonstrating real-world exploitation. Unit 42's research identified 941 Docker registries exposed to the internet, with 117 accessible without authentication containing 2,956 repositories, 15,887 tags, and full source code and historical versions. Out of 117 unsecured registries, 80 allow pull operations to download images, 92 allow push operations to upload malicious images, and 7 allow delete operations for ransomware potential. Sysdig's analysis of over 250,000 Linux images on Docker Hub found 1,652 malicious images including cryptominers, most common, embedded secrets, second most prevalent, SSH keys and public keys for backdoor implants, API keys and authentication tokens, and database credentials. The secrets found in container images included AWS access keys, database passwords, SSH private keys, API tokens for cloud services, GitHub personal access tokens, and TLS certificates. Shadow IT includes unapproved SaaS applications like Dropbox, Google Drive, and personal cloud storage used for work. Personal devices like BYOD laptops, tablets, and smartphones accessing corporate data. Rogue cloud deployments where developers spin up AWS instances without approval. Unauthorized messaging apps like WhatsApp, Telegram, and Signal used for business communication. Unapproved IoT devices like smart speakers, wireless cameras, and fitness trackers on the corporate network. Gartner estimates that shadow IT makes up 30-40% of IT spending in large companies, and 76% of organizations surveyed experienced cyberattacks due to exploitation of unknown, unmanaged, or poorly managed assets. Shadow IT expands your attack surface because it's not protected by your security controls, it's not monitored by your security team, it's not included in your vulnerability scans, it's not patched by your IT department, and it often has weak or default credentials. And you can't secure what you don't know exists. Bring Your Own Device, BYOD, policies sound great for employee flexibility and cost savings. For security teams, they're a nightmare. BYOD expands your attack surface by introducing unmanaged endpoints like personal devices without EDR, antivirus, or encryption. Mixing personal and business use where work data is stored alongside personal apps with unknown security. Connecting from untrusted networks like public Wi-Fi and home networks with compromised routers. Installing unapproved applications with malware or excessive permissions. Lacking consistent security updates with devices running outdated operating systems. Common BYOD security issues include data leakage through personal cloud backup services, malware infections from personal app downloads, lost or stolen devices containing corporate data, family members using devices that access work systems, and lack of IT visibility and control. The 60% of small and mid-sized businesses that close within six months of a major cyberattack often have BYOD-related security gaps as contributing factors. Remote access infrastructure like VPNs and Remote Desktop Protocol, RDP, are among the most exploited attack vectors. SSL VPN appliances from vendors like Fortinet, SonicWall, Check Point, and Palo Alto are under constant attack. VPN attack vectors include authentication bypass vulnerabilities with CVEs allowing attackers to hijack active sessions, credential stuffing through brute-forcing VPN logins with leaked credentials, exploitation of unpatched vulnerabilities with critical CVEs in VPN appliances, and configuration weaknesses like default credentials, weak passwords, and lack of MFA. Real-world attacks demonstrate the risk. Check Point SSL VPN CVE-2024-24919 allowed authentication bypass for session hijacking. Fortinet SSL-VPN vulnerabilities were leveraged for lateral movement and privilege escalation. SonicWall CVE-2024-53704 allowed remote authentication bypass for SSL VPN. Once inside via VPN, attackers conduct network reconnaissance, lateral movement, and privilege escalation. RDP is worse. Sophos found that cybercriminals abused RDP in 90% of attacks they investigated. External remote services like RDP were the initial access vector in 65% of incident response cases. RDP attack vectors include exposed RDP ports with port 3389 open to the internet, weak authentication with simple passwords vulnerable to brute force, lack of MFA with no second factor for authentication, and credential reuse from compromised passwords in data breaches. In one Darktrace case, attackers compromised an organization four times in six months, each time through exposed RDP ports. The attack chain went successful RDP login, internal reconnaissance via WMI, lateral movement via PsExec, and objective achievement. The Palo Alto Unit 42 Incident Response report found RDP was the initial attack vector in 50% of ransomware deployment cases. Email infrastructure remains a primary attack vector. Your email attack surface includes mail servers like Exchange, Office 365, and Gmail with configuration weaknesses, email authentication with misconfigured SPF, DKIM, and DMARC records, phishing-susceptible users targeted through social engineering, email attachments and links as malware delivery mechanisms, and compromised accounts through credential stuffing or password reuse. Email authentication misconfiguration is particularly insidious. If your SPF, DKIM, and DMARC records are wrong or missing, attackers can spoof emails from your domain, your legitimate emails get marked as spam, and phishing emails impersonating your organization succeed. Email servers themselves are also targets. The NSA released guidance on Microsoft Exchange Server security specifically because Exchange servers are so frequently compromised. Container orchestration platforms like Kubernetes introduce massive attack surface complexity. The Kubernetes attack surface includes the Kubernetes API server with exposed or misconfigured API endpoints, container images with vulnerabilities in base images or application layers, container registries like Docker Hub, ECR, and GCR with weak access controls, pod security policies with overly permissive container configurations, network policies with insufficient micro-segmentation between pods, secrets management with hardcoded secrets or weak secret storage, and RBAC misconfigurations with overly broad service account permissions. Container security issues include containers running as root with excessive privileges, exposed Docker daemon sockets allowing container escape, vulnerable dependencies in container images, and lack of runtime security monitoring. The Docker daemon attack surface is particularly concerning. Running containers with privileged access or allowing docker.sock access can enable container escape and host compromise. Serverless computing like AWS Lambda, Azure Functions, and Google Cloud Functions promised to eliminate infrastructure management. Instead, it just created new attack surfaces. Serverless attack surface components include function code vulnerabilities like injection flaws and insecure dependencies, IAM misconfigurations with overly permissive Lambda execution roles, environment variables storing secrets as plain text, function URLs with publicly accessible endpoints without authentication, and event source mappings with untrusted input from various cloud services. The overabundance of event sources expands the attack surface. Lambda functions can be triggered by S3 events, API Gateway requests, DynamoDB streams, SNS topics, EventBridge schedules, IoT events, and dozens more. Each event source is a potential injection point. If function input validation is insufficient, attackers can manipulate event data to exploit the function. Real-world Lambda attacks include credential theft by exfiltrating IAM credentials from environment variables, lateral movement using over-permissioned roles to access other AWS resources, and data exfiltration by invoking functions to query and extract database contents. The Scarlet Eel adversary specifically targeted AWS Lambda for credential theft and lateral movement. Microservices architecture multiplies attack surface by decomposing monolithic applications into dozens or hundreds of independent services. Each microservice has its own attack surface including authentication mechanisms where each service needs to verify requests, authorization rules where each service enforces access controls, API endpoints for service-to-service communication channels, data stores where each service may have its own database, and network interfaces where each service exposes network ports. Microservices security challenges include east-west traffic vulnerabilities with service-to-service communication without encryption or authentication, authentication and authorization complexity from managing auth across 40 plus services multiplied by 3 environments equaling 240 configurations, service-to-service trust where services blindly trust internal traffic, network segmentation failures with flat networks allowing unrestricted pod-to-pod communication, and inconsistent security policies with different services having different security standards. One compromised microservice can enable lateral movement across the entire application. Without proper network segmentation and zero trust architecture, attackers pivot from service to service. How do you measure something this large, right. Attack surface measurement is complex. Attack surface metrics include the total number of assets with all discovered systems, applications, and devices, newly discovered assets found through continuous discovery, the number of exposed assets accessible from the internet, open ports and services with network services listening for connections, vulnerabilities by severity including critical, high, medium, and low CVEs, mean time to detect, MTTD, measuring how quickly new assets are discovered, mean time to remediate, MTTR, measuring how quickly vulnerabilities are fixed, shadow IT assets that are unknown or unmanaged, third-party exposure from vendor and partner access points, and attack surface change rate showing how rapidly the attack surface evolves. Academic research has produced formal attack surface measurement methods. Pratyusa Manadhata's foundational work defines attack surface as a three-tuple, System Attackability, Channel Attackability, Data Attackability. But in practice, most organizations struggle with basic attack surface visibility, let alone quantitative measurement. Your attack surface isn't static. It changes constantly. Changes happen because developers deploy new services and APIs, cloud auto-scaling spins up new instances, shadow IT appears as employees adopt unapproved tools, acquisitions bring new infrastructure into your environment, IoT devices get plugged into your network, and subdomains get created for new projects. Static, point-in-time assessments are obsolete. You need continuous asset discovery and monitoring. Continuous discovery methods include automated network scanning for regular scans to detect new devices, cloud API polling to query cloud provider APIs for resource changes, DNS monitoring to track new subdomains via Certificate Transparency logs, passive traffic analysis to observe network traffic and identify assets, integration with CMDB or ITSM to sync with configuration management databases, and cloud inventory automation using Infrastructure as Code to track deployments. Understanding your attack surface is step one. Reducing it is the goal. Attack surface reduction begins with asset elimination by removing unnecessary assets entirely. This includes decommissioning unused servers and applications, deleting abandoned subdomains and DNS records, shutting down forgotten development environments, disabling unused network services and ports, and removing unused user accounts and service identities. Access control hardening implements least privilege everywhere by enforcing multi-factor authentication, MFA, for all remote access, using role-based access control, RBAC, for cloud resources, implementing zero trust network architecture, restricting network access with micro-segmentation, and applying the principle of least privilege to IAM roles. Exposure minimization reduces what's visible to attackers by moving services behind VPNs or bastion hosts, using private IP ranges for internal services, implementing network address translation, NAT, for outbound access, restricting API endpoints to authorized sources only, and disabling unnecessary features and functionalities. Security hardening strengthens what remains by applying security patches promptly, using security configuration baselines, enabling encryption for data in transit and at rest, implementing Web Application Firewalls, WAF, for web apps, and deploying endpoint detection and response, EDR, on all devices. Monitoring and detection watch for attacks in progress by implementing real-time threat detection, enabling comprehensive logging and SIEM integration, deploying intrusion detection and prevention systems, IDS/IPS, monitoring for anomalous behavior patterns, and using threat intelligence feeds to identify known bad actors. Your attack surface is exponentially larger than you think it is. Every asset you know about probably has three you don't. Every known vulnerability probably has ten undiscovered ones. Every third-party integration probably grants more access than you realize. Every collaboration tool is leaking more data than you imagine. Every paste site contains more of your secrets than you want to admit. And attackers know this. They're not just looking at what you think you've secured. They're systematically enumerating every possible entry point. They're mining Certificate Transparency logs for forgotten subdomains. They're scanning every IP in your address space. They're reverse-engineering your mobile apps. They're buying employee credentials from data breach databases. They're compromising your vendors to reach you. They're scraping Pastebin for your leaked secrets. They're pulling your public Docker images and extracting the embedded credentials. They're accessing your misconfigured S3 buckets and exfiltrating terabytes of data. They're exploiting your exposed Jenkins instances to compromise your entire infrastructure. They're manipulating your AI agents to exfiltrate private Notion data. The asymmetry is brutal. You have to defend every single attack vector. They only need to find one that works. So what do you do. Start by accepting that you don't have complete visibility. Nobody does. But you can work toward better visibility through continuous discovery, automated asset management, and integration of security tools that help map your actual attack surface. Implement attack surface reduction aggressively. Every asset you eliminate is one less thing to defend. Every service you shut down is one less potential vulnerability. Every piece of shadow IT you discover and bring under management is one less blind spot. Every misconfigured cloud storage bucket you fix is terabytes of data no longer exposed. Every leaked secret you rotate is one less credential floating around the internet. Adopt zero trust architecture. Stop assuming that anything, internal services, microservices, authenticated users, collaboration tools, is inherently trustworthy. Verify everything. Monitor paste sites and code repositories. Your secrets are out there. Find them before attackers weaponize them. Secure your collaboration tools. Slack, Trello, Jira, Confluence, Notion, Google Drive, and Airtable are all leaking data. Lock them down. Fix your container security. Scan images for secrets. Use secret managers instead of environment variables. Secure your registries. Harden your CI/CD pipelines. Jenkins, GitHub Actions, and GitLab CI are high-value targets. Protect them. And test your assumptions with red team exercises and continuous security testing. Your attack surface is what an attacker can reach, not what you think you've secured. The attack surface problem isn't getting better. Cloud adoption, DevOps practices, remote work, IoT proliferation, supply chain complexity, collaboration tool sprawl, and container adoption are all expanding organizational attack surfaces faster than security teams can keep up. But understanding the problem is the first step toward managing it. And now you understand exactly how catastrophically large your attack surface actually is.

1 views
Simon Willison 2 months ago

Video + notes on upgrading a Datasette plugin for the latest 1.0 alpha, with help from uv and OpenAI Codex CLI

I'm upgrading various plugins for compatibility with the new Datasette 1.0a20 alpha release and I decided to record a video of the process. This post accompanies that video with detailed additional notes. I picked a very simple plugin to illustrate the upgrade process (possibly too simple). datasette-checkbox adds just one feature to Datasette: if you are viewing a table with boolean columns (detected as integer columns with names like or or ) and your current user has permission to update rows in that table it adds an inline checkbox UI that looks like this: I built the first version with the help of Claude back in August 2024 - details in this issue comment . Most of the implementation is JavaScript that makes calls to Datasette 1.0's JSON write API . The Python code just checks that the user has the necessary permissions before including the extra JavaScript. The first step in upgrading any plugin is to run its tests against the latest Datasette version. Thankfully makes it easy to run code in scratch virtual environments that include the different code versions you want to test against. I have a test utility called (for "test against development Datasette") which I use for that purpose. I can run it in any plugin directory like this: And it will run the existing plugin tests against whatever version of Datasette I have checked out in my directory. You can see the full implementation of (and its friend described below) in this TIL - the basic version looks like this: I started by running in the directory, and got my first failure... but it wasn't due to permissions, it was because the for the plugin was pinned to a specific mismatched version of Datasette: I fixed this problem by swapping to and ran the tests again... and they passed! Which was a problem because I was expecting permission-related failures. It turns out when I first wrote the plugin I was lazy with the tests - they weren't actually confirming that the table page loaded without errors. I needed to actually run the code myself to see the expected bug. First I created myself a demo database using sqlite-utils create-table : Then I ran it with Datasette against the plugin's code like so: Sure enough, visiting produced a 500 error about the missing method. The next step was to update the test to also trigger this error: And now fails as expected. It this point I could have manually fixed the plugin itself - which would likely have been faster given the small size of the fix - but instead I demonstrated a bash one-liner I've been using to apply these kinds of changes automatically: runs OpenAI Codex in non-interactive mode - it will loop until it has finished the prompt you give it. I tell it to consult the subset of the Datasette upgrade documentation that talks about Datasette permissions and then get the command to pass its tests. This is an example of what I call designing agentic loops - I gave Codex the tools it needed ( ) and a clear goal and let it get to work on my behalf. The remainder of the video covers finishing up the work - testing the fix manually, commiting my work using: Then shipping a 0.1a4 release to PyPI using the pattern described in this TIL . Finally, I demonstrated that the shipped plugin worked in a fresh environment using like this: Executing this command installs and runs a fresh Datasette instance with a fresh copy of the new alpha plugin ( ). It's a neat way of confirming that freshly released software works as expected. This video was shot in a single take using Descript , with no rehearsal and perilously little preparation in advance. I recorded through my AirPods and applied the "Studio Sound" filter to clean up the audio. I pasted in a closing slide from my previous video and exported it locally at 1080p, then uploaded it to YouTube. Something I learned from the Software Carpentry instructor training course is that making mistakes in front of an audience is actively helpful - it helps them see a realistic version of how software development works and they can learn from watching you recover. I see this as a great excuse for not editing out all of my mistakes! I'm trying to build new habits around video content that let me produce useful videos while minimizing the amount of time I spend on production. I plan to iterate more on the format as I get more comfortable with the process. I'm hoping I can find the right balance between production time and value to viewers. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Luke Hsiao 2 months ago

Switching from GPG to age

It’s been several years since I went through all the trouble of setting up my own GPG keys and securing them in YubiKeys following drduh’s guide . With that approach, you generate one key securely offline and store it on multiple YubiKeys, along with a backup. It has worked well for me for years, and as the Lindy effect suggests, it would almost certainly continue to. But as my sub-keys were nearing expiration, I was faced with either renewing (more convenient, no forward secrecy ) or rotating them (rather painful, but potentially more secure). However, I’ve realized that I essentially only use these keys for encryption, and almost never for signing. So, instead of doing either of the usual options, I’m going to let my keys expire entirely. I’m now experimenting with , which touts itself as “simple, modern, and secure encryption”. If needed, I will use for signatures. This required changing a couple of things in my typical workflow. First, and foremost, I needed to switch from to , a fork of that uses as the backend. This was actually surprisingly easy because includes a simple bash script to do the migration. There is no installer for , and no Arch packages. But it’s easy enough to install because it’s just a shell script you can throw on your . Note that for Arch, I also needed to install , which it assumes you have. I also name as on my machine. The benefit of this is everything that had integration has continued to “just work”. For example, , my email client of choice, behaved exactly the same after the migration. I would occasionally use as my SSH agent on my machines. It was convenient. However, I also like the idea of having a dedicated SSH key per machine. It makes monitoring their usage and revoking them much finer-grained. The absence of forced me to set up new keys on all my machines and add them to various servers/services. Easy encryption with chezmoi While I was tending to the encryption area of my personal tech “garden”, I also started leveraging chezmoi’s encryption features . I already use for my configuration files, but with encryption, I could also easily add “secrets” to my public dotfiles repo . In my case so far, this just means my copies of my favorite paid font: Berkeley Mono . also has a nice guide for configuring chezmoi to encrypt while asking for a passphrase only once for . I was also very pleasantly surprised with how easy it was to switch to ! Last time I set up the GPG keys on my YubiKeys, I spent several hours. This time, with the help of and embracing the idea of having unique keys on each YubiKey, but encrypting everything for multiple recipients, setting up my keys was surprisingly trivial. It also generates the keys securely on the hardware key itself, which is nice. The whole process probably took 30 minutes. It was so easy that in the future, I’m very much not intimidated by the thought of rotating keys. Did I need to switch to ? Of course not. However, over my career, I repeatedly find that exploring new tools for your core workflows (part of investing in interfaces ) is just plain fun. I often learn new ways of thinking about problems. Sometimes, you walk away with a new default that brings some fresh ideas and some delight to your life. Other times, you walk away with your trusty old tool, with greater appreciation for its history and the hard-earned approach it has established. For my uses at the moment, definitely falls into the former camp.

0 views