Posts in Bash (20 found)
Simon Willison 4 days ago

Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson

I talked with CL Kao and Dori Wilson for an episode of their new Data Renegades podcast titled Data Journalism Unleashed with Simon Willison . I fed the transcript into Claude Opus 4.5 to extract this list of topics with timestamps and illustrative quotes. It did such a good job I'm using what it produced almost verbatim here - I tidied it up a tiny bit and added a bunch of supporting links. What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook I used a Claude Project for the initial analysis, pasting in the HTML of the transcript since that included elements. The project uses the following custom instructions You will be given a transcript of a podcast episode. Find the most interesting quotes in that transcript - quotes that best illustrate the overall themes, and quotes that introduce surprising ideas or express things in a particularly clear or engaging or spicy way. Answer just with those quotes - long quotes are fine. I then added a follow-up prompt saying: Now construct a bullet point list of key topics where each item includes the mm:ss in square braces at the end Then suggest a very comprehensive list of supporting links I could find Here's the full Claude transcript of the analysis. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." NICAR 2026 ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook

0 views

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
devansh 3 weeks ago

Hitchhiker's Guide to Attack Surface Management

I first heard about the word "ASM" (i.e., Attack Surface Management) probably in late 2018, and I thought it must be some complex infrastructure for tracking assets of an organization. Looking back, I realize I almost had a similar stack for discovering, tracking, and detecting obscure assets of organizations, and I was using it for my bug hunting adventures. I feel my stack was kinda goated, as I was able to find obscure assets of Apple, Facebook, Shopify, Twitter, and many other Fortune 100 companies, and reported hundreds of bugs, all through automation. Back in the day, projects like ProjectDiscovery were not present, so if I had to write an effective port scanner, I had to do it from scratch. (Masscan and nmap were present, but I had my fair share of issues using them, this is a story for another time). I used to write DNS resolvers (massdns had a high error rate), port scanners, web scrapers, directory brute-force utilities, wordlists, lots of JavaScript parsing logic using regex, and a hell of a lot of other things. I used to have up to 50+ self-developed tools for bug-bounty recon stuff and another 60-something helper scripts written in bash. I used to orchestrate (gluing together with duct tape is a better word) and slap together scripts like a workflow, and save the output in text files. Whenever I dealt with a large number of domains, I used to distribute the load over multiple servers (server spin-up + SSH into it + SCP for pushing and pulling files from it). The setup was very fragile and error-prone, and I spent countless nights trying to debug errors in the workflows. But it was all worth it. I learned the art of Attack Surface Management without even trying to learn about it. I was just a teenager trying to make quick bucks through bug hunting, and this fragile, duct-taped system was my edge. Fast forward to today, I have now spent almost a decade in the bug bounty scene. I joined HackerOne in 2020 (to present) as a vulnerability triager, where I have triaged and reviewed tens of thousands of vulnerability submissions. Fair to say, I have seen a lot of things, from doomsday level 0-days, to reports related to leaked credentials which could have led to entire infrastructure compromise, just because some dev pushed an AWS secret key in git logs, to things where some organizations were not even aware they were running Jenkins servers on some obscure subdomain which could have allowed RCE and then lateral movement to other layers of infrastructure. A lot of these issues I have seen were totally avoidable, only if organizations followed some basic attack surface management techniques. If I search "Guide to ASM" on Internet, almost none of the supposed guides are real resources. They funnel you to their own ASM solution, and the guide is just present there to provide you with some surface-level information, and is mostly a marketing gimmick. This is precisely why I decided to write something where I try to cover everything I learned and know about ASM, and how to protect your organization's assets before bad actors could get to them. This is going to be a rough and raw guide, and will not lead you to a funnel where I am trying to sell my own ASM SaaS to you. I have nothing to sell, other than offering what I know. But in case you are an organization who needs help implementing the things I am mentioning below, you can reach out to me via X or email (both available on the homepage of this blog). This guide will provide you with insights into exactly how big your attack surface really is. CISOs can look at it and see if their organizations have all of these covered, security researchers and bug hunters can look at this and maybe find new ideas related to where to look during recon. Devs can look at it and see if they are unintentionally leaving any door open for hackers. If you are into security, it has something to offer you. Attack surface is one of those terms getting thrown around in security circles so much that it's become almost meaningless noise. In theory, it sounds simple enough, right. Attack surface is every single potential entry point, interaction vector, or exploitable interface an attacker could use to compromise your systems, steal your data, or generally wreck your day. But here's the thing, it's the sum total of everything you've exposed to the internet. Every API endpoint you forgot about, every subdomain some dev spun up for "testing purposes" five years ago and then abandoned, every IoT device plugged into your network, every employee laptop connecting from a coffee shop, every third-party vendor with a backdoor into your environment, every cloud storage bucket with permissions that make no sense, every Slack channel, every git commit leaking credentials, every paste on Pastebin containing your database passwords. Most organizations think about attack surface in incredibly narrow terms. They think if they have a website, an email server, and maybe some VPN endpoints, they've got "good visibility" into their assets. That's just plain wrong. Straight up wrong. Your actual attack surface would terrify you if you actually understood it. You run , and is your main domain. You probably know about , , maybe . But what about that your intern from 2015 spun up and just never bothered to delete. It's not documented anywhere. Nobody remembers it exists. Domain attack surface goes way beyond what's sitting in your asset management system. Every subdomain is a potential entry point. Most of these subdomains are completely forgotten. Subdomain enumeration is reconnaissance 101 for attackers and bug hunters. It's not rocket science. Setting up a tool that actively monitors through active and passive sources for new subdomains and generates alerts is honestly an hour's worth of work. You can use tools like Subfinder, Amass, or just mine Certificate Transparency logs to discover every single subdomain connected to your domain. Certificate Transparency logs were designed to increase security by making certificate issuance public, and they've become an absolute reconnaissance goldmine. Every time you get an SSL certificate for , that information is sitting in public logs for anyone to find. Attackers systematically enumerate these subdomains using Certificate Transparency log searches, DNS brute-forcing with massive wordlists, reverse DNS lookups to map IP ranges back to domains, historical DNS data from services like SecurityTrails, and zone transfer exploitation if your DNS is misconfigured. Attackers are looking for old development environments still running vulnerable software, staging servers with production data sitting on them, forgotten admin panels, API endpoints without authentication, internal tools accidentally exposed, and test environments with default credentials nobody changed. Every subdomain is an asset. Every asset is a potential vulnerability. Every vulnerability is an entry point. Domains and subdomains are just the starting point though. Once you've figured out all the subdomains belonging to your organization, the next step is to take a hard look at IP address space, which is another absolutely massive component of your attack surface. Organizations own, sometimes lease, IP ranges, sometimes small /24 blocks, sometimes massive /16 ranges, and every single IP address in those blocks and ranges that responds to external traffic is part of your attack surface. And attackers enumerate them all if you won't. They use WHOIS lookups to identify your IP ranges, port scanning to find what services are running where, service fingerprinting to identify exact software versions, and banner grabbing to extract configuration information. If you have a /24 network with 256 IP addresses and even 10% of those IPs are running services, you've got 25 potential attack vectors. Scale that to a /20 or /16 and you're looking at thousands of potential entry points. And attackers aren't just looking at the IPs you know about. They're looking at adjacent IP ranges you might have acquired through mergers, historical IP allocations that haven't been properly decommissioned, and shared IP ranges where your servers coexist with others. Traditional infrastructure was complicated enough, and now we have cloud. It's literally exploded organizations' attack surfaces in ways that are genuinely difficult to even comprehend. Every cloud service you spin up, be it an EC2 instance, S3 bucket, Lambda function, or API Gateway endpoint, all of this is a new attack vector. In my opinion and experience so far, I think the main issue with cloud infrastructure is that it's ephemeral and distributed. Resources get spun up and torn down constantly. Developers create instances for testing and forget about them. Auto-scaling groups generate new resources dynamically. Containerized workloads spin up massive Kubernetes clusters you have minimal visibility into. Your cloud attack surface could be literally anything. Examples are countless, but I'd categorize them into 8 different categories. Compute instances like EC2, Azure VMs, GCP Compute Engine instances exposed to the internet. Storage buckets like S3, Azure Blob Storage, GCP Cloud Storage with misconfigured permissions. Serverless stuff like Lambda functions with public URLs or overly permissive IAM roles. API endpoints like API Gateway, Azure API Management endpoints without proper authentication. Container registries like Docker images with embedded secrets or vulnerabilities. Kubernetes clusters with exposed API servers, misconfigured network policies, vulnerable ingress controllers. Managed databases like RDS, CosmosDB, Cloud SQL instances with weak access controls. IAM roles and service accounts with overly permissive identities that enable privilege escalation. I've seen instances in the past where a single misconfigured S3 bucket policy exposed terabytes of data. An overly permissive Lambda IAM role enabled lateral movement across an entire AWS account. A publicly accessible Kubernetes API server gave an attacker full cluster control. Honestly, cloud kinda scares me as well. And to top it off, multi-cloud infrastructure makes everything worse. If you're running AWS, Azure, and GCP together, you've just tripled your attack surface management complexity. Each cloud provider has different security models, different configuration profiles, and different attack vectors. Every application now uses APIs, and all applications nowadays are like a constellation of APIs talking to each other. Every API you use in your organization is your attack surface. The problem with APIs is that they're often deployed without the same security scrutiny as traditional web applications. Developers spin up API endpoints for specific features and those endpoints accumulate over time. Some of them are shadow APIs, meaning API endpoints which aren't documented anywhere. These endpoints are the equivalent of forgotten subdomains, and attackers can find them through analyzing JavaScript files for API endpoint references, fuzzing common API path patterns, examining mobile app traffic to discover backend APIs, and mining old documentation or code repositories for deprecated endpoints. Your API attack surface includes REST APIs exposed to the internet, GraphQL endpoints with overly broad query capabilities, WebSocket connections for real-time functionality, gRPC services for inter-service communication, and legacy SOAP APIs that never got decommissioned. If your organization has mobile apps, be it iOS, Android, or both, this is a direct window to your infrastructure and should be part of your attack surface management strategy. Mobile apps communicate with backend APIs and those API endpoints are discoverable by reversing the app. The reversed source of the app could reveal hard-coded API keys, tokens, and credentials. Using JADX plus APKTool plus Dex2jar is all a motivated attacker needs. Web servers often expose directories and files that weren't meant to be publicly accessible. Attackers systematically enumerate these using automated tools like ffuf, dirbuster, gobuster, and wfuzz with massive wordlists to discover hidden endpoints, configuration files, backup files, and administrative interfaces. Common exposed directories include admin panels, backup directories containing database dumps or source code, configuration files with database credentials and API keys, development directories with debug information, documentation directories revealing internal systems, upload directories for file storage, and old or forgotten directories from previous deployments. Your attack surface must include directories which are accidentally left accessible during deployments, staging servers with production data, backup directories with old source code versions, administrative interfaces without authentication, API documentation exposing endpoint details, and test directories with debug output enabled. Even if you've removed a directory from production, old cached versions may still be accessible through web caches or CDNs. Search engines also index these directories, making them discoverable through dorking techniques. If your organization is using IoT devices, and everyone uses these days, this should be part of your attack surface management strategy. They're invisible to traditional security tools. Your EDR solution doesn't protect IoT devices. Your vulnerability scanner can't inventory them. Your patch management system can't update them. Your IoT attack surface could include smart building systems like HVAC, lighting, access control. Security cameras and surveillance systems. Printers and copiers, which are computers with network access. Badge readers and physical access systems. Industrial control systems and SCADA devices. Medical devices in healthcare environments. Employee wearables and fitness trackers. Voice assistants and smart speakers. The problem with IoT devices is that they're often deployed without any security consideration. They have default credentials that never get changed, unpatched firmware with known vulnerabilities, no encryption for data in transit, weak authentication mechanisms, and insecure network configurations. Social media presence is an attack surface component that most organizations completely ignore. Attackers can use social media for reconnaissance by looking at employee profiles on LinkedIn to reveal organizational structure, technologies in use, and current projects. Twitter/X accounts can leak information about deployments, outages, and technology stack. Employee GitHub profiles expose email patterns and development practices. Company blogs can announce new features before security review. It could also be a direct attack vector. Attackers can use information from social media to craft convincing phishing attacks. Hijacked social media accounts can be used to spread malware or phishing links. Employees can accidentally share sensitive information. Fake accounts can impersonate your brand to defraud customers. Your employees' social media presence is part of your attack surface whether you like it or not. Third-party vendors, suppliers, contractors, or partners with access to your systems should be part of your attack surface. Supply chain attacks are becoming more and more common these days. Attackers can compromise a vendor with weaker security and then use that vendor's access to reach your environment. From there, they pivot from the vendor network to your systems. This isn't a hypothetical scenario, it has happened multiple times in the past. You might have heard about the SolarWinds attack, where attackers compromised SolarWinds' build system and distributed malware through software updates to thousands of customers. Another famous case study is the MOVEit vulnerability in MOVEit Transfer software, exploited by the Cl0p ransomware group, which affected over 2,700 organizations. These are examples of some high-profile supply chain security attacks. Your third-party attack surface could include things like VPNs, remote desktop connections, privileged access systems, third-party services with API keys to your systems, login credentials shared with vendors, SaaS applications storing your data, and external IT support with administrative access. It's obvious you can't directly control third-party security. You can audit them, have them pen-test their assets as part of your vendor compliance plan, and include security requirements in contracts, but ultimately their security posture is outside your control. And attackers know this. GitHub, GitLab, Bitbucket, they all are a massive attack surface. Attackers search through code repositories in hopes of finding hard-coded credentials like API keys, database passwords, and tokens. Private keys, SSH keys, TLS certificates, and encryption keys. Internal architecture documentation revealing infrastructure details in code comments. Configuration files with database connection strings and internal URLs. Deprecated code with vulnerabilities that's still in production. Even private repositories aren't safe. Attackers can compromise developer accounts to access private repositories, former employees retain access after leaving, and overly broad repository permissions grant access to too many people. Automated scanners continuously monitor public repositories for secrets. The moment a developer accidentally pushes credentials to a public repository, automated systems detect it within minutes. Attackers have already extracted and weaponized those credentials before the developer realizes the mistake. CI/CD pipelines are massive another attack vector. Especially in recent times, and not many organizations are giving attention to this attack vector. This should totally be part of your attack surface management. Attackers compromise GitHub Actions workflows with malicious code injection, Jenkins servers with weak authentication, GitLab CI/CD variables containing secrets, and build artifacts with embedded malware. The GitHub Actions supply chain attack, CVE-2025-30066, demonstrated this perfectly. Attackers compromised the Action used in over 23,000 repositories, injecting malicious code that leaked secrets from build logs. Jenkins specifically is a goldmine for attackers. An exposed Jenkins instance provides complete control over multiple critical servers, access to hardcoded AWS keys, Redis credentials, and BitBucket tokens, ability to manipulate builds and inject malicious code, and exfiltration of production database credentials containing PII. Modern collaboration tools are massive attack surface components that most organizations underestimate. Slack has hidden security risks despite being invite-only. Slack attack surface could include indefinite data retention where every message, channel, and file is stored forever unless admins configure retention periods. Public channels accessible to all users so one breached account opens the floodgates. Third-party integrations with excessive permissions accessing messages and user data. Former contractor access where individuals retain access long after projects end. Phishing and impersonation where it's easy to change names and pictures to impersonate senior personnel. In 2022, Slack leaked hashed passwords for five years affecting 0.5% of users. Slack channels commonly contain API keys, authentication tokens, database credentials, customer PII, financial data, internal system passwords, and confidential project information. The average cost of a breached record was $164 in 2022. When 1 in 166 messages in Slack contains confidential information, every new message adds another dollar to total risk exposure. With 5,000 employees sending 30 million Slack messages per year, that's substantial exposure. Trello board exposure is a significant attack surface. Trello attack vectors include public boards with sensitive information accidentally shared publicly, default public visibility where boards are created as public by default in some configurations, unsecured REST API allowing unauthenticated access to user data, and scraping attacks where attackers use email lists to enumerate Trello accounts. The 2024 Trello data breach exposed 15 million users' personal information when a threat actor named "emo" exploited an unsecured REST API using 500 million email addresses to compile detailed user profiles. Security researcher David Shear documented hundreds of public Trello boards exposing passwords, credentials, IT support customer access details, website admin logins, and client server management credentials. IT companies were using Trello to troubleshoot client requests and manage infrastructure, storing all credentials on public Trello boards. Jira misconfiguration is a widespread attack surface issue. Common misconfigurations include public dashboards and filters with "Everyone" access actually meaning public internet access, anonymous access enabled allowing unauthenticated users to browse, user picker functionality providing complete lists of usernames and email addresses, and project visibility allowing sensitive projects to be accessible without authentication. Confluence misconfiguration exposes internal documentation. Confluence attack surface components include anonymous access at site level allowing public access, public spaces where space admins grant anonymous permissions, inherited permissions where all content within a space inherits space-level access, and user profile visibility allowing anonymous users to view profiles of logged-in users. When anonymous access is enabled globally and space admins allow anonymous users to access their spaces, anyone on the internet can access that content. Confluence spaces often contain internal documentation with hardcoded credentials, financial information, project details, employee information, and API documentation with authentication details. Cloud storage misconfiguration is epidemic. Google Drive misconfiguration attack surface includes "Anyone with the link" sharing making files accessible without authentication, overly permissive sharing defaults making it easy to accidentally share publicly, inherited folder permissions exposing everything beneath, unmanaged third-party apps with excessive read/write/delete permissions, inactive user accounts where former employees retain access, and external ownership blind spots where externally-owned content is shared into the environment. Metomic's 2023 Google Scanner Report found that of 6.5 million Google Drive files analyzed, 40.2% contained sensitive information, 34.2% were shared externally, and 0.5% were publicly accessible, mostly unintentionally. In December 2023, Japanese game developer Ateam suffered a catastrophic Google Drive misconfiguration that exposed personal data of nearly 1 million people for over six years due to "Anyone with the link" settings. Based on Valence research, 22% of external data shares utilize open links, and 94% of these open link shares are inactive, forgotten files with public URLs floating around the internet. Dropbox, OneDrive, and Box share similar attack surface components including misconfigured sharing permissions, weak or missing password protection, overly broad access grants, third-party app integrations with excessive permissions, and lack of visibility into external sharing. Features that make file sharing convenient create data leakage risks when misconfigured. Pastebin and similar paste sites are both reconnaissance sources and attack vectors. Paste site attack surface includes public data dumps of stolen credentials, API keys, and database dumps posted publicly, malware hosting of obfuscated payloads, C2 communications where malware uses Pastebin for command and control, credential leakage from developers accidentally posting secrets, and bypassing security filters since Pastebin is legitimate so security tools don't block it. For organizations, leaked API keys or database credentials on Pastebin lead to unauthorized access, data exfiltration, and service disruption. Attackers continuously scan Pastebin for mentions of target organizations using automated tools. Security teams must actively monitor Pastebin and similar paste sites for company name mentions, email domain references, and specific keywords related to the organization. Because paste sites don't require registration or authentication and content is rarely removed, they've become permanent archives of leaked secrets. Container registries expose significant attack surface. Container registry attack surface includes secrets embedded in image layers where 30,000 unique secrets were found in 19,000 images, with 10% of scanned Docker images containing secrets, and 1,200 secrets, 4%, being active and valid. Immutable cached layers contain 85% of embedded secrets that can't be removed, exposed registries with 117 Docker registries accessible without authentication, unsecured registries allowing pull, push, and delete operations, and source code exposure where full application code is accessible by pulling images. GitGuardian's analysis of 200,000 publicly available Docker images revealed a staggering secret exposure problem. Even more alarming, 99% of images containing active secrets were pulled in 2024, demonstrating real-world exploitation. Unit 42's research identified 941 Docker registries exposed to the internet, with 117 accessible without authentication containing 2,956 repositories, 15,887 tags, and full source code and historical versions. Out of 117 unsecured registries, 80 allow pull operations to download images, 92 allow push operations to upload malicious images, and 7 allow delete operations for ransomware potential. Sysdig's analysis of over 250,000 Linux images on Docker Hub found 1,652 malicious images including cryptominers, most common, embedded secrets, second most prevalent, SSH keys and public keys for backdoor implants, API keys and authentication tokens, and database credentials. The secrets found in container images included AWS access keys, database passwords, SSH private keys, API tokens for cloud services, GitHub personal access tokens, and TLS certificates. Shadow IT includes unapproved SaaS applications like Dropbox, Google Drive, and personal cloud storage used for work. Personal devices like BYOD laptops, tablets, and smartphones accessing corporate data. Rogue cloud deployments where developers spin up AWS instances without approval. Unauthorized messaging apps like WhatsApp, Telegram, and Signal used for business communication. Unapproved IoT devices like smart speakers, wireless cameras, and fitness trackers on the corporate network. Gartner estimates that shadow IT makes up 30-40% of IT spending in large companies, and 76% of organizations surveyed experienced cyberattacks due to exploitation of unknown, unmanaged, or poorly managed assets. Shadow IT expands your attack surface because it's not protected by your security controls, it's not monitored by your security team, it's not included in your vulnerability scans, it's not patched by your IT department, and it often has weak or default credentials. And you can't secure what you don't know exists. Bring Your Own Device, BYOD, policies sound great for employee flexibility and cost savings. For security teams, they're a nightmare. BYOD expands your attack surface by introducing unmanaged endpoints like personal devices without EDR, antivirus, or encryption. Mixing personal and business use where work data is stored alongside personal apps with unknown security. Connecting from untrusted networks like public Wi-Fi and home networks with compromised routers. Installing unapproved applications with malware or excessive permissions. Lacking consistent security updates with devices running outdated operating systems. Common BYOD security issues include data leakage through personal cloud backup services, malware infections from personal app downloads, lost or stolen devices containing corporate data, family members using devices that access work systems, and lack of IT visibility and control. The 60% of small and mid-sized businesses that close within six months of a major cyberattack often have BYOD-related security gaps as contributing factors. Remote access infrastructure like VPNs and Remote Desktop Protocol, RDP, are among the most exploited attack vectors. SSL VPN appliances from vendors like Fortinet, SonicWall, Check Point, and Palo Alto are under constant attack. VPN attack vectors include authentication bypass vulnerabilities with CVEs allowing attackers to hijack active sessions, credential stuffing through brute-forcing VPN logins with leaked credentials, exploitation of unpatched vulnerabilities with critical CVEs in VPN appliances, and configuration weaknesses like default credentials, weak passwords, and lack of MFA. Real-world attacks demonstrate the risk. Check Point SSL VPN CVE-2024-24919 allowed authentication bypass for session hijacking. Fortinet SSL-VPN vulnerabilities were leveraged for lateral movement and privilege escalation. SonicWall CVE-2024-53704 allowed remote authentication bypass for SSL VPN. Once inside via VPN, attackers conduct network reconnaissance, lateral movement, and privilege escalation. RDP is worse. Sophos found that cybercriminals abused RDP in 90% of attacks they investigated. External remote services like RDP were the initial access vector in 65% of incident response cases. RDP attack vectors include exposed RDP ports with port 3389 open to the internet, weak authentication with simple passwords vulnerable to brute force, lack of MFA with no second factor for authentication, and credential reuse from compromised passwords in data breaches. In one Darktrace case, attackers compromised an organization four times in six months, each time through exposed RDP ports. The attack chain went successful RDP login, internal reconnaissance via WMI, lateral movement via PsExec, and objective achievement. The Palo Alto Unit 42 Incident Response report found RDP was the initial attack vector in 50% of ransomware deployment cases. Email infrastructure remains a primary attack vector. Your email attack surface includes mail servers like Exchange, Office 365, and Gmail with configuration weaknesses, email authentication with misconfigured SPF, DKIM, and DMARC records, phishing-susceptible users targeted through social engineering, email attachments and links as malware delivery mechanisms, and compromised accounts through credential stuffing or password reuse. Email authentication misconfiguration is particularly insidious. If your SPF, DKIM, and DMARC records are wrong or missing, attackers can spoof emails from your domain, your legitimate emails get marked as spam, and phishing emails impersonating your organization succeed. Email servers themselves are also targets. The NSA released guidance on Microsoft Exchange Server security specifically because Exchange servers are so frequently compromised. Container orchestration platforms like Kubernetes introduce massive attack surface complexity. The Kubernetes attack surface includes the Kubernetes API server with exposed or misconfigured API endpoints, container images with vulnerabilities in base images or application layers, container registries like Docker Hub, ECR, and GCR with weak access controls, pod security policies with overly permissive container configurations, network policies with insufficient micro-segmentation between pods, secrets management with hardcoded secrets or weak secret storage, and RBAC misconfigurations with overly broad service account permissions. Container security issues include containers running as root with excessive privileges, exposed Docker daemon sockets allowing container escape, vulnerable dependencies in container images, and lack of runtime security monitoring. The Docker daemon attack surface is particularly concerning. Running containers with privileged access or allowing docker.sock access can enable container escape and host compromise. Serverless computing like AWS Lambda, Azure Functions, and Google Cloud Functions promised to eliminate infrastructure management. Instead, it just created new attack surfaces. Serverless attack surface components include function code vulnerabilities like injection flaws and insecure dependencies, IAM misconfigurations with overly permissive Lambda execution roles, environment variables storing secrets as plain text, function URLs with publicly accessible endpoints without authentication, and event source mappings with untrusted input from various cloud services. The overabundance of event sources expands the attack surface. Lambda functions can be triggered by S3 events, API Gateway requests, DynamoDB streams, SNS topics, EventBridge schedules, IoT events, and dozens more. Each event source is a potential injection point. If function input validation is insufficient, attackers can manipulate event data to exploit the function. Real-world Lambda attacks include credential theft by exfiltrating IAM credentials from environment variables, lateral movement using over-permissioned roles to access other AWS resources, and data exfiltration by invoking functions to query and extract database contents. The Scarlet Eel adversary specifically targeted AWS Lambda for credential theft and lateral movement. Microservices architecture multiplies attack surface by decomposing monolithic applications into dozens or hundreds of independent services. Each microservice has its own attack surface including authentication mechanisms where each service needs to verify requests, authorization rules where each service enforces access controls, API endpoints for service-to-service communication channels, data stores where each service may have its own database, and network interfaces where each service exposes network ports. Microservices security challenges include east-west traffic vulnerabilities with service-to-service communication without encryption or authentication, authentication and authorization complexity from managing auth across 40 plus services multiplied by 3 environments equaling 240 configurations, service-to-service trust where services blindly trust internal traffic, network segmentation failures with flat networks allowing unrestricted pod-to-pod communication, and inconsistent security policies with different services having different security standards. One compromised microservice can enable lateral movement across the entire application. Without proper network segmentation and zero trust architecture, attackers pivot from service to service. How do you measure something this large, right. Attack surface measurement is complex. Attack surface metrics include the total number of assets with all discovered systems, applications, and devices, newly discovered assets found through continuous discovery, the number of exposed assets accessible from the internet, open ports and services with network services listening for connections, vulnerabilities by severity including critical, high, medium, and low CVEs, mean time to detect, MTTD, measuring how quickly new assets are discovered, mean time to remediate, MTTR, measuring how quickly vulnerabilities are fixed, shadow IT assets that are unknown or unmanaged, third-party exposure from vendor and partner access points, and attack surface change rate showing how rapidly the attack surface evolves. Academic research has produced formal attack surface measurement methods. Pratyusa Manadhata's foundational work defines attack surface as a three-tuple, System Attackability, Channel Attackability, Data Attackability. But in practice, most organizations struggle with basic attack surface visibility, let alone quantitative measurement. Your attack surface isn't static. It changes constantly. Changes happen because developers deploy new services and APIs, cloud auto-scaling spins up new instances, shadow IT appears as employees adopt unapproved tools, acquisitions bring new infrastructure into your environment, IoT devices get plugged into your network, and subdomains get created for new projects. Static, point-in-time assessments are obsolete. You need continuous asset discovery and monitoring. Continuous discovery methods include automated network scanning for regular scans to detect new devices, cloud API polling to query cloud provider APIs for resource changes, DNS monitoring to track new subdomains via Certificate Transparency logs, passive traffic analysis to observe network traffic and identify assets, integration with CMDB or ITSM to sync with configuration management databases, and cloud inventory automation using Infrastructure as Code to track deployments. Understanding your attack surface is step one. Reducing it is the goal. Attack surface reduction begins with asset elimination by removing unnecessary assets entirely. This includes decommissioning unused servers and applications, deleting abandoned subdomains and DNS records, shutting down forgotten development environments, disabling unused network services and ports, and removing unused user accounts and service identities. Access control hardening implements least privilege everywhere by enforcing multi-factor authentication, MFA, for all remote access, using role-based access control, RBAC, for cloud resources, implementing zero trust network architecture, restricting network access with micro-segmentation, and applying the principle of least privilege to IAM roles. Exposure minimization reduces what's visible to attackers by moving services behind VPNs or bastion hosts, using private IP ranges for internal services, implementing network address translation, NAT, for outbound access, restricting API endpoints to authorized sources only, and disabling unnecessary features and functionalities. Security hardening strengthens what remains by applying security patches promptly, using security configuration baselines, enabling encryption for data in transit and at rest, implementing Web Application Firewalls, WAF, for web apps, and deploying endpoint detection and response, EDR, on all devices. Monitoring and detection watch for attacks in progress by implementing real-time threat detection, enabling comprehensive logging and SIEM integration, deploying intrusion detection and prevention systems, IDS/IPS, monitoring for anomalous behavior patterns, and using threat intelligence feeds to identify known bad actors. Your attack surface is exponentially larger than you think it is. Every asset you know about probably has three you don't. Every known vulnerability probably has ten undiscovered ones. Every third-party integration probably grants more access than you realize. Every collaboration tool is leaking more data than you imagine. Every paste site contains more of your secrets than you want to admit. And attackers know this. They're not just looking at what you think you've secured. They're systematically enumerating every possible entry point. They're mining Certificate Transparency logs for forgotten subdomains. They're scanning every IP in your address space. They're reverse-engineering your mobile apps. They're buying employee credentials from data breach databases. They're compromising your vendors to reach you. They're scraping Pastebin for your leaked secrets. They're pulling your public Docker images and extracting the embedded credentials. They're accessing your misconfigured S3 buckets and exfiltrating terabytes of data. They're exploiting your exposed Jenkins instances to compromise your entire infrastructure. They're manipulating your AI agents to exfiltrate private Notion data. The asymmetry is brutal. You have to defend every single attack vector. They only need to find one that works. So what do you do. Start by accepting that you don't have complete visibility. Nobody does. But you can work toward better visibility through continuous discovery, automated asset management, and integration of security tools that help map your actual attack surface. Implement attack surface reduction aggressively. Every asset you eliminate is one less thing to defend. Every service you shut down is one less potential vulnerability. Every piece of shadow IT you discover and bring under management is one less blind spot. Every misconfigured cloud storage bucket you fix is terabytes of data no longer exposed. Every leaked secret you rotate is one less credential floating around the internet. Adopt zero trust architecture. Stop assuming that anything, internal services, microservices, authenticated users, collaboration tools, is inherently trustworthy. Verify everything. Monitor paste sites and code repositories. Your secrets are out there. Find them before attackers weaponize them. Secure your collaboration tools. Slack, Trello, Jira, Confluence, Notion, Google Drive, and Airtable are all leaking data. Lock them down. Fix your container security. Scan images for secrets. Use secret managers instead of environment variables. Secure your registries. Harden your CI/CD pipelines. Jenkins, GitHub Actions, and GitLab CI are high-value targets. Protect them. And test your assumptions with red team exercises and continuous security testing. Your attack surface is what an attacker can reach, not what you think you've secured. The attack surface problem isn't getting better. Cloud adoption, DevOps practices, remote work, IoT proliferation, supply chain complexity, collaboration tool sprawl, and container adoption are all expanding organizational attack surfaces faster than security teams can keep up. But understanding the problem is the first step toward managing it. And now you understand exactly how catastrophically large your attack surface actually is.

1 views
Simon Willison 3 weeks ago

Video + notes on upgrading a Datasette plugin for the latest 1.0 alpha, with help from uv and OpenAI Codex CLI

I'm upgrading various plugins for compatibility with the new Datasette 1.0a20 alpha release and I decided to record a video of the process. This post accompanies that video with detailed additional notes. I picked a very simple plugin to illustrate the upgrade process (possibly too simple). datasette-checkbox adds just one feature to Datasette: if you are viewing a table with boolean columns (detected as integer columns with names like or or ) and your current user has permission to update rows in that table it adds an inline checkbox UI that looks like this: I built the first version with the help of Claude back in August 2024 - details in this issue comment . Most of the implementation is JavaScript that makes calls to Datasette 1.0's JSON write API . The Python code just checks that the user has the necessary permissions before including the extra JavaScript. The first step in upgrading any plugin is to run its tests against the latest Datasette version. Thankfully makes it easy to run code in scratch virtual environments that include the different code versions you want to test against. I have a test utility called (for "test against development Datasette") which I use for that purpose. I can run it in any plugin directory like this: And it will run the existing plugin tests against whatever version of Datasette I have checked out in my directory. You can see the full implementation of (and its friend described below) in this TIL - the basic version looks like this: I started by running in the directory, and got my first failure... but it wasn't due to permissions, it was because the for the plugin was pinned to a specific mismatched version of Datasette: I fixed this problem by swapping to and ran the tests again... and they passed! Which was a problem because I was expecting permission-related failures. It turns out when I first wrote the plugin I was lazy with the tests - they weren't actually confirming that the table page loaded without errors. I needed to actually run the code myself to see the expected bug. First I created myself a demo database using sqlite-utils create-table : Then I ran it with Datasette against the plugin's code like so: Sure enough, visiting produced a 500 error about the missing method. The next step was to update the test to also trigger this error: And now fails as expected. It this point I could have manually fixed the plugin itself - which would likely have been faster given the small size of the fix - but instead I demonstrated a bash one-liner I've been using to apply these kinds of changes automatically: runs OpenAI Codex in non-interactive mode - it will loop until it has finished the prompt you give it. I tell it to consult the subset of the Datasette upgrade documentation that talks about Datasette permissions and then get the command to pass its tests. This is an example of what I call designing agentic loops - I gave Codex the tools it needed ( ) and a clear goal and let it get to work on my behalf. The remainder of the video covers finishing up the work - testing the fix manually, commiting my work using: Then shipping a 0.1a4 release to PyPI using the pattern described in this TIL . Finally, I demonstrated that the shipped plugin worked in a fresh environment using like this: Executing this command installs and runs a fresh Datasette instance with a fresh copy of the new alpha plugin ( ). It's a neat way of confirming that freshly released software works as expected. This video was shot in a single take using Descript , with no rehearsal and perilously little preparation in advance. I recorded through my AirPods and applied the "Studio Sound" filter to clean up the audio. I pasted in a closing slide from my previous video and exported it locally at 1080p, then uploaded it to YouTube. Something I learned from the Software Carpentry instructor training course is that making mistakes in front of an audience is actively helpful - it helps them see a realistic version of how software development works and they can learn from watching you recover. I see this as a great excuse for not editing out all of my mistakes! I'm trying to build new habits around video content that let me produce useful videos while minimizing the amount of time I spend on production. I plan to iterate more on the format as I get more comfortable with the process. I'm hoping I can find the right balance between production time and value to viewers. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Luke Hsiao 3 weeks ago

Switching from GPG to age

It’s been several years since I went through all the trouble of setting up my own GPG keys and securing them in YubiKeys following drduh’s guide . With that approach, you generate one key securely offline and store it on multiple YubiKeys, along with a backup. It has worked well for me for years, and as the Lindy effect suggests, it would almost certainly continue to. But as my sub-keys were nearing expiration, I was faced with either renewing (more convenient, no forward secrecy ) or rotating them (rather painful, but potentially more secure). However, I’ve realized that I essentially only use these keys for encryption, and almost never for signing. So, instead of doing either of the usual options, I’m going to let my keys expire entirely. I’m now experimenting with , which touts itself as “simple, modern, and secure encryption”. If needed, I will use for signatures. This required changing a couple of things in my typical workflow. First, and foremost, I needed to switch from to , a fork of that uses as the backend. This was actually surprisingly easy because includes a simple bash script to do the migration. There is no installer for , and no Arch packages. But it’s easy enough to install because it’s just a shell script you can throw on your . Note that for Arch, I also needed to install , which it assumes you have. I also name as on my machine. The benefit of this is everything that had integration has continued to “just work”. For example, , my email client of choice, behaved exactly the same after the migration. I would occasionally use as my SSH agent on my machines. It was convenient. However, I also like the idea of having a dedicated SSH key per machine. It makes monitoring their usage and revoking them much finer-grained. The absence of forced me to set up new keys on all my machines and add them to various servers/services. Easy encryption with chezmoi While I was tending to the encryption area of my personal tech “garden”, I also started leveraging chezmoi’s encryption features . I already use for my configuration files, but with encryption, I could also easily add “secrets” to my public dotfiles repo . In my case so far, this just means my copies of my favorite paid font: Berkeley Mono . also has a nice guide for configuring chezmoi to encrypt while asking for a passphrase only once for . I was also very pleasantly surprised with how easy it was to switch to ! Last time I set up the GPG keys on my YubiKeys, I spent several hours. This time, with the help of and embracing the idea of having unique keys on each YubiKey, but encrypting everything for multiple recipients, setting up my keys was surprisingly trivial. It also generates the keys securely on the hardware key itself, which is nice. The whole process probably took 30 minutes. It was so easy that in the future, I’m very much not intimidated by the thought of rotating keys. Did I need to switch to ? Of course not. However, over my career, I repeatedly find that exploring new tools for your core workflows (part of investing in interfaces ) is just plain fun. I often learn new ways of thinking about problems. Sometimes, you walk away with a new default that brings some fresh ideas and some delight to your life. Other times, you walk away with your trusty old tool, with greater appreciation for its history and the hard-earned approach it has established. For my uses at the moment, definitely falls into the former camp.

0 views
xenodium 3 weeks ago

agent-shell 0.17 improvements + MELPA

While it's only been a few weeks since the last agent-shell post , there are plenty of new updates to share. What's agent-shell again? A native Emacs shell to interact with any LLM agent powered by ACP ( Agent Client Protocol ). Before getting to the latest and greatest, I'd like to say thank you to new and existing sponsors backing my projects. While the work going in remains largely unsustainable, your contributions are indeed helping me get closer to sustainability. Thank you! If you benefit from my content and projects, please consider sponsoring to make the work sustainable. Work paying for your LLM tokens and other tools? Why not get your employer to sponsor agent-shell also? Now on to the very first update… Both agent-shell and acp.el are now available on MELPA. As such, installation now boils down to: OpenCode and Qwen Code are two of the latest agents to join agent-shell . Both accessible via and through the agent picker, but also directly from and . Adding files as context has seen quite a few improvements in different shapes. Thank you Ian Davidson for contributing embedded context support. Invoke to take a screenshot and automatically send it over to . A little side-note, did you notice the activity indicator in the header bar? Yep. That's new too. While file completion remains experimental, you can enable via: From any file you can now invoke to send the current file to . If region is selected, region information is sent also. Fancy sending a different file other than current one? Invoke with , or just use . , also operates on files (selection or region), DWIM style ;-) You may have noticed paths in section titles are no longer displayed as absolute paths. We're shortening those relative to project roots. While you can invoke with prefix to create new shells, is now available (and more discoverable than ). Cancelling prompt sessions (via ) is much more reliable now. If you experienced a shell getting stuck after cancelling a session, that's because we were missing part of the protocol implementation. This is now implemented. Use the new to automatically insert shell (ie. bash) command output. Initial work for automatically saving markdown transcripts is now in place. We're still iterating on it, but if keen to try things out, you can enable as follows: Text header Applied changes are now displayed inline. The new and can now be used to change the session mode. You can now find out what capabilities and session modes are supported by your agent. Expand either of the two sections. Tired of pressing and to accept changes from the diff buffer? Now just press from the diff viewer to accept all hunks. Same goes for rejecting. No more and . Now just press from the diff buffer. We get a new basic transient menu. Currently available via . We got lots of awesome pull requests from wonderful folks. Thank you for your contributions! Beyond what's been showcased here, much love and effort's been poured into polishing the experience. Interested in the nitty-gritty? Have a look through the 173 commits since the last blog post. If agent-shell or acp.el are useful to you, please consider sponsoring its development. LLM tokens aren't free, and neither is the time dedicated to building this stuff ;-) Arthur Heymans : Add a Package-Requires header ( PR ). Elle Najt : Execute commands in devcontainer ( PR ). Elle Najt : Fix Write tool diff preview for new files ( PR ). Elle Najt : Inline display of historical changes ( PR ). Elle Najt : Live Markdown transcripts ( PR ). Elle Najt : Prompt session mode cycling and modeline display ( PR ). Fritz Grabo : Devcontainer fallback workspace ( PR ). Guilherme Pires : Codex subscription auth ( PR ). Hordur Freyr Yngvason : Make qwen authentication optional ( PR ). Ian Davidson : Embedded context support ( PR ). Julian Hirn : Fix quick-diff window restoration for full-screen ( PR ). Ruslan Kamashev : Hide header line altogether ( PR ). festive-onion : Show Planning mode more reliably ( PR ).

0 views

How I Use Every Claude Code Feature

I use Claude Code. A lot. As a hobbyist, I run it in a VM several times a week on side projects, often with to vibe code whatever idea is on my mind. Professionally, part of my team builds the AI-IDE rules and tooling for our engineering team that consumes several billion tokens per month just for codegen. The CLI agent space is getting crowded and between Claude Code, Gemini CLI, Cursor, and Codex CLI, it feels like the real race is between Anthropic and OpenAI. But TBH when I talk to other developers, their choice often comes down to what feels like superficials—a “lucky” feature implementation or a system prompt “vibe” they just prefer. At this point these tools are all pretty good. I also feel like folks often also over index on the output style or UI. Like to me the “you’re absolutely right!” sycophancy isn’t a notable bug; it’s a signal that you’re too in-the-loop. Generally my goal is to “shoot and forget”—to delegate, set the context, and let it work. Judging the tool by the final PR and not how it gets there. Having stuck to Claude Code for the last few months, this post is my set of reflections on Claude Code’s entire ecosystem. We’ll cover nearly every feature I use (and, just as importantly, the ones I don’t), from the foundational file and custom slash commands to the powerful world of Subagents, Hooks, and GitHub Actions. This post ended up a bit long and I’d recommend it as more of a reference than something to read in entirety. The single most important file in your codebase for using Claude Code effectively is the root . This file is the agent’s “constitution,” its primary source of truth for how your specific repository works. How you treat this file depends on the context. For my hobby projects, I let Claude dump whatever it wants in there. For my professional work, our monorepo’s is strictly maintained and currently sits at 13KB (I could easily see it growing to 25KB). It only documents tools and APIs used by 30% (arbitrary) or more of our engineers (else tools are documented in product or library specific markdown files) We’ve even started allocating effectively a max token count for each internal tool’s documentation, almost like selling “ad space” to teams. If you can’t explain your tool concisely, it’s not ready for the . Over time, we’ve developed a strong, opinionated philosophy for writing an effective . Start with Guardrails, Not a Manual. Your should start small, documenting based on what Claude is getting wrong. Don’t -File Docs. If you have extensive documentation elsewhere, it’s tempting to -mention those files in your . This bloats the context window by embedding the entire file on every run. But if you just mention the path, Claude will often ignore it. You have to pitch the agent on why and when to read the file. “For complex … usage or if you encounter a , see for advanced troubleshooting steps.” Don’t Just Say “Never.” Avoid negative-only constraints like “Never use the flag.” The agent will get stuck when it thinks it must use that flag. Always provide an alternative. Use as a Forcing Function. If your CLI commands are complex and verbose, don’t write paragraphs of documentation to explain them. That’s patching a human problem. Instead, write a simple bash wrapper with a clear, intuitive API and document that . Keeping your as short as possible is a fantastic forcing function for simplifying your codebase and internal tooling. Here’s a simplified snapshot: Finally, we keep this file synced with an file to maintain compatibility with other AI IDEs that our engineers might be using. If you are looking for more tips for writing markdown for coding agents see “AI Can’t Read Your Docs”, “AI-powered Software Engineering”, and “How Cursor (AI IDE) Works”. The Takeaway: Treat your as a high-level, curated set of guardrails and pointers. Use it to guide where you need to invest in more AI (and human) friendly tools, rather than trying to make it a comprehensive manual. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. I recommend running mid coding session at least once to understand how you are using your 200k token context window (even with Sonnet-1M, I don’t trust that the full context window is actually used effectively). For us a fresh session in our monorepo costs a baseline ~20k tokens (10%) with the remaining 180k for making your change — which can fill up quite fast. A screenshot of /context in one of my recent side projects. You can almost think of this like disk space that fills up as you work on a feature. After a few minutes or hours you’ll need to clear the messages (purple) to make space to continue. I have three main workflows: (Avoid): I avoid this as much as possible. The automatic compaction is opaque, error-prone, and not well-optimized. + (Simple Restart): My default reboot. I the state, then run a custom command to make Claude read all changed files in my git branch. “Document & Clear” (Complex Restart): For large tasks. I have Claude dump its plan and progress into a , the state, then start a new session by telling it to read the and continue. The Takeaway: Don’t trust auto-compaction. Use for simple reboots and the “Document & Clear” method to create durable, external “memory” for complex tasks. I think of slash commands as simple shortcuts for frequently used prompts, nothing more. My setup is minimal: : The command I mentioned earlier. It just prompts Claude to read all changed files in my current git branch. : A simple helper to clean up my code, stage it, and prepare a pull request. IMHO if you have a long list of complex, custom slash commands, you’ve created an anti-pattern. To me the entire point of an agent like Claude is that you can type almost whatever you want and get a useful, mergable result. The moment you force an engineer (or non-engineer) to learn a new, documented-somewhere list of essential magic commands just to get work done, you’ve failed. The Takeaway: Use slash commands as simple, personal shortcuts, not as a replacement for building a more intuitive and better-tooled agent. On paper, custom subagents are Claude Code’s most powerful feature for context management. The pitch is simple: a complex task requires tokens of input context (e.g., how to run tests), accumulates tokens of working context, and produces a token answer. Running tasks means tokens in your main window. The subagent solution is to farm out the work to specialized agents, which only return the final token answers, keeping your main context clean. I find they are a powerful idea that, in practice, custom subagents create two new problems: They Gatekeep Context: If I make a subagent, I’ve now hidden all testing context from my main agent. It can no longer reason holistically about a change. It’s now forced to invoke the subagent just to know how to validate its own code. They Force Human Workflows: Worse, they force Claude into a rigid, human-defined workflow. I’m now dictating how it must delegate, which is the very problem I’m trying to get the agent to solve for me. My preferred alternative is to use Claude’s built-in feature to spawn clones of the general agent. I put all my key context in the . Then, I let the main agent decide when and how to delegate work to copies of itself. This gives me all the context-saving benefits of subagents without the drawbacks. The agent manages its own orchestration dynamically. In my “Building Multi-Agent Systems (Part 2)” post, I called this the “Master-Clone” architecture, and I strongly prefer it over the “Lead-Specialist” model that custom subagents encourage. The Takeaway: Custom subagents are a brittle solution. Give your main agent the context (in ) and let it use its own feature to manage delegation. On a simple level, I use and frequently. They’re great for restarting a bugged terminal or quickly rebooting an older session. I’ll often a session from days ago just to ask the agent to summarize how it overcame a specific error, which I then use to improve our and internal tooling. More in the weeds, Claude Code stores all session history in to tap into the raw historical session data. I have scripts that run meta-analysis on these logs, looking for common exceptions, permission requests, and error patterns to help improve agent-facing context. The Takeaway: Use and to restart sessions and uncover buried historical context. Hooks are huge. I don’t use them for hobby projects, but they are critical for steering Claude in a complex enterprise repo. They are the deterministic “must-do” rules that complement the “should-do” suggestions in . We use two types: Block-at-Submit Hooks: This is our primary strategy. We have a hook that wraps any command. It checks for a file, which our test script only creates if all tests pass. If the file is missing, the hook blocks the commit, forcing Claude into a “test-and-fix” loop until the build is green. Hint Hooks: These are simple, non-blocking hooks that provide “fire-and-forget” feedback if the agent is doing something suboptimal. We intentionally do not use “block-at-write” hooks (e.g., on or ). Blocking an agent mid-plan confuses or even “frustrates” it. It’s far more effective to let it finish its work and then check the final, completed result at the commit stage. The Takeaway: Use hooks to enforce state validation at commit time ( ). Avoid blocking at write time—let the agent finish its plan, then check the final result. Planning is essential for any “large” feature change with an AI IDE. For my hobby projects, I exclusively use the built-in planning mode. It’s a way to align with Claude before it starts, defining both how to build something and the “inspection checkpoints” where it needs to stop and show me its work. Using this regularly builds a strong intuition for what minimal context is needed to get a good plan without Claude botching the implementation. In our work monorepo, we’ve started rolling out a custom planning tool built on the Claude Code SDK. Its similar to native plan mode but heavily prompted to align its outputs with our existing technical design format. It also enforces our internal best practices—from code structure to data privacy and security—out of the box. This lets our engineers “vibe plan” a new feature as if they were a senior architect (or at least that’s the pitch). The Takeaway: Always use the built-in planning mode for complex changes to align on a plan before the agent starts working. I agree with Simon Willison’s : Skills are (maybe) a bigger deal than MCP. If you’ve been following my posts, you’ll know I’ve drifted away from MCP for most dev workflows, preferring to build simple CLIs instead (as I argued in “AI Can’t Read Your Docs” ). My mental model for agent autonomy has evolved into three stages: Single Prompt: Giving the agent all context in one massive prompt. (Brittle, doesn’t scale). Tool Calling: The “classic” agent model. We hand-craft tools and abstract away reality for the agent. (Better, but creates new abstractions and context bottlenecks). Scripting : We give the agent access to the raw environment—binaries, scripts, and docs—and it writes code on the fly to interact with them. With this model in mind, Agent Skills are the obvious next feature. They are the formal productization of the “Scripting” layer. If, like me, you’ve already been favoring CLIs over MCP, you’ve been implicitly getting the benefit of Skills all along. The file is just a more organized, shareable, and discoverable way to document these CLIs and scripts and expose them to the agent. The Takeaway: Skills are the right abstraction. They formalize the “scripting”-based agent model, which is more robust and flexible than the rigid, API-like model that MCP represents. Skills don’t mean MCP is dead (see also “Everything Wrong with MCP” ). Previously, many built awful, context-heavy MCPs with dozens of tools that just mirrored a REST API ( , , ). The “Scripting” model (now formalized by Skills) is better, but it needs a secure way to access the environment. This to me is the new, more focused role for MCP. Instead of a bloated API, an MCP should be a simple, secure gateway that provides a few powerful, high-level tools: In this model, MCP’s job isn’t to abstract reality for the agent; its job is to manage the auth, networking, and security boundaries and then get out of the way. It provides the entry point for the agent, which then uses its scripting and context to do the actual work. The only MCP I still use is for Playwright , which makes sense—it’s a complex, stateful environment. All my stateless tools (like Jira, AWS, GitHub) have been migrated to simple CLIs. The Takeaway: Use MCPs that act as data gateways. Give the agent one or two high-level tools (like a raw data dump API) that it can then script against. Claude Code isn’t just an interactive CLI; it’s also a powerful SDK for building entirely new agents—for both coding and non-coding tasks. I’ve started using it as my default agent framework over tools like LangChain/CrewAI for most new hobby projects. I use it in three main ways: Massive Parallel Scripting: For large-scale refactors, bug fixes, or migrations, I don’t use the interactive chat. I write simple bash scripts that call in parallel. This is far more scalable and controllable than trying to get the main agent to manage dozens of subagent tasks. Building Internal Chat Tools: The SDK is perfect for wrapping complex processes in a simple chat interface for non-technical users. Like an installer that, on error, falls back to the Claude Code SDK to just fix the problem for the user. Or an in-house “ v0-at-home ” tool that lets our design team vibe-code mock frontends in our in-house UI framework, ensuring their ideas are high-fidelity and the code is more directly usable in frontend production code. Rapid Agent Prototyping: This is my most common use. It’s not just for coding. If I have an idea for any agentic task (e.g., a “threat investigation agent” that uses custom CLIs or MCPs), I use the Claude Code SDK to quickly build and test the prototype before committing to a full, deployed scaffolding. The Takeaway: The Claude Code SDK is a powerful, general-purpose agent framework. Use it for batch-processing code, building internal tools, and rapidly prototyping new agents before you reach for more complex frameworks. The Claude Code GitHub Action (GHA) is probably one of my favorite and most slept on features. It’s a simple concept: just run Claude Code in a GHA. But this simplicity is what makes it so powerful. It’s similar to Cursor’s background agents or the Codex managed web UI but is far more customizable. You control the entire container and environment, giving you more access to data and, crucially, much stronger sandboxing and audit controls than any other product provides. Plus, it supports all the advanced features like Hooks and MCP. We’ve used it to build custom “PR-from-anywhere” tooling. Users can trigger a PR from Slack, Jira, or even a CloudWatch alert, and the GHA will fix the bug or add the feature and return a fully tested PR 1 . Since the GHA logs are the full agent logs, we have an ops process to regularly review these logs at a company level for common mistakes, bash errors, or unaligned engineering practices. This creates a data-driven flywheel: Bugs -> Improved CLAUDE.md / CLIs -> Better Agent. The Takeaway: The GHA is the ultimate way to operationalize Claude Code. It turns it from a personal tool into a core, auditable, and self-improving part of your engineering system. Finally, I have a few specific configurations that I’ve found essential for both hobby and professional work. / : This is great for debugging. I’ll use it to inspect the raw traffic to see exactly what prompts Claude is sending. For background agents, it’s also a powerful tool for fine-grained network sandboxing. / : I bump these. I like running long, complex commands, and the default timeouts are often too conservative. I’m honestly not sure if this is still needed now that bash background tasks are a thing, but I keep it just in case. : At work, we use our enterprise API keys ( via apiKeyHelper ). It shifts us from a “per-seat” license to “usage-based” pricing, which is a much better model for how we work. It accounts for the massive variance in developer usage (We’ve seen 1:100x differences between engineers). It lets engineers to tinker with non-Claude-Code LLM scripts, all under our single enterprise account. : I’ll occasionally self-audit the list of commands I’ve allowed Claude to auto-run. The Takeaway: Your is a powerful place for advanced customization. That was a lot, but hopefully, you find it useful. If you’re not already using a CLI-based agent like Claude Code or Codex CLI, you probably should be. There are rarely good guides for these advanced features, so the only way to learn is to dive in. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. To me, a fairly interesting philosophical question is how many reviewers should a PR get that was generated directly from a customer request (no internal human prompter)? We’ve settled on 2 human approvals for any AI-initiated PR for now, but it is kind of a weird paradigm shift (for me at least) when it’s no longer a human making something for another human to review. It only documents tools and APIs used by 30% (arbitrary) or more of our engineers (else tools are documented in product or library specific markdown files) We’ve even started allocating effectively a max token count for each internal tool’s documentation, almost like selling “ad space” to teams. If you can’t explain your tool concisely, it’s not ready for the . Start with Guardrails, Not a Manual. Your should start small, documenting based on what Claude is getting wrong. Don’t -File Docs. If you have extensive documentation elsewhere, it’s tempting to -mention those files in your . This bloats the context window by embedding the entire file on every run. But if you just mention the path, Claude will often ignore it. You have to pitch the agent on why and when to read the file. “For complex … usage or if you encounter a , see for advanced troubleshooting steps.” Don’t Just Say “Never.” Avoid negative-only constraints like “Never use the flag.” The agent will get stuck when it thinks it must use that flag. Always provide an alternative. Use as a Forcing Function. If your CLI commands are complex and verbose, don’t write paragraphs of documentation to explain them. That’s patching a human problem. Instead, write a simple bash wrapper with a clear, intuitive API and document that . Keeping your as short as possible is a fantastic forcing function for simplifying your codebase and internal tooling. A screenshot of /context in one of my recent side projects. You can almost think of this like disk space that fills up as you work on a feature. After a few minutes or hours you’ll need to clear the messages (purple) to make space to continue. I have three main workflows: (Avoid): I avoid this as much as possible. The automatic compaction is opaque, error-prone, and not well-optimized. + (Simple Restart): My default reboot. I the state, then run a custom command to make Claude read all changed files in my git branch. “Document & Clear” (Complex Restart): For large tasks. I have Claude dump its plan and progress into a , the state, then start a new session by telling it to read the and continue. : The command I mentioned earlier. It just prompts Claude to read all changed files in my current git branch. : A simple helper to clean up my code, stage it, and prepare a pull request. They Gatekeep Context: If I make a subagent, I’ve now hidden all testing context from my main agent. It can no longer reason holistically about a change. It’s now forced to invoke the subagent just to know how to validate its own code. They Force Human Workflows: Worse, they force Claude into a rigid, human-defined workflow. I’m now dictating how it must delegate, which is the very problem I’m trying to get the agent to solve for me. Block-at-Submit Hooks: This is our primary strategy. We have a hook that wraps any command. It checks for a file, which our test script only creates if all tests pass. If the file is missing, the hook blocks the commit, forcing Claude into a “test-and-fix” loop until the build is green. Hint Hooks: These are simple, non-blocking hooks that provide “fire-and-forget” feedback if the agent is doing something suboptimal. Single Prompt: Giving the agent all context in one massive prompt. (Brittle, doesn’t scale). Tool Calling: The “classic” agent model. We hand-craft tools and abstract away reality for the agent. (Better, but creates new abstractions and context bottlenecks). Scripting : We give the agent access to the raw environment—binaries, scripts, and docs—and it writes code on the fly to interact with them. Massive Parallel Scripting: For large-scale refactors, bug fixes, or migrations, I don’t use the interactive chat. I write simple bash scripts that call in parallel. This is far more scalable and controllable than trying to get the main agent to manage dozens of subagent tasks. Building Internal Chat Tools: The SDK is perfect for wrapping complex processes in a simple chat interface for non-technical users. Like an installer that, on error, falls back to the Claude Code SDK to just fix the problem for the user. Or an in-house “ v0-at-home ” tool that lets our design team vibe-code mock frontends in our in-house UI framework, ensuring their ideas are high-fidelity and the code is more directly usable in frontend production code. Rapid Agent Prototyping: This is my most common use. It’s not just for coding. If I have an idea for any agentic task (e.g., a “threat investigation agent” that uses custom CLIs or MCPs), I use the Claude Code SDK to quickly build and test the prototype before committing to a full, deployed scaffolding. / : This is great for debugging. I’ll use it to inspect the raw traffic to see exactly what prompts Claude is sending. For background agents, it’s also a powerful tool for fine-grained network sandboxing. / : I bump these. I like running long, complex commands, and the default timeouts are often too conservative. I’m honestly not sure if this is still needed now that bash background tasks are a thing, but I keep it just in case. : At work, we use our enterprise API keys ( via apiKeyHelper ). It shifts us from a “per-seat” license to “usage-based” pricing, which is a much better model for how we work. It accounts for the massive variance in developer usage (We’ve seen 1:100x differences between engineers). It lets engineers to tinker with non-Claude-Code LLM scripts, all under our single enterprise account. : I’ll occasionally self-audit the list of commands I’ve allowed Claude to auto-run.

0 views
マリウス 1 months ago

A Word on Omarchy

Pro tip: If you’ve arrived here via a link aggregator, feel free to skip ahead to the Summary for a conveniently digestible tl;dr that spares you all the tedious details, yet still provides enough ammunition to trash-talk this post in the comments of whatever platform you stumbled upon it. In the recent months, there has been a noticeable shift away from the Windows desktop, as well as from macOS , to Linux, driven by various frustrations, such as the Windows 11 Recall feature. While there have historically been more than enough Linux distributions to choose from, for each skill level and amount of desired pain, a recent Arch -based configuration has seemingly made strides across the Linux landscape: Omarchy . This pre-configured Arch system is the brainchild of David Heinemeier Hansson , a Danish web developer and entrepreneur known as one of the co-founders of 37signals and for developing the Ruby on Rails framework. The name Omarchy appears to be a portmanteau of Arch , the Linux distribution that Hansson ’s configuration is based upon, and お任せ, which translates to omakase and means to leave something up to someone else (任せる, makaseru, to entrust ). When ordering omakase in a restaurant, you’re leaving it up to the chef to serve you whatever they think is best. Oma(kase) + (A)rch + y is supposedly where the name comes from. It’s important to note that, contrary to what Hansson says in the introduction video , Omarchy is not an actual Linux distribution . Instead, it’s an opinionated installation of Arch Linux that aims to make it easy to set up and run an Arch desktop, seemingly with as much TUI-hacker-esque aesthetic as possible. Omarchy comes bundled with Hyprland , a tiling window manager that focuses on customizability and graphic effects, but apparently not as much on code quality and safety . However, the sudden hype around Omarchy , which at this point has attracted attention and seemingly even funding from companies like Framework (Computer Inc.) ( attention ) and Cloudflare ( attention and seemingly funding ), made me want to take a closer look at the supposed cool kid on the block to understand what it was all about. Omarchy is a pre-configured installation of the Arch distribution that comes with a TUI installer on a 6.2GB ISO. It ships with a collection of shell scripts that use existing FOSS software (e.g. walker ) to implement individual features. The project is based on the work that the FOSS community, especially the Arch Linux maintainers, have done over the years, and ties together individual components to offer a supposed ready-to-use desktop experience. Omarchy also adds some links to different websites, disguised as “Apps” , but more on that later. This, however, seems to be enough to spark an avalanche of attention and, more importantly, financial support for the project. Anyway, let’s give Omarchy an actual try, and see what chef Hansson recommended to us. The Omarchy installer is a simple text user interface that tries to replicate what Charm has pioneered with their TUI libraries: A smooth command-line interface that preserves the simplicity of the good old days , yet enhances the experience with playful colors, emojis, and animations for the younger, future generation of users. Unlike mature installers, Omarchy ’s installer script doesn’t allow for much customization, which is probably to be expected with an “Opinionated Arch/Hyprland Setup” . Info: Omarchy uses gum , a Charm tool, under the hood. One of the first things that struck me as unexpected was the fact that I was able to use as my user password, an easy-to-guess word that Omarchy will also use for the drive encryption, without any resistance from the installer. Most modern Linux distributions actively prevent users from setting easily guessable or brute-forceable passwords. Moreover, taking into account that the system relies heavily on sudo (instead of the more modern doas ), and also considering that the default installation configures the maximum number of password retries to 10 (instead of the more cautious limit of three), it raises an important question: Does Omarchy care about security? Let’s take a look at the Omarchy manual to find out: Omarchy takes security extremely seriously. This is meant to be an operating system that you can use to do Real Work in the Real World . Where losing a laptop can’t lead to a security emergency. According to the manual, taking security extremely seriously means enabling full-disk encryption (but without rejecting simple keys), blocking all ports except for 22 (SSH, on a desktop) and 53317 (LocalSend), continuously running (even though staying bleeding-edge has repeatedly proven to be in insufficient security measure in the past) and maintaining a Cloudflare protected package mirror. That’s seemingly all. Hm. Proceeding with the installation, the TUI prompts for an email address, which makes the whole process feel a bit like the Windows setup routine. While one might assume Omarchy is simply trying to accommodate its new user base, the actual reason appears to be much simpler: . If, however, you’d be expecting for Omarchy to set up GPG with proper defaults, configure SSH with equally secure defaults, and perhaps offer an option to create new GPG/SSH keys or import existing ones, in order to enable proper commit and push signing for Git, you will be left disappointed. Unfortunately, none of this is the case. The Git config doesn’t enable commit or push signing, neither the GPG nor the SSH client configurations set secure defaults, and the user isn’t offered a way to import existing keys or create new ones. Given that Hansson himself usually does not sign his commits, it seems that these aspects are not particularly high on the project’s list of priorities. The rest of the installer routine is fairly straightforward and offers little customization, so I won’t bore you with the details, but you can check the screenshots below. After initially downloading the official ISO file, the first boot of the system greets you with a terminal window informing you that it needs to update a few packages . And by “a few” it means another 1.8GB. I’m still not entirely sure why the v3.0.2 ISO is a hefty 6.2GB, or why it requires downloading an additional 1.8GB after installation on a system with internet access. For comparison, the official Arch installer image is just 1.4GB in size . While downloading the updates (which took over an hour for me), and with over 15GB of storage consumed on my hard drive, I set out to experience the full Omarchy goodness! After hovering over a few icons on the Waybar , I discovered the menu button on the very left. It’s not a traditional menu, but rather a shortcut to the aforementioned walker launcher tool, which contains a few submenus: The menu reads: Apps, Learn, Trigger, Style, Setup, Install, Remove, Update, About, System; It feels like a random assortment of categories, settings, package manager subcommands, and actions. From a UX perspective, this main menu doesn’t make much sense to me. But I’m feeling lucky, so let’s just go ahead and type “Browser” ! Hm, nothing. “Firefox” , maybe? Nope. “Chrome” ? Nah. “Chromium” ? No. Unfortunately the search in the menu is not universal and requires you to first click into the Apps category. The Apps category seems to list all available GUI (and some TUI) applications. Let’s take a look at the default apps that Omarchy comes with: The bundled “apps” are: 1Password, Alacritty, Basecamp, Bluetooth, Calculator, ChatGPT, Chromium, Discord, Disk Usage, Docker, Document Viewer, Electron 37, Figma, Files, GitHub, Google Contacts, Google Messages, Google Photos, HEY, Image Viewer, Kdenlive, LibreOffice, LibreOffice Base, LibreOffice Calc, LibreOffice Draw, LibreOffice Impress, LibreOffice Math, LibreOffice Writer, Limine-snapper-restore, LocalSend, Media Player, Neovim, OBS Studio, Obsidian, OpenJDK Java 25 Console, OpenJDK Java 25 Shell, Pinta, Print Settings, Signal, Spotify, Typora, WhatsApp, X, Xournal++, YouTube, Zoom; Aside from the fact that nearly a third of the apps are essentially just browser windows pointing to websites , which leaves me wondering where the 15GB of used storage went, the selection of apps is also… well, let’s call it opinionated , for now at least. Starting with the browser, Omarchy comes with Chromium by default, specifically version 141.0.7390.107 in my case, which, unlike, for example, ungoogled-chromium , has disabled support for manifest v2 and thus doesn’t include extensions like uBlock Origin or any other advanced add-ons. In fact, the browser is completely vanilla, with no decent configuration. The only extension it includes is the copy-url extension, which serves a rather obscure purpose: Providing a non-intuitive way to copy the current page’s URL to your clipboard using an even less intuitive shortcut ( ) while using any of the “Apps” that are essentially just browser windows without browser controls. Other than that, it’s pretty much stock Chromium. It allows all third-party cookies, doesn’t send “Do Not Track” requests, sends browsing data to Google Safe Browsing , but doesn’t enforce HTTPS. It has JavaScript optimization enabled for all websites, which increases the attack surface, and it uses Google as the default search engine. There’s not a single opinionated setting in the configuration of the default browser on Omarchy , let alone in the choice of browser itself. And the fact that the only extension installed and active by default is an obscure workaround for the lack of URL bars in “App” windows doesn’t exactly make this first impression of what is likely one of the most important components for the typical Omarchy user very appealing. Alright, let’s have a look at what is probably the second most important app after the browser for many people in the target audience: Basecamp ! Just kidding. Obviously, it’s the terminal. Omarchy comes with Alacritty by default, which is a bit of an odd choice in 2025, especially for a desktop that seemingly prioritizes form over function, given the ultra-conservative approach the Alacritty developers take toward anything related to form and sometimes even function. I would have rather expected Kitty , WezTerm , or Ghostty . That said, Alacritty works and is fairly configurable. Unfortunately, like the browser and various other tools such as Git, there’s little to no opinionated configuration happening, especially one that would enhance integration with the Omarchy ecosystem. Omarchy seemingly highlights the availability of NeoVim by default, yet doesn’t explicitly configure Alacritty’s vi mode , leaving it at its factory defaults . In fact, aside from the keybinding for full-screen mode, which is a less-than-ideal shortcut for anyone with a keyboard smaller than 100% (unless specifically mapped), the Alacritty config doesn’t define any other shortcuts to integrate the terminal more seamlessly into the supposed opinionated workflow. Not even the desktop’s key-repeat rate is configured to a reasonable value, as it takes about a second for it to kick in. Fun fact: When you leave your computer idling on your desk, the screensaver you’ll encounter isn’t an actual hyprlock that locks your desktop and uses PAM authentication to prevent unauthorized access. Instead, it’s a shell script that launches a full-screen Alacritty window to display a CPU-intensive ASCII animation. While Omarchy does use hyprlock , its timeout is set longer than that of the screensaver. Because you can’t dismiss the screensaver with your mouse (only with your keyboard) it might give inexperienced users a false sense of security. This is yet another example of prioritizing gimmicky animations over actual functionality and, to some degree, security. Like the browser and the terminal emulator, the default shell configuration is a pretty basic B….ash , and useful extensions like Starship are barely configured. For example, I ed into a boilerplate Python project directory, activated its venv , and expected Starship to display some useful information, like the virtual environment name or the Python version. However, none of these details appeared in my prompt. “Surely if I do the same in a Ruby on Rails project, Starship will show me some useful info!” I thought, and ed into a Rails boilerplate project. Nope. In fact… Omarchy doesn’t come with Rails pre-installed. I assume Hansson ’s target audience doesn’t primarily consist of Rails developers, despite the unconditional , but let’s not get ahead of ourselves. It is nevertheless puzzling that Omarchy doesn’t come with at least Ruby pre-installed. I find it a bit odd that the person who literally built the most successful Ruby framework on earth is pre-installing “Apps” like HEY , Spotify , and X , but not his own FOSS creation or even just the Ruby interpreter. If you want Rails , you have to navigate through the menu to “Install” , then “Development” , and finally select "‘Ruby on Rails" to make RoR available on your system. Not just Ruby , though. And even going the extra mile to do so still won’t make Starship display any additional useful info when inside a Rails project folder. PS: The script that installs these development tools bypasses the system’s default package manager and repository, opting instead to use mise to install interpreters and compilers. This is yet another example of security not being taken quite as seriously as it should be. At the very least, the script should inform the user that this is about to happen and offer the option to use the package manager instead, if the distributed version meets the user’s needs. Fun fact: At the time of writing, mise installed Ruby 3.4.7. The latest package available through the package manager is – you guessed it – 3.4.7. As mentioned earlier, Omarchy is built entirely using Bash scripts, and there’s nothing inherently wrong with that. When done correctly and kept at a sane limit, Bash scripts are powerful and relatively easy to maintain. However, the scripts in Omarchy are unfortunately riddled with little oversights that can cause issues. Those scripts are also used in places in which a proper software implementation would have made more sense. Take the theme scripts, for example. If you go ahead and create a new theme under and name it , and then run a couple of times until the tool hits your new theme, you can see one effect of these oversights. Nothing catastrophic happened, except now won’t work anymore. If you’d want to annoy an unsuspecting Omarchy user, you could do this: While this is such a tiny detail to complain about, it is an equally low-hanging fruit to write scripts in a way in which this won’t happen. Apart from the numerous places where globbing and word splitting can occur, there are other instances of code that could have also been written a little bit more elegantly. Take this line , for example: To drop and from the , you don’t have to call and pipe to . Instead, you can simply use Bash’s built-in regex matching to do so: Similarly, in this line there’s no need to test for a successful exit code with a dedicated check, when you can simply make the call from within the condition: And frankly, I have no idea what this line is supposed to be: What are you doing, Hansson? Are you alright? Make no mistake to believe that the remarks made above are the only issues with Hansson ’s scripts in Omarchy . While these specific examples are nitpicks, they paint a picture that is only getting less colorful the more we look into the details. We can continue to gauge the quality of the scripts by looking beyond just syntax. Take, for example, the migration : This script runs five commands in sequence within an condition: first , followed by two invocations, then again, and finally . While this might work as expected “on a sunny day” , the first command could fail for various reasons. If it does, the subsequent commands may encounter issues that the script doesn’t account for, and the outcome of this migration will be differently from what the author anticipated. For experienced users, the impact in such a case may be minimal, but for others, it may present a more significant hurdle. Furthermore, as can be seen in here , the invoking process cannot detect if only one of the five commands failed. As a result, the entire migration might be marked as skipped , despite changes being made to the system. But let’s continue to look into specifically the migrations in just a moment. The real concern here, however, is the widespread absence of exception handling, either through status code checks for previously executed commands or via dependent executions (e.g., ). In most scripts, there is no validation to ensure that actions have the desired effect and the current state actually represents the desired outcome. Almost all sequentially executed commands depend upon one another, yet the author doesn’t make sure that if fails the script won’t just blindly run . Note: Although sets , which would cause a script like the one presented above to fail when the first command fails, the migrations are invoked by sourcing the script. This script, in turn, invokes the script using the helper function . However, this function executes the script in the following way: In this case, the options are not inherited by the actual migration , meaning it won’t stop immediately when an error occurs. This behavior makes sense, as abruptly stopping the installation would leave the system in an undefined state. But even if we ignored that and assumed that migrations would stop when the first command would fail, it still wouldn’t actually handle the exception, but merely stop the following commands from performing actions on an unexpected state. To understand the broader issue and its impact on security, we need to dive deeper into the system’s functioning, and especially into migrations . This helps illustrate how the fragile nature of Omarchy could take a dangerous turn, especially considering the lack of tests, let alone any dedicated testing infrastructure. Let’s start by adding some context and examining how configurations are applied in Omarchy . Inspired by his work as a web developer, Hansson has attempted to bring concepts from his web projects into the scripts that shape his Linux setup. In Omarchy , configuration changes are handled through migration scripts, as we just saw, which are in principle similar to the database migrations you might recall from Rails projects. However, unlike SQL or the Ruby DSL used in Active Record Migrations , these Bash scripts do not merely contain a structured query language; They execute actual system commands during installation. More importantly: They are not idempotent by default! While the idea of migrations isn’t inherently problematic, in this case, it can (and has) introduce(d) issues that go/went unnoticed by the Omarchy maintainers for extended periods, but more on that in a second. The migration files in Omarchy are a collection of ambiguously named scripts, each containing a set of changes to the system. These changes aren’t confined to specific configuration files or components. They can be entirely arbitrary, depending on what the migration is attempting to implement at the time it is written. To modify a configuration file, these migrations typically rely on the command. For instance, the first migration intended to change from to might execute something like . The then following one would have to account for the previous change: . Another common approach involves removing a specific line with and appending the new settings via . However, since multiple migrations are executed sequentially, often touching the same files and running the same commands, determining the final state of a configuration file can become a tedious process. There is no clear indication of which migration modifies which file, nor any specific keywords (e.g., ) to grep for and help identify the relevant migration(s) when searching through the code. Moreover, because migrations rely on fixed paths and vary in their commands, it’s impossible to test them against mock files/folders, to predict their outcome. These scripts can invoke anything from sourcing other scripts to running commands, with no restrictions on what they can or cannot do. There’s no “framework” or API within which these scripts operate. To understand what I mean by that, let’s take a quick look at a fairly widely used pile of scripts that is of similar importance to a system’s functionality: OpenRC . While the init.d scripts in OpenRC are also just that, namely scripts, they follow a relatively well-defined API : Note: I’m not claiming that OpenRC ’s implementation is flawless or the ultimate solution, far from it. However, given the current state of the Omarchy project, it’s fair to say that OpenRC is significantly better within its existing constraints. Omarchy , however, does not use any sort of API for that matter. Instead, scripts can basically do whatever they want, in whichever way they deem adequate. Without such well defined interfaces , it is hard to understand the effects that migrations will have, especially when changes to individual services are split across a number of different migration scripts. Here’s a fun challenge: Try to figure out how your folder looks after installation by only inspecting the migration files. To make matters worse, other scripts (outside the migration folder) may also modify configurations that were previously altered by migrations , at runtime, such as . Note: To the disappointment of every NixOS user, unlike database migrations in Rails , the migrations in Omarchy don’t support rollbacks and, judging by their current structure, are unlikely to do so moving forward. The only chance Omarchy users have in case a migration should ever brick their existing system is to make use of the available snapshots . All of this (the lack of interfaces , the missing exception handling and checks for desired outcomes, the overlapping modification, etc.) creates a chaotic environment that is hard to overview and maintain, which can severely compromise system integrity and, by extension, security. Want an example? On my fresh installation, I wanted to validate the following claim from the manual : Firewall is enabled by default: All incoming traffic by default except for port 22 for ssh and port 53317 for LocalSend. We even lock down Docker access using the ufw-docker setup to prevent that your containers are accidentally exposed to the world. What I discovered upon closer inspection, however, is that Omarchy ’s firewall doesn’t actually run, despite its pre-configured ruleset . Yes, you read that right, everyone installing the v3.0.2 ISO (and presumably earlier versions) of Omarchy is left with a system that doesn’t block any of the ports that individual software might open during runtime. Please bear in mind that apart from the full-disk encryption, the firewall is the only security measure that Omarchy puts in place. And it’s off by default. Only once I manually enabled and started using / , it did activate the rules mentioned in the handbook. As highlighted in the original issue , it appears that, with the chaos that are the migration- , preflight- and first-run- scripts no one ever realized that you need to tell to explicitly enable a service for it to actually run. And because it’s all made up of Bash scripts that can do whatever they want, you cannot easily test these things to notice that the state that was expected for a specific service was not reached. Unlike in Rails , where you can initialize your (test) database and run each migration manually if necessary to make sure that the schema reaches the desired state and that the database is seeded correctly, this agglomeration of Bash scripts is not structured data. Hence, applying the same principle to something as arbitrary as a Bash script is not as easily possible, at least not without clearly defined structures and interfaces . As a user who trusted Omarchy to secure their installation, I would be upset, to say the least. The system failed to keep users safe, and more importantly, nobody noticed for a long time. There was no hotfix ISO issued, nor even a heads-up to existing users alongside the implemented fix ( e.g. ). While mistakes happen, simply brushing them under the rug feels like rather negligent behavior. When looking into the future, the mess that is the Bash scripts certainly won’t decrease in complexity, making me doubt that things like these won’t happen again. Note: The firewall fix was listed in v2.1.1. However, on my installation of v3.0.2 the firewall would still not come up automatically. I double-checked this by running the installation of v3.0.2 twice, and both times the firewall would not autostart after the second reboot. While writing this post, v3.1.0 ( update: v3.1.1 ) was released and I also checked the issue there. v3.1.0 appears to have finally fixed the firewall issue. Having that said, it shows how much of a mess the whole system is, when things that were identified and supposedly fixed multiple versions ago still don’t work in newer releases weeks later. Tl;dr: v3.1.0 appears to be the first release to actually fix the firewall issue, even though it was identified and presumably fixed in v2.1.1, according to the changelog. With the firewall active, it becomes apparent that Omarchy ’s configuration does indeed leave port 22 (SSH) open, even though the SSH daemon is not running by default. While I couldn’t find a clear explanation for why this port is left open on a desktop system without an active SSH server, my assumption is that it’s intended to allow the user to remotely access their workstation should they ever need to. It’s important to note that the file in Omarchy , like many other system files, remains unchanged. Users might reasonably assume that, since Omarchy intentionally leaves the SSH port open, it must have also configured the SSH server with sensible defaults. Unfortunately, this is not the case. In a typical Arch installation, users would eventually come across the “Protection” section on the OpenSSH wiki page, where they would learn about the crucial settings that should be adjusted for security reasons. However, when using a system like Omarchy , which is marketed as an opinionated setup that takes security seriously , users might expect these considerations to be handled for them, making it all the more troubling that no sensible configuration is in place, despite the deliberate decision to leave the SSH port open for future use. Hansson seemingly struggles to get even basics like right. The fact that there’s so little oversight, that users are allowed to set weak password for both, their account and drive encryption, and that the only other security measure put in place, the firewall, simply hasn’t been working, does not speak in favor of Omarchy . Info: is abstraction layer that simplifies managing the powerful / firewall and it stands for “ u ncomplicated f ire w all”. Going into this review I wasn’t expecting a hardened Linux installation with SELinux , intrusion detection mechanisms, and all these things. But Hansson is repeatedly addressing users of Windows and macOS (operating systems with working firewalls and notably more security measures in place) who are frustrated with their OS, as a target audience. At this point, however, Omarchy is a significantly worse option for those users. Not only does Omarchy give a hard pass on Linux Security Modules , linux-hardened , musl , hardened_malloc , or tools like OpenSnitch , and fails to properly address security-related topics like SSH, GPG or maybe even AGE and AGE/Yubikey , but it in fact weakens the system security with changes like the increase of and login password retries and the decrease of faillock timeouts . Omarchy appears to be undoing security measures that were put in place by the software- and by the Arch -developers, while the basis it uses for building the system does not appear to be reliable enough to protect its users from future mishaps. Then there is the big picture of Omarchy that Hansson tries to curate, which is that of a TUI-centered, hacker -esque desktop that promises productivity and so on. He even goes as far as calling it “a pro system” . However, as we clearly see from the implementation, configuration and the project’s approach to security, this is unlike anything you would expect from a pro system . The entire image of a TUI-centered productivity environment is further contradicted in many different places, primarily by the lack of opinions and configuration . If the focus is supposed to be on “pro” usage, and especially the command-line, then… The configuration doesn’t live up to its sales pitch, and there are many aspects that either don’t make sense or aren’t truly opinionated , meaning they’re no different from a standard Arch Linux installation. In fact, I would go as far as to say that Omarchy is barely a ready-to-use system at all out of the box and requires a lot of in-depth configuration of the underlying Arch distribution for it to become actually useful. Let’s look at only a few details. There are some fairly basic things you’ll miss on the “lightweight” 15GB installation of Omarchy : With the attention Omarchy is receiving, particularly from Framework (Computer Inc.) , it is surprising that there is no option to install the system on RAID1 hardware: I would argue that RAID1 is a fairly common use case, especially with Framework (Computer Inc.) 16" laptops, which support a secondary storage device. Considering that Omarchy is positioning itself to compete against e.g. macOS with TimeMachine , yet it does not include an automated off-drive backup solution for user data by default – which by the way is just another notable shortcoming we could discuss – and given that configuring a RAID1 root with encryption is notoriously tedious on Linux, even for advanced users, the absence of this option is especially disappointing for the intended audience. Even moreso when neither the installer nor the post-installation process provides any means to utilize the additional storage device, leaving inexperienced users seemingly stuck with the command. Omarchy does not come with a dedicated swap partition, leaving me even more puzzled about its use of 15GB of disk space. I won’t talk through why having a dedicated swap partition that is ideally encrypted using the same mechanisms already in place is a good idea. This topic has been thoroughly discussed and written about countless times. However, if you, like seemingly the Omarchy author, are unfamiliar with the benefits of having swap on Linux, I highly recommend reading this insightful write-up to get a better understanding. What I will note, however, is that the current configuration does not appear to support hibernation via the command through the use of a dynamic swap file . This leads me to believe that hibernation may not function on Omarchy . Given the ongoing battery drain issues with especially Framework (Computer Inc.) laptops while in suspend mode, it’s clear that hibernation is an essential feature for many Linux laptop users. Additionally, it’s hard to believe that Hansson , a former Apple evangelist , wouldn’t be accustomed to the simple act of closing the lid on his laptop and expecting it to enter a light sleep mode, and eventually transitioning into deep sleep to preserve battery life. If he had ever used Omarchy day-to-day on a laptop in the same way most people use their MacBooks , he would almost certainly have noticed the absence of these features. This further reinforces the impression that Omarchy is a project designed to appear robust at first glance, but reveals a surprisingly hollow foundation upon closer inspection. Let’s keep our focus on laptop use. We’ve seen Hansson showcasing his Framework (Computer Inc.) laptop on camera, so it’s reasonable to assume he’s using Omarchy on a laptop. It’s also safe to say that many users who might genuinely want to try Omarchy will likely do so on a laptop as well. That said, as we’ve established before, closing the laptop lid doesn’t seem to trigger hibernate mode in Omarchy . But if you close the lid and slip the laptop into your backpack, surely it would activate some power-saving measures, right? At the very least, it should blank the screen, switch the CPU governor to powersaving , or perhaps even initiate suspend to RAM ? Well… Of course, I can’t test these scenarios firsthand, as I’m evaluating Omarchy within a securely confined virtual machine, where any unintended consequences are contained. Still, based on the system’s configuration, or more accurately the lack thereof, it seems unlikely that an Omarchy laptop will behave as expected. The system might switch power profiles due to the power-profiles-daemon when not plugged in, yet its functionality is not comparable to a properly configured or similar. It seems improbable that it will enter suspend to RAM or hibernate mode, and it’s doubtful any other power-saving measures (like temporarily halting non-essential background processes) will be employed to conserve battery life. Although the configuration comes with an “app” for mail, namely HEY , that platform does not support standard mail protocols . I don’t think it’s a hot take to say that probably 99% of Omarchy ’s potential users will need to work with an email system that does support IMAP and SMTP, however. Yet, the base system offers zero tools for that. I’m not even asking for anything “fancy” like ; Omarchy unfortunately doesn’t even come with the most basic tools like the command out of the box. Whether you want to send email through your provider, get a simple summary for a scheduled Cron job delivered to your local mailbox, or just debug some mail-related issue, the command is relatively essential, even on a desktop system, but it is nowhere to be found on Omarchy . Speaking of which: Cron jobs? Not a thing on Omarchy . Want to automate backing up some files to remote storage? Get ready to dive into the wonderful world of timers , where you’ll spend hours figuring out where to create the necessary files, what they need to contain, and how to activate them. Omarchy could’ve easily included a Cron daemon or at least for the sake of convenience. But I guess this is a pro system , and if the user needs periodic jobs, they will have to figure out . Omarchy is, after all, -based … … and that’s why it makes perfect sense for it to use rootless Podman containers instead of Docker. That way, users can take advantage of quadlets and all the glorious integration. Unfortunately, Omarchy doesn’t actually use Podman . It uses plain ol’ Docker instead. Like most things in Omarchy , power monitoring and alerting are handled through a script , which is executed every 30 seconds via a timer. That’s your crash course on timers right there, Omarchy users! This script queries and then uses to parse the battery percentage and state. It’s almost comical how hacky the implementation is. Given that the system is already using UPower , which transmits power data via D-Bus , there’s a much cleaner and more efficient way to handle things. You could simply use a piece of software that connects to D-Bus to continuously monitor the power info UPower sends. Since it’s already dealing with D-Bus , it can also send a desktop notification directly to whatever notification service you’re using (like in Omarchy ’s case). No need for , , or a periodic Bash script triggered by a timer. “But where could I possibly find such a piece of software?” , you might ask. Worry not, Hr. Hansson , I have just the thing you need ! That said, I can understand that you, Hr. Hansson , might be somewhat reluctant to place your trust in software created by someone who is actively delving into the intricacies of your project, rather than merely offering a superficial YouTube interview to casually navigate the Hyprland UI for half an hour. Of course, Hr. Hansson , you could have always taken the initiative to develop a more robust solution yourself, in a proper, lower-level language, and neatly integrated it into your Omarchy repository. But we will explore why this likely hasn’t been a priority for you, Hr. Hansson , in just a moment. While the author’s previous attempt for a developer setup still came with Zellij , this time his opinions seemingly changed and Omarchy doesn’t include Zellij , or Tmux or even screen anymore. And nope, picocom isn’t there either, so good luck reading that Arduino output from . That moment, when you realize that you’ve spent hours figuring out timers , only to find out that you can’t actually back up those files to a remote storage because there’s no , let alone or . At least there is the command. :-) Unfortunately not, but Omarchy comes with and by default. I could go on and on, and scavenge through the rest of the unconfigured system and the scripts, like for example the one, where Omarchy once again seems to prefer -ing random scripts from the internet (or anyone man-in-the-middle -ing it) rather than using the system package manager to install Tailscale . But, for the sake of both your sanity and mine, I’ll stop here. As we’ve seen, Omarchy is more unconfigured than it is opinionated . Can you simply install all the missing bits and piece and configure them yourself? Sure! But then what is the point of this supposed “perfect developer setup” or “pro system” to begin with? In terms of the “opinionated” buzzword, most actual opinions I’ve come across so far are mainly about colors, themes, and security measures. I won’t dare to judge the former two, but as for the latter, well, unfortunately they’re the wrong opinions . In terms of implementation: Omarchy is just scripts, scripts, and more scripts, with no proper structure or (CI) tests. BTW: A quick shout out to your favorite tech influencer , who probably has at least one video reviewing the Omarchy project without mentioning anything along these lines. It is unfortunate that these influential people barely scratch the surface on a topic like this, and it is even more saddening that recording a 30 minute video of someone clicking around on a UI seemingly counts as a legitimate “review” these days. The primary focus for many of these people is seemingly on pumping out content and generating hype for views and attention rather than providing a thoughtful, thorough analysis. ( Alright, we’re almost there. Stick with me, we’re in the home stretch. ) The Omarchy manual : The ultimate repository of Omarchy wisdom, all packed into 33 pages, clocking in at little over 10,000 words. For context, this post on Omarchy alone is almost 10,000 words long. As is the case with the rest of the system, the documentation also adheres to Hansson ’s form over function approach. I’ve mentioned this before, but it bears repeating: Omarchy doesn’t offer any built-in for its scripts, let alone auto-completion, nor does it come with traditional pages. The documentation is tucked away in yet another SaaS product from Hansson ’s company ( Writebook ) and its focus is predominantly on themes, more themes, creating your own themes, and of course, the ever-evolving hotkeys. Beyond that, the manual mostly covers how to locate configuration files for individual UI components and offers guidance on how to configure Hyprland for a range of what feels like outrageously expensive peripherals. For the truly informative content, look no further than the shell function guide, with gems such as: : Format an entire disk with a single ext4 partition. Be careful! Wow, thanks, Professor Oak, I will be! :-) On a more serious note, though, the documentation leaves much to be desired, as evidenced by the user questions over on the GitHub discussions page . Take this question , which unintentionally sums up the Omarchy experience for probably many inexperienced users: I installed this from github without knowing what I was getting into (the page is very minimal for a project of this size, and I forgot there was a link in the footnotes). Please tell me there’s a way to remove Omarchy without wiping my entire computer. I lost my flashdrive, and don’t have a way to back up all my important files anymore. While this may seem comical on the surface, it’s a sad testament to how Omarchy appears to have a knack for luring in unsuspecting users with flashy visuals and so called “reviews” on YouTube, only to leave them stranded without adequate documentation. The only recourse? Relying on the solid Arch docs, which is an abrupt plunge into the deep end, given that Arch assumes you’re at least familiar with its very basics and that you know how you set up your own system. Maybe GitHub isn’t the most representative forum for the project’s support; I haven’t tried Discord, for example. But no matter where the community is, users should be able to fend for themselves with proper documentation, turning to others only as a last resort. It’s difficult to compile a list of things that could have made Omarchy a reasonable setup for people to consider, mainly because, in my opinion, the core of the setup – scripts doing things they shouldn’t or that should have been handled by other means (e.g., the package manager) – is fundamentally flawed. That said, I do think it’s worth mentioning a few improvements that, if implemented, could have made Omarchy a less bad option. Configuration files should not be altered through loose migration scripts. Instead, updated configuration files should be provided directly (ideally via packages, see below) and applied as patches using a mechanism similar to etc-update or dpkg . This approach ensures clarity and reduces confusion, preserves user modifications, and aligns with established best practices. Improve on the user experience where necessary and maybe even contribute improvements back. Use proper software implementations where appropriate. Want a fancy screensaver? Extend Hyprlock instead of awkwardly repurposing a fullscreen terminal window to mimic one. Need to display power status notifications without relying on GNOME or KDE components? Develop a lightweight solution that integrates cleanly with the desktop environment, or extend the existing Waybar battery widget to send notifications. Don’t like existing Linux “App Store” options? Build your own, rather than diverting a launcher from its intended use only to run Bash scripts that install packages from third-party sources on a system that has a perfectly good package manager in place. Arguably the most crucial improvement: Package the required software and install it via the system’s package manager. Avoid relying on brittle scripts, third-party tools like mise , or worse, piping scripts directly into . I understand that the author is coming from an operating system where it’s sort of fine to and use software like to manage individual Ruby versions. However, we have to take into consideration that specifically macOS has a significantly more advanced security architecture in place than (unfortunately) most out-of-the-box Linux installations have, let alone Omarchy . On Hanssons setup the approach is neither sensible nor advisable, especially given that it’s ultimately a system that is built around a proper package manager. If you want multiple versions of Ruby, package them and use slotting (or the equivalent of it on the distribution that you’re using, e.g. installation to version-specific directories on Arch ). Much of what the migrations and other scripts attempt to do could, and should have been achieved through well-maintained packages and the proven mechanisms of a package manager. Whether it’s Gentoo , NixOS , or Ubuntu , each distribution operates in its own unique way, offering users a distinct set of tools and defaults. Yet, they all share one common trait: A set of strong, well-defined opinions that shape the system. Omarchy , in contrast, feels little more than a glorified collection of Hyprland configurations atop an unopinionated, barebones foundation. If you’re going to have opinions, don’t limit them to just nice colors and cute little wallpapers. Form opinions on the tools that truly matter, on how those tools should be configured, and on the more intricate, challenging aspects of the system, not just the surface-level, easy choices. Have opinions on the really sticky and complicated stuff, like power-saving modes, redundant storage, critical system functionality, and security. Above all, cultivate reasonable opinions, ones that others can get behind, and build a system that reflects those. Comprehensive documentation is essential to help users understand how the system works. Currently, there’s no clear explanation for the myriad Bash scripts, nor is there any user-facing guidance on how global system updates affect individual configuration files. ( finally… ) Omarchy feels like a project created by a Linux newcomer, utterly captivated by all the cool things that Linux can do , but lacking the architectural knowledge to get the basics right, and the experience to give each tool a thoughtful review. Instead of carefully selecting software and ensuring that everything works as promised, the approach seems to be more about throwing everything that somehow looks cool into a pile. There’s no attention to sensible defaults, no real quality control, and certainly no verification that the setup won’t end up causing harm or, at the very least, frustration for the user. The primary focus seems to be on creating a visually appealing but otherwise hollow product . Moreover, the entire Omarchy ecosystem is held together by often poorly written Bash scripts that lack any structure, let alone properly defined interfaces . Software packages are being installed via or similar mechanisms, rather than provided as properly packaged solutions via a package manager. Hansson is quick to label Omarchy a Linux distribution , yet he seems reluctant to engage with the foundational work that defines a true distribution: The development and proper packaging (“distribution”) of software . Whenever Hansson seeks a software (or software version) that is unavailable in the Arch package repositories, he bypasses the proper process of packaging it for the system. Instead, he resorts to running arbitrary scripts or tools that download the required software from third-party sources, rather than offering the desired versions through a more standardized package repository. Hansson also appears to avoid using lower-level programming languages to implement features in a more robust and maintainable manner at all costs , often opting instead for makeshift solutions, such as executing “hacky” Bash scripts through timers. A closer look at his GitHub profile and Basecamp’s repositories reveals that Hansson has seemingly worked exclusively with Ruby and JavaScript , with most contributions to more complex projects, like or , coming from other developers. This observation is not meant to diminish the author’s profession and accomplishments as a web developer, but it highlights the lack of experience in areas such as systems programming, which are crucial for the type of work required to build and maintain a proper Linux distribution. Speaking of packages, the system gobbles up 15GB of storage on a basic install, yet fails to deliver truly useful or high-quality software. It includes a hodgepodge of packages, like OpenJDK and websites of paid services in “App” -disguise, but lacks any real optimization for specific use cases. Despite Omarchy claiming to be opinionated most of the included software is left at its default settings, straight from the developers. Given Hansson ’s famously strong opinions on everything, it makes me wonder if the Omarchy author simply hasn’t yet gained the experience necessary to develop clear, informed stances on individual configurations. Moreover, his prioritization of his paid products like Basecamp and HEY over his own free software like Rails leaves a distinctly bitter aftertaste when considering Omarchy . What’s even more baffling is that seemingly no one at Framework (Computer Inc.) or Cloudflare appears to have properly vetted the project they’re directing attention (and sometimes financial support) to. I find it hard to believe that knowledgeable people at either company have looked at Omarchy and thought, “Out of all the Linux distributions out there, this barely configured stack of poorly written Bash scripts on top of Arch is clearly the best choice for us to support!” In fact, I would go as far as to call it a slap in the face to each and every proper distro maintainer and FOSS developer. Furthermore, I fail to see the supposed gap Omarchy is trying to fill. A fresh installation of Arch Linux, or any of its established derivatives like Manjaro , is by no means more complicated or time-consuming than Omarchy . In fact, it is Omarchy that complicates things further down the line, by including a number of unnecessary components and workarounds, especially when it comes to its chosen desktop environment. The moment an inexperienced user wants or needs to change anything, they’ll be confronted with a jumbled mess that’s difficult to understand and even harder to manage. If you want Arch but are too lazy to read through its fantastic Wiki , then look at Manjaro , it’ll take care of you. If that’s still not to your liking, maybe explore something completely different . On the other hand, if you’re just looking to tweak your existing desktop, check out other people’s dotfiles and dive into the unixporn communities for inspiration. As boring as Fedora Workstation or Ubuntu Desktop might sound, these are solid choices for anyone who doesn’t want to waste time endlessly configuring their OS and, more importantly, wants something that works right out of the box and actually keeps them safe. Fedora Workstation comes with SELinux enabled in “enforcing” mode by default, and Ubuntu Desktop utilizes AppArmor out of the box. Note: Yes, I hear you loud and clear, SuSE fans. The moment your favorite distro gets its things together with regard to the AppArmor-SELinux transition and actually enables SELinux in enforcing mode across all its different products and versions I will include it here as well. Omarchy is essentially an installation routine for someone else’s dotfiles slapped on top of an otherwise barebones Linux desktop. Although you could simply run its installation scripts on your existing, fully configured Arch system, it doesn’t seem to make much sense and it’s definitely not the author’s primary objective. If this was just Hansson’s personal laptop setup, nobody, including myself, would care about the oversights or eccentricities, but it is not. In fact, this project is clearly marketed to the broader, less experienced user base, with Hansson repeatedly misrepresenting Omarchy as being “for developers or anyone interested in a pro system” . I emphasize marketed here, because Hansson is using his reach and influence in every possible way to advertise and seemingly monetize Omarchy ; Apart from the corporate financial support, the project even has its own merch that people can spend money on. Given that numerous YouTubers have been heavily promoting the project over the past few weeks, often in the same breath with Framework (Computer Inc.) , it wouldn’t be surprising to see the company soon offering it as a pre-installation option on their hardware. If you’re serious about Linux, you’re unlikely to fall for the Omarchy sales pitch. However, if you’re an inexperienced user who’s heard about Omarchy from a tech-influencer raving about it, I strongly recommend starting your Linux journey elsewhere, with a distribution that actually prioritizes your security and system integrity, and is built and maintained by people who live and breathe systems, and especially Linux. Alright, that’s it. Why don’t any of the Bash scripts and functions provide a flag or maybe even autocompletions? Why are there no Omarchy -related pages? Why does the system come with GNOME Files , which requires several gvfs processes running in the background, yet it lacks basic command-line file managers like or ? Why would you define as an for unconditionally, but not install Rails by default? Why bother shipping tools like and but fail to provide aliases for , , etc to make use of these tools by default? Why wouldn’t you set up an O.G. alias like in your defaults ? Why ship the GNOME Calculator but not include any command-line calculators (e.g., , ), forcing users to rely on basics like ? Why ship the full suite of LibreOffice, but not a single useful terminal tool like , , , etc.? Why define functions like with and without an option to enable encryption, when the rest of the system uses and ? And if it’s intended for use by inexperienced users primarily for things like USB sticks, why not make it instead of so the drive works across most operating systems? Why not define actually useful functions like or / ? Why doesn’t your Bash configuration include history- and command-flag-based auto-suggestions? Or a terminal-independent vi mode ? Or at least more consistent Emacs-style shortcuts? Why don’t you include some quality-of-life tools like or some other command-line community favorites? If you had to squeeze in ChatGPT , why not have Crush available by default? Why does the base install with a single running Alacritty window occupy over 2.2GB of RAM right after booting? For comparison: My Gentoo system with a single instance of Ghostty ends up at around half of that. Why set up NeoVim but not define as an alias for , or even create a symlink? And speaking of NeoVim , why does the supposedly opinionated config make NeoVim feel slower than VSCode ?

0 views
Evan Hahn 1 months ago

Scripts I wrote that I use all the time

In my decade-plus of maintaining my dotfiles , I’ve written a lot of little shell scripts. Here’s a big list of my personal favorites. and are simple wrappers around system clipboard managers, like on macOS and on Linux. I use these all the time . prints the current state of your clipboard to stdout, and then whenever the clipboard changes, it prints the new version. I use this once a week or so. copies the current directory to the clipboard. Basically . I often use this when I’m in a directory and I want use that directory in another terminal tab; I copy it in one tab and to it in another. I use this once a day or so. makes a directory and s inside. It’s basically . I use this all the time —almost every time I make a directory, I want to go in there. changes to a temporary directory. It’s basically . I use this all the time to hop into a sandbox directory. It saves me from having to manually clean up my work. A couple of common examples: moves and to the trash. Supports macOS and Linux. I use this every day. I definitely run it more than , and it saves me from accidentally deleting files. makes it quick to create shell scripts. creates , makes it executable with , adds some nice Bash prefixes, and opens it with my editor (Vim in my case). I use this every few days. Many of the scripts in this post were made with this helper! starts a static file server on in the current directory. It’s basically but handles cases where Python isn’t installed, falling back to other programs. I use this a few times a week. Probably less useful if you’re not a web developer. uses to download songs, often from YouTube or SoundCloud, in the highest available quality. For example, downloads that video as a song. I use this a few times a week…typically to grab video game soundtracks… similarly uses to download something for a podcast player. There are a lot of videos that I’d rather listen to like a podcast. I use this a few times a month. downloads the English subtitles for a video. (There’s some fanciness to look for “official” subtitles, falling back to auto-generated subtitles.) Sometimes I read the subtitles manually, sometimes I run , sometimes I just want it as a backup of a video I don’t want to save on my computer. I use this every few days. , , and are useful for controlling my system’s wifi. is the one I use most often, when I’m having network trouble. I use this about once a month. parses a URL into its parts. I use this about once a month to pull data out of a URL, often because I don’t want to click a nasty tracking link. prints line 10 from stdin. For example, prints line 10 of a file. This feels like one of those things that should be built in, like and . I use this about once a month. opens a temporary Vim buffer. It’s basically an alias for . I use this about once a day for quick text manipulation tasks, or to take a little throwaway note. converts “smart quotes” to “straight quotes” (sometimes called “dumb quotes”). I don’t care much about these in general, but they sometimes weasel their way into code I’m working on. It can also make the file size smaller, which is occasionally useful. I use this at least once a week. adds before every line. I use it in Vim a lot; I select a region and then run to quote the selection. I use this about once a week. returns . (I should probably just use .) takes JSON at stdin and pretty-prints it to stdout. I use this a few times a year. and convert strings to upper and lowercase. For example, returns . I use these about once a week. returns . I use this most often when talking to customer service and need to read out a long alphanumeric string, which has only happened a couple of times in my whole life. But it’s sometimes useful! returns . A quick way to do a lookup of a Unicode string. I don’t use this one that often…probably about once a month. cats . I use for , for a quick “not interested” response to job recruiters, to print a “Lorem ipsum” block, and a few others. I probably use one or two of these a week. Inspired by Ruby’s built-in REPL, I’ve made: prints the current date in ISO format, like . I use this all the time because I like to prefix files with the current date. starts a timer for 10 minutes, then (1) plays an audible ring sound (2) sends an OS notification (see below). I often use to start a 5 minute timer in the background (see below). I use this almost every day as a useful way to keep on track of time. prints the current time and date using and . I probably use it once a week. It prints something like this: extracts text from an image and prints it to stdout. It only works on macOS, unfortunately, but I want to fix that. (I wrote a post about this script .) (an alias, not a shell script) makes a happy sound if the previous command succeeded and a sad sound otherwise. I do things like which will tell me, audibly, whether the tests succeed. It’s also helpful for long-running commands, because you get a little alert when they’re done. I use this all the time . basically just plays . Used in and above. uses to play audio from a file. I use this all the time , running . uses to show a picture. I use this a few times a week to look at photos. is a little wrapper around some of my favorite internet radio stations. and are two of my favorites. I use this a few times a month. reads from stdin, removes all Markdown formatting, and pipes it to a text-to-speech system ( on macOS and on Linux). I like using text-to-speech when I can’t proofread out loud. I use this a few times a month. is an wrapper that compresses a video a bit. I use this about once a month. removes EXIF data from JPEGs. I don’t use this much, in part because it doesn’t remove EXIF data from other file formats like PNGs…but I keep it around because I hope to expand this one day. is one I almost never use, but you can use it to watch videos in the terminal. It’s cursed and I love it, even if I never use it. is my answer to and , which I find hard to use. For example, runs on every file in a directory. I use this infrequently but I always mess up so this is a nice alternative. is like but much easier (for me) to read—just the PID (highlighted in purple) and the command. or is a wrapper around that sends , waits a little, then sends , waits and sends , waits before finally sending . If I want a program to stop, I want to ask it nicely before getting more aggressive. I use this a few times a month. waits for a PID to exit before continuing. It also keeps the system from going to sleep. I use this about once a month to do things like: is like but it really really runs it in the background. You’ll never hear from that program again. It’s useful when you want to start a daemon or long-running process you truly don’t care about. I use and most often. I use this about once a day. prints but with newlines separating entries, which makes it much easier to read. I use this pretty rarely—mostly just when I’m debugging a issue, which is unusual—but I’m glad I have it when I do. runs until it succeeds. runs until it fails. I don’t use this much, but it’s useful for various things. will keep trying to download something. will stop once my tests start failing. is my emoji lookup helper. For example, prints the following: prints all HTTP statuses. prints . As a web developer, I use this a few times a month, instead of looking it up online. just prints the English alphabet in upper and lowercase. I use this surprisingly often (probably about once a month). It literally just prints this: changes my whole system to dark mode. changes it to light mode. It doesn’t just change the OS theme—it also changes my Vim, Tmux, and terminal themes. I use this at least once a day. puts my system to sleep, and works on macOS and Linux. I use this a few times a week. recursively deletes all files in a directory. I hate that macOS clutters directories with these files! I don’t use this often, but I’m glad I have it when I need it. is basically . Useful for seeing the source code of a file in your path (used it for writing up this post, for example!). I use this a few times a month. sends an OS notification. It’s used in several of my other scripts (see above). I also do something like this about once a month: prints a v4 UUID. I use this about once a month. These are just scripts I use a lot. I hope some of them are useful to you! If you liked this post, you might like “Why ‘alias’ is my last resort for aliases” and “A decade of dotfiles” . Oh, and contact me if you have any scripts you think I’d like. to start a Clojure REPL to start a Deno REPL (or a Node REPL when Deno is missing) to start a PHP REPL to start a Python REPL to start a SQLite shell (an alias for )

0 views
Simon Willison 1 months ago

Claude Code for web - a new asynchronous coding agent from Anthropic

Anthropic launched Claude Code for web this morning. It's an asynchronous coding agent - their answer to OpenAI's Codex Cloud and Google's Jules , and has a very similar shape. I had preview access over the weekend and I've already seen some very promising results from it. It's available online at claude.ai/code and shows up as a tab in the Claude iPhone app as well: As far as I can tell it's their latest Claude Code CLI app wrapped in a container (Anthropic are getting really good at containers these days) and configured to . It appears to behave exactly the same as the CLI tool, and includes a neat "teleport" feature which can copy both the chat transcript and the edited files down to your local Claude Code CLI tool if you want to take over locally. It's very straight-forward to use. You point Claude Code for web at a GitHub repository, select an environment (fully locked down, restricted to an allow-list of domains or configured to access domains of your choosing, including "*" for everything) and kick it off with a prompt. While it's running you can send it additional prompts which are queued up and executed after it completes its current step. Once it's done it opens a branch on your repo with its work and can optionally open a pull request. Claude Code for web's PRs are indistinguishable from Claude Code CLI's, so Anthropic told me it was OK to submit those against public repos even during the private preview. Here are some examples from this weekend: That second example is the most interesting. I saw a tweet from Armin about his MiniJinja Rust template language adding support for Python 3.14 free threading. I hadn't realized that project had Python bindings, so I decided it would be interesting to see a quick performance comparison between MiniJinja and Jinja2. I ran Claude Code for web against a private repository with a completely open environment ( in the allow-list) and prompted: I’m interested in benchmarking the Python bindings for https://github.com/mitsuhiko/minijinja against the equivalente template using Python jinja2 Design and implement a benchmark for this. It should use the latest main checkout of minijinja and the latest stable release of jinja2. The benchmark should use the uv version of Python 3.14 and should test both the regular 3.14 and the 3.14t free threaded version - so four scenarios total The benchmark should run against a reasonably complicated example of a template, using template inheritance and loops and such like In the PR include a shell script to run the entire benchmark, plus benchmark implantation, plus markdown file describing the benchmark and the results in detail, plus some illustrative charts created using matplotlib I entered this into the Claude iPhone app on my mobile keyboard, hence the typos. It churned away for a few minutes and gave me exactly what I asked for. Here's one of the four charts it created: (I was surprised to see MiniJinja out-performed by Jinja2, but I guess Jinja2 has had a decade of clever performance optimizations and doesn't need to deal with any extra overhead of calling out to Rust.) Note that I would likely have got the exact same result running this prompt against Claude CLI on my laptop. The benefit of Claude Code for web is entirely in its convenience as a way of running these tasks in a hosted container managed by Anthropic, with a pleasant web and mobile UI layered over the top. It's interesting how Anthropic chose to announce this new feature: the product launch is buried half way down their new engineering blog post Beyond permission prompts: making Claude Code more secure and autonomous , which starts like this: Claude Code's new sandboxing features, a bash tool and Claude Code on the web, reduce permission prompts and increase user safety by enabling two boundaries: filesystem and network isolation. I'm very excited to hear that Claude Code CLI is taking sandboxing more seriously. I've not yet dug into the details of that - it looks like it's using seatbelt on macOS and Bubblewrap on Linux. Anthropic released a new open source (Apache 2) library, anthropic-experimental/sandbox-runtime , with their implementation of this so far. Filesystem sandboxing is relatively easy. The harder problem is network isolation, which they describe like this: Network isolation , by only allowing internet access through a unix domain socket connected to a proxy server running outside the sandbox. This proxy server enforces restrictions on the domains that a process can connect to, and handles user confirmation for newly requested domains. And if you’d like further-increased security, we also support customizing this proxy to enforce arbitrary rules on outgoing traffic. This is crucial to protecting against both prompt injection and lethal trifecta attacks. The best way to prevent lethal trifecta attacks is to cut off one of the three legs, and network isolation is how you remove the data exfiltration leg that allows successful attackers to steal your data. If you run Claude Code for web in "No network access" mode you have nothing to worry about. I'm a little bit nervous about their "Trusted network access" environment. It's intended to only allow access to domains relating to dependency installation, but the default domain list has dozens of entries which makes me nervous about unintended exfiltration vectors sneaking through. You can also configure a custom environment with your own allow-list. I have one called "Everything" which allow-lists "*", because for projects like my MiniJinja/Jinja2 comparison above there are no secrets or source code involved that need protecting. I see Anthropic's focus on sandboxes as an acknowledgment that coding agents run in YOLO mode ( and the like) are enormously more valuable and productive than agents where you have to approve their every step. The challenge is making it convenient and easy to run them safely. This kind of sandboxing kind is the only approach to safety that feels credible to me. Update : A note on cost: I'm currently using a Claude "Max" plan that Anthropic gave me in order to test some of their features, so I don't have a good feeling for how Claude Code would cost for these kinds of projects. From running (an unofficial cost estimate tool ) it looks like I'm using between $1 and $5 worth of daily Claude CLI invocations at the moment. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Add query-string-stripper.html tool against my simonw/tools repo - a very simple task that creates (and deployed via GitHub Pages) this query-string-stripper tool. minijinja vs jinja2 Performance Benchmark - I ran this against a private repo and then copied the results here, so no PR. Here's the prompt I used. Update deepseek-ocr README to reflect successful project completion - I noticed that the README produced by Claude Code CLI for this project was misleadingly out of date, so I had Claude Code for web fix the problem.

0 views

When it comes to MCPs, everything we know about API design is wrong

TL;DR: I built a lightweight Chrome MCP. Scroll to the end to learn how to install it. Read the whole post to learn a little bit about the Zen of MCP design. Claude Code has built in tools to fetch web pages and to search the web – they actually run through Anthropic's servers, if I recall correctly. They do clever things to carefully manage context and to return information in a format that's easy for Claude to digest. These tools work really well. Right up to the point where they completely fall apart. An uncoached testimonial from the only customer who matters. Last week, I somehow got it into my head that I should update my custom blogging client to use Apple's new Liquid Glass look and feel. The first issue I ran into was that Claude was absolutely sure that macOS 26 wasn't out yet. (Amusingly, when asked to review a draft of this post, one of the things it flagged was: ' Inconsistent model naming - You refer to "macOS 26" but I believe you mean "macOS 15" (Sequoia). macOS 26 would be way in the future.') Claude was, however, happy to speculate about what a "Liquid Glass" UI might look like. Once I reminded the model that it had memory issues and Apple had indeed released the new version of their operating system, it was ready to get to work. I told it to go read Apple's Human Interface Guidelines and make a plan. This is what Claude saw: It turns out that Apple no longer offer a downloadable version of the HIG. And the online version requires JavaScript . After a bit of flailing, Claude reached for the industry-standard Playwright MCP from Microsoft. The Playwright MCP is a collection of 21 tools covering all aspects of driving a browser and debugging webapps, from to to . Just having the Playwright MCP available costs 13,678 tokens (7% of the whole context window) in every single session, even if you never use it. (Yes, the Google Chrome team has their own Chrome MCP. Its API surface is even bigger ) And once you do start using it, things get worse. Some of its tools return the entire DOM of the webpage you're working with. This means that simple requests fail because they return more tokens than Claude can handle in a response: It's frustrating to see a coding agent trying over and over to use a tool the way it's supposed to and having that tool just fail to return useful data. After hearing me complain about this a few times, Dan Grigsby commented that he'd had success just asking Claude to teach itself a skill: Using the raw Dev Tools remote control protocol to drive Chrome. This seemed like a neat trick, so I asked my Claude to take a swing at it. Claude was only too happy to try to speak raw JSON-RPC to Chrome on port 9292. It...just worked. But it was also very clunky and wasteful feeling. Claude was writing raw JSON-RPC command lines for each and every interaction. It was very verbose and required the LLM to get a whole bunch of details right on every single command invocation. It was time to make a proper Skill. After thinking about it for a moment, I asked Claude to write a little zero-dependency command-line tool called that it could run with the Bash tool to control Chrome, as well as a new file explaining how to use that script . encapsulated the complexity and made Chrome easily scriptable from the command line. The skill sets up the basics of web browsing with its tools and uses progressive disclosure to tell Claude how to get more information, but only when it has a need to know. For example, these examples of how to use the tool . Claude didn't always reach for the skill, so it wasn't aware of its new command-line tool, but once I pointed it in the right direction, it worked surprisingly well . This setup was incredibly token efficient – Nothing in the context window at startup other than a skill and in the system prompt. What was a little frustrating for me was that any time Claude wanted to do anything with the browser, it had to run a custom Bash command that I had to approve. Every click. Every navigation. Every javascript expression. It got old really, really fast. There's no real way to fix that without creating a custom MCP. But that would put us right back where we were with the official Playwright MCP, right? Nearly two dozen tools and 13k tokens spilled on the floor every time we started a session. Even trimming things down to only the dozen most important commands is still a bunch of tools, most of which Claude won't use in a given session. If you've ever done API design, you probably know how important it is to name your methods well. You know that every method should do one thing and only one thing. You know that you really need to type (and validate) all your parameters to make sure your callers can tell what they're supposed to be passing in and to make bad method calls fail as soon as possible. It would be absolutely unhinged to have a method called that took a parameter called that was itself a method dispatcher, a parameter called , and a parameter called . You'd have to be crazy to think that it's acceptable API design to have the optional, untyped field just have a description like And yet. That is exactly how I designed it. And it's just great. The high-level tool description reads: At session startup, the whole MCP config weighs in at just 947 tokens. I'm pretty sure I can shave at least 30-40 more. It's optimized to make Claude's life as easy as possble. Rather than having a method to start the browser, the MCP...just does it when it needs to. Same with opening a new tab if there wasn't one waiting. The tool description tells Claude what to do and where to read up when it needs more help. At least so far, it works just great for me. One of the mistakes I made while developing the MCP was to instruct Claude to cut down the API surface by only accepting CSS selectors, rather than accepting CSS or XPath. It seemed natural to me that a smaller, simpler API would be easier for Claude to work with and reason about. Right up until I saw the MCP tool description containing multiple admonitions like . The whole thing just...worked better when I let the selector fields accept either CSS or XPath. Another thing that Claude got not-quite-right when it first implemented the MCP was that it included detailed human-readable text for all the method parameters. Because LLMs that are using MCPs can see both the and the actual JSON schema, you don't need to repeat things like lists of values for an enum or type validations. One trick you can use is to ask your agent to tell you exactly what it can see about how to use an API. One of the weirdest realizations I had while building is this: I have no doubt that there are a dozen similar tools out there, but it was literally faster and easier to build the tool that I thought should exist than to test out a dozen tools to see if any of them work the way I think they should. Over the last couple of decades, the common wisdom has become that Postel's Law (aka the robustness principle) is dated and wrong and that APIs should be rigid and rigorous. That's the wrong choice when you're designing for use by LLMs. This might be a hard lesson to hear, but tools you build for LLMs are going to work much, much better if you think of your end-user as a "person" rather than a computer. Build your tools like they're a set of scripts you're handing to that undertrained kid who just got hired in the NOC. They are going to page you at 2AM when they can't figure out what's going on or when they misuse the tools in a way they can't unwind. Names and method descriptions matter far more than they ever have before. Automatic recovery is hugely important. Designing for error recovery rather than failing fast will make the whole system more reliable and less expensive to operate. When errors are unaviodable, your error messages should tell the user how to fix or work around the problem in plain English. If you can't give the user exactly what they asked for, but you can give them a partial answer or related information, do that. Claude absolutely does not care about the architectural purity of your API. It just wants to help you get work done with the limited resources at its disposal. This new MCP and skill for Claude Code, is called superpowers-chrome . You can install it like this: If you're already using Superpowers , you can just type /plugin, navigate to 'Install plugins', pick 'superpowers-marketplace' and then you should see . I'd love to hear from you if you find it helpful. I'd also love patches and pull requests.

0 views
allvpv’s space 1 months ago

Environment variables are a legacy mess: Let's dive deep into them

Programming languages have rapidly evolved in recent years. But in software development, the new often meets the old, and the scaffolding that OS gives for running new processes hasn’t changed much since Unix. If you need to parametrize your application at runtime by passing a few ad-hoc variables (without special files or a custom solution involving IPC or networking), you’re doomed to a pretty awkward, outdated interface: There are no namespaces for them, no types. Just a flat, embarrassingly global dictionary of strings. But what exactly are these envvars? Is it some kind of special dictionary inside the OS? If not, who owns them and how do they propagate? In a nutshell: they’re passed from parent to child. On Linux, a program must use the syscall to execute another program. Whether you type in Bash, call in Python, or launch a code editor, it ultimately comes down to , preceded by a / . The family of C functions also relies on . This system call takes three arguments: , , . For example, for an invocation: By default, all envvars are passed from the parent to the child. However, nothing prevents a parent process from passing a completely different or even empty environment when calling ! In practice, most tooling passes the environment down: Bash, Python’s , the C library , and so on. And this is what you expect – variables are inherited by child processes. That’s the point – to track the environment. Which tools do not pass the parent’s environment? For example, the executable, used when signing into a system, sets up a fresh environment for its children. After launching the new program, the kernel dumps the variables on the stack as a sequence of null-terminated strings which contain the envvar definitions. Here is a hex view: This static layout can’t easily be modified or extended; the program must copy those variables into its own data structure. Let’s look at how Bash, C, and Python store envvars internally. I analyzed their source code and here is a summary. It stores the variables in a hashmap . Or, more precisely, in a stack of hashmaps . When you spawn a new process using Bash, it traverses the stack of hashmaps to find variables marked as exported and copies them into the environment array passed to the child. Side note: Why is traversing the stack needed? Each function invocation in Bash creates a new local scope – a new entry on the stack. If you declare your variable with , it ends up in this locally-scoped hashmap. What’s interesting is that you can export a variable too! I wouldn’t have learned this without diving into Bash source. My intuitive (wrong) assumption was that automatically makes the variable global – like ! Super interesting stuff. exposes a dynamic array, managed via and library functions. It uses an array, so the time complexity of and is linear in the number of envvars. Remember – envvars are not a high-performance dictionary and you should not abuse them. Python couples its environment to the C library, which can cause surprising inconsistencies. If you’ve programmed some Python, you’ve probably used the dictionary. On startup, is built from the C library’s array. But those dictionary values are NOT the “ground truth” for child processes. Rather, each change to invokes the native function, which in turn calls the C library’s . Note that the propagation is one-directional: modifying will call , but not the other way around. Call , and won’t be updated. The Linux kernel is very liberal about the format of environment variables, and so is . For example, your C program can manipulate the environment – the global array – such that several variables share the same name but have different values. And when you execute a child process, it will inherit this “broken” setup. You don’t even need an equals sign separating name from value! The usual entry is , but nothing prevents you from adding to the array. The kernel happily accepts any null-terminated string as an “environment variable” definition. It just imposes a size limitation: Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars. But the fact that you can do something does not mean that you should. For example, if you start Bash with the “broken” environment – duplicated names and entries without – it deduplicates the variables and drops the nonsense. One interesting edge case is a space inside the variable name . My beloved shell – Nushell – has no problem with the following assignment: Python is fine with it, too. Bash, on the other hand, can’t reference it because whitespace isn’t allowed in variable names. Fortunately, the variable isn’t lost – Bash keeps such entries in a special hashmap called and still passes them to child processes. So what name and value can you safely use for your envvar? A popular misconception, repeated on StackOverflow and by ChatGPT, is that POSIX permits only uppercase envvars, and everything else is undefined behavior. But this is seriously NOT what the standard says : These strings have the form name=value; names shall not contain the character ‘=’. For values to be portable across systems conforming to POSIX.1-2017, the value shall be composed of characters from the portable character set (except NUL and as indicated below). There is no meaning associated with the order of strings in the environment. If more than one string in an environment of a process has the same name, the consequences are undefined. Environment variable names used by the utilities in the Shell and Utilities volume of POSIX.1-2017 consist solely of uppercase letters, digits, and the <underscore> ( ‘_’ ) from the characters defined in Portable Character Set and do not begin with a digit. Other characters may be permitted by an implementation; applications shall tolerate the presence of such names. Uppercase and lowercase letters shall retain their unique identities and shall not be folded together. The name space of environment variable names containing lowercase letters is reserved for applications. Applications can define any environment variables with names from this name space without modifying the behavior of the standard utilities. Yes, POSIX-specified utilities use uppercase envvars, but that’s not prescriptive for your programs. Quite the contrary: you’re encouraged to use lowercase for your envvars so they don’t collide with the standard tools. The only strict rule is that a variable name cannot contain an equals sign. POSIX requires compliant applications to preserve all variables that conform to this rule. But in reality, not many applications use lowercase. The proper etiquette in software development is to use . …to use for names, and UTF-8 for values. You shouldn’t hit problems on Linux. If you want to be super safe: instead of UTF-8, use the POSIX-mandated Portable Character Set (PCS) – essentially ASCII without control characters. …and I hope it wasn’t a boring read. is the (the executable path), is the array of command line arguments – the implicit first (“zero”) argument is usually the executable name, is the array of envvars (typically much longer). Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars.

0 views
André Arko 1 months ago

Announcing <code>rv</code> 0.2

With the help of many new contributors, and after many late nights wrestling with make, we are happy to (slightly belatedly) announce the 0.2 release of rv ! This version dramatically expands support for Rubies, shells, and architectures. Rubies: we have added Ruby 3.3, as well as re-compiled all Ruby 3.3 and 3.4 versions with YJIT. On Linux, YJIT increases our glibc minimum version to 2.35 or higher. That means most distro releases from 2022 or later should work, but please let us know if you run into any problems. Shells: we have added support for bash, fish, and nushell in addition to zsh. Architectures: we have added Ruby compiled for macOS on x86, in addition to Apple Silicon, and added Ruby compiled for Linux on ARM, in addition to x86. Special thanks to newest member of the maintainers’ team @adamchalmers for improving code and tests, adding code coverage and fuzzing, heroic amounts of issue triage, and nushell support. Additional thanks are due to all the new contributors in version 0.2, including @Thomascountz , @lgarron , @coezbek , and @renatolond . To upgrade, run , or check the release notes for other options.

0 views
flak 1 months ago

backporting go on openbsd

The OpenBSD ports tree generally tracks current, but sometimes backports (and stable packages) are made for more serious issues. As was the case for git 2.50.1. However, the go port has not seen a backport in quite some time. The OpenBSD release schedule aligns with the go schedule such that we always get the latest release, but not minor revisions. For instance, OpenBSD 7.7 shipped with go 1.24.1, but there’s a few minor revisions after that. We maybe don’t care about many of these backports, but issue 73570 is a backported fix for a bug specific to OpenBSD, so let’s say we want that. I always forget the procedures for building ports from scratch and waste a bunch of time running and cancelling and rerunning commands. So here’s a recipe that worked. If we don’t have the ports tree, we need to get that. If we don’t have bash, we need to install that. (There’s a magic formula to have ports install packages, but this is simpler.) We need to update the go port to a suitable revision. The port is currently on 1.25, but I’d rather stick with 1.24, so we go back a little ways. The OpenBSD port was never updated for 1.24.7, but those changes don’t look very exciting. Maybe next time I’ll try a custom update to a new version. We build the port. The bootstrap flavor is important, or we’ll end up building it twice. Tick, tock, ding, ding. Running will build and install a package. Check. Looks good. redux What if we want a version that’s not in ports? I figured this post would be pretty boring, but go just released 1.24.8, which includes security fixes I’d like, so now we definitely need to try building a new version. Let’s edit the Makefile . Now tell the ports system to download the new version and update the checksum. This downloads the new version and prints it’s checksum. Okay? Well, the go downloads page shows checksums in hex, but we can redo it to check. Looks good. Now run and again. Uh oh. Fucking FIPS, every fucking time. I don’t want to think too much about what this is doing, but the file has been renamed, so we need to update the pkg/PLIST file. Hopefully this is an aberration, as the go team is usually conservative about backporting changes, but one never knows what to expect. Now the package builds and installs correctly. And then rebuild everything that uses go.

0 views

LLMs Eat Scaffolding for Breakfast

We just deleted thousands of lines of code. Again. Each time a new LLM model comes out, that’s the same story. LLMs have limitations so we build scaffolding around them. Each models introduce new capabilities so that old scaffoldings must be deleted and new ones be added. But as we move closer to super intelligence, less scaffoldings are needed. This post is about what it takes to build successfully in AI today. Every line of scaffolding is a confession: the model wasn’t good enough. LLMs can’t read PDF? Let’s build a complex system to convert PDF to markdown LLMs can’t do math? Let’s build compute engine to return accurate numbers LLMs can’t handle structured output? Let’s build complex JSON validators and regex parsers LLMs can’t read images? Let’s use a specialized image to text model to describe the image to the LLM LLMs can’t read more than 3 pages? Let’s build a complex retrieval pipeline with a search engine to feed the best content to the LLM. LLMs can’t reason? Let’s build chain-of-thought logic with forced step-by-step breakdowns, verification loops, and self-consistency checks. etc, etc... millions of lines of code to add external capabilities to the model. But look at models today: GPT-5 is solving frontier mathematics, Grok-4 Fast can read 3000+ pages with its 2M context window, Claude 4.5 sonnet can ingest images or PDFs, all models have native reasoning capabilities and support structured outputs. The once essential scaffolding are now obsolete. Those tools are backed in the model capabilities. It’s nearly impossible to predict what scaffolding will become obsolete and when. What appears to be essential infrastructure and industry best practice today can transform into legacy technical debt within months. The best way to grasp how fast LLMs are eating scaffolding is to look at their system prompt (the top-level instruction that tells the AI how to behave). Looking at the prompt used in Codex, OpenAI coding agent from GPT-o3 model to GPT-5 is mind-blowing. GPT-o3 prompt: 310 lines GPT-5 prompt: 104 lines The new prompt removed 206 lines. A 66% reduction. GPT-5 needs way less handholding. The old prompt had complex instructions on how to behave as a coding agent (personality, preambles, when to plan, how to validate). The new prompt assumes GPT-5 already knows this and only specifies the Codex-specific technical requirements (sandboxing, tool usage, output formatting). The new prompt removed all the detailed guidance about autonomously resolving queries, coding guidelines, git usage. It’s also less prescriptive. Instead of “do this and this” it says “here are the tools at your disposal.” As we move closer to super intelligence, the models require more freedom and leeway (scary, lol!). Advanced models require simple instructions and tooling. Claude Code, the most sophisticated agent today, relies on a simple filesystem instead of a complex index and use bash commands (find, read, grep, glob) instead of complex tools. It moves so fast. Each model introduces a new paradigm shift. If you miss a paradigm shift, you’re dead. Having an edge in building AI applications require deep technical understanding, insatiable curiosity, and low ego. By the way, because everything changes, it’s good to focus on what won’t change Context window is how much text you can feed the model in a single conversation. Early model could only handle a couple of pages. Now it’s thousands of pages and it’s growing fast. Dario Amodei the founder of Anthropic expects 100M+ context windows while Sam Altman hinted at billions of context tokens . It means the LLMs can see more context so you need less scaffolding like retrieval augmented generation. November 2022 : GPT-3.5 could handle 4K context November 2023 : GPT-4 Turbo with 128K context June 2024 : Claude 3.5 Sonnet with 200K context June 2025 : Gemini 2.5 Pro with 1M context September 2025 : Grok-4 Fast with 2M context Models used to stream at 30-40 tokens per second. Today’s fastest models like Gemini 2.5 Flash and Grok-4 Fast hit 200+ tokens per second. A 5x improvement. On specialized AI chips (LPUs), providers like Cerebras push open-source models to 2,000 tokens per second. We’re approaching real-time LLM: full responses on complex task in under a second. LLMs are becoming exponentially smarter. With every new model, benchmarks get saturated. On the path to AGI, every benchmark will get saturated. Every job can be done and will be done by AI. As with humans, a key factor in intelligence is the ability to use tools to accomplish an objective. That is the current frontier: how well a model can use tools such as reading, writing, and searching to accomplish a task over a long period of time. This is important to grasp. Models will not improve their language translation skills (they are already at 100%), but they will improve how they chain translation tasks over time to accomplish a goal. For example, you can say, “Translate this blog post into every language on Earth,” and the model will work for a couple of hours on its own to make it happen. Tool use and long-horizon tasks are the new frontier. The uncomfortable truth: most engineers are maintaining infrastructure that shouldn’t exist. Models will make it obsolete and the survival of AI apps depends on how fast you can adapt to the new paradigm. That’s what startups have an edge over big companies. Bigcorp are late by at least two paradigms. Some examples of scaffolding that are on the decline: Vector databases : Companies paying thousands/month for when they could now just put docs in the prompt or use agentic-search instead of RAG ( my article on the topic ) LLM frameworks : These frameworks solved real problems in 2023. In 2025? They’re abstraction layers that slow you down. The best practice is now to use the model API directly. Prompt engineering teams : Companies hiring “prompt engineers” to craft perfect prompts when now current models just need clear instructions with open tools Model fine-tuning : Teams spending months fine-tuning models only for the next generation of out of the box models to outperform their fine-tune (cf my 2024 article on that ) Custom caching layers : Building Redis-backed semantic caches that add latency and complexity when prompt caching is built into the API. This cycle accelerates with every model release. The best AI teams master have critical skills: Deep model awareness : They understand exactly what today’s models can and cannot do, building only the minimal scaffolding needed to bridge capability gaps. Strategic foresight : They distinguish between infrastructure that solves today’s problems versus infrastructure that will survive the next model generation. Frontier vigilance : They treat model releases like breaking news. Missing a single capability announcement from OpenAI, Anthropic, or Google can render months of work obsolete. Ruthless iteration : They celebrate deleting code. When a new model makes their infrastructure redundant, they pivot in days, not months. It’s not easy. Teams are fighting powerful forces: Lack of awareness : Teams don’t realize models have improved enough to eliminate scaffolding (this is massive btw) Sunk cost fallacy : “We spent 3 years building this RAG pipeline!” Fear of regression : “What if the new approach is simple but doesn’t work as well on certain edge cases?” Organizational inertia : Getting approval to delete infrastructure is harder than building it Resume-driven development : “RAG pipeline with vector DB and reranking” looks better on a resume than “put files in prompt” In AI the best team builds for fast obsolescence and stay at the edge. Software engineering sits on top of a complex stack. More layers, more abstractions, more frameworks. Complexity was a sophistication. A simple web form in 2024? React for UI, Redux for state, TypeScript for types, Webpack for bundling, Jest for testing, ESLint for linting, Prettier for formatting, Docker for deployment…. AI is inverting this. The best AI code is simple and close to the model. Experienced engineers look at modern AI codebases and think: “This can’t be right. Where’s the architecture? Where’s the abstraction? Where’s the framework?” The answer: The model ate it bro, get over it. The worst AI codebases are the ones that were best practices 12 months ago. As models improve, the scaffolding becomes technical debt. The sophisticated architecture becomes the liability. The framework becomes the bottleneck. LLMs eat scaffolding for breakfast and the trend is accelerating. Thanks for reading! Subscribe for free to receive new posts and support my work. LLMs can’t read PDF? Let’s build a complex system to convert PDF to markdown LLMs can’t do math? Let’s build compute engine to return accurate numbers LLMs can’t handle structured output? Let’s build complex JSON validators and regex parsers LLMs can’t read images? Let’s use a specialized image to text model to describe the image to the LLM LLMs can’t read more than 3 pages? Let’s build a complex retrieval pipeline with a search engine to feed the best content to the LLM. LLMs can’t reason? Let’s build chain-of-thought logic with forced step-by-step breakdowns, verification loops, and self-consistency checks. Vector databases : Companies paying thousands/month for when they could now just put docs in the prompt or use agentic-search instead of RAG ( my article on the topic ) LLM frameworks : These frameworks solved real problems in 2023. In 2025? They’re abstraction layers that slow you down. The best practice is now to use the model API directly. Prompt engineering teams : Companies hiring “prompt engineers” to craft perfect prompts when now current models just need clear instructions with open tools Model fine-tuning : Teams spending months fine-tuning models only for the next generation of out of the box models to outperform their fine-tune (cf my 2024 article on that ) Custom caching layers : Building Redis-backed semantic caches that add latency and complexity when prompt caching is built into the API. Deep model awareness : They understand exactly what today’s models can and cannot do, building only the minimal scaffolding needed to bridge capability gaps. Strategic foresight : They distinguish between infrastructure that solves today’s problems versus infrastructure that will survive the next model generation. Frontier vigilance : They treat model releases like breaking news. Missing a single capability announcement from OpenAI, Anthropic, or Google can render months of work obsolete. Ruthless iteration : They celebrate deleting code. When a new model makes their infrastructure redundant, they pivot in days, not months. Lack of awareness : Teams don’t realize models have improved enough to eliminate scaffolding (this is massive btw) Sunk cost fallacy : “We spent 3 years building this RAG pipeline!” Fear of regression : “What if the new approach is simple but doesn’t work as well on certain edge cases?” Organizational inertia : Getting approval to delete infrastructure is harder than building it Resume-driven development : “RAG pipeline with vector DB and reranking” looks better on a resume than “put files in prompt”

0 views
baby steps 1 months ago

SymmACP: extending Zed's ACP to support Composable Agents

This post describes SymmACP – a proposed extension to Zed’s Agent Client Protocol that lets you build AI tools like Unix pipes or browser extensions. Want a better TUI? Found some cool slash commands on GitHub? Prefer a different backend? With SymmACP, you can mix and match these pieces and have them all work together without knowing about each other. This is pretty different from how AI tools work today, where everything is a monolith – if you want to change one piece, you’re stuck rebuilding the whole thing from scratch. SymmACP allows you to build out new features and modes of interactions in a layered, interoperable way. This post explains how SymmACP would work by walking through a series of examples. Right now, SymmACP is just a thought experiment. I’ve sketched these ideas to the Zed folks, and they seemed interested, but we still have to discuss the details in this post. My plan is to start prototyping in Symposium – if you think the ideas I’m discussing here are exciting, please join the Symposium Zulip and let’s talk! I’m going to explain the idea of “composable agents” by walking through a series of features. We’ll start with a basic CLI agent 1 tool – basically a chat loop with access to some MCP servers so that it can read/write files and execute bash commands. Then we’ll show how you could add several features on top: The magic trick is that each of these features will be developed as separate repositories. What’s more, they could be applied to any base tool you want, so long as it speaks SymmACP. And you could also combine them with different front-ends, such as a TUI, a web front-end, builtin support from Zed or IntelliJ , etc. Pretty neat. My hope is that if we can centralize on SymmACP, or something like it, then we could move from everybody developing their own bespoke tools to an interoperable ecosystem of ideas that can build off of one another. SymmACP begins with ACP, so let’s explain what ACP is. ACP is a wonderfully simple protocol that lets you abstract over CLI agents. Imagine if you were using an agentic CLI tool except that, instead of communication over the terminal, the CLI tool communicates with a front-end over JSON-RPC messages, currently sent via stdin/stdout. When you type something into the GUI, the editor sends a JSON-RPC message to the agent with what you typed. The agent responds with a stream of messages containing text and images. If the agent decides to invoke a tool, it can request permission by sending a JSON-RPC message back to the editor. And when the agent has completed, it responds to the editor with an “end turn” message that says “I’m ready for you to type something else now”. OK, let’s tackle our first feature. If you’ve used a CLI agent, you may have noticed that they don’t know what time it is – or even what year it is. This may sound trivial, but it can lead to some real mistakes. For example, they may not realize that some information is outdated. Or when they do web searches for information, they can search for the wrong thing: I’ve seen CLI agents search the web for “API updates in 2024” for example, even though it is 2025. To fix this, many CLI agents will inject some extra text along with your prompt, something like . This gives the LLM the context it needs. So how could use ACP to build that? The idea is to create a proxy . This proxy would wrap the original ACP server: This proxy will take every “prompt” message it receives and decorate it with the date and time: Simple, right? And of course this can be used with any editor and any ACP-speaking tool. Let’s look at another feature that basically “falls out” from ACP: injecting personality. Most agents give you the ability to configure “context” in various ways – or what Claude Code calls memory . This is useful, but I and others have noticed that if what you want is to change how Claude “behaves” – i.e., to make it more collaborative – it’s not really enough. You really need to kick off the conversation by reinforcing that pattern. In Symposium, the “yiasou” prompt (also available as “hi”, for those of you who don’t speak Greek 😛) is meant to be run as the first thing in the conversation. But there’s nothing an MCP server can do to ensure that the user kicks off the conversation with or something similar. Of course, if Symposium were implemented as an ACP Server, we absolutely could do that: Some of you may be saying, “hmm, isn’t that what hooks are for?” And yes, you could do this with hooks, but there’s two problems with that. First, hooks are non-standard, so you have to do it differently for every agent. The second problem with hooks is that they’re fundamentally limited to what the hook designer envisioned you might want. You only get hooks at the places in the workflow that the tool gives you, and you can only control what the tool lets you control. The next feature starts to show what I mean: as far as I know, it cannot readily be implemented with hooks the way I would want it to work. Let’s move on to our next feature, long-running asynchronous tasks. This feature is going to have to go beyond the current capabilities of ACP into the expanded “SymmACP” feature set. Right now, when the server invokes an MCP tool, it executes in a blocking way. But sometimes the task it is performing might be long and complicated. What you would really like is a way to “start” the task and then go back to working. When the task is complete, you (and the agent) could be notified. This comes up for me a lot with “deep research”. A big part of my workflow is that, when I get stuck on something I don’t understand, I deploy a research agent to scour the web for information. Usually what I will do is ask the agent I’m collaborating with to prepare a research prompt summarizing the things we tried, what obstacles we hit, and other details that seem relevant. Then I’ll pop over to claude.ai or Gemini Deep Research and paste in the prompt. This will run for 5-10 minutes and generate a markdown report in response. I’ll download that and give it to my agent. Very often this lets us solve the problem. 2 This research flow works well but it is tedious and requires me to copy-and-paste. What I would ideally want is an MCP tool that does the search for me and, when the results are done, hands them off to the agent so it can start processing immediately. But in the meantime, I’d like to be able to continue working with the agent while we wait. Unfortunately, the protocol for tools provides no mechanism for asynchronous notifications like this, from what I can tell. So how would I do it with SymmACP? Well, I would want to extend the ACP protocol as it is today in two ways: In that case, we could implement our Research Proxy like so: What’s cool about this is that the proxy encapsulates the entire flow: it knows how to do the research, and it manages notifying the various participants when the research completes. (Also, this leans on one detail I left out, which is that ) Let’s explore our next feature, Q CLI’s mode . This feature is interesting because it’s a simple (but useful!) example of history editing. The way works is that, when you first type , Q CLI saves your current state. You can then continue as normal but when you next type , your state is restored to where you were. This, as the name suggests, lets you explore a side conversation without polluting your main context. The basic idea for supporting tangent in SymmACP is that the proxy is going to (a) intercept the tangent prompt and remember where it began; (b) allow the conversation to continue as normal; and then (c) when it’s time to end the tangent, create a new session and replay the history up until the point of the tangent 3 . You can almost implement “tangent” in ACP as it is, but not quite. In ACP, the agent always owns the session history. The editor can create a new session or load an older one; when loading an older one, the agent “replays” “replays” the events so that the editor can reconstruct the GUI. But there is no way for the editor to “replay” or construct a session to the agent . Instead, the editor can only send prompts, which will cause the agent to reply. In this case, what we want is to be able to say “create a new chat in which I said this and you responded that” so that we can setup the initial state. This way we could easily create a new session that contains the messages from the old one. So how this would work: One of the nicer features of Symposium is the ability to do interactive walkthroughs . These consist of an HTML sidebar as well as inline comments in the code: Right now, this is implemented by a kind of hacky dance: It works, but it’s a giant Rube Goldberg machine. With SymmACP, we would structure the passthrough mechanism as a proxy. Just as today, it would provide an MCP tool to the agent to receive the walkthrough markdown. It would then convert that into the HTML to display on the side along with the various comments to embed in the code. But this is where things are different. Instead of sending that content over IPC, what I would want to do is to make it possible for proxies to deliver extra information along with the chat. This is relatively easy to do in ACP as is, since it provides for various capabilities, but I think I’d want to go one step further I would have a proxy layer that manages walkthroughs. As we saw before, it would provide a tool. But there’d be one additional thing, which is that, beyond just a chat history, it would be able to convey additional state. I think the basic conversation structure is like: but I think it’d be useful to (a) be able to attach metadata to any of those things, e.g., to add extra context about the conversation or about a specific turn (or even a specific prompt ), but also additional kinds of events. For example, tool approvals are an event . And presenting a walkthrough and adding annotations are an event too. The way I imagine it, one of the core things in SymmACP would be the ability to serialize your state to JSON. You’d be able to ask a SymmACP paricipant to summarize a session. They would in turn ask any delegates to summarize and then add their own metadata along the way. You could also send the request in the other direction – e.g., the agent might present its state to the editor and ask it to augment it. This would mean a walkthrough proxy could add extra metadata into the chat transcript like “the current walkthrough” and “the current comments that are in place”. Then the editor would either know about that metadata or not. If it doesn’t, you wouldn’t see it in your chat. Oh well – or perhaps we do something HTML like, where there’s a way to “degrade gracefully” (e.g., the walkthrough could be presented as a regular “response” but with some metadata that, if you know to look, tells you to interpret it differently). But if the editor DOES know about the metadata, it interprets it specially, throwing the walkthrough up in a panel and adding the comments into the code. With enriched histories, I think we can even say that in SymmACP, the ability to load, save, and persist sessions itself becomes an extension, something that can be implemented by a proxy; the base protocol only needs the ability to conduct and serialize a conversation. Let me sketch out another feature that I’ve been noodling on that I think would be pretty cool. It’s well known that there’s a problem that LLMs get confused when there are too many MCP tools available. They get distracted. And that’s sensible, so would I, if I were given a phonebook-size list of possible things I could do and asked to figure something out. I’d probably just ignore it. But how do humans deal with this? Well, we don’t take the whole phonebook – we got a shorter list of categories of options and then we drill down. So I go to the File Menu and then I get a list of options, not a flat list of commands. I wanted to try building an MCP tool for IDE capabilities that was similar. There’s a bajillion set of things that a modern IDE can “do”. It can find references. It can find definitions. It can get type hints. It can do renames. It can extract methods. In fact, the list is even open-ended, since extensions can provide their own commands. I don’t know what all those things are but I have a sense for the kinds of things an IDE can do – and I suspect models do too. What if you gave them a single tool, “IDE operation”, and they could use plain English to describe what they want? e.g., . Hmm, this is sounding a lot like a delegate, or a sub-agent. Because now you need to use a second LLM to interpret that request – you probably want to do something like, give it a list of sugested IDE capabilities and the ability to find out full details and ask it to come up with a plan (or maybe directly execute the tools) to find the answer. As it happens, MCP has a capability to enable tools to do this – it’s called (somewhat oddly, in my opinion) “sampling”. It allows for “callbacks” from the MCP tool to the LLM. But literally nobody implements it, from what I can tell. 4 But sampling is kind of limited anyway. With SymmACP, I think you could do much more interesting things. The key is that ACP already permits a single agent to “serve up” many simultaneous sessions. So that means that if I have a proxy, perhaps one supplying an MCP tool definition, I could use it to start fresh sessions – combine that with the “history replay” capability I mentioned above, and the tool can control exactly what context to bring over into that session to start from, as well, which is very cool (that’s a challenge for MCP servers today, they don’t get access to the conversation history). Ok, this post sketched a variant on ACP that I call SymmACP. SymmACP extends ACP with Most of these are modest extensions to ACP, in my opinion, and easily doable in a backwards fashion just by adding new capabilities. But together they unlock the ability for anyone to craft extensions to agents and deploy them in a composable way. I am super excited about this. This is exactly what I wanted Symposium to be all about. It’s worth noting the old adage: “with great power, comes great responsibility”. These proxies and ACP layers I’ve been talking about are really like IDE extensions. They can effectively do anything you could do. There are obvious security concerns. Though I think that approaches like Microsoft’s Wassette are key here – it’d be awesome to have a “capability-based” notion of what a “proxy layer” is, where everything compiles to WASM, and where users can tune what a given proxy can actually do . I plan to start sketching a plan to drive this work in Symposium and elsewhere. My goal is to have a completely open and interopable client, one that can be based on any agent (including local ones) and where you can pick and choose which parts you want to use. I expect to build out lots of custom functionality to support Rust development (e.g., explaining and diagnosting trait errors using the new trait solver is high on my list…and macro errors…) but also to have other features like walkthroughs, collaborative interaction style, etc that are all language independent – and I’d love to see language-focused features for other langauges, especially Python and TypeScript (because “the new trifecta” ) and Swift and Kotlin (because mobile). If that vision excites you, come join the Symposium Zulip and let’s chat! One question I’ve gotten when discussing this is how it compares to the other host of protocols out there. Let me give a brief overview of the related work and how I understand its pros and cons: Everybody uses agents in various ways. I like Simon Willison’s “agents are models using tools in a loop” definition; I feel that an “agentic CLI tool” fits that definition, it’s just that part of the loop is reading input from the user. I think “fully autonomous” agents are a subset of all agents – many agent processes interact with the outside world via tools etc. From a certain POV, you can view the agent “ending the turn” as invoking a tool for “gimme the next prompt”.  ↩︎ Research reports are a major part of how I avoid hallucination. You can see an example of one such report I commissioned on the details of the Language Server Protocol here ; if we were about to embark on something that required detailed knowledge of LSP, I would ask the agent to read that report first.  ↩︎ Alternatively: clear the session history and rebuild it, but I kind of prefer the functional view of the world, where a given session never changes.  ↩︎ I started an implementation for Q CLI but got distracted – and, for reasons that should be obvious, I’ve started to lose interest.  ↩︎ Yes, you read that right. There is another ACP. Just a mite confusing when you google search. =)  ↩︎ Addressing time-blindness by helping the agent know what time it is. Injecting context and “personality” to the agent. Spawning long-running, asynchronous tasks. A copy of Q CLI’s mode that lets you do a bit of “off the books” work that gets removed from your history later. Implementing Symposium’s interactive walkthroughs , which give the agent a richer vocabulary for communicating with you than just text. Smarter tool delegation. I’d like the ACP proxy to be able to provide tools that the proxy will execute. Today, the agent is responsible for executing all tools; the ACP protocol only comes into play when requesting permission . But it’d be trivial to have MCP tools where, to execute the tool, the agent sends back a message over ACP instead. I’d like to have a way for the agent to initiate responses to the editor . Right now, the editor always initiatives each communication session with a prompt; but, in this case, the agent might want to send messages back unprompted. The agent invokes an MCP tool and sends it the walkthrough in markdown. This markdown includes commands meant to be placed on particular lines, identified not by line number (agents are bad at line numbers) but by symbol names or search strings. The MCP tool parses the markdown, determines the line numbers for comments, and creates HTML. It sends that HTML over IPC to the VSCode extension. The VSCode receives the IPC message, displays the HTML in the sidebar, and creates the comments in the code. Conversation Turn User prompt(s) – could be zero or more Response(s) – could be zero or more Tool use(s) – could be zero or more the ability for either side to provide the initial state of a conversation, not just the server the ability for an “editor” to provide an MCP tool to the “agent” the ability for agents to respond without an initial prompt the ability to serialize conversations and attach extra state (already kind of present) Model context protocol (MCP) : The queen of them all. A protocol that provides a set of tools, prompts, and resources up to the agent. Agents can invoke tools by supplying appropriate parameters, which are JSON. Prompts are shorthands that users can invoke using special commands like or , they are essentially macros that expand “as if the user typed it” (but they can also have parameters and be dynamically constructed). Resources are just data that can be requested. MCP servers can either be local or hosted remotely. Remote MCP has only recently become an option and auth in particular is limited. Comparison to SymmACP: MCP provides tools that the agent can invoke. SymmACP builds on it by allowing those tools to be provided by outer layers in the proxy chain. SymmACP is oriented at controlling the whole chat “experience”. Zed’s Agent Client Protocol (ACP) : The basis for SymmACP. Allows editors to create and manage sessions. Focused only on local sessions, since your editor runs locally. Comparison to SymmACP: That’s what this post is all about! SymmACP extends ACP with new capabilities that let intermediate layers manipulate history, provide tools, and provide extended data upstream to support richer interaction patterns than jus chat. PS I expect we may want to support more remote capabilities, but it’s kinda orthogonal in my opinion (e.g., I’d like to be able to work with an agent running over in a cloud-hosted workstation, but I’d probably piggyback on ssh for that). Google’s Agent-to-Agent Protocol (A2A) and IBM’s Agent Communication Protocol (ACP) 5 : From what I can tell, Google’s “agent-to-agent” protocol is kinda like a mix of MCP and OpenAPI. You can ping agents that are running remotely and get them to send you “agent cards”, which describe what operations they can perform, how you authenticate, and other stuff like that. It looks to me quite similar to MCP except that it has richer support for remote execution and in particular supports things like long-running communication, where an agent may need to go off and work for a while and then ping you back on a webhook. Comparison to MCP: To me, A2A looks like a variant of MCP that is more geared to remote execution. MCP has a method for tool discovery where you ping the server to get a list of tools; A2A has a similar mechanism with Agent Cards. MCP can run locally, which A2A cannot afaik, but A2A has more options about auth. MCP can only be invoked synchronously, whereas A2A supports long-running operations, progress updates, and callbacks. It seems like the two could be merged to make a single whole. Comparison to SymmACP: I think A2A is orthogonal from SymmACP. A2A is geared to agents that provide services to one another. SymmACP is geared towards building new development tools for interacting with agents. It’s possible you could build something like SymmACP on A2A but I don’t know what you would really gain by it (and I think it’d be easy to do later). Everybody uses agents in various ways. I like Simon Willison’s “agents are models using tools in a loop” definition; I feel that an “agentic CLI tool” fits that definition, it’s just that part of the loop is reading input from the user. I think “fully autonomous” agents are a subset of all agents – many agent processes interact with the outside world via tools etc. From a certain POV, you can view the agent “ending the turn” as invoking a tool for “gimme the next prompt”.  ↩︎ Research reports are a major part of how I avoid hallucination. You can see an example of one such report I commissioned on the details of the Language Server Protocol here ; if we were about to embark on something that required detailed knowledge of LSP, I would ask the agent to read that report first.  ↩︎ Alternatively: clear the session history and rebuild it, but I kind of prefer the functional view of the world, where a given session never changes.  ↩︎ I started an implementation for Q CLI but got distracted – and, for reasons that should be obvious, I’ve started to lose interest.  ↩︎ Yes, you read that right. There is another ACP. Just a mite confusing when you google search. =)  ↩︎

0 views
Robin Moffatt 1 months ago

Stumbling into AI: Part 5—Agents

A short series of notes for myself as I learn more about the AI ecosystem as of Autumn [Fall] 2025. The driver for all this is understanding more about Apache Flink’s Flink Agents project, and Confluent’s Streaming Agents . I started off this series —somewhat randomly, with hindsight—looking at Model Context Protocol ( MCP ) . It’s a helper technology to make things easier to use and provide a richer experience. Next I tried to wrap my head around Models —mostly LLMs, but also with an addendum discussing other types of model too. Along the lines of MCP, Retrieval Augmented Generation ( RAG ) is another helper technology that on its own doesn’t do anything but combined with an LLM gives it added smarts. I took a brief moment in part 4 to try and build a clearer understanding of the difference between ML and AI . So whilst RAG and MCP combined make for a bunch of nice capabilities beyond models such as LLMs alone, what I’m really circling around here is what we can do when we combine all these things: Agents ! But…what is an Agent, both conceptually and in practice? Let’s try and figure it out. Let’s begin with Wikipedia’s definition : In computer science, a software agent is a computer program that acts for a user or another program in a relationship of agency . We can get more specialised if we look at Wikipedia’s entry for an Intelligent Agent : In artificial intelligence, an intelligent agent is an entity that perceives its environment, takes actions autonomously to achieve goals , and may improve its performance through machine learning or by acquiring knowledge . Citing Wikipedia is perhaps the laziest ever blog author’s trick, but I offer no apologies 😜. Behind all the noise and fuss, this is what we’re talking about: a bit of software that’s going to go and do something for you (or your company) autonomously . LangChain have their own definition of an Agent, explicitly identifying the use of an LLM: An AI agent is a system that uses an LLM to decide the control flow of an application. The blog post from LangChain as a whole gives more useful grounding in this area and is worth a read. In fact, if you want to really get into it, the LangChain Academy is free and the Introduction to LangGraph course gives a really good primer on Agents and more. Meanwhile, the Anthropic team have a chat about their definition of an Agent . In a blog post Anthropic differentiates between Workflows (that use LLMs) and Agents: Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks. Independent researcher Simon Willison also uses the LLM word in his definition: An LLM agent runs tools in a loop to achieve a goal. He explores the definition in a recent blog post: I think “agent” may finally have a widely enough agreed upon definition to be useful jargon now , in which Josh Bickett’s meme demonstrates how much of a journey this definition has been on: That there’s still discussion and ambiguity nearly two years after this meme was created is telling. My colleague Sean Falconer knows a lot more this than I do. He was a guest on a recent podcast episode in which he spells things out: [Agentic AI] involves AI systems that can reason, dynamically choose tasks, gather information, and perform actions as a more complete software system. [ 1 ] [Agents] are software that can dynamically decide its own control flow: choosing tasks, workflows, and gathering context as needed. Realistically, current enterprise agents have limited agency[…]. They’re mostly workflow automations rather than fully autonomous systems . [ 2 ] In many ways […] an agent [is] just a microservice . [ 3 ] A straightforward software Agent might do something like: Order more biscuits when there are only two left The pseudo-code looks like this: We take this code, stick it on a server and leave it to run. One happy Agent, done. An AI Agent could look more like this: Other examples of AI Agents include: Coding Agents . Everyone’s favourite tool (when used right). It can reason about code, it can write code, it can review PRs. One of the trends that I’ve noticed recently (October 2025) is the use of Agents to help with some of the up-front jobs in software engineering (such as data modelling and writing tests ), rather than full-blown code that’s going to ship to production. That’s not to say that coding Agents aren’t being used for that, but by using AI to accelerate certain tasks whilst retaining human oversight (a.k.a. HITL ) it makes it easier to review the output rather than just trusting to luck that reams and reams of code are correct. There’s a good talk from Uber on how they’re using AI in the development process, including code conversion, and testing. Travel booking . Perhaps you tell it when you want to go, the kind of vacation you like, and what your budget is; it then goes and finds where it’s nice at that time of year, figures out travel plans within your budget, and either proposes an itinerary or even books it for you. Another variation could be you tell it where , and then it integrates with your calendar to figure out the when . This is a canonical example that is oft-cited; I’d be interested if anyone can point me to an actual implementation of it, even if just a toy one . I saw this in a blog post from Simon Willison that made me wince, but am leaving the above in anyway just to serve as an example of the confusion/hype that exists in this space: comes from plus , the latter meaning of, relating to, or characterised by . So is simply AI that is characterised by an Agent, or Agency. Contrast that to AI that’s you sat at the ChatGPT prompt asking it to draw pictures of a duck dressed as a clown . Nothing Agentic about that—just a human-led and human-driven interaction. "AI Agents" becomes a bit of a mouthful with the qualifier, so much of the current industry noise is simply around "Agents". That said, "Agentic AI" sounds cool, so gets used as the marketing term in place of "AI" alone. So we’ve muddled our way through to some kind of understanding of what an Agent is, and what we mean by Agentic AI. But how do we actually build one? All we need is an LLM (such as access to the API for OpenAI or Claude ), something to call that API (there are worse choices than !), and a way to call external services (e.g. MCP servers) if the LLM determines that it needs to use them. So in theory we could build an Agent with some lines of bash, some API calls, and a bunch of sticky-backed plastic . This is a grossly oversimplified example (and is missing elements such as memory)—but it hopefully illustrates what we’re building at the core of an Agent. On top of this goes all the general software engineering requirements of any system that gets built (suitable programming language and framework, error handling, LLM output validation, guard rails, observability, tests, etc etc). The other nuance that I’ve noticed is that whilst the above simplistic diagram is 100% driven by an LLM (it decides what tools to call, it decides when to iterate) there are plenty of cases where an Agent is to some degree rules-driven. So perhaps the LLM does some of the autonomous work, but then there’s a bunch of good ol' statements in there too. This is also borne out by the notion of "Workflows" when people talk about Agents. An Agent doesn’t wake up in the morning and set out on its day serving only to fulfill its own goals and enrichment. More often than not an Agent is going to be tightly bound into a pre-defined path with a limited range of autonomy. What if you want to actually build this kind of thing for real? That’s where tools like LangGraph and LangChain come in. Here’s a notebook with an example of an actual Agent built with these tools. LlamaIndex is another framework, with details of building an Agent in their docs. As we build up from the so-simple-it-is-laughable strawman example of an Agent above, one of the features we’ll soon encounter is the concept of memory. The difference between a crappy response and a holy-shit-that’s-magic response from an LLM is often down to context . The richer the context, the better a chance it has at generating a more accurate output. So if an Agent can look back on what it did previously, determining what worked well and what didn’t, perhaps even taking into account human feedback, it can then generate a more successful response the next time. You can read a lot more about memory in this chapter of Agentic Design Patterns by Antonio Gulli . This blog post from "The BIG DATA guy" is also useful: Agentic AI, Agent Memory, & Context Engineering This diagram from Generative Agents: Interactive Simulacra of Human Behavior (J.S. Park, J.C. O’Brien, C.J. Cai, M.R. Morris, P. Liang, M.S. Bernstein) gives a good overview of a much richer definition of an Agent’s implementation. The additional concepts include memory (discussed briefly above), planning, and reflection: Also check out Paul Iusztin’s talk from QCon London 2025 on The Data Backbone of LLM Systems . Around the 35-minute mark he goes into some depth around Agent architectures. Just as you can build computer systems as monoliths (everything done in one place) or microservices (multiple programs, each responsible for a discrete operation or domain), you can also have one big Agent trying to do everything (probably not such a good idea) or individual Agents each good at their particular thing that are then hooked together into what’s known as a Multi-Agent System (MAS). Sean Falconer’s family meal planning demo is a good example of a MAS. One Agent plans the kids' meals, one the adults' meals, another combines the two into a single plan, and so on. This is a term you’ll come across referring to the fact that Agents might be pretty good, but they’re not infallible. In the travel booking example above, do we really trust the Agent to book the best holiday for us? Almost certainly we’d want—at a minimum—the option to sign off on the booking before it goes ahead and sinks £10k on an all-inclusive trip to Bognor Regis. Then again, we’re probably happy enough for an Agent to access our calendars without asking permission, and as to whether they need permission or not to create a meeting is up to us and how much we trust them. When it comes to coding, having an Agent write code, test it, fix the broken tests, compare it to a spec, and iterate is really neat. On the other hand, letting it decide to run …less so 😅. Every time an Agent requires HITL, it reduces its autonomy and/or responsiveness to situations. As well as simply using smarter models that make fewer mistakes, there are other things that an Agent can do to reduce the need for HITL such as using guardrails to define acceptable parameters. For example, an Agent is allowed to book travel but only up to a defined threshold. That way the user gets to trade off convenience (no HITL) with risk (unintended first-class flight to Hawaii). 📃 Generative Agents: Interactive Simulacra of Human Behavior 🎥 Paul Iusztin - The Data Backbone of LLM Systems - QCon London 2025 📖 Antonio Gulli - Agentic Design Patterns 📖 Sean Falconer - https://seanfalconer.medium.com/ Coding Agents . Everyone’s favourite tool (when used right). It can reason about code, it can write code, it can review PRs. One of the trends that I’ve noticed recently (October 2025) is the use of Agents to help with some of the up-front jobs in software engineering (such as data modelling and writing tests ), rather than full-blown code that’s going to ship to production. That’s not to say that coding Agents aren’t being used for that, but by using AI to accelerate certain tasks whilst retaining human oversight (a.k.a. HITL ) it makes it easier to review the output rather than just trusting to luck that reams and reams of code are correct. There’s a good talk from Uber on how they’re using AI in the development process, including code conversion, and testing. Travel booking . Perhaps you tell it when you want to go, the kind of vacation you like, and what your budget is; it then goes and finds where it’s nice at that time of year, figures out travel plans within your budget, and either proposes an itinerary or even books it for you. Another variation could be you tell it where , and then it integrates with your calendar to figure out the when . This is a canonical example that is oft-cited; I’d be interested if anyone can point me to an actual implementation of it, even if just a toy one . I saw this in a blog post from Simon Willison that made me wince, but am leaving the above in anyway just to serve as an example of the confusion/hype that exists in this space: 📃 Generative Agents: Interactive Simulacra of Human Behavior 🎥 Paul Iusztin - The Data Backbone of LLM Systems - QCon London 2025 📖 Antonio Gulli - Agentic Design Patterns 📖 Sean Falconer - https://seanfalconer.medium.com/

0 views
Dayvster 1 months ago

Is Odin Just a More Boring C?

## Why I Tried Odin ### Background My recent posts have been diving deep into Zig and C, a shift from my earlier focus on React and JavaScript. This isn’t a pivot but a return to my roots. I started programming at 13 with C and C++, and over the years, I’ve built a wide range of projects in systems programming languages like C, C++, Rust, and now Zig. From hobby experiments and custom Linux utilities to professional embedded systems work think vehicle infotainment, tracking solutions, and low-level components I’ve always been drawn to the power and precision of systems programming. Alongside this, I’ve crafted tools for my own environment and tackled plenty of backend engineering, blending my full-stack expertise with a passion for low-level control. ### Why Odin Caught My Eye I like many others initially dismissed Odin as that language that was primarily intended for game development. It took me a moment or should I say many moments to realize just how stupid that notion was. Because let's analyze what game development actually means, it means building complex systems that need to be efficient, performant and reliable. It means working with graphics, physics, input handling, networking and more. It means dealing with concurrency, memory management and low level optimizations. In other words, game development is a perfect fit for a systems programming language like Odin. So basically if it's intended for game development, then it should be a great fit for general systems programming, desktop applications and since game dev usually means manual memory management without a garbage collector, it should also be possible to some extent to use it for embedded systems. So after I've gave myself a good slap on the forehead for being a bit of an idiot. I decided why not give Odin a fair shot and build something useful with it. ## The Project Now I may have been a bit liberal with the word useful there, what I actually decided to build was something that I usually like to build whenever I wanna try out a new language, namely a tiny key-value store with a pub/sub system. It won't win any awards for originality and I'm pretty sure the folks over at redis aren't exactly shaking in their boots. It is the most basic most barebones implementation of both lacking any real useful features that would make it usable in a production environment. But it is a good exercise in understanding the language and its capabilities. Mainly because it involves a few different aspects of programming that are relevant to systems programming. It involves data structures, memory management, concurrency and networking. And even if you create something as basic and lacking as I have in this example, you still have room for experimentation and exploration to add more features. ### Building a Tiny KV Store With Pub/Sub My initial minimal proof of concept was simple and straightforward. ```odin package main import "core:fmt" import "core:time" KVStore :: struct { store: map[string]string, } kvstore_init :: proc() -> KVStore { return KVStore{store = map[string]string{}} } kv_put :: proc(kv: ^KVStore, key: string, value: string) { kv.store[key] = value } kv_get :: proc(kv: ^KVStore, key: string) -> string { if value, ok := kv.store[key]; ok { return value } return "" } PubSub :: struct { subscribers: map[string][]proc(msg: string), } pubsub_init :: proc() -> PubSub { return PubSub{subscribers = map[string][]proc(msg: string){}} } subscribe :: proc(ps: ^PubSub, topic: string, handler: proc(msg: string)) { if arr, ok := ps.subscribers[topic]; ok { new_arr := make([]proc(msg: string), len(arr)+1); for i in 0..<len(arr) { new_arr[i] = arr[i]; } new_arr[len(arr)] = handler; ps.subscribers[topic] = new_arr; } else { ps.subscribers[topic] = []proc(msg: string){handler}; } } publish :: proc(ps: ^PubSub, topic: string, msg: string) { if handlers, ok := ps.subscribers[topic]; ok { for handler in handlers { handler(msg); } } } kv: KVStore; main :: proc() { kv = kvstore_init(); ps := pubsub_init(); handler1 :: proc(msg: string) { fmt.println("Sub1 got:", msg); kv_put(&kv, "last_msg", msg); } handler2 :: proc(msg: string) { fmt.println("Sub2 got:", msg); } handler3 :: proc(msg: string) { fmt.println("Sub3 got:", msg); } subscribe(&ps, "demo", handler1); subscribe(&ps, "demo", handler2); subscribe(&ps, "demo", handler3); publish(&ps, "demo", "Welcome to dayvster.com"); time.sleep(2 * time.Second); publish(&ps, "demo", "Here's another message after 2 seconds"); last := kv_get(&kv, "last_msg"); fmt.println("Last in kvstore:", last); } ``` As you can see it currently lacks any real error handling, concurrency and persistence. But it does demonstrate the basic functionality of a key-value store with pub/sub capabilities. What I have done is created two main structures, `KVStore` and `PubSub`. The `KVStore` structure contains a map to store key-value pairs and provides functions to put and get values. The `PubSub` structure contains a map of subscribers for different topics and provides functions to subscribe to topics and publish messages. The `main` function initializes the key-value store and pub/sub system, defines a few handlers for incoming messages, subscribes them to a topic, and then publishes some messages to demonstrate the functionality. From this basic example we've explored how to handle memory management in Odin, how to work with data structures like maps and slices, and how to define and use procedures. ### Memory Management Like C and Zig, Odin employs manual memory management, but it offers user-friendly utilities to streamline the process, much like Zig, in contrast to C’s more rudimentary approach. For instance, the `make` function in Odin enables the creation of slices with a defined length and capacity, akin to Zig’s slice allocation. In the code above, `make([]proc(msg: string), len(arr)+1)` generates a slice of procedure pointers with a length of `len(arr)+1`. Essentially, it allocates memory on the heap and returns a slice header, which includes a pointer to the allocated memory, along with the length and capacity of the slice. **but how and when is that memory freed?** In this code, memory allocated by `make` (e.g., for the slice in `subscribe`) and for maps (e.g., `kv.store` and `ps.subscribers`) is not explicitly freed. Since this is a short-lived program, the memory is reclaimed by the operating system when the program exits. However, in a long-running application, you’d need to use Odin’s delete procedure to free slices and maps explicitly. For example: ```odin kvstore_deinit :: proc(kv: ^KVStore) { delete(kv.store); } pubsub_deinit :: proc(ps: ^PubSub) { for topic, handlers in ps.subscribers { delete(handlers); } delete(ps.subscribers); } ``` So let's add that in the `main` function before it exits to ensure we clean up properly: ```odin // ... existing code ... main :: proc() { // ... existing code ... pubsub_deinit(&ps); kvstore_deinit(&kv); } // end of main ``` Well would you look at that, we just added proper memory management to our tiny KV store with pub/sub system and all it took was a few lines of code. I'm still a huge fan of C but this does feel nice and clean, not to mention really readable and easy to understand. Is our code now perfect and fully memory safe? Not quite, it still needs error handling and thread safety(way later) for production use, but it’s a solid step toward responsible memory management. ### Adding concurrency Enhancing Pub/Sub with Concurrency in Odin To make our pub/sub system more realistic, we've introduced concurrency to the publish procedure using Odin's core:thread library. This allows subscribers to process messages simultaneously, mimicking real-world pub/sub behavior. Since handler1 modifies kv.store via kv_put, we've added a mutex to KVStore to ensure thread-safe access to the shared map. Here's how it works: - **Concurrent Execution with Threads**: The publish procedure now runs each handler in a separate thread created with thread.create. Each thread receives the handler and message via t.user_args, and thread.start kicks off execution. Threads are collected in a dynamic array (threads), which is cleaned up using defer delete(threads). The thread.join call ensures the program waits for all threads to finish, and thread.destroy frees thread resources. This setup enables handler1, handler2, and handler3 to process messages concurrently, with output order varying based on thread scheduling. - **Thread Safety with Mutex**: Since handler1 updates kv.store via kv_put, concurrent access could lead to race conditions, as Odin's maps aren't inherently thread-safe. To address this, a sync.Mutex is added to KVStore. The kv_put and kv_get procedures lock the mutex during map access, ensuring only one thread modifies or reads kv.store at a time. The mutex is initialized in kvstore_init and destroyed in kvstore_deinit. ```odin publish :: proc(ps: ^PubSub, topic: string, msg: string) { if handlers, ok := ps.subscribers[topic]; ok { threads := make([dynamic]^thread.Thread, 0, len(handlers)) defer delete(threads) // Allocate ThreadArgs for each handler thread_args := make([dynamic]^ThreadArgs, 0, len(handlers)) defer { for args in thread_args { free(args) } delete(thread_args) } for handler in handlers { msg_ptr := new(string) msg_ptr^ = msg t := thread.create(proc(t: ^thread.Thread) { handler := cast(proc(msg: string)) t.user_args[0] msg_ptr := cast(^string) t.user_args[1] handler(msg_ptr^) free(msg_ptr) }) t.user_args[0] = rawptr(handler) t.user_args[1] = rawptr(msg_ptr) thread.start(t) append(&threads, t) } for t in threads { thread.join(t) thread.destroy(t) } } } ``` This implementation adds concurrency by running each handler in its own thread, allowing parallel message processing. The mutex ensures thread safety for kv.store updates in handler1, preventing race conditions. Odin's core:thread library simplifies thread management, offering a clean, pthread-like experience. Odin’s threading feels like a bit like C’s pthreads but without the usual headache, and it’s honestly a breeze to read and write. For this demo, the mutex version keeps everything nice and tidy, However in a real application, you'd still want to consider more robust error handling and possibly a thread pool for efficiency and also some way to handle thread lifecycle and errors and so on... ## Adding Persistence I haven't added persistence to this code-block personally because I feel that would quickly spiral the demo that I wanted to keep simple and focused into something much more complex. But if you wanted to add persistence, you could use Odin's `core:file` library to read and write the `kv.store` map to a file. You would need to serialize the map to a string format (like `JSON` or `CSV`) when saving and deserialize it when loading. Luckily odin has `core:encoding/json` and `core:encoding/csv` libraries that can help with this. Which should at the very least make that step fairly trivial. So if you feel like it, give it a shot and let me know how it goes. Do note that this step is a lot harder than it may seem especially if you want to do it properly and performantly. ## Now to Compile and Run Now here's the thing the first time I ran `odin build .` I thought I messed up somewhere because, it basically took a split second and produced no output no warnings no nothing. But I did see that a binary was produced named after the folder I was in. So I ran it with ```bash ❯ ./kvpub Sub1 got: Welcome to dayvster.com Sub2 got: Welcome to dayvster.com Sub3 got: Welcome to dayvster.com Sub1 got: Here's another message after 2 seconds Sub2 got: Here's another message after 2 seconds Sub3 got: Here's another message after 2 seconds Last in kvstore: Here's another message after 2 seconds ``` And there you have it, a tiny key-value store with pub/sub capabilities built in Odin. That compiled bizarrely fast, in fact I used a util ([pulse](https://github.com/dayvster/pulse)) I wrote to benchmark processes and their execution time and it clocked in at a blazing 0.4 seconds to compile ```bash ❯ pulse --benchmark --cmd 'odin build .' --runs 3 ┌──────────────┬──────┬─────────┬─────────┬─────────┬───────────┬────────────┐ │ Command ┆ Runs ┆ Avg (s) ┆ Min (s) ┆ Max (s) ┆ Max CPU% ┆ Max RAM MB │ ╞══════════════╪══════╪═════════╪═════════╪═════════╪═══════════╪════════════╡ │ odin build . ┆ 3 ┆ 0.401 ┆ 0.401 ┆ 0.401 ┆ 0.00 ┆ 0.00 │ └──────────────┴──────┴─────────┴─────────┴─────────┴───────────┴────────────┘ ``` Well I couldn't believe that so I ran it again this time with `--runs 16` to get a better average and it still came in at a very respectable `0.45` (MAX) seconds. **OK that is pretty impressive.** but consistent maybe my tool is broken? I'm not infallible after all. So I re-confirmed it why `hyperfine` and it came out at: ```bash ❯ hyperfine "odin build ." Benchmark 1: odin build . Time (mean ± σ): 385.1 ms ± 12.5 ms [User: 847.1 ms, System: 354.6 ms] Range (min … max): 357.3 ms … 400.1 ms 10 runs ``` God damn that is fast, now I know the program is tiny and simple but still that is impressive and makes me wonder how it would handle a larger codebase. Please if you have any feedback or insights on this let me know I am really curious. just for sanitysake I also ran `time odin build .` and it came out at you've guessed it `0.4` seconds. ### Right so it's fast, but how's the experience? Well I have to say it was pretty smooth overall. The compiler is fast and the error messages are generally clear and helpful if not perhaps a bit... verbose for my taste **For example** I've intentionally introduced a simple typo in the `map` keyword and named is `masp` to showcase what I mean: ```bash ❯ odin build . /home/dave/Workspace/TMP/odinest/main.odin(44:31) Error: Expected an operand, got ] subscribers: masp[string][]proc(msg: string), ^ /home/dave/Workspace/TMP/odinest/main.odin(44:32) Syntax Error: Expected '}', got 'proc' subscribers: masp[string][]proc(msg: string), ^ /home/dave/Workspace/TMP/odinest/main.odin(44:40) Syntax Error: Expected ')', got ':' subscribers: masp[string][]proc(msg: string), ^ /home/dave/Workspace/TMP/odinest/main.odin(44:41) Syntax Error: Expected ';', got identifier subscribers: masp[string][]proc(msg: string), ^ ``` I chose specifically this map because I wanted to showcase how Odin handles errors when you try to build, it could simply say `Error: Unknown type 'masp'` but instead it goes on to produce 4 separate errors that all stem from the same root cause. This is obviously because the parser gets confused and can't make sense of the code anymore. So essentially you get every single error that results from the initial mistake even if they are on the same line. Now would I love to see them condensed into a single error message? Because it stems from the same line and the same root cause? Yes I would. But that's just my personal preference. ## Where Odin Shines ### Simplicity and Readability Odin kinda feels like a modernized somehow even more boring C but in the best way possible. It's simple, straightforward and easy to read. It does not try to have some sort of clever syntax or fancy features, it really feels like a no-nonsense no frills language that wants you to start coding and being productive as quickly as possible. In fact this brings me to my next point. ### The Built in Libraries Galore I was frankly blown away with just how much is included in the standard and vendored(more on that later) libraries. I mean it has everything you'd expect from a modern systems programming language but it also comes with a ton of complete data structures, algorithms and utilities that you would usually have to rely on third-party libraries for in C or even Zig. For more info just look at [Odin's Core Library](https://pkg.odin-lang.org/core/) and I mean really look at it and read it do not just skim it. Here's an example [flags](https://pkg.odin-lang.org/core/flags/) which is a complete command line argument parser, or even [rbtree](https://pkg.odin-lang.org/core/container/rbtree/) which is a complete implementation of a red-black tree data structure that you can just import and use right away But what really blew me away was ### The Built in Vendor Libraries / Packages Odin comes with a set of vendor libraries that basically give you useful bindings to stuff like `SDL2/3`, `OpenGL`, `Vulkan`, `Raylib`, `DirectX` and more. This is really impressive because it means you can start building games or graphics applications right away without having to worry about setting up bindings or dealing with C interop. Now I'm not super sure if these vendor bindings are all maintained and created by the Odin team from what I could gather so far, it would certainly seem so but I could be wrong. If you know more about this please let me know. But all that aside these bindings are really well done and easy to use. For example here's how you can create a simple window with SDL2 in Odin: ```odin package main import sdl "vendor:sdl2" main :: proc() { sdl.Init(sdl.INIT_VIDEO) defer sdl.Quit() window := sdl.CreateWindow( "Odin SDL2 Black Window", sdl.WINDOWPOS_CENTERED, sdl.WINDOWPOS_CENTERED, 800, 600, sdl.WINDOW_SHOWN, ) defer sdl.DestroyWindow(window) renderer := sdl.CreateRenderer(window, -1, sdl.RENDERER_ACCELERATED) defer sdl.DestroyRenderer(renderer) event: sdl.Event running := true for running { for sdl.PollEvent(&event) { if event.type == sdl.EventType.QUIT { running = false } } sdl.SetRenderDrawColor(renderer, 0, 0, 0, 255) sdl.RenderClear(renderer) sdl.RenderPresent(renderer) } } ``` This code creates a simple window with a black background using SDL2. It's pretty straightforward and easy to understand, especially if you're already familiar with SDL2 or SDL3. ### C Interop Odin makes it trivially easy to interop with C libraries, as long as that. This is done via their `foreign import` where you'd create an import name and link to the library file and `foreign` blocks to link to declared individual function or types. I could explain it with examples here but Odin's own documentation does a way better job and will keep this post from getting even longer than it already is. So please check out [Odin's C interop](https://odin-lang.org/news/binding-to-c/) documentation for more info. ## Where Odin Feels Awkward ### Standard Library Gaps While Odin's standard library is quite comprehensive, there are still some gaps and missing features that can make certain tasks more cumbersome. For example, while it has basic file I/O capabilities, it lacks more advanced features like file watching or asynchronous I/O. Additionally, while it has a decent set of data structures, it lacks some more specialized ones like tries or bloom filters I'd also love to see a b+ tree implementation in the core library. But those are at most nitpicks and finding third-party libraries or writing your own implementations is usually straightforward. However... ### No Package Manager I really like languages that come with their own package manager, it makes it so much easier to discover, install and manage third-party libraries / dependencies. Odin currently lacks a built-in package manager, which means you have to manually download and include third-party libraries in your projects. This can be a bit of a hassle, especially I'd imagine for larger projects with multiple dependencies. ### Smaller Nitpicks - **dir inconsistencies**: I love how it auto named my binary after the folder I was in but I wish it did the same whenever I ran `odin run` and `odin build` I had to explicitly specify `odin run .` and `odin build .` that felt a bit inconsistent to me because if it knows the folder we are in why not just use that as the default value when we wanna tell it to run or build in the current directory? - **Error messages**: As mentioned earlier, while Odin's error messages are generally clear, they can sometimes be overly verbose, especially when multiple errors stem from a single root cause. It would be nice to see more concise error reporting in such cases. So to fix this I'd love to either see error messages collapsed into a single message with an array of messages from the same line, or somehow grouped together into blocks. ### Pointers are ^ and not * I'm on a German keyboard and the `^` character is a bit of a pain to type, especially when compared to the `*` character which is right next to the `Enter` key on my keyboard. I get that Odin wants to differentiate itself from C and C++ but this small change feels unnecessary and adds a bit of friction to the coding experience. These are as the title says just minor nitpicks and in no way detract from the overall experience of using Odin, just minor annoyances that I personally had while using the language your experience may differ vastly and none of these may even bother you. ## So is Odin just a More Boring C? In a way, yes kind of. I mean it's very similar in approach and philosophy but with more "guard rails" and helpful utilities to make the experience smoother and more enjoyable and the what I so far assume are first party bindings to popular libraries via the vendors package really makes it stand out in a great way, where you get a lot more consistency and predictability than you would if you were to use C with those same libraries. And I guess that's the strength of Odin, it's so boring that it just let's you be a productive programmer without getting in your way or trying to be too clever or fancy. I use boring here in an affectionate way, if you've ever read any of my other posts you'll know that I do not appreciate complexity and unnecessary cleverness in programming which is why I suppose I'm often quite critical of rust even though I do like it for certain use cases. In this case I'd say Odin is very similar to Go both are fantastic boring languages that let you get stuff done without much fuss or hassle. The only difference is that Go decided to ship with a garbage collector and Odin did not, which honestly for me personally makes Odin vastly more appealing. ### Syntax and Ergonomics Odin’s syntax is like C with a modern makeover clean, readable, and less prone to boilerplate. It did take me quite a while to get used to replacing my muscle memory for `*` with `^` for pointers and `func`, `fn`, `fun`, `function` with `proc` for functions. But once I got over that initial hump, it felt pretty natural. Also `::` for type declarations is a bit unusual and took me longer than I care to admit, as I'm fairly used to `::` being used for scope resolution in languages like C++ and Rust. But again, once I got used to it, it felt fine. Everything else about the syntax felt pretty intuitive and straightforward. ## Who Odin Might Be Right For ### Ideal Use Cases - **Game Development**: Honestly I totally see where people are coming from when they say Odin is great for game development. The built-in vendor libraries for SDL2/3, OpenGL, Vulkan, Raylib and more make it super easy to get started with game development. Plus the language's performance and low-level capabilities are a great fit for the demands of game programming. - **Systems Programming**: Odin's manual memory management, low-level access, and performance make it a solid choice for systems programming tasks like writing operating systems, device drivers, or embedded systems. I will absolutely be writing some utilities for my Linux setup in Odin in the near future. - **Desktop Applications**: Again this is where those vendor libraries shine, making it easy to build cross-platform desktop applications with graphical interfaces as long as you're fine with doing some manual drawing of components, I'd love to see a binding for something like `GTK` or `Qt` in the vendor packages in the future. - **General Purpose Programming**: This brings me back to my intro where I said that it took me a while to realize that if Odin is good for game development then realistically by all means it should basically be good for anything and everything you wish to create with it. So yea give it a shot make something cool with it. ### Where It’s Not a Good Fit Yet - **Web Development**: The Net library is pretty darn nice and very extensive, however it does seem like it's maybe a bit more fit for general networking tasks rather than simplifying your life as a web backend developer. I'm sure there's already a bunch of third party libraries for this, but if you're a web dev you are almost spoiled for choice at the moment by languages that support web development out of the box with all the fixings and doodads. ## Final Thoughts ### Would I Use It Again? Absolutely in fact I will, I've already started planning some small utilities for my Linux setup in Odin. I really like the simplicity and readability of the language, as well as the comprehensive standard and vendor libraries. The performance is also impressive, especially the fast compile times. ### Source Code and Feedback You can find the complete source code for the tiny key-value store with pub/sub capabilities on my GitHub: [dayvster/odin-kvpub](https://github.com/dayvster/odin-kvpubsub) If you create anything cool with it I'd love to see it so do hit me up on any of my socials. I'd love to hear your thoughts and experiences with Odin, whether you've used it before or are considering giving it a try. Feel free to leave a comment or reach out to me on Twitter [@dayvster](https://twitter.com/dayvsterdev). Appreciate the time you took to read this post, and happy coding!

1 views
Nick Khami 2 months ago

Use the Accept Header to serve Markdown instead of HTML to LLMs

Agents don't need to see websites with markup and styling; anything other than plain Markdown is just wasted money spent on context tokens. I decided to make my Astro sites more accessible to LLMs by having them return Markdown versions of pages when the header has or preceding . This was very heavily inspired by this post on X from bunjavascript . Hopefully this helps SEO too, since agents are a big chunk of my traffic. The Bun team reported a 10x token drop for Markdown and frontier labs pay per token, so cheaper pages should get scraped more, be more likely to end up in training data, and give me a little extra lift from assistants and search. Note: You can check out the feature live by running or in your terminal. Static site generators like Astro and Gatsby already generate a big folder of HTML files, typically in a or folder through an command. The only thing missing is a way to convert those HTML files to markdown. It turns out there's a great CLI tool for this called html-to-markdown that can be installed with and run during a build step using . Here's a quick Bash script an LLM wrote to convert all HTML files in to Markdown files in , preserving the directory structure: Once you have the conversion script in place, the next step is to make it run as a post-build action. Here's an example of how to modify your scripts section: Moving all HTML files to first is only necessary if you're using Cloudflare Workers, which will serve existing static assets before falling back to your Worker. If you're using a traditional reverse proxy, you can skip that step and just convert directly from to . Note: I learned after I finished the project that I could have added to my so I didn't have to move any files around. That field forces the worker to always run frst. Shoutout to the kind folks on reddit for telling me. I pushed myself to go out of my comfort zone and learn Cloudflare Workers for this project since my company uses them extensively. If you're using a traditional reverse proxy like Nginx or Caddy, you can skip this section (and honestly, you'll have a much easier time). If you're coming from traditional reverse proxy servers, Cloudflare Workers force you into a different paradigm. What would normally be a simple Nginx or Caddy rule becomes custom configuration, moving your entire site to a shadow directory so Cloudflare doesn't serve static assets by default, writing JavaScript to manually check headers and using to serve files. SO MANY STEPS TO MAKE A SIMPLE FILE SERVER! This experience finally made Next.js 'middleware' click for me. It's not actually middleware in the traditional sense of a REST API; it's more like 'use this where you would normally have a real reverse proxy.' Both Cloudflare Workers and Next.js Middleware are essentially JavaScript-based reverse proxies that intercept requests before they hit your application. While I'd personally prefer Terraform with a hyperscaler or a VPS for a more traditional setup, new startups love this pattern, so it's worth understanding. Here's an example of a working file to refer to a new worker script and also bind your build output directory as a static asset namespace: Below is a minimal worker script that inspects the header and serves markdown when requested, otherwise falls back to HTML: Pro tip: make the root path serve your sitemap.xml instead of markdown content for your homepage such that an agent visiting your root URL can see all the links on your site. It's likely much easier to set this system up with a traditional reverse proxy file server like Caddy or Nginx. Here's a simple Caddyfile configuration that does the same thing: I will leave Nginx configuration as an exercise for the reader or perhaps the reader's LLM of choice. By serving lean, semantic Markdown to LLM agents, you can achieve a 10x reduction in token usage while making your content more accessible and efficient for the AI systems that increasingly browse the web. This optimization isn't just about saving money; it's about GEO (Generative Engine Optimization) for a changed world where millions of users discover content through AI assistants. Astro's flexibility made this implementation surprisingly straightforward. It only took me a couple of hours to get both the personal blog you're reading now and patron.com to support this feature. If you're ready to make your site agent-friendly, I encourage you to try this out. For a fun exercise, copy this article's URL and ask your favorite LLM to "Use the blog post to write a Cloudflare Worker for my own site." See how it does! You can also check out the source code for this feature at github.com/skeptrunedev/personal-site to get started. I'm excited to see the impact of this change on my site's analytics and hope it inspires others. If you implement this on your own site, I'd love to hear about your experience! Connect with me on X or LinkedIn .

1 views

NixCon 2025 Trip Report 🐝

I liked the NixOS meetup earlier this year, and at the end of the meetup they told everyone about NixCon 2025, which would be happening in Switzerland this year, at the very same location, the University Of Applied Sciences OST in Rapperswil, so I decided to go! In this trip report, I want to give you a rough impression of how I experienced this awesome conference :) The bee in the title is a NixCon inside joke ;) I arrived at about 09:30 on a rainy Friday morning, meaning I hurried from the train station into OST building 1 to show my ticket QR code and pick up my conference badge and custom name tag that I pre-ordered. The custom ones have your name engraved and come with a strong magnet to attach them to your clothes: After grabbing a bite to eat, I headed to the main lecture hall for the opening session. Prof. Dr. Farhad Mehta from OST, as well as the entire NixCon orga team, welcomed the 450 registered attendees to the 10th NixCon! I recognized many familiar faces from the Nix meetup, but many hands went up when the audience was asked for whom it was the first time at NixCon, or in Switzerland in general. I want to thank Prof. Mehta in particular for making possible such meetups and events! 👏 If you work at a university, school or other organisation that has access to rooms, consider offering to host a meetup (on a regular basis, or even just once)! Locations are always hard to find, so offering a space is a great contribution to Open Source. The first technical talk of the day was “What if GitHub Actions were local-first and built using Nix?” by Domen Kožar, the person behind cachix.org , which is a hosted Nix cache. The talk pitched cloud.devenv.sh , which is a Nix-based CI solution (like GitHub Actions) using devenv . By using this solution, you solve the problem that you can’t easily / completely run GitHub Actions locally (yes, we all know about act ), and you get to (?) write Nix configs instead of YAML configs. The solution seems nice, but I found the talk a little unstructured because the presenter jumped around between slides so much. One crucial question was left unanswered: How do you integrate this custom solution with your GitHub projects? To me, diverging from the default way of configuring GitHub Actions does not seem worth it for my projects. YMMV. → watch the recording (46 minutes) on media.ccc.de Next up: “Rewriting the Hydra Queue Runner in Rust” by Simon Hauser from Helsinki Systems , a small German software company. Hydra is the component in the NixOS infrastructure which schedules builds: when nixpkgs changes, this is the component that runs the build whose result ends up on cache.nixos.org (the Debian equivalent is buildd ). Simon explained that bottlenecks in the current queue runner result in stranding of infrastructure: the project has machines available that it cannot use fully. He outlined how they replaced a crufty SSH-based automation with a well-designed gRPC protocol. I got the impression that a group of people was involved in developing and reviewing this design, which is a great sign for a healthy project. One thing that was unfortunately missing from the talk were metrics. It would have been great to see a few graphs that illustrate just how much better the rewritten queue runner is. Currently, the new queue runner is already used for Nix Community builds, but not yet in production for NixOS itself. Hopefully soon, though! → watch the recording (27 minutes) on media.ccc.de This talk was presented by Zach Mitchell from Flox , which is a Nix-based dev environment solution. Thus far, I use or (see Development shells with Nix: four quick examples ), so I was curious what I’d learn from this talk. Zach explained that both, and were originally written to debug Nix package builds, not to provide general-purpose development environments. For users, this manifests as not being able to use your favorite shell — only supports Bash. One might read about , but that’s wrong, because then the shell’s RC files run after Nix setup, possibly destroying parts of the setup. One interesting thing I learnt is that the Nix garbage collector scans to avoid removing Nix store paths that are still needed by running processes. Zach mentioned https://github.com/zmitchell/proctrace , which is a bpftrace-based profiler that tracks forks/execs and generates gantt chart syntax of the timing. Sounds cool, but is unfortunately broken right now…? Too bad. → watch the recording (45 minutes) on media.ccc.de In this fireside chat, Tarus Balog shared how he ended up at AWS after 20 years of Open Source, and how his team wants to give back to the community. One specific way in which they’re doing that is by hosting cache.nixos.org. → watch the recording (24 minutes) on media.ccc.de Josh Heinrichs from Shopify shared how they adopted Nix (again!), and I think real-world enterprise adoption stories like these are very interesting. In summary, Shopify had a command (since 2016), which offered declarative configuration and then dispatched to (Linux) or (macOS). In the first attempt to move to Nix, the effort didn’t reach stable footing (some folks couldn’t use it yet) and then a company-wide shift to cloud development happened, where the easier solution was to “just use ubuntu”. A few years in, folks are apparently not so happy with the cloud development environments and one day, Shopify CEO Tobias Lütke finds devenv , which is a Nix-based solution that is remarkably similar to Shopify’s . So Tobi adopts devenv for one of their services and becomes supportive of using Nix. This time around, they spend a lot more time on a successful rollout within the organization, meaning incremental adoption, getting all stakeholders on board, etc. The takeaway is that one specific, well-supported use-case can be the adoption driver. And once you have your development environments on a Nix-based solution, you can more easily adopt other parts of the ecosystem as well. → watch the recording (19 minutes) on media.ccc.de In a similar spirit to the Shopify talk, Kavisha Kumar from ASML shared how she got into Nix after seeing a colleague use to obtain a clean development shell. Kavisha spent a lot of time at ASML to teach others about why and how to use Nix. She shared a number of nice metaphors that explained Nix concepts through the subject area of video gaming. I think many people are excited about Nix, but have trouble conveying that excitement to others. Kavisha showed us a good way that worked for her. → watch the recording (19 minutes) on media.ccc.de The rest of the day was filled with lightning talks. Cole Mickens from Determinate Systems explained what features they are currently shipping in their downstream distribution “Determinate Nix” (features will be upstreamed): lazy trees (a performance optimization for evaluating Flakes), parallel evaluation (brings evaluation times down from 16s to 7s) and a native Linux builder for mac. Next up are Flake Schemas, which I haven’t read about yet. Yvan Sraka from Numtide , a Nix and DevOps consultancy, showed how he manages Linux machines for friends and family with NixOS. He has his own configuration layer on top of NixOS and only uses the system as a base. Most actual programs are used through AppImage, Flatpaks, envfs and nix-ld . The latter two are solutions to use FHS based programs (those that expect and other standard locations to be present) on non-FHS systems like NixOS. I had heard of nix-ld before, but not of envfs. Jacek Galowicz from Nixcademy showed how to use systemd-sysupdate and systemd-repart to implement A/B style updates with NixOS and systemd. It’s great to see that this technique is more and more mainstream, as I am also using A/B style updating successfully in gokrazy . The weather on Saturday was a lot better, so I made sure to get a seat with a view of Lake Zürich: In this talk, Silvan Mosberger from Tweag (and one of the main NixCon organizers!), explains how the official formatting tool for .nix files came to be. I was delighted to hear , the official Go formatter, being mentioned as a source of inspiration. Just like in other language ecosystems, introducing uniform formatting eliminates time-consuming back-and-forth in code review over adhering to coding style. Unfortunately, the formatting folks did not replicate one key aspect to gofmt’s success: gofmt has no options. As the famous Go proverb goes: Gofmt’s style is no one’s favorite, yet gofmt is everyone’s favorite! Meaning that it’s more important that everyone uses the same style, compared to everyone being able to express their personal style preferences. → watch the recording (20 minutes) on media.ccc.de In this two-hour workshop, Jacek Galowicz from Nixcademy , who is not only a Nix teacher, but also happens to be the maintainer of the NixOS integration test driver, shows us how to write complex integration tests with a few lines of Nix and Python. Jacek showed an integration test example: a Bittorrent service, consisting of tracker, clients, firewalls and multiple networks! Nixpkgs contains over 1000 such integration tests, and running one on your laptop is easy. The various ways to debug your tests seem pretty cool: using vsock instead of port forwardings, and enabling a debug hook that will make a failed test hang and wait to be debugged. I thought this was a great overview and Jacek is an engaging teacher. I would recommend booking his classes! Ryota spoke about when to use Nix and when not to use Nix. For example, you could manage your dotfiles (config files) with Nix, or you could decide not to. Having recently migrated more and more machines and configurations to Nix, I found myself agreeing with this talk: It’s important to understand what you’ll get out of declaratively or statefully managed configs, and when which approach is better. → watch the recording (19 minutes) on media.ccc.de The rest of the day I spent in lightning talks, some of which were sponsored talk slots. I learnt about, in no particular order: After all the talks, we met outside for a group picture followed by barbecue at the lake: NixCon 2025 by Arik Grahl. Licensed under CC BY-SA 4.0. Before the conference, I wasn’t sure if I would even bother showing up for Sunday (Hack day), but on Sunday, I was like “of course!”, and it was a great decision! Many people were still around and were working on their projects. It felt like the answer to any Nix question was just one chat message away — there was expertise and helping hands from many parts of the project. I ended up meeting a couple of people I only knew from online interactions before, and we also talked a lot about meetups. Now, I am invited to multiple meetups to give a talk :D This was a wonderful conference! The orga team and all contributors did a great job! As always, the OST in Rapperswil is a great venue for Open Source events. Ticket sales and talk submission / scheduling were done using the Pretix and Pretalx Open Source systems, which makes me proud to have contributed to Pretix. The selection of talks was great: Some deeply technical, some covering only the human side of things, and many somewhere in between. I got the impression that all the presenters I saw genuinely cared about their topic, so the overall energy was very good! (You can watch the talk recordings at media.ccc.de: NixCon 2025 .) Also outside of the talks, I had many friendly interactions and interesting conversations. There is a lot of interest and adoption of Nix, which is great to see! The production level of the conference was very high for such a volunteer-driven event. For example, the very cool sounding break music between talks was created specifically for NixCon: “Lava” by tonstr.studio . Similarly, the welcome bag contained dark Swiss chocolate, specifically made for NixCon (see picture below). I don’t even like dark chocolate, but this one was delicious! Thanks again to all helpers, and I look forward to coming back soon! Cloud Hypervisor , a KVM based hypervisor like qemu, but written in Rust. nixbuild.net , a pay-as-you-go offering for extra build capacity you can rent. On Sunday I heard someone say that their company is using nixbuild.net and it’s very smooth. NixCI , a Nix-based hosted CI. So, the cloud.devenv.sh service we heard about on Friday is a competitor to this service. Nix in the Wild is an effort by Flox where they do 45-60 minute interviews about Nix success stories. This might help you convince folks in your organization. clan is a fleet management solution. NovaCustom , a one-person laptop/PC company. The laptops come with coreboot and work with NixOS. ExpressVPN is migrating their internal server setup (TrustedServer) from Debian to NixOS! Deploying weekly in 105+ countries. Cyberus, a German company, is offering NixOS LTS releases, compliant with the EU Cyber Resilience Act obligations. David’s styx project is a more bandwidth-efficient download mechanism for NixOS updates. This uses EROFS , which seems like an interesting alternative to SquashFS images.

0 views