Posts in Bash (20 found)

AMD GPU Debugger

I’ve always wondered why we don’t have a GPU debugger similar to the one used for CPUs. A tool that allows pausing execution and examining the current state. This capability feels essential, especially since the GPU’s concurrent execution model is much harder to reason about. After searching for solutions, I came across rocgdb, a debugger for AMD’s ROCm environment. Unfortunately, its scope is limited to that environment. Still, this shows it’s technically possible. I then found a helpful series of blog posts by Marcell Kiss , detailing how he achieved this, which inspired me to try to recreate the process myself. The best place to start learning about this is RADV . By tracing what it does, we can find how to do it. Our goal here is to run the most basic shader without using Vulkan, aka RADV in our case. First of all, we need to open the DRM file to establish a connection with the KMD, using a simple open(“/dev/dri/cardX”), then we find that it’s calling , which is a function defined in , which is a library that acts as middleware between user mode drivers(UMD) like and and kernel mode drivers(KMD) like amdgpu driver, and then when we try to do some actual work we have to create a context which can be achieved by calling from again, next up we need to allocate 2 buffers one of them for our code and the other for writing our commands into, we do this by calling a couple of functions, here’s how I do it: Here we’re choosing the domain and assigning flags based on the params, some buffers we will need uncached, as we will see: Now we have the memory, we need to map it. I opt to map anything that can be CPU-mapped for ease of use. We have to map the memory to both the GPU and the CPU virtual space. The KMD creates the page table when we open the DRM file, as shown here . So map it to the GPU VM and, if possible, to the CPU VM as well. Here, at this point, there’s a libdrm function that does all of this setup for us and maps the memory, but I found that even when specifying , it doesn’t always tag the page as uncached, not quite sure if it’s a bug in my code or something in anyways, the function is , I opted to do it manually here and issue the IOCTL call myself: Now we have the context and 2 buffers. Next, fill those buffers and send our commands to the KMD, which will then forward them to the Command Processor (CP) in the GPU for processing. Let’s compile our code. We can use clang assembler for that, like this: The bash script compiles the code, and then we’re only interested in the actual machine code, so we use objdump to figure out the offset and the size of the section and copy it to a new file called asmc.bin, then we can just load the file and write its bytes to the CPU-mapped address of the code buffer. Next up, filling in the commands. This was extremely confusing for me because it’s not well documented. It was mostly learning how does things and trying to do similar things. Also, shout-out to the folks on the Graphics Programming Discord server for helping me, especially Picoduck. The commands are encoded in a special format called , which has multiple types. We only care about : each packet has an opcode and the number of bytes it contains. The first thing we need to do is program the GPU registers, then dispatch the shader. Some of those registers are ; those registers are responsible for a number of configurations, pgm_[lo/hi], which hold the pointer to the code buffer and ; those are responsible for the number of threads inside a work group. All of those are set using the packets, and here is how to encode them: {% mark %} It’s worth mentioning that we can set multiple registers in 1 packet if they’re consecutive.{% /mark %} Then we append the dispatch command: Now we want to write those commands into our buffer and send them to the KMD: {% mark %} Here is a good point to make a more complex shader that outputs something. For example, writing 1 to a buffer. {% /mark %} No GPU hangs ?! nothing happened ?! cool, cool, now we have a shader that runs on the GPU, what’s next? Let’s try to hang the GPU by pausing the execution, aka make the GPU trap. The RDNA3’s ISA manual does mention 2 registers, ; here’s how they describe them respectively: Holds the pointer to the current trap handler program address. Per-VMID register. Bit [63] indicates if the trap handler is present (1) or not (0) and is not considered part of the address (bit[62] is replicated into address bit[63]). Accessed via S_SENDMSG_RTN. Temporary register for shader operations. For example, it can hold a pointer to memory used by the trap handler. {%mark%} You can configure the GPU to enter the trap handler when encountering certain exceptions listed in the RDNA3 ISA manual. {%/mark%} We know from Marcell Kiss’s blog posts that we need to compile a trap handler, which is a normal shader the GPU switches to when encountering a . The TBA register has a special bit that indicates whether the trap handler is enabled. Since these are privileged registers, we cannot write to them from user space. To bridge this gap for debugging, we can utilize the debugfs interface. Luckily, we have UMR , which uses that debugfs interface, and it’s open source; we copy AMD’s homework here which is great. The amdgpu KMD has a couple of files in debugfs under ; one of them is , which is an interface to a in the kernel that writes to the registers. It works by simply opening the file, seeking the register’s offset, and then writing; it also performs some synchronisation and writes the value correctly. We need to provide more parameters about the register before writing to the file, tho and do that by using an ioctl call. Here are the ioctl arguments: The 2 structs are because there are 2 types of registers, GRBM and SRBM, each of which is banked by different constructs; you can learn more about some of them here in the Linux kernel documentation . Turns out our registers here are SBRM registers and banked by VMIDs, meaning each VMID has its own TBA and TMA registers. Cool, now we need to figure out the VMID of our process. As far as I understand, VMIDs are a way for the GPU to identify a specific process context, including the page table base address, so the address translation unit can translate a virtual memory address. The context is created when we open the DRM file. They get assigned dynamically at dispatch time, which is a problem for us; we want to write to those registers before dispatch. We can obtain the VMID of the dispatched process by querying the register with s_getreg_b32. I do a hack here, by enabling the trap handler in every VMID, and there are 16 of them, the first being special, and used by the KMD and the last 8 allocated to the amdkfd driver. We loop over the remaining VMIDs and write to those registers. This can cause issues to other processes using other VMIDs by enabling trap handlers in them and writing the virtual address of our trap handler, which is only valid within our virtual memory address space. It’s relatively safe tho since most other processes won’t cause a trap[^1]. [^1]: Other processes need to have a s_trap instruction or have trap on exception flags set, which is not true for most normal GPU processes. Now we can write to TMA and TBA, here’s the code: And here’s how we write to and : {%mark%} If you noticed, I’m using bitfields. I use them because working with them is much easier than macros, and while the byte order is not guaranteed by the C spec, it’s guaranteed by System V ABI, which Linux adheres to. {%/mark%} Anyway, now that we can write to those registers, if we enable the trap handler correctly, the GPU should hang when we launch our shader if we added instruction to it, or we enabled the bit in rsrc3[^2] register. [^2]: Available since RDNA3, if I’m not mistaken. Now, let’s try to write a trap handler. {%mark%} If you wrote a different shader that outputs to a buffer, u can try writing to that shader from the trap handler, which is nice to make sure it’s actually being run. {%/mark%} We need 2 things: our trap handler and some scratch memory to use when needed, which we will store the address of in the TMA register. The trap handler is just a normal program running in privileged state, meaning we have access to special registers like TTMP[0-15]. When we enter a trap handler, we need to first ensure that the state of the GPU registers is saved, just as the kernel does for CPU processes when context-switching, by saving a copy of the stable registers and the program counter, etc. The problem, tho, is that we don’t have a stable ABI for GPUs, or at least not one I’m aware of, and compilers use all the registers they can, so we need to save everything. AMD GPUs’ Command Processors (CPs) have context-switching functionality, and the amdkfd driver does implement some context-switching shaders . The problem is they’re not documented, and we have to figure them out from the amdkfd driver source and from other parts of the driver stack that interact with it, which is a pain in the ass. I kinda did a workaround here since I didn’t find luck understanding how it works, and some other reasons I’ll discuss later in the post. The workaround here is to use only TTMP registers and a combination of specific instructions to copy the values of some registers, allowing us to use more instructions to copy the remaining registers. The main idea is to make use of the instruction, which adds the index of the current thread within the wave to the writing address, aka $$ ID_{thread} * 4 + address $$ This allows us to write a unique value per thread using only TTMP registers, which are unique per wave, not per thread[^3], so we can save the context of a single wave. [^3]: VGPRs are unique per thread, and SGPRs are unique per wave The problem is that if we have more than 1 wave, they will overlap, and we will have a race condition. Here is the code: Now that we have those values in memory, we need to tell the CPU: Hey, we got the data, and pause the GPU’s execution until the CPU issues a command. Also, notice we can just modify those from the CPU. Before we tell the CPU, we need to write some values that might help the CPU. Here are they: Now the GPU should just wait for the CPU, and here’s the spin code it’s implemented as described by Marcell Kiss here : The main loop in the CPU is like enable trap handler, then dispatch shader, then wait for the GPU to write some specific value in a specific address to signal all data is there, then examine and display, and tell the GPU all clear, go ahead. Now that our uncached buffers are in play, we just keep looping and checking whether the GPU has written the register values. When it does, the first thing we do is halt the wave by writing into the register to allow us to do whatever with the wave without causing any issues, tho if we halt for too long, the GPU CP will reset the command queue and kill the process, but we can change that behaviour by adjusting lockup_timeout parameter of the amdgpu kernel module: From here on, we can do whatever with the data we have. All the data we need to build a proper debugger. We will come back to what to do with the data in a bit; let’s assume we did what was needed for now. Now that we’re done with the CPU, we need to write to the first byte in our TMA buffer, since the trap handler checks for that, then resume the wave, and the trap handler should pick it up. We can resume by writing to the register again: Then the GPU should continue. We need to restore everything and return the program counter to the original address. Based on whether it’s a hardware trap or not, the program counter may point to the instruction before or the instruction itself. The ISA manual and Marcell Kiss’s posts explain that well, so refer to them. Now we can run compiled code directly, but we don’t want people to compile their code manually, then extract the text section, and give it to us. The plan is to take SPIR-V code, compile it correctly, then run it, or, even better, integrate with RADV and let RADV give us more information to work with. My main plan was making like fork RADV and then add then make report for us the vulkan calls and then we can have a better view on the GPU work know the buffers/textures it’s using etc, This seems like a lot more work tho so I’ll keep it in mind but not doing that for now unless someone is willing to pay me for that ;). For now, let’s just use RADV’s compiler . Luckily, RADV has a mode, aka it will not do actual work or open DRM files, just a fake Vulkan device, which is perfect for our case here, since we care about nothing other than just compiling code. We can enable it by setting the env var , then we just call what we need like this: Now that we have a well-structured loop and communication between the GPU and the CPU, we can run SPIR-V binaries to some extent. Let’s see how we can make it an actual debugger. We talked earlier about CPs natively supporting context-switching, this appears to be compute spcific feature, which prevents from implementing it for other types of shaders, tho, it appears that mesh shaders and raytracing shaders are just compute shaders under the hood, which will allow us to use that functionality. For now debugging one wave feels enough, also we can moify the wave parameters to debug some specific indices. Here’s some of the features For stepping, we can use 2 bits: one in and the other in . They’re and , respectively. The former enters the trap handler after each instruction, and the latter enters before the first instruction. This means we can automatically enable instruction-level stepping. Regarding breakpoints, I haven’t implemented them, but they’re rather simple to implement here by us having the base address of the code buffer and knowing the size of each instruction; we can calculate the program counter location ahead and have a list of them available to the GPU, and we can do a binary search on the trap handler. The ACO shader compiler does generate instruction-level source code mapping, which is good enough for our purposes here. By taking the offset[^4] of the current program counter and indexing into the code buffer, we can retrieve the current instruction and disassemble it, as well as find the source code mapping from the debug info. [^4]: We can get that by subtracting the current program counter from the address of the code buffer. We can implement this by marking the GPU page as protected. On a GPU fault, we enter the trap handler, check whether it’s within the range of our buffers and textures, and then act accordingly. Also, looking at the registers, we can find these: which suggests that the hardware already supports this natively, so we don’t even need to do that dance. It needs more investigation on my part, tho, since I didn’t implement this. This needs some serious plumbing, since we need to make NIR(Mesa’s intermediate representation) optimisation passes propagate debug info correctly. I already started on this here . Then we need to make ACO track variables and store the information. This requires ditching our simple UMD we made earlier and using RADV, which is what should happen eventually, then we have our custom driver maybe pause on before a specific frame, or get triggered by a key, and then ask before each dispatch if to attach to it or not, or something similar, since we have a full proper Vulkan implementation we already have all the information we would need like buffers, textures, push constants, types, variable names, … etc, that would be a much better and more pleasant debugger to use. Finally, here’s some live footage: ::youtube{url=“ https://youtu.be/HDMC9GhaLyc ”} Here is an incomplete user-mode page walking code for gfx11, aka rx7900xtx

0 views
Kaushik Gopal 5 days ago

Combating AI coding atrophy with Rust

It’s no secret that I’ve fully embraced AI for my coding. A valid concern ( and one I’ve been thinking about deeply ) is the atrophying of the part of my brain that helps me code. To push back on that, I’ve been learning Rust on the side for the last few months. I am absolutely loving it. Kotlin remains my go-to language. It’s the language I know like the back of my hand. If someone sends me a swath of Kotlin code, whether handwritten or AI generated, I can quickly grok it and form a strong opinion on how to improve it. But Kotlin is a high-level language that runs on a JVM. There are structural limits to the performance you can eke out of it, and for most of my career 1 I’ve worked with garbage-collected languages. For a change, I wanted a systems-level language, one without the training wheels of a garbage collector. I also wanted a language with a different core philosophy, something that would force me to think in new ways. I picked up Go casually but it didn’t feel like a big enough departure from the languages I already knew. It just felt more useful to ask AI to generate Go code than to learn it myself. With Rust, I could get code translated, but then I’d stare at the generated code and realize I was missing some core concepts and fundamentals. I loved that! The first time I hit a lifetime error, I had no mental model for it. That confusion was exactly what I was looking for. Coming from a GC world, memory management is an afterthought — if it requires any thought at all. Rust really pushes you to think through the ownership and lifespan of your data, every step of the way. In a bizarre way, AI made this gap obvious. It showed me where I didn’t understand things and pointed me toward something worth learning. Here’s some software that’s either built entirely in Rust or uses it in fundamental ways: Many of the most important tools I use daily are built with Rust. Can’t hurt to know the language they’re written in. Rust is quite similar to Kotlin in many ways. Both use strict static typing with advanced type inference. Both support null safety and provide compile-time guarantees. The compile-time strictness and higher-level constructs made it fairly easy for me to pick up the basics. Syntactically, it feels very familiar. I started by rewriting a couple of small CLI tools I used to keep in Bash or Go. Even in these tiny programs, the borrow checker forced me to be clear about who owns what and when data goes away. It can be quite the mental workout at times, which is perfect for keeping that atrophy from setting in. After that, I started to graduate to slightly larger programs and small services. There are two main resources I keep coming back to: There are times when the book or course mentions a concept and I want to go deeper. Typically, I’d spend time googling, searching Stack Overflow, finding references, diving into code snippets, and trying to clear up small nuances. But that’s changed dramatically with AI. One of my early aha moments with AI was how easy it made ramping up on code. The same is true for learning a new language like Rust. For example, what’s the difference 2 between these two: Another thing I loved doing is asking AI: what are some idiomatic ways people use these concepts? Here’s a prompt I gave Gemini while learning: Here’s an abbreviated response (the full response was incredibly useful): It’s easy to be doom and gloom about AI in coding — the “we’ll all forget how to program” anxiety is real. But I hope this offers a more hopeful perspective. If you’re an experienced developer worried about skill atrophy, learn a language that forces you to think differently. AI can help you cross that gap faster. Use it as a tutor, not just a code generator. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎ fd (my tool of choice for finding files) ripgrep (my tool of choice for searching files) Fish shell (my shell of choice, recently rewrote in Rust) Zed (my text/code editor of choice) Firefox ( my browser of choice) Android?! That’s right: Rust now powers some of the internals of the OS, including the recent Quick Share feature. Fondly referred to as “ The Book ”. There’s also a convenient YouTube series following the book . Google’s Comprehensive Rust course, presumably created to ramp up their Android team. It even has a dedicated Android chapter . This worked beautifully for me. I did a little C/C++ in high school, but nowhere close to proficiency.  ↩︎ Think mutable var to a “shared reference” vs. immutable var to an “exclusive reference”.  ↩︎

0 views
neilzone 6 days ago

Adding a button to Home Assistant to run a shell command

Now that I have fixed my garage door controller , I wanted to see if I could use it from within Home Assistant, primarily so that I can then have a widget on my phone screen. In this case, the shell command is a simple invocation to a remote machine. (I am not aware of a way to add a button to an Android home screen to run a shell command or bash script.) Adding a button to run a shell command or bash script in Home Assistant was pretty straightforward, following the Home Assistant documentation for shell commands . To my configuration.yaml file, I added something like: The quirk here was that reloading the configuration.yaml file from within the Home Assistant UI was insufficient. I needed to completely restart Home Assistant to pick up the changes. Once I had restarted Home Assistant, the shell commands were available. To add buttons, I needed to create “helpers”. I did this from the Home Assistant UI, via Settings / Devices & services / Helpers / Create helper". One helper for each button. After I had created each helper, I went back into the helper’s settings, to add zone information, so that it appeared in the right place in the dashboard. Having created the button/helper, and the shell command, I used an automation to link the two together. I did this via the Home Assistant UI, Settings / Automations & scenes. For the button, it was linked to a change in state of the button, with no parameters specified. The “then do” action is the shell command for the door in question.

0 views
マリウス 1 weeks ago

disable-javascript.org

With several posts on this website attracting significant views in the last few months I had come across plenty of feedback on the tab gimmick implemented last quarter . While the replies that I came across on platforms like the Fediverse and Bluesky were lighthearted and oftentimes with humor, the visitors coming from traditional link aggregators sadly weren’t as amused about it. Obviously a large majority of people disagreeing with the core message behind this prank appear to be web developers, who’s very existence quite literally depends on JavaScript, and who didn’t hold back to express their anger in the comment sections as well as through direct emails. Unfortunately, most commenters are missing the point. This email exchange is just one example of feedback that completely misses the point: I just found it a bit hilarious that your site makes notes about ditching and disable Javascript, and yet Google explicitly requires it for the YouTube embeds. Feels weird. The email contained the following attachment: Given the lack of context I assume that the author was referring to the YouTube embeds on this website (e.g. on the keyboard page). Here is my reply: Simply click the link on the video box that says “Try watching this video on www.youtube.com” and you should be directed to YouTube (or a frontend of your choosing with LibRedirect [1]) where you can watch it. Sadly, I don’t have the influence to convince YouTube to make their video embeds working without JavaScript enabled. ;-) However, if more people would disable JavaScript by default, maybe there would be a higher incentive for server-side-rendering and video embeds would at the very least show a thumbnail of the video (which YouTube could easily do, from a technical point of view). Kind regards! [1]: https://libredirect.github.io It also appears that many of the people disliking the feature didn’t care to properly read the highlighted part of the popover that says “Turn JavaScript off, now, and only allow it on websites you trust!” : Indeed - and the author goes on to show a screenshot of Google Trends which, I’m sure, won’t work without JavaScript turned on. This comment perfectly encapsulates the flawed rhetoric. Google Trends (like YouTube in the previous example) is a website that is unlikely to exploit 0-days in your JavaScript engine, or at least that’s the general consensus. However, when you clicked on a link that looks like someone typed it in by putting their head on the keyboard , that led you to a website you obviously didn’t know beforehand, it’s a different story. What I’m advocating for is to have JavaScript disabled by default for everything unknown to you , and only enable it for websites that you know and trust . Not only is this approach going to protect you from jump-scares , regardless whether that’s a changing tab title, a popup, or an actual exploit, but it will hopefully pivot the thinking of particularly web developers back from “Let’s render the whole page using JavaScript and display nothing if it’s disabled” towards “Let’s make the page as functional as possible without the use of JavaScript and only sprinkle it on top as a way to make the experience better for anyone who choses to enable it” . It is mind boggling how this simple take is perceived as militant techno-minimalism and can provoke such salty feedback. I keep wondering whether these are the same people that consider to be a generally okay way to install software …? One of the many commenters that however did agree with the approach that I’m taking on this site had put it fairly nicely: About as annoying as your friend who bumped key’ed his way into your flat in 5 seconds waiting for you in the living room. Or the protest blocking the highway making you late for work. Many people don’t realize that JavaScript means running arbitrary untrusted code on your machine. […] Maybe the hacker ethos has changed, but I for one miss the days of small pranks and nudges to illustrate security flaws, instead of ransomware and exploits for cash. A gentle reminder that we can all do better, and the world isn’t always all that friendly. As the author of this comment correctly hints, the hacker ethos has in fact changed. My guess is that only a tiny fraction of the people that are actively commenting on platforms like Hacker News or Reddit these days know about, let’s say, cDc ’s Back Orifice , the BOFH stories, bash.org , and all the kickme.to/* links that would trigger a disconnect in AOL ’s dialup desktop software. Hence, the understanding about how far pranks in the 90s and early 2000s really went simply isn’t there. And with most things these days required to be politically correct , having the tab change to what looks like a Google image search for “sam bankman-fried nudes” is therefor frowned upon by many, even when the reason behind it is to inform. Frankly, it seems that conformism has eaten not only the internet, but to an extent the whole world, when an opinion that goes ever so slightly against the status quo is labelled as some sort of extreme view . To feel even just a “tiny bit violated by” something as mundane as a changing text and icon in the browser’s tab bar seems absurd, especially when it is you that allowed my website to run arbitrary code on your computer! Because I’m convinced that a principled stance against the insanity that is the modern web is necessary, I am doubling down on this effort by making it an actual initiative: disable-javascript.org disable-javascript.org is a website that informs the average user about some of the most severe issues affecting the JavaScript ecosystem and browsers/users all over the world, and explains in simple terms how to disable JavaScript in various browsers and only enable it for specific, trusted websites. The site is linked on the JavaScript popover that appears on this website, so that visitors aren’t only pranked into hopefully disabling JavaScript, but can also easily find out how to do so. disable-javascript.org offers a JavaScript-snippet that is almost identical to the one in use by this website, in case you would like to participate in the cause. Of course, you can as well simply link to disable-javascript.org from anywhere on your website to show your support. If you’d like to contribute to the initiative by extending the website with valuable info, you can do so through its Git repository . Feel free to open pull-requests with the updates that you would like to see on disable-javascript.org . :-)

0 views
Simon Willison 2 weeks ago

Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson

I talked with CL Kao and Dori Wilson for an episode of their new Data Renegades podcast titled Data Journalism Unleashed with Simon Willison . I fed the transcript into Claude Opus 4.5 to extract this list of topics with timestamps and illustrative quotes. It did such a good job I'm using what it produced almost verbatim here - I tidied it up a tiny bit and added a bunch of supporting links. What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook I used a Claude Project for the initial analysis, pasting in the HTML of the transcript since that included elements. The project uses the following custom instructions You will be given a transcript of a podcast episode. Find the most interesting quotes in that transcript - quotes that best illustrate the overall themes, and quotes that introduce surprising ideas or express things in a particularly clear or engaging or spicy way. Answer just with those quotes - long quotes are fine. I then added a follow-up prompt saying: Now construct a bullet point list of key topics where each item includes the mm:ss in square braces at the end Then suggest a very comprehensive list of supporting links I could find Here's the full Claude transcript of the analysis. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." NICAR 2026 ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook

0 views

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds Bang Di, Yun Xu, Kaijie Guo, Yibin Shen, Yu Li, Sanchuan Cheng, Hao Zheng, Fudong Qiu, Xiaokang Hu, Naixuan Guan, Dongdong Huang, Jinhu Li, Yi Wang, Yifang Yang, Jintao Li, Hang Yang, Chen Liang, Yilong Lv, Zikang Chen, Zhenwei Lu, Xiaohan Ma, and Jiesheng Wu SOSP'25 Here is a contrarian view: the existence of hypervisors means that operating systems have fundamentally failed in some way. I remember thinking this a long time ago, and it still nags me from time to time. What does a hypervisor do? It virtualizes hardware so that it can be safely and fairly shared. But isn’t that what an OS is for? My conclusion is that this is a pragmatic engineering decision. It would simply be too much work to try to harden a large OS such that a cloud service provider would be comfortable allowing two competitors to share one server. It is a much safer bet to leave the legacy OS alone and instead introduce the hypervisor. This kind of decision comes up in other circumstances too. There are often two ways to go about implementing something. The first way involves widespread changes to legacy code, and the other way involves a low-level Jiu-Jitsu move which achieves the desired goal while leaving the legacy code untouched. Good managers have a reliable intuition about these decisions. The context here is a cloud service provider which virtualizes the network with a SmartNIC. The SmartNIC (e.g., NVIDIA BlueField-3 ) comprises ARM cores and programmable hardware accelerators. On many systems, the ARM cores are part of the data-plane (software running on an ARM core is invoked for each packet). These cores are also used as part of the control-plane (e.g., programming a hardware accelerator when a new VM is created). The ARM cores on the SmartNIC run an OS (e.g., Linux), which is separate from the host OS. The paper says that the traditional way to schedule work on SmartNIC cores is static scheduling. Some cores are reserved for data-plane tasks, while other cores are reserved for control-plane tasks. The trouble is, the number of VMs assigned to each server (and the size of each VM) changes dynamically. Fig. 2 illustrates a problem that arises from static scheduling: control-plane tasks take more time to execute on servers that host many small VMs. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dynamic Scheduling Headaches Dynamic scheduling seems to be a natural solution to this problem. The OS running on the SmartNIC could schedule a set of data-plane and control-plane threads. Data-plane threads would have higher priority, but control-plane threads could be scheduled onto all ARM cores when there aren’t many packets flowing. Section 3.2 says this is a no-go. It would be great if there was more detail here. The fundamental problem is that control-plane software on the SmartNIC calls kernel functions which hold spinlocks (which disable preemption) for relatively long periods of time. For example, during VM creation, a programmable hardware accelerator needs to be configured such that it will route packets related to that VM appropriately. Control-plane software running on an ARM core achieves this by calling kernel routines which acquire a spinlock, and then synchronously communicate with the accelerator. The authors take this design as immutable. It seems plausible that the communication with the accelerator could be done in an asynchronous manner, but that would likely have ramifications to the entire control-plane software stack. This quote is telling: Furthermore, the CP ecosystem comprises 300–500 heterogeneous tasks spanning C, Python, Java, Bash, and Rust, demanding non-intrusive deployment strategies to accommodate multi-language implementations without code modification. Here is the Jiu-Jitsu move: lie to the SmartNIC OS about how many ARM cores the SmartNIC has. Fig. 7(a) shows a simple example. The underlying hardware has 2 cores, but Linux thinks there are 3. One of the cores that the Linux scheduler sees is actually a virtual CPU (vCPU), the other two are physical CPUs (pCPU). Control-plane tasks run on vCPUs, while data-plane tasks run on pCPUs. From the point of view of Linux, all three CPUs may be running simultaneously, but in reality, a Linux kernel module (5,800 lines of code) is allowing the vCPU to run at times of low data-plane activity. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 One neat trick the paper describes is the hardware workload probe . This takes advantage of the fact that packets are first processed by a hardware accelerator (which can do things like parsing of packet headers) before they are processed by an ARM core. Fig. 10 shows that the hardware accelerator sees a packet at least 3 microseconds before an ARM core does. This enables this system to hide the latency of the context switch from vCPU to pCPU. Think of it like a group of students in a classroom without any teachers (e.g., network packets). The kids nominate one student to be on the lookout for an approaching adult. When the coast is clear, the students misbehave (i.e., execute control-plane tasks). When the lookout sees the teacher (a network packet) returning, they shout “act responsible”, and everyone returns to their schoolwork (running data-plane code). Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Results Section 6 of the paper has lots of data showing that throughput (data-plane) performance is not impacted by this technique. Fig. 17 shows the desired improvement for control-plane tasks: VM startup time is roughly constant no matter how many VMs are packed onto one server. Source: https://dl.acm.org/doi/10.1145/3731569.3764851 Dangling Pointers To jump on the AI bandwagon, I wonder if LLMs will eventually change the engineering equation. Maybe LLMs will get to the point where widespread changes across a legacy codebase will be tractable. If that happens, then Jiu-Jitsu moves like this one will be less important. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
devansh 1 months ago

Hitchhiker's Guide to Attack Surface Management

I first heard about the word "ASM" (i.e., Attack Surface Management) probably in late 2018, and I thought it must be some complex infrastructure for tracking assets of an organization. Looking back, I realize I almost had a similar stack for discovering, tracking, and detecting obscure assets of organizations, and I was using it for my bug hunting adventures. I feel my stack was kinda goated, as I was able to find obscure assets of Apple, Facebook, Shopify, Twitter, and many other Fortune 100 companies, and reported hundreds of bugs, all through automation. Back in the day, projects like ProjectDiscovery were not present, so if I had to write an effective port scanner, I had to do it from scratch. (Masscan and nmap were present, but I had my fair share of issues using them, this is a story for another time). I used to write DNS resolvers (massdns had a high error rate), port scanners, web scrapers, directory brute-force utilities, wordlists, lots of JavaScript parsing logic using regex, and a hell of a lot of other things. I used to have up to 50+ self-developed tools for bug-bounty recon stuff and another 60-something helper scripts written in bash. I used to orchestrate (gluing together with duct tape is a better word) and slap together scripts like a workflow, and save the output in text files. Whenever I dealt with a large number of domains, I used to distribute the load over multiple servers (server spin-up + SSH into it + SCP for pushing and pulling files from it). The setup was very fragile and error-prone, and I spent countless nights trying to debug errors in the workflows. But it was all worth it. I learned the art of Attack Surface Management without even trying to learn about it. I was just a teenager trying to make quick bucks through bug hunting, and this fragile, duct-taped system was my edge. Fast forward to today, I have now spent almost a decade in the bug bounty scene. I joined HackerOne in 2020 (to present) as a vulnerability triager, where I have triaged and reviewed tens of thousands of vulnerability submissions. Fair to say, I have seen a lot of things, from doomsday level 0-days, to reports related to leaked credentials which could have led to entire infrastructure compromise, just because some dev pushed an AWS secret key in git logs, to things where some organizations were not even aware they were running Jenkins servers on some obscure subdomain which could have allowed RCE and then lateral movement to other layers of infrastructure. A lot of these issues I have seen were totally avoidable, only if organizations followed some basic attack surface management techniques. If I search "Guide to ASM" on Internet, almost none of the supposed guides are real resources. They funnel you to their own ASM solution, and the guide is just present there to provide you with some surface-level information, and is mostly a marketing gimmick. This is precisely why I decided to write something where I try to cover everything I learned and know about ASM, and how to protect your organization's assets before bad actors could get to them. This is going to be a rough and raw guide, and will not lead you to a funnel where I am trying to sell my own ASM SaaS to you. I have nothing to sell, other than offering what I know. But in case you are an organization who needs help implementing the things I am mentioning below, you can reach out to me via X or email (both available on the homepage of this blog). This guide will provide you with insights into exactly how big your attack surface really is. CISOs can look at it and see if their organizations have all of these covered, security researchers and bug hunters can look at this and maybe find new ideas related to where to look during recon. Devs can look at it and see if they are unintentionally leaving any door open for hackers. If you are into security, it has something to offer you. Attack surface is one of those terms getting thrown around in security circles so much that it's become almost meaningless noise. In theory, it sounds simple enough, right. Attack surface is every single potential entry point, interaction vector, or exploitable interface an attacker could use to compromise your systems, steal your data, or generally wreck your day. But here's the thing, it's the sum total of everything you've exposed to the internet. Every API endpoint you forgot about, every subdomain some dev spun up for "testing purposes" five years ago and then abandoned, every IoT device plugged into your network, every employee laptop connecting from a coffee shop, every third-party vendor with a backdoor into your environment, every cloud storage bucket with permissions that make no sense, every Slack channel, every git commit leaking credentials, every paste on Pastebin containing your database passwords. Most organizations think about attack surface in incredibly narrow terms. They think if they have a website, an email server, and maybe some VPN endpoints, they've got "good visibility" into their assets. That's just plain wrong. Straight up wrong. Your actual attack surface would terrify you if you actually understood it. You run , and is your main domain. You probably know about , , maybe . But what about that your intern from 2015 spun up and just never bothered to delete. It's not documented anywhere. Nobody remembers it exists. Domain attack surface goes way beyond what's sitting in your asset management system. Every subdomain is a potential entry point. Most of these subdomains are completely forgotten. Subdomain enumeration is reconnaissance 101 for attackers and bug hunters. It's not rocket science. Setting up a tool that actively monitors through active and passive sources for new subdomains and generates alerts is honestly an hour's worth of work. You can use tools like Subfinder, Amass, or just mine Certificate Transparency logs to discover every single subdomain connected to your domain. Certificate Transparency logs were designed to increase security by making certificate issuance public, and they've become an absolute reconnaissance goldmine. Every time you get an SSL certificate for , that information is sitting in public logs for anyone to find. Attackers systematically enumerate these subdomains using Certificate Transparency log searches, DNS brute-forcing with massive wordlists, reverse DNS lookups to map IP ranges back to domains, historical DNS data from services like SecurityTrails, and zone transfer exploitation if your DNS is misconfigured. Attackers are looking for old development environments still running vulnerable software, staging servers with production data sitting on them, forgotten admin panels, API endpoints without authentication, internal tools accidentally exposed, and test environments with default credentials nobody changed. Every subdomain is an asset. Every asset is a potential vulnerability. Every vulnerability is an entry point. Domains and subdomains are just the starting point though. Once you've figured out all the subdomains belonging to your organization, the next step is to take a hard look at IP address space, which is another absolutely massive component of your attack surface. Organizations own, sometimes lease, IP ranges, sometimes small /24 blocks, sometimes massive /16 ranges, and every single IP address in those blocks and ranges that responds to external traffic is part of your attack surface. And attackers enumerate them all if you won't. They use WHOIS lookups to identify your IP ranges, port scanning to find what services are running where, service fingerprinting to identify exact software versions, and banner grabbing to extract configuration information. If you have a /24 network with 256 IP addresses and even 10% of those IPs are running services, you've got 25 potential attack vectors. Scale that to a /20 or /16 and you're looking at thousands of potential entry points. And attackers aren't just looking at the IPs you know about. They're looking at adjacent IP ranges you might have acquired through mergers, historical IP allocations that haven't been properly decommissioned, and shared IP ranges where your servers coexist with others. Traditional infrastructure was complicated enough, and now we have cloud. It's literally exploded organizations' attack surfaces in ways that are genuinely difficult to even comprehend. Every cloud service you spin up, be it an EC2 instance, S3 bucket, Lambda function, or API Gateway endpoint, all of this is a new attack vector. In my opinion and experience so far, I think the main issue with cloud infrastructure is that it's ephemeral and distributed. Resources get spun up and torn down constantly. Developers create instances for testing and forget about them. Auto-scaling groups generate new resources dynamically. Containerized workloads spin up massive Kubernetes clusters you have minimal visibility into. Your cloud attack surface could be literally anything. Examples are countless, but I'd categorize them into 8 different categories. Compute instances like EC2, Azure VMs, GCP Compute Engine instances exposed to the internet. Storage buckets like S3, Azure Blob Storage, GCP Cloud Storage with misconfigured permissions. Serverless stuff like Lambda functions with public URLs or overly permissive IAM roles. API endpoints like API Gateway, Azure API Management endpoints without proper authentication. Container registries like Docker images with embedded secrets or vulnerabilities. Kubernetes clusters with exposed API servers, misconfigured network policies, vulnerable ingress controllers. Managed databases like RDS, CosmosDB, Cloud SQL instances with weak access controls. IAM roles and service accounts with overly permissive identities that enable privilege escalation. I've seen instances in the past where a single misconfigured S3 bucket policy exposed terabytes of data. An overly permissive Lambda IAM role enabled lateral movement across an entire AWS account. A publicly accessible Kubernetes API server gave an attacker full cluster control. Honestly, cloud kinda scares me as well. And to top it off, multi-cloud infrastructure makes everything worse. If you're running AWS, Azure, and GCP together, you've just tripled your attack surface management complexity. Each cloud provider has different security models, different configuration profiles, and different attack vectors. Every application now uses APIs, and all applications nowadays are like a constellation of APIs talking to each other. Every API you use in your organization is your attack surface. The problem with APIs is that they're often deployed without the same security scrutiny as traditional web applications. Developers spin up API endpoints for specific features and those endpoints accumulate over time. Some of them are shadow APIs, meaning API endpoints which aren't documented anywhere. These endpoints are the equivalent of forgotten subdomains, and attackers can find them through analyzing JavaScript files for API endpoint references, fuzzing common API path patterns, examining mobile app traffic to discover backend APIs, and mining old documentation or code repositories for deprecated endpoints. Your API attack surface includes REST APIs exposed to the internet, GraphQL endpoints with overly broad query capabilities, WebSocket connections for real-time functionality, gRPC services for inter-service communication, and legacy SOAP APIs that never got decommissioned. If your organization has mobile apps, be it iOS, Android, or both, this is a direct window to your infrastructure and should be part of your attack surface management strategy. Mobile apps communicate with backend APIs and those API endpoints are discoverable by reversing the app. The reversed source of the app could reveal hard-coded API keys, tokens, and credentials. Using JADX plus APKTool plus Dex2jar is all a motivated attacker needs. Web servers often expose directories and files that weren't meant to be publicly accessible. Attackers systematically enumerate these using automated tools like ffuf, dirbuster, gobuster, and wfuzz with massive wordlists to discover hidden endpoints, configuration files, backup files, and administrative interfaces. Common exposed directories include admin panels, backup directories containing database dumps or source code, configuration files with database credentials and API keys, development directories with debug information, documentation directories revealing internal systems, upload directories for file storage, and old or forgotten directories from previous deployments. Your attack surface must include directories which are accidentally left accessible during deployments, staging servers with production data, backup directories with old source code versions, administrative interfaces without authentication, API documentation exposing endpoint details, and test directories with debug output enabled. Even if you've removed a directory from production, old cached versions may still be accessible through web caches or CDNs. Search engines also index these directories, making them discoverable through dorking techniques. If your organization is using IoT devices, and everyone uses these days, this should be part of your attack surface management strategy. They're invisible to traditional security tools. Your EDR solution doesn't protect IoT devices. Your vulnerability scanner can't inventory them. Your patch management system can't update them. Your IoT attack surface could include smart building systems like HVAC, lighting, access control. Security cameras and surveillance systems. Printers and copiers, which are computers with network access. Badge readers and physical access systems. Industrial control systems and SCADA devices. Medical devices in healthcare environments. Employee wearables and fitness trackers. Voice assistants and smart speakers. The problem with IoT devices is that they're often deployed without any security consideration. They have default credentials that never get changed, unpatched firmware with known vulnerabilities, no encryption for data in transit, weak authentication mechanisms, and insecure network configurations. Social media presence is an attack surface component that most organizations completely ignore. Attackers can use social media for reconnaissance by looking at employee profiles on LinkedIn to reveal organizational structure, technologies in use, and current projects. Twitter/X accounts can leak information about deployments, outages, and technology stack. Employee GitHub profiles expose email patterns and development practices. Company blogs can announce new features before security review. It could also be a direct attack vector. Attackers can use information from social media to craft convincing phishing attacks. Hijacked social media accounts can be used to spread malware or phishing links. Employees can accidentally share sensitive information. Fake accounts can impersonate your brand to defraud customers. Your employees' social media presence is part of your attack surface whether you like it or not. Third-party vendors, suppliers, contractors, or partners with access to your systems should be part of your attack surface. Supply chain attacks are becoming more and more common these days. Attackers can compromise a vendor with weaker security and then use that vendor's access to reach your environment. From there, they pivot from the vendor network to your systems. This isn't a hypothetical scenario, it has happened multiple times in the past. You might have heard about the SolarWinds attack, where attackers compromised SolarWinds' build system and distributed malware through software updates to thousands of customers. Another famous case study is the MOVEit vulnerability in MOVEit Transfer software, exploited by the Cl0p ransomware group, which affected over 2,700 organizations. These are examples of some high-profile supply chain security attacks. Your third-party attack surface could include things like VPNs, remote desktop connections, privileged access systems, third-party services with API keys to your systems, login credentials shared with vendors, SaaS applications storing your data, and external IT support with administrative access. It's obvious you can't directly control third-party security. You can audit them, have them pen-test their assets as part of your vendor compliance plan, and include security requirements in contracts, but ultimately their security posture is outside your control. And attackers know this. GitHub, GitLab, Bitbucket, they all are a massive attack surface. Attackers search through code repositories in hopes of finding hard-coded credentials like API keys, database passwords, and tokens. Private keys, SSH keys, TLS certificates, and encryption keys. Internal architecture documentation revealing infrastructure details in code comments. Configuration files with database connection strings and internal URLs. Deprecated code with vulnerabilities that's still in production. Even private repositories aren't safe. Attackers can compromise developer accounts to access private repositories, former employees retain access after leaving, and overly broad repository permissions grant access to too many people. Automated scanners continuously monitor public repositories for secrets. The moment a developer accidentally pushes credentials to a public repository, automated systems detect it within minutes. Attackers have already extracted and weaponized those credentials before the developer realizes the mistake. CI/CD pipelines are massive another attack vector. Especially in recent times, and not many organizations are giving attention to this attack vector. This should totally be part of your attack surface management. Attackers compromise GitHub Actions workflows with malicious code injection, Jenkins servers with weak authentication, GitLab CI/CD variables containing secrets, and build artifacts with embedded malware. The GitHub Actions supply chain attack, CVE-2025-30066, demonstrated this perfectly. Attackers compromised the Action used in over 23,000 repositories, injecting malicious code that leaked secrets from build logs. Jenkins specifically is a goldmine for attackers. An exposed Jenkins instance provides complete control over multiple critical servers, access to hardcoded AWS keys, Redis credentials, and BitBucket tokens, ability to manipulate builds and inject malicious code, and exfiltration of production database credentials containing PII. Modern collaboration tools are massive attack surface components that most organizations underestimate. Slack has hidden security risks despite being invite-only. Slack attack surface could include indefinite data retention where every message, channel, and file is stored forever unless admins configure retention periods. Public channels accessible to all users so one breached account opens the floodgates. Third-party integrations with excessive permissions accessing messages and user data. Former contractor access where individuals retain access long after projects end. Phishing and impersonation where it's easy to change names and pictures to impersonate senior personnel. In 2022, Slack leaked hashed passwords for five years affecting 0.5% of users. Slack channels commonly contain API keys, authentication tokens, database credentials, customer PII, financial data, internal system passwords, and confidential project information. The average cost of a breached record was $164 in 2022. When 1 in 166 messages in Slack contains confidential information, every new message adds another dollar to total risk exposure. With 5,000 employees sending 30 million Slack messages per year, that's substantial exposure. Trello board exposure is a significant attack surface. Trello attack vectors include public boards with sensitive information accidentally shared publicly, default public visibility where boards are created as public by default in some configurations, unsecured REST API allowing unauthenticated access to user data, and scraping attacks where attackers use email lists to enumerate Trello accounts. The 2024 Trello data breach exposed 15 million users' personal information when a threat actor named "emo" exploited an unsecured REST API using 500 million email addresses to compile detailed user profiles. Security researcher David Shear documented hundreds of public Trello boards exposing passwords, credentials, IT support customer access details, website admin logins, and client server management credentials. IT companies were using Trello to troubleshoot client requests and manage infrastructure, storing all credentials on public Trello boards. Jira misconfiguration is a widespread attack surface issue. Common misconfigurations include public dashboards and filters with "Everyone" access actually meaning public internet access, anonymous access enabled allowing unauthenticated users to browse, user picker functionality providing complete lists of usernames and email addresses, and project visibility allowing sensitive projects to be accessible without authentication. Confluence misconfiguration exposes internal documentation. Confluence attack surface components include anonymous access at site level allowing public access, public spaces where space admins grant anonymous permissions, inherited permissions where all content within a space inherits space-level access, and user profile visibility allowing anonymous users to view profiles of logged-in users. When anonymous access is enabled globally and space admins allow anonymous users to access their spaces, anyone on the internet can access that content. Confluence spaces often contain internal documentation with hardcoded credentials, financial information, project details, employee information, and API documentation with authentication details. Cloud storage misconfiguration is epidemic. Google Drive misconfiguration attack surface includes "Anyone with the link" sharing making files accessible without authentication, overly permissive sharing defaults making it easy to accidentally share publicly, inherited folder permissions exposing everything beneath, unmanaged third-party apps with excessive read/write/delete permissions, inactive user accounts where former employees retain access, and external ownership blind spots where externally-owned content is shared into the environment. Metomic's 2023 Google Scanner Report found that of 6.5 million Google Drive files analyzed, 40.2% contained sensitive information, 34.2% were shared externally, and 0.5% were publicly accessible, mostly unintentionally. In December 2023, Japanese game developer Ateam suffered a catastrophic Google Drive misconfiguration that exposed personal data of nearly 1 million people for over six years due to "Anyone with the link" settings. Based on Valence research, 22% of external data shares utilize open links, and 94% of these open link shares are inactive, forgotten files with public URLs floating around the internet. Dropbox, OneDrive, and Box share similar attack surface components including misconfigured sharing permissions, weak or missing password protection, overly broad access grants, third-party app integrations with excessive permissions, and lack of visibility into external sharing. Features that make file sharing convenient create data leakage risks when misconfigured. Pastebin and similar paste sites are both reconnaissance sources and attack vectors. Paste site attack surface includes public data dumps of stolen credentials, API keys, and database dumps posted publicly, malware hosting of obfuscated payloads, C2 communications where malware uses Pastebin for command and control, credential leakage from developers accidentally posting secrets, and bypassing security filters since Pastebin is legitimate so security tools don't block it. For organizations, leaked API keys or database credentials on Pastebin lead to unauthorized access, data exfiltration, and service disruption. Attackers continuously scan Pastebin for mentions of target organizations using automated tools. Security teams must actively monitor Pastebin and similar paste sites for company name mentions, email domain references, and specific keywords related to the organization. Because paste sites don't require registration or authentication and content is rarely removed, they've become permanent archives of leaked secrets. Container registries expose significant attack surface. Container registry attack surface includes secrets embedded in image layers where 30,000 unique secrets were found in 19,000 images, with 10% of scanned Docker images containing secrets, and 1,200 secrets, 4%, being active and valid. Immutable cached layers contain 85% of embedded secrets that can't be removed, exposed registries with 117 Docker registries accessible without authentication, unsecured registries allowing pull, push, and delete operations, and source code exposure where full application code is accessible by pulling images. GitGuardian's analysis of 200,000 publicly available Docker images revealed a staggering secret exposure problem. Even more alarming, 99% of images containing active secrets were pulled in 2024, demonstrating real-world exploitation. Unit 42's research identified 941 Docker registries exposed to the internet, with 117 accessible without authentication containing 2,956 repositories, 15,887 tags, and full source code and historical versions. Out of 117 unsecured registries, 80 allow pull operations to download images, 92 allow push operations to upload malicious images, and 7 allow delete operations for ransomware potential. Sysdig's analysis of over 250,000 Linux images on Docker Hub found 1,652 malicious images including cryptominers, most common, embedded secrets, second most prevalent, SSH keys and public keys for backdoor implants, API keys and authentication tokens, and database credentials. The secrets found in container images included AWS access keys, database passwords, SSH private keys, API tokens for cloud services, GitHub personal access tokens, and TLS certificates. Shadow IT includes unapproved SaaS applications like Dropbox, Google Drive, and personal cloud storage used for work. Personal devices like BYOD laptops, tablets, and smartphones accessing corporate data. Rogue cloud deployments where developers spin up AWS instances without approval. Unauthorized messaging apps like WhatsApp, Telegram, and Signal used for business communication. Unapproved IoT devices like smart speakers, wireless cameras, and fitness trackers on the corporate network. Gartner estimates that shadow IT makes up 30-40% of IT spending in large companies, and 76% of organizations surveyed experienced cyberattacks due to exploitation of unknown, unmanaged, or poorly managed assets. Shadow IT expands your attack surface because it's not protected by your security controls, it's not monitored by your security team, it's not included in your vulnerability scans, it's not patched by your IT department, and it often has weak or default credentials. And you can't secure what you don't know exists. Bring Your Own Device, BYOD, policies sound great for employee flexibility and cost savings. For security teams, they're a nightmare. BYOD expands your attack surface by introducing unmanaged endpoints like personal devices without EDR, antivirus, or encryption. Mixing personal and business use where work data is stored alongside personal apps with unknown security. Connecting from untrusted networks like public Wi-Fi and home networks with compromised routers. Installing unapproved applications with malware or excessive permissions. Lacking consistent security updates with devices running outdated operating systems. Common BYOD security issues include data leakage through personal cloud backup services, malware infections from personal app downloads, lost or stolen devices containing corporate data, family members using devices that access work systems, and lack of IT visibility and control. The 60% of small and mid-sized businesses that close within six months of a major cyberattack often have BYOD-related security gaps as contributing factors. Remote access infrastructure like VPNs and Remote Desktop Protocol, RDP, are among the most exploited attack vectors. SSL VPN appliances from vendors like Fortinet, SonicWall, Check Point, and Palo Alto are under constant attack. VPN attack vectors include authentication bypass vulnerabilities with CVEs allowing attackers to hijack active sessions, credential stuffing through brute-forcing VPN logins with leaked credentials, exploitation of unpatched vulnerabilities with critical CVEs in VPN appliances, and configuration weaknesses like default credentials, weak passwords, and lack of MFA. Real-world attacks demonstrate the risk. Check Point SSL VPN CVE-2024-24919 allowed authentication bypass for session hijacking. Fortinet SSL-VPN vulnerabilities were leveraged for lateral movement and privilege escalation. SonicWall CVE-2024-53704 allowed remote authentication bypass for SSL VPN. Once inside via VPN, attackers conduct network reconnaissance, lateral movement, and privilege escalation. RDP is worse. Sophos found that cybercriminals abused RDP in 90% of attacks they investigated. External remote services like RDP were the initial access vector in 65% of incident response cases. RDP attack vectors include exposed RDP ports with port 3389 open to the internet, weak authentication with simple passwords vulnerable to brute force, lack of MFA with no second factor for authentication, and credential reuse from compromised passwords in data breaches. In one Darktrace case, attackers compromised an organization four times in six months, each time through exposed RDP ports. The attack chain went successful RDP login, internal reconnaissance via WMI, lateral movement via PsExec, and objective achievement. The Palo Alto Unit 42 Incident Response report found RDP was the initial attack vector in 50% of ransomware deployment cases. Email infrastructure remains a primary attack vector. Your email attack surface includes mail servers like Exchange, Office 365, and Gmail with configuration weaknesses, email authentication with misconfigured SPF, DKIM, and DMARC records, phishing-susceptible users targeted through social engineering, email attachments and links as malware delivery mechanisms, and compromised accounts through credential stuffing or password reuse. Email authentication misconfiguration is particularly insidious. If your SPF, DKIM, and DMARC records are wrong or missing, attackers can spoof emails from your domain, your legitimate emails get marked as spam, and phishing emails impersonating your organization succeed. Email servers themselves are also targets. The NSA released guidance on Microsoft Exchange Server security specifically because Exchange servers are so frequently compromised. Container orchestration platforms like Kubernetes introduce massive attack surface complexity. The Kubernetes attack surface includes the Kubernetes API server with exposed or misconfigured API endpoints, container images with vulnerabilities in base images or application layers, container registries like Docker Hub, ECR, and GCR with weak access controls, pod security policies with overly permissive container configurations, network policies with insufficient micro-segmentation between pods, secrets management with hardcoded secrets or weak secret storage, and RBAC misconfigurations with overly broad service account permissions. Container security issues include containers running as root with excessive privileges, exposed Docker daemon sockets allowing container escape, vulnerable dependencies in container images, and lack of runtime security monitoring. The Docker daemon attack surface is particularly concerning. Running containers with privileged access or allowing docker.sock access can enable container escape and host compromise. Serverless computing like AWS Lambda, Azure Functions, and Google Cloud Functions promised to eliminate infrastructure management. Instead, it just created new attack surfaces. Serverless attack surface components include function code vulnerabilities like injection flaws and insecure dependencies, IAM misconfigurations with overly permissive Lambda execution roles, environment variables storing secrets as plain text, function URLs with publicly accessible endpoints without authentication, and event source mappings with untrusted input from various cloud services. The overabundance of event sources expands the attack surface. Lambda functions can be triggered by S3 events, API Gateway requests, DynamoDB streams, SNS topics, EventBridge schedules, IoT events, and dozens more. Each event source is a potential injection point. If function input validation is insufficient, attackers can manipulate event data to exploit the function. Real-world Lambda attacks include credential theft by exfiltrating IAM credentials from environment variables, lateral movement using over-permissioned roles to access other AWS resources, and data exfiltration by invoking functions to query and extract database contents. The Scarlet Eel adversary specifically targeted AWS Lambda for credential theft and lateral movement. Microservices architecture multiplies attack surface by decomposing monolithic applications into dozens or hundreds of independent services. Each microservice has its own attack surface including authentication mechanisms where each service needs to verify requests, authorization rules where each service enforces access controls, API endpoints for service-to-service communication channels, data stores where each service may have its own database, and network interfaces where each service exposes network ports. Microservices security challenges include east-west traffic vulnerabilities with service-to-service communication without encryption or authentication, authentication and authorization complexity from managing auth across 40 plus services multiplied by 3 environments equaling 240 configurations, service-to-service trust where services blindly trust internal traffic, network segmentation failures with flat networks allowing unrestricted pod-to-pod communication, and inconsistent security policies with different services having different security standards. One compromised microservice can enable lateral movement across the entire application. Without proper network segmentation and zero trust architecture, attackers pivot from service to service. How do you measure something this large, right. Attack surface measurement is complex. Attack surface metrics include the total number of assets with all discovered systems, applications, and devices, newly discovered assets found through continuous discovery, the number of exposed assets accessible from the internet, open ports and services with network services listening for connections, vulnerabilities by severity including critical, high, medium, and low CVEs, mean time to detect, MTTD, measuring how quickly new assets are discovered, mean time to remediate, MTTR, measuring how quickly vulnerabilities are fixed, shadow IT assets that are unknown or unmanaged, third-party exposure from vendor and partner access points, and attack surface change rate showing how rapidly the attack surface evolves. Academic research has produced formal attack surface measurement methods. Pratyusa Manadhata's foundational work defines attack surface as a three-tuple, System Attackability, Channel Attackability, Data Attackability. But in practice, most organizations struggle with basic attack surface visibility, let alone quantitative measurement. Your attack surface isn't static. It changes constantly. Changes happen because developers deploy new services and APIs, cloud auto-scaling spins up new instances, shadow IT appears as employees adopt unapproved tools, acquisitions bring new infrastructure into your environment, IoT devices get plugged into your network, and subdomains get created for new projects. Static, point-in-time assessments are obsolete. You need continuous asset discovery and monitoring. Continuous discovery methods include automated network scanning for regular scans to detect new devices, cloud API polling to query cloud provider APIs for resource changes, DNS monitoring to track new subdomains via Certificate Transparency logs, passive traffic analysis to observe network traffic and identify assets, integration with CMDB or ITSM to sync with configuration management databases, and cloud inventory automation using Infrastructure as Code to track deployments. Understanding your attack surface is step one. Reducing it is the goal. Attack surface reduction begins with asset elimination by removing unnecessary assets entirely. This includes decommissioning unused servers and applications, deleting abandoned subdomains and DNS records, shutting down forgotten development environments, disabling unused network services and ports, and removing unused user accounts and service identities. Access control hardening implements least privilege everywhere by enforcing multi-factor authentication, MFA, for all remote access, using role-based access control, RBAC, for cloud resources, implementing zero trust network architecture, restricting network access with micro-segmentation, and applying the principle of least privilege to IAM roles. Exposure minimization reduces what's visible to attackers by moving services behind VPNs or bastion hosts, using private IP ranges for internal services, implementing network address translation, NAT, for outbound access, restricting API endpoints to authorized sources only, and disabling unnecessary features and functionalities. Security hardening strengthens what remains by applying security patches promptly, using security configuration baselines, enabling encryption for data in transit and at rest, implementing Web Application Firewalls, WAF, for web apps, and deploying endpoint detection and response, EDR, on all devices. Monitoring and detection watch for attacks in progress by implementing real-time threat detection, enabling comprehensive logging and SIEM integration, deploying intrusion detection and prevention systems, IDS/IPS, monitoring for anomalous behavior patterns, and using threat intelligence feeds to identify known bad actors. Your attack surface is exponentially larger than you think it is. Every asset you know about probably has three you don't. Every known vulnerability probably has ten undiscovered ones. Every third-party integration probably grants more access than you realize. Every collaboration tool is leaking more data than you imagine. Every paste site contains more of your secrets than you want to admit. And attackers know this. They're not just looking at what you think you've secured. They're systematically enumerating every possible entry point. They're mining Certificate Transparency logs for forgotten subdomains. They're scanning every IP in your address space. They're reverse-engineering your mobile apps. They're buying employee credentials from data breach databases. They're compromising your vendors to reach you. They're scraping Pastebin for your leaked secrets. They're pulling your public Docker images and extracting the embedded credentials. They're accessing your misconfigured S3 buckets and exfiltrating terabytes of data. They're exploiting your exposed Jenkins instances to compromise your entire infrastructure. They're manipulating your AI agents to exfiltrate private Notion data. The asymmetry is brutal. You have to defend every single attack vector. They only need to find one that works. So what do you do. Start by accepting that you don't have complete visibility. Nobody does. But you can work toward better visibility through continuous discovery, automated asset management, and integration of security tools that help map your actual attack surface. Implement attack surface reduction aggressively. Every asset you eliminate is one less thing to defend. Every service you shut down is one less potential vulnerability. Every piece of shadow IT you discover and bring under management is one less blind spot. Every misconfigured cloud storage bucket you fix is terabytes of data no longer exposed. Every leaked secret you rotate is one less credential floating around the internet. Adopt zero trust architecture. Stop assuming that anything, internal services, microservices, authenticated users, collaboration tools, is inherently trustworthy. Verify everything. Monitor paste sites and code repositories. Your secrets are out there. Find them before attackers weaponize them. Secure your collaboration tools. Slack, Trello, Jira, Confluence, Notion, Google Drive, and Airtable are all leaking data. Lock them down. Fix your container security. Scan images for secrets. Use secret managers instead of environment variables. Secure your registries. Harden your CI/CD pipelines. Jenkins, GitHub Actions, and GitLab CI are high-value targets. Protect them. And test your assumptions with red team exercises and continuous security testing. Your attack surface is what an attacker can reach, not what you think you've secured. The attack surface problem isn't getting better. Cloud adoption, DevOps practices, remote work, IoT proliferation, supply chain complexity, collaboration tool sprawl, and container adoption are all expanding organizational attack surfaces faster than security teams can keep up. But understanding the problem is the first step toward managing it. And now you understand exactly how catastrophically large your attack surface actually is.

1 views
Simon Willison 1 months ago

Video + notes on upgrading a Datasette plugin for the latest 1.0 alpha, with help from uv and OpenAI Codex CLI

I'm upgrading various plugins for compatibility with the new Datasette 1.0a20 alpha release and I decided to record a video of the process. This post accompanies that video with detailed additional notes. I picked a very simple plugin to illustrate the upgrade process (possibly too simple). datasette-checkbox adds just one feature to Datasette: if you are viewing a table with boolean columns (detected as integer columns with names like or or ) and your current user has permission to update rows in that table it adds an inline checkbox UI that looks like this: I built the first version with the help of Claude back in August 2024 - details in this issue comment . Most of the implementation is JavaScript that makes calls to Datasette 1.0's JSON write API . The Python code just checks that the user has the necessary permissions before including the extra JavaScript. The first step in upgrading any plugin is to run its tests against the latest Datasette version. Thankfully makes it easy to run code in scratch virtual environments that include the different code versions you want to test against. I have a test utility called (for "test against development Datasette") which I use for that purpose. I can run it in any plugin directory like this: And it will run the existing plugin tests against whatever version of Datasette I have checked out in my directory. You can see the full implementation of (and its friend described below) in this TIL - the basic version looks like this: I started by running in the directory, and got my first failure... but it wasn't due to permissions, it was because the for the plugin was pinned to a specific mismatched version of Datasette: I fixed this problem by swapping to and ran the tests again... and they passed! Which was a problem because I was expecting permission-related failures. It turns out when I first wrote the plugin I was lazy with the tests - they weren't actually confirming that the table page loaded without errors. I needed to actually run the code myself to see the expected bug. First I created myself a demo database using sqlite-utils create-table : Then I ran it with Datasette against the plugin's code like so: Sure enough, visiting produced a 500 error about the missing method. The next step was to update the test to also trigger this error: And now fails as expected. It this point I could have manually fixed the plugin itself - which would likely have been faster given the small size of the fix - but instead I demonstrated a bash one-liner I've been using to apply these kinds of changes automatically: runs OpenAI Codex in non-interactive mode - it will loop until it has finished the prompt you give it. I tell it to consult the subset of the Datasette upgrade documentation that talks about Datasette permissions and then get the command to pass its tests. This is an example of what I call designing agentic loops - I gave Codex the tools it needed ( ) and a clear goal and let it get to work on my behalf. The remainder of the video covers finishing up the work - testing the fix manually, commiting my work using: Then shipping a 0.1a4 release to PyPI using the pattern described in this TIL . Finally, I demonstrated that the shipped plugin worked in a fresh environment using like this: Executing this command installs and runs a fresh Datasette instance with a fresh copy of the new alpha plugin ( ). It's a neat way of confirming that freshly released software works as expected. This video was shot in a single take using Descript , with no rehearsal and perilously little preparation in advance. I recorded through my AirPods and applied the "Studio Sound" filter to clean up the audio. I pasted in a closing slide from my previous video and exported it locally at 1080p, then uploaded it to YouTube. Something I learned from the Software Carpentry instructor training course is that making mistakes in front of an audience is actively helpful - it helps them see a realistic version of how software development works and they can learn from watching you recover. I see this as a great excuse for not editing out all of my mistakes! I'm trying to build new habits around video content that let me produce useful videos while minimizing the amount of time I spend on production. I plan to iterate more on the format as I get more comfortable with the process. I'm hoping I can find the right balance between production time and value to viewers. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Luke Hsiao 1 months ago

Switching from GPG to age

It’s been several years since I went through all the trouble of setting up my own GPG keys and securing them in YubiKeys following drduh’s guide . With that approach, you generate one key securely offline and store it on multiple YubiKeys, along with a backup. It has worked well for me for years, and as the Lindy effect suggests, it would almost certainly continue to. But as my sub-keys were nearing expiration, I was faced with either renewing (more convenient, no forward secrecy ) or rotating them (rather painful, but potentially more secure). However, I’ve realized that I essentially only use these keys for encryption, and almost never for signing. So, instead of doing either of the usual options, I’m going to let my keys expire entirely. I’m now experimenting with , which touts itself as “simple, modern, and secure encryption”. If needed, I will use for signatures. This required changing a couple of things in my typical workflow. First, and foremost, I needed to switch from to , a fork of that uses as the backend. This was actually surprisingly easy because includes a simple bash script to do the migration. There is no installer for , and no Arch packages. But it’s easy enough to install because it’s just a shell script you can throw on your . Note that for Arch, I also needed to install , which it assumes you have. I also name as on my machine. The benefit of this is everything that had integration has continued to “just work”. For example, , my email client of choice, behaved exactly the same after the migration. I would occasionally use as my SSH agent on my machines. It was convenient. However, I also like the idea of having a dedicated SSH key per machine. It makes monitoring their usage and revoking them much finer-grained. The absence of forced me to set up new keys on all my machines and add them to various servers/services. Easy encryption with chezmoi While I was tending to the encryption area of my personal tech “garden”, I also started leveraging chezmoi’s encryption features . I already use for my configuration files, but with encryption, I could also easily add “secrets” to my public dotfiles repo . In my case so far, this just means my copies of my favorite paid font: Berkeley Mono . also has a nice guide for configuring chezmoi to encrypt while asking for a passphrase only once for . I was also very pleasantly surprised with how easy it was to switch to ! Last time I set up the GPG keys on my YubiKeys, I spent several hours. This time, with the help of and embracing the idea of having unique keys on each YubiKey, but encrypting everything for multiple recipients, setting up my keys was surprisingly trivial. It also generates the keys securely on the hardware key itself, which is nice. The whole process probably took 30 minutes. It was so easy that in the future, I’m very much not intimidated by the thought of rotating keys. Did I need to switch to ? Of course not. However, over my career, I repeatedly find that exploring new tools for your core workflows (part of investing in interfaces ) is just plain fun. I often learn new ways of thinking about problems. Sometimes, you walk away with a new default that brings some fresh ideas and some delight to your life. Other times, you walk away with your trusty old tool, with greater appreciation for its history and the hard-earned approach it has established. For my uses at the moment, definitely falls into the former camp.

0 views
xenodium 1 months ago

agent-shell 0.17 improvements + MELPA

While it's only been a few weeks since the last agent-shell post , there are plenty of new updates to share. What's agent-shell again? A native Emacs shell to interact with any LLM agent powered by ACP ( Agent Client Protocol ). Before getting to the latest and greatest, I'd like to say thank you to new and existing sponsors backing my projects. While the work going in remains largely unsustainable, your contributions are indeed helping me get closer to sustainability. Thank you! If you benefit from my content and projects, please consider sponsoring to make the work sustainable. Work paying for your LLM tokens and other tools? Why not get your employer to sponsor agent-shell also? Now on to the very first update… Both agent-shell and acp.el are now available on MELPA. As such, installation now boils down to: OpenCode and Qwen Code are two of the latest agents to join agent-shell . Both accessible via and through the agent picker, but also directly from and . Adding files as context has seen quite a few improvements in different shapes. Thank you Ian Davidson for contributing embedded context support. Invoke to take a screenshot and automatically send it over to . A little side-note, did you notice the activity indicator in the header bar? Yep. That's new too. While file completion remains experimental, you can enable via: From any file you can now invoke to send the current file to . If region is selected, region information is sent also. Fancy sending a different file other than current one? Invoke with , or just use . , also operates on files (selection or region), DWIM style ;-) You may have noticed paths in section titles are no longer displayed as absolute paths. We're shortening those relative to project roots. While you can invoke with prefix to create new shells, is now available (and more discoverable than ). Cancelling prompt sessions (via ) is much more reliable now. If you experienced a shell getting stuck after cancelling a session, that's because we were missing part of the protocol implementation. This is now implemented. Use the new to automatically insert shell (ie. bash) command output. Initial work for automatically saving markdown transcripts is now in place. We're still iterating on it, but if keen to try things out, you can enable as follows: Text header Applied changes are now displayed inline. The new and can now be used to change the session mode. You can now find out what capabilities and session modes are supported by your agent. Expand either of the two sections. Tired of pressing and to accept changes from the diff buffer? Now just press from the diff viewer to accept all hunks. Same goes for rejecting. No more and . Now just press from the diff buffer. We get a new basic transient menu. Currently available via . We got lots of awesome pull requests from wonderful folks. Thank you for your contributions! Beyond what's been showcased here, much love and effort's been poured into polishing the experience. Interested in the nitty-gritty? Have a look through the 173 commits since the last blog post. If agent-shell or acp.el are useful to you, please consider sponsoring its development. LLM tokens aren't free, and neither is the time dedicated to building this stuff ;-) Arthur Heymans : Add a Package-Requires header ( PR ). Elle Najt : Execute commands in devcontainer ( PR ). Elle Najt : Fix Write tool diff preview for new files ( PR ). Elle Najt : Inline display of historical changes ( PR ). Elle Najt : Live Markdown transcripts ( PR ). Elle Najt : Prompt session mode cycling and modeline display ( PR ). Fritz Grabo : Devcontainer fallback workspace ( PR ). Guilherme Pires : Codex subscription auth ( PR ). Hordur Freyr Yngvason : Make qwen authentication optional ( PR ). Ian Davidson : Embedded context support ( PR ). Julian Hirn : Fix quick-diff window restoration for full-screen ( PR ). Ruslan Kamashev : Hide header line altogether ( PR ). festive-onion : Show Planning mode more reliably ( PR ).

0 views

How I Use Every Claude Code Feature

I use Claude Code. A lot. As a hobbyist, I run it in a VM several times a week on side projects, often with to vibe code whatever idea is on my mind. Professionally, part of my team builds the AI-IDE rules and tooling for our engineering team that consumes several billion tokens per month just for codegen. The CLI agent space is getting crowded and between Claude Code, Gemini CLI, Cursor, and Codex CLI, it feels like the real race is between Anthropic and OpenAI. But TBH when I talk to other developers, their choice often comes down to what feels like superficials—a “lucky” feature implementation or a system prompt “vibe” they just prefer. At this point these tools are all pretty good. I also feel like folks often also over index on the output style or UI. Like to me the “you’re absolutely right!” sycophancy isn’t a notable bug; it’s a signal that you’re too in-the-loop. Generally my goal is to “shoot and forget”—to delegate, set the context, and let it work. Judging the tool by the final PR and not how it gets there. Having stuck to Claude Code for the last few months, this post is my set of reflections on Claude Code’s entire ecosystem. We’ll cover nearly every feature I use (and, just as importantly, the ones I don’t), from the foundational file and custom slash commands to the powerful world of Subagents, Hooks, and GitHub Actions. This post ended up a bit long and I’d recommend it as more of a reference than something to read in entirety. The single most important file in your codebase for using Claude Code effectively is the root . This file is the agent’s “constitution,” its primary source of truth for how your specific repository works. How you treat this file depends on the context. For my hobby projects, I let Claude dump whatever it wants in there. For my professional work, our monorepo’s is strictly maintained and currently sits at 13KB (I could easily see it growing to 25KB). It only documents tools and APIs used by 30% (arbitrary) or more of our engineers (else tools are documented in product or library specific markdown files) We’ve even started allocating effectively a max token count for each internal tool’s documentation, almost like selling “ad space” to teams. If you can’t explain your tool concisely, it’s not ready for the . Over time, we’ve developed a strong, opinionated philosophy for writing an effective . Start with Guardrails, Not a Manual. Your should start small, documenting based on what Claude is getting wrong. Don’t -File Docs. If you have extensive documentation elsewhere, it’s tempting to -mention those files in your . This bloats the context window by embedding the entire file on every run. But if you just mention the path, Claude will often ignore it. You have to pitch the agent on why and when to read the file. “For complex … usage or if you encounter a , see for advanced troubleshooting steps.” Don’t Just Say “Never.” Avoid negative-only constraints like “Never use the flag.” The agent will get stuck when it thinks it must use that flag. Always provide an alternative. Use as a Forcing Function. If your CLI commands are complex and verbose, don’t write paragraphs of documentation to explain them. That’s patching a human problem. Instead, write a simple bash wrapper with a clear, intuitive API and document that . Keeping your as short as possible is a fantastic forcing function for simplifying your codebase and internal tooling. Here’s a simplified snapshot: Finally, we keep this file synced with an file to maintain compatibility with other AI IDEs that our engineers might be using. If you are looking for more tips for writing markdown for coding agents see “AI Can’t Read Your Docs”, “AI-powered Software Engineering”, and “How Cursor (AI IDE) Works”. The Takeaway: Treat your as a high-level, curated set of guardrails and pointers. Use it to guide where you need to invest in more AI (and human) friendly tools, rather than trying to make it a comprehensive manual. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. I recommend running mid coding session at least once to understand how you are using your 200k token context window (even with Sonnet-1M, I don’t trust that the full context window is actually used effectively). For us a fresh session in our monorepo costs a baseline ~20k tokens (10%) with the remaining 180k for making your change — which can fill up quite fast. A screenshot of /context in one of my recent side projects. You can almost think of this like disk space that fills up as you work on a feature. After a few minutes or hours you’ll need to clear the messages (purple) to make space to continue. I have three main workflows: (Avoid): I avoid this as much as possible. The automatic compaction is opaque, error-prone, and not well-optimized. + (Simple Restart): My default reboot. I the state, then run a custom command to make Claude read all changed files in my git branch. “Document & Clear” (Complex Restart): For large tasks. I have Claude dump its plan and progress into a , the state, then start a new session by telling it to read the and continue. The Takeaway: Don’t trust auto-compaction. Use for simple reboots and the “Document & Clear” method to create durable, external “memory” for complex tasks. I think of slash commands as simple shortcuts for frequently used prompts, nothing more. My setup is minimal: : The command I mentioned earlier. It just prompts Claude to read all changed files in my current git branch. : A simple helper to clean up my code, stage it, and prepare a pull request. IMHO if you have a long list of complex, custom slash commands, you’ve created an anti-pattern. To me the entire point of an agent like Claude is that you can type almost whatever you want and get a useful, mergable result. The moment you force an engineer (or non-engineer) to learn a new, documented-somewhere list of essential magic commands just to get work done, you’ve failed. The Takeaway: Use slash commands as simple, personal shortcuts, not as a replacement for building a more intuitive and better-tooled agent. On paper, custom subagents are Claude Code’s most powerful feature for context management. The pitch is simple: a complex task requires tokens of input context (e.g., how to run tests), accumulates tokens of working context, and produces a token answer. Running tasks means tokens in your main window. The subagent solution is to farm out the work to specialized agents, which only return the final token answers, keeping your main context clean. I find they are a powerful idea that, in practice, custom subagents create two new problems: They Gatekeep Context: If I make a subagent, I’ve now hidden all testing context from my main agent. It can no longer reason holistically about a change. It’s now forced to invoke the subagent just to know how to validate its own code. They Force Human Workflows: Worse, they force Claude into a rigid, human-defined workflow. I’m now dictating how it must delegate, which is the very problem I’m trying to get the agent to solve for me. My preferred alternative is to use Claude’s built-in feature to spawn clones of the general agent. I put all my key context in the . Then, I let the main agent decide when and how to delegate work to copies of itself. This gives me all the context-saving benefits of subagents without the drawbacks. The agent manages its own orchestration dynamically. In my “Building Multi-Agent Systems (Part 2)” post, I called this the “Master-Clone” architecture, and I strongly prefer it over the “Lead-Specialist” model that custom subagents encourage. The Takeaway: Custom subagents are a brittle solution. Give your main agent the context (in ) and let it use its own feature to manage delegation. On a simple level, I use and frequently. They’re great for restarting a bugged terminal or quickly rebooting an older session. I’ll often a session from days ago just to ask the agent to summarize how it overcame a specific error, which I then use to improve our and internal tooling. More in the weeds, Claude Code stores all session history in to tap into the raw historical session data. I have scripts that run meta-analysis on these logs, looking for common exceptions, permission requests, and error patterns to help improve agent-facing context. The Takeaway: Use and to restart sessions and uncover buried historical context. Hooks are huge. I don’t use them for hobby projects, but they are critical for steering Claude in a complex enterprise repo. They are the deterministic “must-do” rules that complement the “should-do” suggestions in . We use two types: Block-at-Submit Hooks: This is our primary strategy. We have a hook that wraps any command. It checks for a file, which our test script only creates if all tests pass. If the file is missing, the hook blocks the commit, forcing Claude into a “test-and-fix” loop until the build is green. Hint Hooks: These are simple, non-blocking hooks that provide “fire-and-forget” feedback if the agent is doing something suboptimal. We intentionally do not use “block-at-write” hooks (e.g., on or ). Blocking an agent mid-plan confuses or even “frustrates” it. It’s far more effective to let it finish its work and then check the final, completed result at the commit stage. The Takeaway: Use hooks to enforce state validation at commit time ( ). Avoid blocking at write time—let the agent finish its plan, then check the final result. Planning is essential for any “large” feature change with an AI IDE. For my hobby projects, I exclusively use the built-in planning mode. It’s a way to align with Claude before it starts, defining both how to build something and the “inspection checkpoints” where it needs to stop and show me its work. Using this regularly builds a strong intuition for what minimal context is needed to get a good plan without Claude botching the implementation. In our work monorepo, we’ve started rolling out a custom planning tool built on the Claude Code SDK. Its similar to native plan mode but heavily prompted to align its outputs with our existing technical design format. It also enforces our internal best practices—from code structure to data privacy and security—out of the box. This lets our engineers “vibe plan” a new feature as if they were a senior architect (or at least that’s the pitch). The Takeaway: Always use the built-in planning mode for complex changes to align on a plan before the agent starts working. I agree with Simon Willison’s : Skills are (maybe) a bigger deal than MCP. If you’ve been following my posts, you’ll know I’ve drifted away from MCP for most dev workflows, preferring to build simple CLIs instead (as I argued in “AI Can’t Read Your Docs” ). My mental model for agent autonomy has evolved into three stages: Single Prompt: Giving the agent all context in one massive prompt. (Brittle, doesn’t scale). Tool Calling: The “classic” agent model. We hand-craft tools and abstract away reality for the agent. (Better, but creates new abstractions and context bottlenecks). Scripting : We give the agent access to the raw environment—binaries, scripts, and docs—and it writes code on the fly to interact with them. With this model in mind, Agent Skills are the obvious next feature. They are the formal productization of the “Scripting” layer. If, like me, you’ve already been favoring CLIs over MCP, you’ve been implicitly getting the benefit of Skills all along. The file is just a more organized, shareable, and discoverable way to document these CLIs and scripts and expose them to the agent. The Takeaway: Skills are the right abstraction. They formalize the “scripting”-based agent model, which is more robust and flexible than the rigid, API-like model that MCP represents. Skills don’t mean MCP is dead (see also “Everything Wrong with MCP” ). Previously, many built awful, context-heavy MCPs with dozens of tools that just mirrored a REST API ( , , ). The “Scripting” model (now formalized by Skills) is better, but it needs a secure way to access the environment. This to me is the new, more focused role for MCP. Instead of a bloated API, an MCP should be a simple, secure gateway that provides a few powerful, high-level tools: In this model, MCP’s job isn’t to abstract reality for the agent; its job is to manage the auth, networking, and security boundaries and then get out of the way. It provides the entry point for the agent, which then uses its scripting and context to do the actual work. The only MCP I still use is for Playwright , which makes sense—it’s a complex, stateful environment. All my stateless tools (like Jira, AWS, GitHub) have been migrated to simple CLIs. The Takeaway: Use MCPs that act as data gateways. Give the agent one or two high-level tools (like a raw data dump API) that it can then script against. Claude Code isn’t just an interactive CLI; it’s also a powerful SDK for building entirely new agents—for both coding and non-coding tasks. I’ve started using it as my default agent framework over tools like LangChain/CrewAI for most new hobby projects. I use it in three main ways: Massive Parallel Scripting: For large-scale refactors, bug fixes, or migrations, I don’t use the interactive chat. I write simple bash scripts that call in parallel. This is far more scalable and controllable than trying to get the main agent to manage dozens of subagent tasks. Building Internal Chat Tools: The SDK is perfect for wrapping complex processes in a simple chat interface for non-technical users. Like an installer that, on error, falls back to the Claude Code SDK to just fix the problem for the user. Or an in-house “ v0-at-home ” tool that lets our design team vibe-code mock frontends in our in-house UI framework, ensuring their ideas are high-fidelity and the code is more directly usable in frontend production code. Rapid Agent Prototyping: This is my most common use. It’s not just for coding. If I have an idea for any agentic task (e.g., a “threat investigation agent” that uses custom CLIs or MCPs), I use the Claude Code SDK to quickly build and test the prototype before committing to a full, deployed scaffolding. The Takeaway: The Claude Code SDK is a powerful, general-purpose agent framework. Use it for batch-processing code, building internal tools, and rapidly prototyping new agents before you reach for more complex frameworks. The Claude Code GitHub Action (GHA) is probably one of my favorite and most slept on features. It’s a simple concept: just run Claude Code in a GHA. But this simplicity is what makes it so powerful. It’s similar to Cursor’s background agents or the Codex managed web UI but is far more customizable. You control the entire container and environment, giving you more access to data and, crucially, much stronger sandboxing and audit controls than any other product provides. Plus, it supports all the advanced features like Hooks and MCP. We’ve used it to build custom “PR-from-anywhere” tooling. Users can trigger a PR from Slack, Jira, or even a CloudWatch alert, and the GHA will fix the bug or add the feature and return a fully tested PR 1 . Since the GHA logs are the full agent logs, we have an ops process to regularly review these logs at a company level for common mistakes, bash errors, or unaligned engineering practices. This creates a data-driven flywheel: Bugs -> Improved CLAUDE.md / CLIs -> Better Agent. The Takeaway: The GHA is the ultimate way to operationalize Claude Code. It turns it from a personal tool into a core, auditable, and self-improving part of your engineering system. Finally, I have a few specific configurations that I’ve found essential for both hobby and professional work. / : This is great for debugging. I’ll use it to inspect the raw traffic to see exactly what prompts Claude is sending. For background agents, it’s also a powerful tool for fine-grained network sandboxing. / : I bump these. I like running long, complex commands, and the default timeouts are often too conservative. I’m honestly not sure if this is still needed now that bash background tasks are a thing, but I keep it just in case. : At work, we use our enterprise API keys ( via apiKeyHelper ). It shifts us from a “per-seat” license to “usage-based” pricing, which is a much better model for how we work. It accounts for the massive variance in developer usage (We’ve seen 1:100x differences between engineers). It lets engineers to tinker with non-Claude-Code LLM scripts, all under our single enterprise account. : I’ll occasionally self-audit the list of commands I’ve allowed Claude to auto-run. The Takeaway: Your is a powerful place for advanced customization. That was a lot, but hopefully, you find it useful. If you’re not already using a CLI-based agent like Claude Code or Codex CLI, you probably should be. There are rarely good guides for these advanced features, so the only way to learn is to dive in. Thanks for reading Shrivu’s Substack! Subscribe for free to receive new posts and support my work. To me, a fairly interesting philosophical question is how many reviewers should a PR get that was generated directly from a customer request (no internal human prompter)? We’ve settled on 2 human approvals for any AI-initiated PR for now, but it is kind of a weird paradigm shift (for me at least) when it’s no longer a human making something for another human to review. It only documents tools and APIs used by 30% (arbitrary) or more of our engineers (else tools are documented in product or library specific markdown files) We’ve even started allocating effectively a max token count for each internal tool’s documentation, almost like selling “ad space” to teams. If you can’t explain your tool concisely, it’s not ready for the . Start with Guardrails, Not a Manual. Your should start small, documenting based on what Claude is getting wrong. Don’t -File Docs. If you have extensive documentation elsewhere, it’s tempting to -mention those files in your . This bloats the context window by embedding the entire file on every run. But if you just mention the path, Claude will often ignore it. You have to pitch the agent on why and when to read the file. “For complex … usage or if you encounter a , see for advanced troubleshooting steps.” Don’t Just Say “Never.” Avoid negative-only constraints like “Never use the flag.” The agent will get stuck when it thinks it must use that flag. Always provide an alternative. Use as a Forcing Function. If your CLI commands are complex and verbose, don’t write paragraphs of documentation to explain them. That’s patching a human problem. Instead, write a simple bash wrapper with a clear, intuitive API and document that . Keeping your as short as possible is a fantastic forcing function for simplifying your codebase and internal tooling. A screenshot of /context in one of my recent side projects. You can almost think of this like disk space that fills up as you work on a feature. After a few minutes or hours you’ll need to clear the messages (purple) to make space to continue. I have three main workflows: (Avoid): I avoid this as much as possible. The automatic compaction is opaque, error-prone, and not well-optimized. + (Simple Restart): My default reboot. I the state, then run a custom command to make Claude read all changed files in my git branch. “Document & Clear” (Complex Restart): For large tasks. I have Claude dump its plan and progress into a , the state, then start a new session by telling it to read the and continue. : The command I mentioned earlier. It just prompts Claude to read all changed files in my current git branch. : A simple helper to clean up my code, stage it, and prepare a pull request. They Gatekeep Context: If I make a subagent, I’ve now hidden all testing context from my main agent. It can no longer reason holistically about a change. It’s now forced to invoke the subagent just to know how to validate its own code. They Force Human Workflows: Worse, they force Claude into a rigid, human-defined workflow. I’m now dictating how it must delegate, which is the very problem I’m trying to get the agent to solve for me. Block-at-Submit Hooks: This is our primary strategy. We have a hook that wraps any command. It checks for a file, which our test script only creates if all tests pass. If the file is missing, the hook blocks the commit, forcing Claude into a “test-and-fix” loop until the build is green. Hint Hooks: These are simple, non-blocking hooks that provide “fire-and-forget” feedback if the agent is doing something suboptimal. Single Prompt: Giving the agent all context in one massive prompt. (Brittle, doesn’t scale). Tool Calling: The “classic” agent model. We hand-craft tools and abstract away reality for the agent. (Better, but creates new abstractions and context bottlenecks). Scripting : We give the agent access to the raw environment—binaries, scripts, and docs—and it writes code on the fly to interact with them. Massive Parallel Scripting: For large-scale refactors, bug fixes, or migrations, I don’t use the interactive chat. I write simple bash scripts that call in parallel. This is far more scalable and controllable than trying to get the main agent to manage dozens of subagent tasks. Building Internal Chat Tools: The SDK is perfect for wrapping complex processes in a simple chat interface for non-technical users. Like an installer that, on error, falls back to the Claude Code SDK to just fix the problem for the user. Or an in-house “ v0-at-home ” tool that lets our design team vibe-code mock frontends in our in-house UI framework, ensuring their ideas are high-fidelity and the code is more directly usable in frontend production code. Rapid Agent Prototyping: This is my most common use. It’s not just for coding. If I have an idea for any agentic task (e.g., a “threat investigation agent” that uses custom CLIs or MCPs), I use the Claude Code SDK to quickly build and test the prototype before committing to a full, deployed scaffolding. / : This is great for debugging. I’ll use it to inspect the raw traffic to see exactly what prompts Claude is sending. For background agents, it’s also a powerful tool for fine-grained network sandboxing. / : I bump these. I like running long, complex commands, and the default timeouts are often too conservative. I’m honestly not sure if this is still needed now that bash background tasks are a thing, but I keep it just in case. : At work, we use our enterprise API keys ( via apiKeyHelper ). It shifts us from a “per-seat” license to “usage-based” pricing, which is a much better model for how we work. It accounts for the massive variance in developer usage (We’ve seen 1:100x differences between engineers). It lets engineers to tinker with non-Claude-Code LLM scripts, all under our single enterprise account. : I’ll occasionally self-audit the list of commands I’ve allowed Claude to auto-run.

0 views
Jason Scheirer 1 months ago

I Don't Like that I Like Starship

I have my seed Starship config up as a Gist. Starship is a tool that frustrates me because it seems so bikesheddy and unneeded: a custom prompt manager. We already had shell prompt customization! I blindly install on new machines for prompt customization! And Starship is written in Rust. People just use Rust to be cute. Then I realize that it’s okay to have nice things. The command line environment from the 90s can change. doesn’t suck. is right up there with Perl in opaque tools other people do interesting things with and then I steal the interesting things for myself. The TUIs don’t suck. I mock Textual and Charm openly and unrepentantly and still hypocritically use and enjoy the tools built with them. The terminal is changing because the people using it have the agency and hubris to use it differently. I can have a growth mindset and accept that But its killer features: So here it is. All this talk for something that doesn’t look substantively different but compounds into feeling different over the course of the days and weeks I use it: The preferred command line tool to do a thing can change in my lifetime and The preferred workflow to accomplish a task can change in the face of new tools and The people who invented the old tools we’re replacing were not gods, they were just as fallible as us so Writing new tools that learn hard lessons from decades of using the old ones is fine. It’s not sacrilege. Limited/Constrained/Opinionated : Other prompt customization schemes I’ve used have let you do anything , but you had to know how to do anything . I can’t think of cleverness I want in my prompt, I just want to see which Git branch I’m on. We’re all doing that. We all want that. Starship has a way to do that which isn’t brittle bash I have to maintain myself. Multiplatform : I have Windows Bash, Windows PowerShell, Linux Bash/Zsh, and macOS Bash/Zsh all driven by the same seed config. I can have it show the same information everywhere. I can use little emblems to let me know if I’m on my Mac, on a Linux machine, if I’m on Windows and if that’s PowerShell or Bash all right there. The Nerd Font Dark Horse : This system takes advantage of the glyphs in Nerd Fonts and normalizes abusing them. This adds the additional “burden” of installing a Nerd Font enabled typeface on all my apps with terminal editors, but I’ve already normalized that.

0 views
マリウス 1 months ago

A Word on Omarchy

Pro tip: If you’ve arrived here via a link aggregator, feel free to skip ahead to the Summary for a conveniently digestible tl;dr that spares you all the tedious details, yet still provides enough ammunition to trash-talk this post in the comments of whatever platform you stumbled upon it. In the recent months, there has been a noticeable shift away from the Windows desktop, as well as from macOS , to Linux, driven by various frustrations, such as the Windows 11 Recall feature. While there have historically been more than enough Linux distributions to choose from, for each skill level and amount of desired pain, a recent Arch -based configuration has seemingly made strides across the Linux landscape: Omarchy . This pre-configured Arch system is the brainchild of David Heinemeier Hansson , a Danish web developer and entrepreneur known as one of the co-founders of 37signals and for developing the Ruby on Rails framework. The name Omarchy appears to be a portmanteau of Arch , the Linux distribution that Hansson ’s configuration is based upon, and お任せ, which translates to omakase and means to leave something up to someone else (任せる, makaseru, to entrust ). When ordering omakase in a restaurant, you’re leaving it up to the chef to serve you whatever they think is best. Oma(kase) + (A)rch + y is supposedly where the name comes from. It’s important to note that, contrary to what Hansson says in the introduction video , Omarchy is not an actual Linux distribution . Instead, it’s an opinionated installation of Arch Linux that aims to make it easy to set up and run an Arch desktop, seemingly with as much TUI-hacker-esque aesthetic as possible. Omarchy comes bundled with Hyprland , a tiling window manager that focuses on customizability and graphic effects, but apparently not as much on code quality and safety . However, the sudden hype around Omarchy , which at this point has attracted attention and seemingly even funding from companies like Framework (Computer Inc.) ( attention ) and Cloudflare ( attention and seemingly funding ), made me want to take a closer look at the supposed cool kid on the block to understand what it was all about. Omarchy is a pre-configured installation of the Arch distribution that comes with a TUI installer on a 6.2GB ISO. It ships with a collection of shell scripts that use existing FOSS software (e.g. walker ) to implement individual features. The project is based on the work that the FOSS community, especially the Arch Linux maintainers, have done over the years, and ties together individual components to offer a supposed ready-to-use desktop experience. Omarchy also adds some links to different websites, disguised as “Apps” , but more on that later. This, however, seems to be enough to spark an avalanche of attention and, more importantly, financial support for the project. Anyway, let’s give Omarchy an actual try, and see what chef Hansson recommended to us. The Omarchy installer is a simple text user interface that tries to replicate what Charm has pioneered with their TUI libraries: A smooth command-line interface that preserves the simplicity of the good old days , yet enhances the experience with playful colors, emojis, and animations for the younger, future generation of users. Unlike mature installers, Omarchy ’s installer script doesn’t allow for much customization, which is probably to be expected with an “Opinionated Arch/Hyprland Setup” . Info: Omarchy uses gum , a Charm tool, under the hood. One of the first things that struck me as unexpected was the fact that I was able to use as my user password, an easy-to-guess word that Omarchy will also use for the drive encryption, without any resistance from the installer. Most modern Linux distributions actively prevent users from setting easily guessable or brute-forceable passwords. Moreover, taking into account that the system relies heavily on sudo (instead of the more modern doas ), and also considering that the default installation configures the maximum number of password retries to 10 (instead of the more cautious limit of three), it raises an important question: Does Omarchy care about security? Let’s take a look at the Omarchy manual to find out: Omarchy takes security extremely seriously. This is meant to be an operating system that you can use to do Real Work in the Real World . Where losing a laptop can’t lead to a security emergency. According to the manual, taking security extremely seriously means enabling full-disk encryption (but without rejecting simple keys), blocking all ports except for 22 (SSH, on a desktop) and 53317 (LocalSend), continuously running (even though staying bleeding-edge has repeatedly proven to be in insufficient security measure in the past) and maintaining a Cloudflare protected package mirror. That’s seemingly all. Hm. Proceeding with the installation, the TUI prompts for an email address, which makes the whole process feel a bit like the Windows setup routine. While one might assume Omarchy is simply trying to accommodate its new user base, the actual reason appears to be much simpler: . If, however, you’d be expecting for Omarchy to set up GPG with proper defaults, configure SSH with equally secure defaults, and perhaps offer an option to create new GPG/SSH keys or import existing ones, in order to enable proper commit and push signing for Git, you will be left disappointed. Unfortunately, none of this is the case. The Git config doesn’t enable commit or push signing, neither the GPG nor the SSH client configurations set secure defaults, and the user isn’t offered a way to import existing keys or create new ones. Given that Hansson himself usually does not sign his commits, it seems that these aspects are not particularly high on the project’s list of priorities. The rest of the installer routine is fairly straightforward and offers little customization, so I won’t bore you with the details, but you can check the screenshots below. After initially downloading the official ISO file, the first boot of the system greets you with a terminal window informing you that it needs to update a few packages . And by “a few” it means another 1.8GB. I’m still not entirely sure why the v3.0.2 ISO is a hefty 6.2GB, or why it requires downloading an additional 1.8GB after installation on a system with internet access. For comparison, the official Arch installer image is just 1.4GB in size . While downloading the updates (which took over an hour for me), and with over 15GB of storage consumed on my hard drive, I set out to experience the full Omarchy goodness! After hovering over a few icons on the Waybar , I discovered the menu button on the very left. It’s not a traditional menu, but rather a shortcut to the aforementioned walker launcher tool, which contains a few submenus: The menu reads: Apps, Learn, Trigger, Style, Setup, Install, Remove, Update, About, System; It feels like a random assortment of categories, settings, package manager subcommands, and actions. From a UX perspective, this main menu doesn’t make much sense to me. But I’m feeling lucky, so let’s just go ahead and type “Browser” ! Hm, nothing. “Firefox” , maybe? Nope. “Chrome” ? Nah. “Chromium” ? No. Unfortunately the search in the menu is not universal and requires you to first click into the Apps category. The Apps category seems to list all available GUI (and some TUI) applications. Let’s take a look at the default apps that Omarchy comes with: The bundled “apps” are: 1Password, Alacritty, Basecamp, Bluetooth, Calculator, ChatGPT, Chromium, Discord, Disk Usage, Docker, Document Viewer, Electron 37, Figma, Files, GitHub, Google Contacts, Google Messages, Google Photos, HEY, Image Viewer, Kdenlive, LibreOffice, LibreOffice Base, LibreOffice Calc, LibreOffice Draw, LibreOffice Impress, LibreOffice Math, LibreOffice Writer, Limine-snapper-restore, LocalSend, Media Player, Neovim, OBS Studio, Obsidian, OpenJDK Java 25 Console, OpenJDK Java 25 Shell, Pinta, Print Settings, Signal, Spotify, Typora, WhatsApp, X, Xournal++, YouTube, Zoom; Aside from the fact that nearly a third of the apps are essentially just browser windows pointing to websites , which leaves me wondering where the 15GB of used storage went, the selection of apps is also… well, let’s call it opinionated , for now at least. Starting with the browser, Omarchy comes with Chromium by default, specifically version 141.0.7390.107 in my case, which, unlike, for example, ungoogled-chromium , has disabled support for manifest v2 and thus doesn’t include extensions like uBlock Origin or any other advanced add-ons. In fact, the browser is completely vanilla, with no decent configuration. The only extension it includes is the copy-url extension, which serves a rather obscure purpose: Providing a non-intuitive way to copy the current page’s URL to your clipboard using an even less intuitive shortcut ( ) while using any of the “Apps” that are essentially just browser windows without browser controls. Other than that, it’s pretty much stock Chromium. It allows all third-party cookies, doesn’t send “Do Not Track” requests, sends browsing data to Google Safe Browsing , but doesn’t enforce HTTPS. It has JavaScript optimization enabled for all websites, which increases the attack surface, and it uses Google as the default search engine. There’s not a single opinionated setting in the configuration of the default browser on Omarchy , let alone in the choice of browser itself. And the fact that the only extension installed and active by default is an obscure workaround for the lack of URL bars in “App” windows doesn’t exactly make this first impression of what is likely one of the most important components for the typical Omarchy user very appealing. Alright, let’s have a look at what is probably the second most important app after the browser for many people in the target audience: Basecamp ! Just kidding. Obviously, it’s the terminal. Omarchy comes with Alacritty by default, which is a bit of an odd choice in 2025, especially for a desktop that seemingly prioritizes form over function, given the ultra-conservative approach the Alacritty developers take toward anything related to form and sometimes even function. I would have rather expected Kitty , WezTerm , or Ghostty . That said, Alacritty works and is fairly configurable. Unfortunately, like the browser and various other tools such as Git, there’s little to no opinionated configuration happening, especially one that would enhance integration with the Omarchy ecosystem. Omarchy seemingly highlights the availability of NeoVim by default, yet doesn’t explicitly configure Alacritty’s vi mode , leaving it at its factory defaults . In fact, aside from the keybinding for full-screen mode, which is a less-than-ideal shortcut for anyone with a keyboard smaller than 100% (unless specifically mapped), the Alacritty config doesn’t define any other shortcuts to integrate the terminal more seamlessly into the supposed opinionated workflow. Not even the desktop’s key-repeat rate is configured to a reasonable value, as it takes about a second for it to kick in. Fun fact: When you leave your computer idling on your desk, the screensaver you’ll encounter isn’t an actual hyprlock that locks your desktop and uses PAM authentication to prevent unauthorized access. Instead, it’s a shell script that launches a full-screen Alacritty window to display a CPU-intensive ASCII animation. While Omarchy does use hyprlock , its timeout is set longer than that of the screensaver. Because you can’t dismiss the screensaver with your mouse (only with your keyboard) it might give inexperienced users a false sense of security. This is yet another example of prioritizing gimmicky animations over actual functionality and, to some degree, security. Like the browser and the terminal emulator, the default shell configuration is a pretty basic B….ash , and useful extensions like Starship are barely configured. For example, I ed into a boilerplate Python project directory, activated its venv , and expected Starship to display some useful information, like the virtual environment name or the Python version. However, none of these details appeared in my prompt. “Surely if I do the same in a Ruby on Rails project, Starship will show me some useful info!” I thought, and ed into a Rails boilerplate project. Nope. In fact… Omarchy doesn’t come with Rails pre-installed. I assume Hansson ’s target audience doesn’t primarily consist of Rails developers, despite the unconditional , but let’s not get ahead of ourselves. It is nevertheless puzzling that Omarchy doesn’t come with at least Ruby pre-installed. I find it a bit odd that the person who literally built the most successful Ruby framework on earth is pre-installing “Apps” like HEY , Spotify , and X , but not his own FOSS creation or even just the Ruby interpreter. If you want Rails , you have to navigate through the menu to “Install” , then “Development” , and finally select "‘Ruby on Rails" to make RoR available on your system. Not just Ruby , though. And even going the extra mile to do so still won’t make Starship display any additional useful info when inside a Rails project folder. PS: The script that installs these development tools bypasses the system’s default package manager and repository, opting instead to use mise to install interpreters and compilers. This is yet another example of security not being taken quite as seriously as it should be. At the very least, the script should inform the user that this is about to happen and offer the option to use the package manager instead, if the distributed version meets the user’s needs. Fun fact: At the time of writing, mise installed Ruby 3.4.7. The latest package available through the package manager is – you guessed it – 3.4.7. As mentioned earlier, Omarchy is built entirely using Bash scripts, and there’s nothing inherently wrong with that. When done correctly and kept at a sane limit, Bash scripts are powerful and relatively easy to maintain. However, the scripts in Omarchy are unfortunately riddled with little oversights that can cause issues. Those scripts are also used in places in which a proper software implementation would have made more sense. Take the theme scripts, for example. If you go ahead and create a new theme under and name it , and then run a couple of times until the tool hits your new theme, you can see one effect of these oversights. Nothing catastrophic happened, except now won’t work anymore. If you’d want to annoy an unsuspecting Omarchy user, you could do this: While this is such a tiny detail to complain about, it is an equally low-hanging fruit to write scripts in a way in which this won’t happen. Apart from the numerous places where globbing and word splitting can occur, there are other instances of code that could have also been written a little bit more elegantly. Take this line , for example: To drop and from the , you don’t have to call and pipe to . Instead, you can simply use Bash’s built-in regex matching to do so: Similarly, in this line there’s no need to test for a successful exit code with a dedicated check, when you can simply make the call from within the condition: And frankly, I have no idea what this line is supposed to be: What are you doing, Hansson? Are you alright? Make no mistake to believe that the remarks made above are the only issues with Hansson ’s scripts in Omarchy . While these specific examples are nitpicks, they paint a picture that is only getting less colorful the more we look into the details. We can continue to gauge the quality of the scripts by looking beyond just syntax. Take, for example, the migration : This script runs five commands in sequence within an condition: first , followed by two invocations, then again, and finally . While this might work as expected “on a sunny day” , the first command could fail for various reasons. If it does, the subsequent commands may encounter issues that the script doesn’t account for, and the outcome of this migration will be differently from what the author anticipated. For experienced users, the impact in such a case may be minimal, but for others, it may present a more significant hurdle. Furthermore, as can be seen in here , the invoking process cannot detect if only one of the five commands failed. As a result, the entire migration might be marked as skipped , despite changes being made to the system. But let’s continue to look into specifically the migrations in just a moment. The real concern here, however, is the widespread absence of exception handling, either through status code checks for previously executed commands or via dependent executions (e.g., ). In most scripts, there is no validation to ensure that actions have the desired effect and the current state actually represents the desired outcome. Almost all sequentially executed commands depend upon one another, yet the author doesn’t make sure that if fails the script won’t just blindly run . Note: Although sets , which would cause a script like the one presented above to fail when the first command fails, the migrations are invoked by sourcing the script. This script, in turn, invokes the script using the helper function . However, this function executes the script in the following way: In this case, the options are not inherited by the actual migration , meaning it won’t stop immediately when an error occurs. This behavior makes sense, as abruptly stopping the installation would leave the system in an undefined state. But even if we ignored that and assumed that migrations would stop when the first command would fail, it still wouldn’t actually handle the exception, but merely stop the following commands from performing actions on an unexpected state. To understand the broader issue and its impact on security, we need to dive deeper into the system’s functioning, and especially into migrations . This helps illustrate how the fragile nature of Omarchy could take a dangerous turn, especially considering the lack of tests, let alone any dedicated testing infrastructure. Let’s start by adding some context and examining how configurations are applied in Omarchy . Inspired by his work as a web developer, Hansson has attempted to bring concepts from his web projects into the scripts that shape his Linux setup. In Omarchy , configuration changes are handled through migration scripts, as we just saw, which are in principle similar to the database migrations you might recall from Rails projects. However, unlike SQL or the Ruby DSL used in Active Record Migrations , these Bash scripts do not merely contain a structured query language; They execute actual system commands during installation. More importantly: They are not idempotent by default! While the idea of migrations isn’t inherently problematic, in this case, it can (and has) introduce(d) issues that go/went unnoticed by the Omarchy maintainers for extended periods, but more on that in a second. The migration files in Omarchy are a collection of ambiguously named scripts, each containing a set of changes to the system. These changes aren’t confined to specific configuration files or components. They can be entirely arbitrary, depending on what the migration is attempting to implement at the time it is written. To modify a configuration file, these migrations typically rely on the command. For instance, the first migration intended to change from to might execute something like . The then following one would have to account for the previous change: . Another common approach involves removing a specific line with and appending the new settings via . However, since multiple migrations are executed sequentially, often touching the same files and running the same commands, determining the final state of a configuration file can become a tedious process. There is no clear indication of which migration modifies which file, nor any specific keywords (e.g., ) to grep for and help identify the relevant migration(s) when searching through the code. Moreover, because migrations rely on fixed paths and vary in their commands, it’s impossible to test them against mock files/folders, to predict their outcome. These scripts can invoke anything from sourcing other scripts to running commands, with no restrictions on what they can or cannot do. There’s no “framework” or API within which these scripts operate. To understand what I mean by that, let’s take a quick look at a fairly widely used pile of scripts that is of similar importance to a system’s functionality: OpenRC . While the init.d scripts in OpenRC are also just that, namely scripts, they follow a relatively well-defined API : Note: I’m not claiming that OpenRC ’s implementation is flawless or the ultimate solution, far from it. However, given the current state of the Omarchy project, it’s fair to say that OpenRC is significantly better within its existing constraints. Omarchy , however, does not use any sort of API for that matter. Instead, scripts can basically do whatever they want, in whichever way they deem adequate. Without such well defined interfaces , it is hard to understand the effects that migrations will have, especially when changes to individual services are split across a number of different migration scripts. Here’s a fun challenge: Try to figure out how your folder looks after installation by only inspecting the migration files. To make matters worse, other scripts (outside the migration folder) may also modify configurations that were previously altered by migrations , at runtime, such as . Note: To the disappointment of every NixOS user, unlike database migrations in Rails , the migrations in Omarchy don’t support rollbacks and, judging by their current structure, are unlikely to do so moving forward. The only chance Omarchy users have in case a migration should ever brick their existing system is to make use of the available snapshots . All of this (the lack of interfaces , the missing exception handling and checks for desired outcomes, the overlapping modification, etc.) creates a chaotic environment that is hard to overview and maintain, which can severely compromise system integrity and, by extension, security. Want an example? On my fresh installation, I wanted to validate the following claim from the manual : Firewall is enabled by default: All incoming traffic by default except for port 22 for ssh and port 53317 for LocalSend. We even lock down Docker access using the ufw-docker setup to prevent that your containers are accidentally exposed to the world. What I discovered upon closer inspection, however, is that Omarchy ’s firewall doesn’t actually run, despite its pre-configured ruleset . Yes, you read that right, everyone installing the v3.0.2 ISO (and presumably earlier versions) of Omarchy is left with a system that doesn’t block any of the ports that individual software might open during runtime. Please bear in mind that apart from the full-disk encryption, the firewall is the only security measure that Omarchy puts in place. And it’s off by default. Only once I manually enabled and started using / , it did activate the rules mentioned in the handbook. As highlighted in the original issue , it appears that, with the chaos that are the migration- , preflight- and first-run- scripts no one ever realized that you need to tell to explicitly enable a service for it to actually run. And because it’s all made up of Bash scripts that can do whatever they want, you cannot easily test these things to notice that the state that was expected for a specific service was not reached. Unlike in Rails , where you can initialize your (test) database and run each migration manually if necessary to make sure that the schema reaches the desired state and that the database is seeded correctly, this agglomeration of Bash scripts is not structured data. Hence, applying the same principle to something as arbitrary as a Bash script is not as easily possible, at least not without clearly defined structures and interfaces . As a user who trusted Omarchy to secure their installation, I would be upset, to say the least. The system failed to keep users safe, and more importantly, nobody noticed for a long time. There was no hotfix ISO issued, nor even a heads-up to existing users alongside the implemented fix ( e.g. ). While mistakes happen, simply brushing them under the rug feels like rather negligent behavior. When looking into the future, the mess that is the Bash scripts certainly won’t decrease in complexity, making me doubt that things like these won’t happen again. Note: The firewall fix was listed in v2.1.1. However, on my installation of v3.0.2 the firewall would still not come up automatically. I double-checked this by running the installation of v3.0.2 twice, and both times the firewall would not autostart after the second reboot. While writing this post, v3.1.0 ( update: v3.1.1 ) was released and I also checked the issue there. v3.1.0 appears to have finally fixed the firewall issue. Having that said, it shows how much of a mess the whole system is, when things that were identified and supposedly fixed multiple versions ago still don’t work in newer releases weeks later. Tl;dr: v3.1.0 appears to be the first release to actually fix the firewall issue, even though it was identified and presumably fixed in v2.1.1, according to the changelog. With the firewall active, it becomes apparent that Omarchy ’s configuration does indeed leave port 22 (SSH) open, even though the SSH daemon is not running by default. While I couldn’t find a clear explanation for why this port is left open on a desktop system without an active SSH server, my assumption is that it’s intended to allow the user to remotely access their workstation should they ever need to. It’s important to note that the file in Omarchy , like many other system files, remains unchanged. Users might reasonably assume that, since Omarchy intentionally leaves the SSH port open, it must have also configured the SSH server with sensible defaults. Unfortunately, this is not the case. In a typical Arch installation, users would eventually come across the “Protection” section on the OpenSSH wiki page, where they would learn about the crucial settings that should be adjusted for security reasons. However, when using a system like Omarchy , which is marketed as an opinionated setup that takes security seriously , users might expect these considerations to be handled for them, making it all the more troubling that no sensible configuration is in place, despite the deliberate decision to leave the SSH port open for future use. Hansson seemingly struggles to get even basics like right. The fact that there’s so little oversight, that users are allowed to set weak password for both, their account and drive encryption, and that the only other security measure put in place, the firewall, simply hasn’t been working, does not speak in favor of Omarchy . Info: is abstraction layer that simplifies managing the powerful / firewall and it stands for “ u ncomplicated f ire w all”. Going into this review I wasn’t expecting a hardened Linux installation with SELinux , intrusion detection mechanisms, and all these things. But Hansson is repeatedly addressing users of Windows and macOS (operating systems with working firewalls and notably more security measures in place) who are frustrated with their OS, as a target audience. At this point, however, Omarchy is a significantly worse option for those users. Not only does Omarchy give a hard pass on Linux Security Modules , linux-hardened , musl , hardened_malloc , or tools like OpenSnitch , and fails to properly address security-related topics like SSH, GPG or maybe even AGE and AGE/Yubikey , but it in fact weakens the system security with changes like the increase of and login password retries and the decrease of faillock timeouts . Omarchy appears to be undoing security measures that were put in place by the software- and by the Arch -developers, while the basis it uses for building the system does not appear to be reliable enough to protect its users from future mishaps. Then there is the big picture of Omarchy that Hansson tries to curate, which is that of a TUI-centered, hacker -esque desktop that promises productivity and so on. He even goes as far as calling it “a pro system” . However, as we clearly see from the implementation, configuration and the project’s approach to security, this is unlike anything you would expect from a pro system . The entire image of a TUI-centered productivity environment is further contradicted in many different places, primarily by the lack of opinions and configuration . If the focus is supposed to be on “pro” usage, and especially the command-line, then… The configuration doesn’t live up to its sales pitch, and there are many aspects that either don’t make sense or aren’t truly opinionated , meaning they’re no different from a standard Arch Linux installation. In fact, I would go as far as to say that Omarchy is barely a ready-to-use system at all out of the box and requires a lot of in-depth configuration of the underlying Arch distribution for it to become actually useful. Let’s look at only a few details. There are some fairly basic things you’ll miss on the “lightweight” 15GB installation of Omarchy : With the attention Omarchy is receiving, particularly from Framework (Computer Inc.) , it is surprising that there is no option to install the system on RAID1 hardware: I would argue that RAID1 is a fairly common use case, especially with Framework (Computer Inc.) 16" laptops, which support a secondary storage device. Considering that Omarchy is positioning itself to compete against e.g. macOS with TimeMachine , yet it does not include an automated off-drive backup solution for user data by default – which by the way is just another notable shortcoming we could discuss – and given that configuring a RAID1 root with encryption is notoriously tedious on Linux, even for advanced users, the absence of this option is especially disappointing for the intended audience. Even moreso when neither the installer nor the post-installation process provides any means to utilize the additional storage device, leaving inexperienced users seemingly stuck with the command. Omarchy does not come with a dedicated swap partition, leaving me even more puzzled about its use of 15GB of disk space. I won’t talk through why having a dedicated swap partition that is ideally encrypted using the same mechanisms already in place is a good idea. This topic has been thoroughly discussed and written about countless times. However, if you, like seemingly the Omarchy author, are unfamiliar with the benefits of having swap on Linux, I highly recommend reading this insightful write-up to get a better understanding. What I will note, however, is that the current configuration does not appear to support hibernation via the command through the use of a dynamic swap file . This leads me to believe that hibernation may not function on Omarchy . Given the ongoing battery drain issues with especially Framework (Computer Inc.) laptops while in suspend mode, it’s clear that hibernation is an essential feature for many Linux laptop users. Additionally, it’s hard to believe that Hansson , a former Apple evangelist , wouldn’t be accustomed to the simple act of closing the lid on his laptop and expecting it to enter a light sleep mode, and eventually transitioning into deep sleep to preserve battery life. If he had ever used Omarchy day-to-day on a laptop in the same way most people use their MacBooks , he would almost certainly have noticed the absence of these features. This further reinforces the impression that Omarchy is a project designed to appear robust at first glance, but reveals a surprisingly hollow foundation upon closer inspection. Let’s keep our focus on laptop use. We’ve seen Hansson showcasing his Framework (Computer Inc.) laptop on camera, so it’s reasonable to assume he’s using Omarchy on a laptop. It’s also safe to say that many users who might genuinely want to try Omarchy will likely do so on a laptop as well. That said, as we’ve established before, closing the laptop lid doesn’t seem to trigger hibernate mode in Omarchy . But if you close the lid and slip the laptop into your backpack, surely it would activate some power-saving measures, right? At the very least, it should blank the screen, switch the CPU governor to powersaving , or perhaps even initiate suspend to RAM ? Well… Of course, I can’t test these scenarios firsthand, as I’m evaluating Omarchy within a securely confined virtual machine, where any unintended consequences are contained. Still, based on the system’s configuration, or more accurately the lack thereof, it seems unlikely that an Omarchy laptop will behave as expected. The system might switch power profiles due to the power-profiles-daemon when not plugged in, yet its functionality is not comparable to a properly configured or similar. It seems improbable that it will enter suspend to RAM or hibernate mode, and it’s doubtful any other power-saving measures (like temporarily halting non-essential background processes) will be employed to conserve battery life. Although the configuration comes with an “app” for mail, namely HEY , that platform does not support standard mail protocols . I don’t think it’s a hot take to say that probably 99% of Omarchy ’s potential users will need to work with an email system that does support IMAP and SMTP, however. Yet, the base system offers zero tools for that. I’m not even asking for anything “fancy” like ; Omarchy unfortunately doesn’t even come with the most basic tools like the command out of the box. Whether you want to send email through your provider, get a simple summary for a scheduled Cron job delivered to your local mailbox, or just debug some mail-related issue, the command is relatively essential, even on a desktop system, but it is nowhere to be found on Omarchy . Speaking of which: Cron jobs? Not a thing on Omarchy . Want to automate backing up some files to remote storage? Get ready to dive into the wonderful world of timers , where you’ll spend hours figuring out where to create the necessary files, what they need to contain, and how to activate them. Omarchy could’ve easily included a Cron daemon or at least for the sake of convenience. But I guess this is a pro system , and if the user needs periodic jobs, they will have to figure out . Omarchy is, after all, -based … … and that’s why it makes perfect sense for it to use rootless Podman containers instead of Docker. That way, users can take advantage of quadlets and all the glorious integration. Unfortunately, Omarchy doesn’t actually use Podman . It uses plain ol’ Docker instead. Like most things in Omarchy , power monitoring and alerting are handled through a script , which is executed every 30 seconds via a timer. That’s your crash course on timers right there, Omarchy users! This script queries and then uses to parse the battery percentage and state. It’s almost comical how hacky the implementation is. Given that the system is already using UPower , which transmits power data via D-Bus , there’s a much cleaner and more efficient way to handle things. You could simply use a piece of software that connects to D-Bus to continuously monitor the power info UPower sends. Since it’s already dealing with D-Bus , it can also send a desktop notification directly to whatever notification service you’re using (like in Omarchy ’s case). No need for , , or a periodic Bash script triggered by a timer. “But where could I possibly find such a piece of software?” , you might ask. Worry not, Hr. Hansson , I have just the thing you need ! That said, I can understand that you, Hr. Hansson , might be somewhat reluctant to place your trust in software created by someone who is actively delving into the intricacies of your project, rather than merely offering a superficial YouTube interview to casually navigate the Hyprland UI for half an hour. Of course, Hr. Hansson , you could have always taken the initiative to develop a more robust solution yourself, in a proper, lower-level language, and neatly integrated it into your Omarchy repository. But we will explore why this likely hasn’t been a priority for you, Hr. Hansson , in just a moment. While the author’s previous attempt for a developer setup still came with Zellij , this time his opinions seemingly changed and Omarchy doesn’t include Zellij , or Tmux or even screen anymore. And nope, picocom isn’t there either, so good luck reading that Arduino output from . That moment, when you realize that you’ve spent hours figuring out timers , only to find out that you can’t actually back up those files to a remote storage because there’s no , let alone or . At least there is the command. :-) Unfortunately not, but Omarchy comes with and by default. I could go on and on, and scavenge through the rest of the unconfigured system and the scripts, like for example the one, where Omarchy once again seems to prefer -ing random scripts from the internet (or anyone man-in-the-middle -ing it) rather than using the system package manager to install Tailscale . But, for the sake of both your sanity and mine, I’ll stop here. As we’ve seen, Omarchy is more unconfigured than it is opinionated . Can you simply install all the missing bits and piece and configure them yourself? Sure! But then what is the point of this supposed “perfect developer setup” or “pro system” to begin with? In terms of the “opinionated” buzzword, most actual opinions I’ve come across so far are mainly about colors, themes, and security measures. I won’t dare to judge the former two, but as for the latter, well, unfortunately they’re the wrong opinions . In terms of implementation: Omarchy is just scripts, scripts, and more scripts, with no proper structure or (CI) tests. BTW: A quick shout out to your favorite tech influencer , who probably has at least one video reviewing the Omarchy project without mentioning anything along these lines. It is unfortunate that these influential people barely scratch the surface on a topic like this, and it is even more saddening that recording a 30 minute video of someone clicking around on a UI seemingly counts as a legitimate “review” these days. The primary focus for many of these people is seemingly on pumping out content and generating hype for views and attention rather than providing a thoughtful, thorough analysis. ( Alright, we’re almost there. Stick with me, we’re in the home stretch. ) The Omarchy manual : The ultimate repository of Omarchy wisdom, all packed into 33 pages, clocking in at little over 10,000 words. For context, this post on Omarchy alone is almost 10,000 words long. As is the case with the rest of the system, the documentation also adheres to Hansson ’s form over function approach. I’ve mentioned this before, but it bears repeating: Omarchy doesn’t offer any built-in for its scripts, let alone auto-completion, nor does it come with traditional pages. The documentation is tucked away in yet another SaaS product from Hansson ’s company ( Writebook ) and its focus is predominantly on themes, more themes, creating your own themes, and of course, the ever-evolving hotkeys. Beyond that, the manual mostly covers how to locate configuration files for individual UI components and offers guidance on how to configure Hyprland for a range of what feels like outrageously expensive peripherals. For the truly informative content, look no further than the shell function guide, with gems such as: : Format an entire disk with a single ext4 partition. Be careful! Wow, thanks, Professor Oak, I will be! :-) On a more serious note, though, the documentation leaves much to be desired, as evidenced by the user questions over on the GitHub discussions page . Take this question , which unintentionally sums up the Omarchy experience for probably many inexperienced users: I installed this from github without knowing what I was getting into (the page is very minimal for a project of this size, and I forgot there was a link in the footnotes). Please tell me there’s a way to remove Omarchy without wiping my entire computer. I lost my flashdrive, and don’t have a way to back up all my important files anymore. While this may seem comical on the surface, it’s a sad testament to how Omarchy appears to have a knack for luring in unsuspecting users with flashy visuals and so called “reviews” on YouTube, only to leave them stranded without adequate documentation. The only recourse? Relying on the solid Arch docs, which is an abrupt plunge into the deep end, given that Arch assumes you’re at least familiar with its very basics and that you know how you set up your own system. Maybe GitHub isn’t the most representative forum for the project’s support; I haven’t tried Discord, for example. But no matter where the community is, users should be able to fend for themselves with proper documentation, turning to others only as a last resort. It’s difficult to compile a list of things that could have made Omarchy a reasonable setup for people to consider, mainly because, in my opinion, the core of the setup – scripts doing things they shouldn’t or that should have been handled by other means (e.g., the package manager) – is fundamentally flawed. That said, I do think it’s worth mentioning a few improvements that, if implemented, could have made Omarchy a less bad option. Configuration files should not be altered through loose migration scripts. Instead, updated configuration files should be provided directly (ideally via packages, see below) and applied as patches using a mechanism similar to etc-update or dpkg . This approach ensures clarity and reduces confusion, preserves user modifications, and aligns with established best practices. Improve on the user experience where necessary and maybe even contribute improvements back. Use proper software implementations where appropriate. Want a fancy screensaver? Extend Hyprlock instead of awkwardly repurposing a fullscreen terminal window to mimic one. Need to display power status notifications without relying on GNOME or KDE components? Develop a lightweight solution that integrates cleanly with the desktop environment, or extend the existing Waybar battery widget to send notifications. Don’t like existing Linux “App Store” options? Build your own, rather than diverting a launcher from its intended use only to run Bash scripts that install packages from third-party sources on a system that has a perfectly good package manager in place. Arguably the most crucial improvement: Package the required software and install it via the system’s package manager. Avoid relying on brittle scripts, third-party tools like mise , or worse, piping scripts directly into . I understand that the author is coming from an operating system where it’s sort of fine to and use software like to manage individual Ruby versions. However, we have to take into consideration that specifically macOS has a significantly more advanced security architecture in place than (unfortunately) most out-of-the-box Linux installations have, let alone Omarchy . On Hanssons setup the approach is neither sensible nor advisable, especially given that it’s ultimately a system that is built around a proper package manager. If you want multiple versions of Ruby, package them and use slotting (or the equivalent of it on the distribution that you’re using, e.g. installation to version-specific directories on Arch ). Much of what the migrations and other scripts attempt to do could, and should have been achieved through well-maintained packages and the proven mechanisms of a package manager. Whether it’s Gentoo , NixOS , or Ubuntu , each distribution operates in its own unique way, offering users a distinct set of tools and defaults. Yet, they all share one common trait: A set of strong, well-defined opinions that shape the system. Omarchy , in contrast, feels little more than a glorified collection of Hyprland configurations atop an unopinionated, barebones foundation. If you’re going to have opinions, don’t limit them to just nice colors and cute little wallpapers. Form opinions on the tools that truly matter, on how those tools should be configured, and on the more intricate, challenging aspects of the system, not just the surface-level, easy choices. Have opinions on the really sticky and complicated stuff, like power-saving modes, redundant storage, critical system functionality, and security. Above all, cultivate reasonable opinions, ones that others can get behind, and build a system that reflects those. Comprehensive documentation is essential to help users understand how the system works. Currently, there’s no clear explanation for the myriad Bash scripts, nor is there any user-facing guidance on how global system updates affect individual configuration files. ( finally… ) Omarchy feels like a project created by a Linux newcomer, utterly captivated by all the cool things that Linux can do , but lacking the architectural knowledge to get the basics right, and the experience to give each tool a thoughtful review. Instead of carefully selecting software and ensuring that everything works as promised, the approach seems to be more about throwing everything that somehow looks cool into a pile. There’s no attention to sensible defaults, no real quality control, and certainly no verification that the setup won’t end up causing harm or, at the very least, frustration for the user. The primary focus seems to be on creating a visually appealing but otherwise hollow product . Moreover, the entire Omarchy ecosystem is held together by often poorly written Bash scripts that lack any structure, let alone properly defined interfaces . Software packages are being installed via or similar mechanisms, rather than provided as properly packaged solutions via a package manager. Hansson is quick to label Omarchy a Linux distribution , yet he seems reluctant to engage with the foundational work that defines a true distribution: The development and proper packaging (“distribution”) of software . Whenever Hansson seeks a software (or software version) that is unavailable in the Arch package repositories, he bypasses the proper process of packaging it for the system. Instead, he resorts to running arbitrary scripts or tools that download the required software from third-party sources, rather than offering the desired versions through a more standardized package repository. Hansson also appears to avoid using lower-level programming languages to implement features in a more robust and maintainable manner at all costs , often opting instead for makeshift solutions, such as executing “hacky” Bash scripts through timers. A closer look at his GitHub profile and Basecamp’s repositories reveals that Hansson has seemingly worked exclusively with Ruby and JavaScript , with most contributions to more complex projects, like or , coming from other developers. This observation is not meant to diminish the author’s profession and accomplishments as a web developer, but it highlights the lack of experience in areas such as systems programming, which are crucial for the type of work required to build and maintain a proper Linux distribution. Speaking of packages, the system gobbles up 15GB of storage on a basic install, yet fails to deliver truly useful or high-quality software. It includes a hodgepodge of packages, like OpenJDK and websites of paid services in “App” -disguise, but lacks any real optimization for specific use cases. Despite Omarchy claiming to be opinionated most of the included software is left at its default settings, straight from the developers. Given Hansson ’s famously strong opinions on everything, it makes me wonder if the Omarchy author simply hasn’t yet gained the experience necessary to develop clear, informed stances on individual configurations. Moreover, his prioritization of his paid products like Basecamp and HEY over his own free software like Rails leaves a distinctly bitter aftertaste when considering Omarchy . What’s even more baffling is that seemingly no one at Framework (Computer Inc.) or Cloudflare appears to have properly vetted the project they’re directing attention (and sometimes financial support) to. I find it hard to believe that knowledgeable people at either company have looked at Omarchy and thought, “Out of all the Linux distributions out there, this barely configured stack of poorly written Bash scripts on top of Arch is clearly the best choice for us to support!” In fact, I would go as far as to call it a slap in the face to each and every proper distro maintainer and FOSS developer. Furthermore, I fail to see the supposed gap Omarchy is trying to fill. A fresh installation of Arch Linux, or any of its established derivatives like Manjaro , is by no means more complicated or time-consuming than Omarchy . In fact, it is Omarchy that complicates things further down the line, by including a number of unnecessary components and workarounds, especially when it comes to its chosen desktop environment. The moment an inexperienced user wants or needs to change anything, they’ll be confronted with a jumbled mess that’s difficult to understand and even harder to manage. If you want Arch but are too lazy to read through its fantastic Wiki , then look at Manjaro , it’ll take care of you. If that’s still not to your liking, maybe explore something completely different . On the other hand, if you’re just looking to tweak your existing desktop, check out other people’s dotfiles and dive into the unixporn communities for inspiration. As boring as Fedora Workstation or Ubuntu Desktop might sound, these are solid choices for anyone who doesn’t want to waste time endlessly configuring their OS and, more importantly, wants something that works right out of the box and actually keeps them safe. Fedora Workstation comes with SELinux enabled in “enforcing” mode by default, and Ubuntu Desktop utilizes AppArmor out of the box. Note: Yes, I hear you loud and clear, SuSE fans. The moment your favorite distro gets its things together with regard to the AppArmor-SELinux transition and actually enables SELinux in enforcing mode across all its different products and versions I will include it here as well. Omarchy is essentially an installation routine for someone else’s dotfiles slapped on top of an otherwise barebones Linux desktop. Although you could simply run its installation scripts on your existing, fully configured Arch system, it doesn’t seem to make much sense and it’s definitely not the author’s primary objective. If this was just Hansson’s personal laptop setup, nobody, including myself, would care about the oversights or eccentricities, but it is not. In fact, this project is clearly marketed to the broader, less experienced user base, with Hansson repeatedly misrepresenting Omarchy as being “for developers or anyone interested in a pro system” . I emphasize marketed here, because Hansson is using his reach and influence in every possible way to advertise and seemingly monetize Omarchy ; Apart from the corporate financial support, the project even has its own merch that people can spend money on. Given that numerous YouTubers have been heavily promoting the project over the past few weeks, often in the same breath with Framework (Computer Inc.) , it wouldn’t be surprising to see the company soon offering it as a pre-installation option on their hardware. If you’re serious about Linux, you’re unlikely to fall for the Omarchy sales pitch. However, if you’re an inexperienced user who’s heard about Omarchy from a tech-influencer raving about it, I strongly recommend starting your Linux journey elsewhere, with a distribution that actually prioritizes your security and system integrity, and is built and maintained by people who live and breathe systems, and especially Linux. Alright, that’s it. Why don’t any of the Bash scripts and functions provide a flag or maybe even autocompletions? Why are there no Omarchy -related pages? Why does the system come with GNOME Files , which requires several gvfs processes running in the background, yet it lacks basic command-line file managers like or ? Why would you define as an for unconditionally, but not install Rails by default? Why bother shipping tools like and but fail to provide aliases for , , etc to make use of these tools by default? Why wouldn’t you set up an O.G. alias like in your defaults ? Why ship the GNOME Calculator but not include any command-line calculators (e.g., , ), forcing users to rely on basics like ? Why ship the full suite of LibreOffice, but not a single useful terminal tool like , , , etc.? Why define functions like with and without an option to enable encryption, when the rest of the system uses and ? And if it’s intended for use by inexperienced users primarily for things like USB sticks, why not make it instead of so the drive works across most operating systems? Why not define actually useful functions like or / ? Why doesn’t your Bash configuration include history- and command-flag-based auto-suggestions? Or a terminal-independent vi mode ? Or at least more consistent Emacs-style shortcuts? Why don’t you include some quality-of-life tools like or some other command-line community favorites? If you had to squeeze in ChatGPT , why not have Crush available by default? Why does the base install with a single running Alacritty window occupy over 2.2GB of RAM right after booting? For comparison: My Gentoo system with a single instance of Ghostty ends up at around half of that. Why set up NeoVim but not define as an alias for , or even create a symlink? And speaking of NeoVim , why does the supposedly opinionated config make NeoVim feel slower than VSCode ?

0 views
Evan Hahn 1 months ago

Scripts I wrote that I use all the time

In my decade-plus of maintaining my dotfiles , I’ve written a lot of little shell scripts. Here’s a big list of my personal favorites. and are simple wrappers around system clipboard managers, like on macOS and on Linux. I use these all the time . prints the current state of your clipboard to stdout, and then whenever the clipboard changes, it prints the new version. I use this once a week or so. copies the current directory to the clipboard. Basically . I often use this when I’m in a directory and I want use that directory in another terminal tab; I copy it in one tab and to it in another. I use this once a day or so. makes a directory and s inside. It’s basically . I use this all the time —almost every time I make a directory, I want to go in there. changes to a temporary directory. It’s basically . I use this all the time to hop into a sandbox directory. It saves me from having to manually clean up my work. A couple of common examples: moves and to the trash. Supports macOS and Linux. I use this every day. I definitely run it more than , and it saves me from accidentally deleting files. makes it quick to create shell scripts. creates , makes it executable with , adds some nice Bash prefixes, and opens it with my editor (Vim in my case). I use this every few days. Many of the scripts in this post were made with this helper! starts a static file server on in the current directory. It’s basically but handles cases where Python isn’t installed, falling back to other programs. I use this a few times a week. Probably less useful if you’re not a web developer. uses to download songs, often from YouTube or SoundCloud, in the highest available quality. For example, downloads that video as a song. I use this a few times a week…typically to grab video game soundtracks… similarly uses to download something for a podcast player. There are a lot of videos that I’d rather listen to like a podcast. I use this a few times a month. downloads the English subtitles for a video. (There’s some fanciness to look for “official” subtitles, falling back to auto-generated subtitles.) Sometimes I read the subtitles manually, sometimes I run , sometimes I just want it as a backup of a video I don’t want to save on my computer. I use this every few days. , , and are useful for controlling my system’s wifi. is the one I use most often, when I’m having network trouble. I use this about once a month. parses a URL into its parts. I use this about once a month to pull data out of a URL, often because I don’t want to click a nasty tracking link. prints line 10 from stdin. For example, prints line 10 of a file. This feels like one of those things that should be built in, like and . I use this about once a month. opens a temporary Vim buffer. It’s basically an alias for . I use this about once a day for quick text manipulation tasks, or to take a little throwaway note. converts “smart quotes” to “straight quotes” (sometimes called “dumb quotes”). I don’t care much about these in general, but they sometimes weasel their way into code I’m working on. It can also make the file size smaller, which is occasionally useful. I use this at least once a week. adds before every line. I use it in Vim a lot; I select a region and then run to quote the selection. I use this about once a week. returns . (I should probably just use .) takes JSON at stdin and pretty-prints it to stdout. I use this a few times a year. and convert strings to upper and lowercase. For example, returns . I use these about once a week. returns . I use this most often when talking to customer service and need to read out a long alphanumeric string, which has only happened a couple of times in my whole life. But it’s sometimes useful! returns . A quick way to do a lookup of a Unicode string. I don’t use this one that often…probably about once a month. cats . I use for , for a quick “not interested” response to job recruiters, to print a “Lorem ipsum” block, and a few others. I probably use one or two of these a week. Inspired by Ruby’s built-in REPL, I’ve made: prints the current date in ISO format, like . I use this all the time because I like to prefix files with the current date. starts a timer for 10 minutes, then (1) plays an audible ring sound (2) sends an OS notification (see below). I often use to start a 5 minute timer in the background (see below). I use this almost every day as a useful way to keep on track of time. prints the current time and date using and . I probably use it once a week. It prints something like this: extracts text from an image and prints it to stdout. It only works on macOS, unfortunately, but I want to fix that. (I wrote a post about this script .) (an alias, not a shell script) makes a happy sound if the previous command succeeded and a sad sound otherwise. I do things like which will tell me, audibly, whether the tests succeed. It’s also helpful for long-running commands, because you get a little alert when they’re done. I use this all the time . basically just plays . Used in and above. uses to play audio from a file. I use this all the time , running . uses to show a picture. I use this a few times a week to look at photos. is a little wrapper around some of my favorite internet radio stations. and are two of my favorites. I use this a few times a month. reads from stdin, removes all Markdown formatting, and pipes it to a text-to-speech system ( on macOS and on Linux). I like using text-to-speech when I can’t proofread out loud. I use this a few times a month. is an wrapper that compresses a video a bit. I use this about once a month. removes EXIF data from JPEGs. I don’t use this much, in part because it doesn’t remove EXIF data from other file formats like PNGs…but I keep it around because I hope to expand this one day. is one I almost never use, but you can use it to watch videos in the terminal. It’s cursed and I love it, even if I never use it. is my answer to and , which I find hard to use. For example, runs on every file in a directory. I use this infrequently but I always mess up so this is a nice alternative. is like but much easier (for me) to read—just the PID (highlighted in purple) and the command. or is a wrapper around that sends , waits a little, then sends , waits and sends , waits before finally sending . If I want a program to stop, I want to ask it nicely before getting more aggressive. I use this a few times a month. waits for a PID to exit before continuing. It also keeps the system from going to sleep. I use this about once a month to do things like: is like but it really really runs it in the background. You’ll never hear from that program again. It’s useful when you want to start a daemon or long-running process you truly don’t care about. I use and most often. I use this about once a day. prints but with newlines separating entries, which makes it much easier to read. I use this pretty rarely—mostly just when I’m debugging a issue, which is unusual—but I’m glad I have it when I do. runs until it succeeds. runs until it fails. I don’t use this much, but it’s useful for various things. will keep trying to download something. will stop once my tests start failing. is my emoji lookup helper. For example, prints the following: prints all HTTP statuses. prints . As a web developer, I use this a few times a month, instead of looking it up online. just prints the English alphabet in upper and lowercase. I use this surprisingly often (probably about once a month). It literally just prints this: changes my whole system to dark mode. changes it to light mode. It doesn’t just change the OS theme—it also changes my Vim, Tmux, and terminal themes. I use this at least once a day. puts my system to sleep, and works on macOS and Linux. I use this a few times a week. recursively deletes all files in a directory. I hate that macOS clutters directories with these files! I don’t use this often, but I’m glad I have it when I need it. is basically . Useful for seeing the source code of a file in your path (used it for writing up this post, for example!). I use this a few times a month. sends an OS notification. It’s used in several of my other scripts (see above). I also do something like this about once a month: prints a v4 UUID. I use this about once a month. These are just scripts I use a lot. I hope some of them are useful to you! If you liked this post, you might like “Why ‘alias’ is my last resort for aliases” and “A decade of dotfiles” . Oh, and contact me if you have any scripts you think I’d like. to start a Clojure REPL to start a Deno REPL (or a Node REPL when Deno is missing) to start a PHP REPL to start a Python REPL to start a SQLite shell (an alias for )

0 views
Simon Willison 1 months ago

Claude Code for web - a new asynchronous coding agent from Anthropic

Anthropic launched Claude Code for web this morning. It's an asynchronous coding agent - their answer to OpenAI's Codex Cloud and Google's Jules , and has a very similar shape. I had preview access over the weekend and I've already seen some very promising results from it. It's available online at claude.ai/code and shows up as a tab in the Claude iPhone app as well: As far as I can tell it's their latest Claude Code CLI app wrapped in a container (Anthropic are getting really good at containers these days) and configured to . It appears to behave exactly the same as the CLI tool, and includes a neat "teleport" feature which can copy both the chat transcript and the edited files down to your local Claude Code CLI tool if you want to take over locally. It's very straight-forward to use. You point Claude Code for web at a GitHub repository, select an environment (fully locked down, restricted to an allow-list of domains or configured to access domains of your choosing, including "*" for everything) and kick it off with a prompt. While it's running you can send it additional prompts which are queued up and executed after it completes its current step. Once it's done it opens a branch on your repo with its work and can optionally open a pull request. Claude Code for web's PRs are indistinguishable from Claude Code CLI's, so Anthropic told me it was OK to submit those against public repos even during the private preview. Here are some examples from this weekend: That second example is the most interesting. I saw a tweet from Armin about his MiniJinja Rust template language adding support for Python 3.14 free threading. I hadn't realized that project had Python bindings, so I decided it would be interesting to see a quick performance comparison between MiniJinja and Jinja2. I ran Claude Code for web against a private repository with a completely open environment ( in the allow-list) and prompted: I’m interested in benchmarking the Python bindings for https://github.com/mitsuhiko/minijinja against the equivalente template using Python jinja2 Design and implement a benchmark for this. It should use the latest main checkout of minijinja and the latest stable release of jinja2. The benchmark should use the uv version of Python 3.14 and should test both the regular 3.14 and the 3.14t free threaded version - so four scenarios total The benchmark should run against a reasonably complicated example of a template, using template inheritance and loops and such like In the PR include a shell script to run the entire benchmark, plus benchmark implantation, plus markdown file describing the benchmark and the results in detail, plus some illustrative charts created using matplotlib I entered this into the Claude iPhone app on my mobile keyboard, hence the typos. It churned away for a few minutes and gave me exactly what I asked for. Here's one of the four charts it created: (I was surprised to see MiniJinja out-performed by Jinja2, but I guess Jinja2 has had a decade of clever performance optimizations and doesn't need to deal with any extra overhead of calling out to Rust.) Note that I would likely have got the exact same result running this prompt against Claude CLI on my laptop. The benefit of Claude Code for web is entirely in its convenience as a way of running these tasks in a hosted container managed by Anthropic, with a pleasant web and mobile UI layered over the top. It's interesting how Anthropic chose to announce this new feature: the product launch is buried half way down their new engineering blog post Beyond permission prompts: making Claude Code more secure and autonomous , which starts like this: Claude Code's new sandboxing features, a bash tool and Claude Code on the web, reduce permission prompts and increase user safety by enabling two boundaries: filesystem and network isolation. I'm very excited to hear that Claude Code CLI is taking sandboxing more seriously. I've not yet dug into the details of that - it looks like it's using seatbelt on macOS and Bubblewrap on Linux. Anthropic released a new open source (Apache 2) library, anthropic-experimental/sandbox-runtime , with their implementation of this so far. Filesystem sandboxing is relatively easy. The harder problem is network isolation, which they describe like this: Network isolation , by only allowing internet access through a unix domain socket connected to a proxy server running outside the sandbox. This proxy server enforces restrictions on the domains that a process can connect to, and handles user confirmation for newly requested domains. And if you’d like further-increased security, we also support customizing this proxy to enforce arbitrary rules on outgoing traffic. This is crucial to protecting against both prompt injection and lethal trifecta attacks. The best way to prevent lethal trifecta attacks is to cut off one of the three legs, and network isolation is how you remove the data exfiltration leg that allows successful attackers to steal your data. If you run Claude Code for web in "No network access" mode you have nothing to worry about. I'm a little bit nervous about their "Trusted network access" environment. It's intended to only allow access to domains relating to dependency installation, but the default domain list has dozens of entries which makes me nervous about unintended exfiltration vectors sneaking through. You can also configure a custom environment with your own allow-list. I have one called "Everything" which allow-lists "*", because for projects like my MiniJinja/Jinja2 comparison above there are no secrets or source code involved that need protecting. I see Anthropic's focus on sandboxes as an acknowledgment that coding agents run in YOLO mode ( and the like) are enormously more valuable and productive than agents where you have to approve their every step. The challenge is making it convenient and easy to run them safely. This kind of sandboxing kind is the only approach to safety that feels credible to me. Update : A note on cost: I'm currently using a Claude "Max" plan that Anthropic gave me in order to test some of their features, so I don't have a good feeling for how Claude Code would cost for these kinds of projects. From running (an unofficial cost estimate tool ) it looks like I'm using between $1 and $5 worth of daily Claude CLI invocations at the moment. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Add query-string-stripper.html tool against my simonw/tools repo - a very simple task that creates (and deployed via GitHub Pages) this query-string-stripper tool. minijinja vs jinja2 Performance Benchmark - I ran this against a private repo and then copied the results here, so no PR. Here's the prompt I used. Update deepseek-ocr README to reflect successful project completion - I noticed that the README produced by Claude Code CLI for this project was misleadingly out of date, so I had Claude Code for web fix the problem.

0 views

When it comes to MCPs, everything we know about API design is wrong

TL;DR: I built a lightweight Chrome MCP. Scroll to the end to learn how to install it. Read the whole post to learn a little bit about the Zen of MCP design. Claude Code has built in tools to fetch web pages and to search the web – they actually run through Anthropic's servers, if I recall correctly. They do clever things to carefully manage context and to return information in a format that's easy for Claude to digest. These tools work really well. Right up to the point where they completely fall apart. An uncoached testimonial from the only customer who matters. Last week, I somehow got it into my head that I should update my custom blogging client to use Apple's new Liquid Glass look and feel. The first issue I ran into was that Claude was absolutely sure that macOS 26 wasn't out yet. (Amusingly, when asked to review a draft of this post, one of the things it flagged was: ' Inconsistent model naming - You refer to "macOS 26" but I believe you mean "macOS 15" (Sequoia). macOS 26 would be way in the future.') Claude was, however, happy to speculate about what a "Liquid Glass" UI might look like. Once I reminded the model that it had memory issues and Apple had indeed released the new version of their operating system, it was ready to get to work. I told it to go read Apple's Human Interface Guidelines and make a plan. This is what Claude saw: It turns out that Apple no longer offer a downloadable version of the HIG. And the online version requires JavaScript . After a bit of flailing, Claude reached for the industry-standard Playwright MCP from Microsoft. The Playwright MCP is a collection of 21 tools covering all aspects of driving a browser and debugging webapps, from to to . Just having the Playwright MCP available costs 13,678 tokens (7% of the whole context window) in every single session, even if you never use it. (Yes, the Google Chrome team has their own Chrome MCP. Its API surface is even bigger ) And once you do start using it, things get worse. Some of its tools return the entire DOM of the webpage you're working with. This means that simple requests fail because they return more tokens than Claude can handle in a response: It's frustrating to see a coding agent trying over and over to use a tool the way it's supposed to and having that tool just fail to return useful data. After hearing me complain about this a few times, Dan Grigsby commented that he'd had success just asking Claude to teach itself a skill: Using the raw Dev Tools remote control protocol to drive Chrome. This seemed like a neat trick, so I asked my Claude to take a swing at it. Claude was only too happy to try to speak raw JSON-RPC to Chrome on port 9292. It...just worked. But it was also very clunky and wasteful feeling. Claude was writing raw JSON-RPC command lines for each and every interaction. It was very verbose and required the LLM to get a whole bunch of details right on every single command invocation. It was time to make a proper Skill. After thinking about it for a moment, I asked Claude to write a little zero-dependency command-line tool called that it could run with the Bash tool to control Chrome, as well as a new file explaining how to use that script . encapsulated the complexity and made Chrome easily scriptable from the command line. The skill sets up the basics of web browsing with its tools and uses progressive disclosure to tell Claude how to get more information, but only when it has a need to know. For example, these examples of how to use the tool . Claude didn't always reach for the skill, so it wasn't aware of its new command-line tool, but once I pointed it in the right direction, it worked surprisingly well . This setup was incredibly token efficient – Nothing in the context window at startup other than a skill and in the system prompt. What was a little frustrating for me was that any time Claude wanted to do anything with the browser, it had to run a custom Bash command that I had to approve. Every click. Every navigation. Every javascript expression. It got old really, really fast. There's no real way to fix that without creating a custom MCP. But that would put us right back where we were with the official Playwright MCP, right? Nearly two dozen tools and 13k tokens spilled on the floor every time we started a session. Even trimming things down to only the dozen most important commands is still a bunch of tools, most of which Claude won't use in a given session. If you've ever done API design, you probably know how important it is to name your methods well. You know that every method should do one thing and only one thing. You know that you really need to type (and validate) all your parameters to make sure your callers can tell what they're supposed to be passing in and to make bad method calls fail as soon as possible. It would be absolutely unhinged to have a method called that took a parameter called that was itself a method dispatcher, a parameter called , and a parameter called . You'd have to be crazy to think that it's acceptable API design to have the optional, untyped field just have a description like And yet. That is exactly how I designed it. And it's just great. The high-level tool description reads: At session startup, the whole MCP config weighs in at just 947 tokens. I'm pretty sure I can shave at least 30-40 more. It's optimized to make Claude's life as easy as possble. Rather than having a method to start the browser, the MCP...just does it when it needs to. Same with opening a new tab if there wasn't one waiting. The tool description tells Claude what to do and where to read up when it needs more help. At least so far, it works just great for me. One of the mistakes I made while developing the MCP was to instruct Claude to cut down the API surface by only accepting CSS selectors, rather than accepting CSS or XPath. It seemed natural to me that a smaller, simpler API would be easier for Claude to work with and reason about. Right up until I saw the MCP tool description containing multiple admonitions like . The whole thing just...worked better when I let the selector fields accept either CSS or XPath. Another thing that Claude got not-quite-right when it first implemented the MCP was that it included detailed human-readable text for all the method parameters. Because LLMs that are using MCPs can see both the and the actual JSON schema, you don't need to repeat things like lists of values for an enum or type validations. One trick you can use is to ask your agent to tell you exactly what it can see about how to use an API. One of the weirdest realizations I had while building is this: I have no doubt that there are a dozen similar tools out there, but it was literally faster and easier to build the tool that I thought should exist than to test out a dozen tools to see if any of them work the way I think they should. Over the last couple of decades, the common wisdom has become that Postel's Law (aka the robustness principle) is dated and wrong and that APIs should be rigid and rigorous. That's the wrong choice when you're designing for use by LLMs. This might be a hard lesson to hear, but tools you build for LLMs are going to work much, much better if you think of your end-user as a "person" rather than a computer. Build your tools like they're a set of scripts you're handing to that undertrained kid who just got hired in the NOC. They are going to page you at 2AM when they can't figure out what's going on or when they misuse the tools in a way they can't unwind. Names and method descriptions matter far more than they ever have before. Automatic recovery is hugely important. Designing for error recovery rather than failing fast will make the whole system more reliable and less expensive to operate. When errors are unaviodable, your error messages should tell the user how to fix or work around the problem in plain English. If you can't give the user exactly what they asked for, but you can give them a partial answer or related information, do that. Claude absolutely does not care about the architectural purity of your API. It just wants to help you get work done with the limited resources at its disposal. This new MCP and skill for Claude Code, is called superpowers-chrome . You can install it like this: If you're already using Superpowers , you can just type /plugin, navigate to 'Install plugins', pick 'superpowers-marketplace' and then you should see . I'd love to hear from you if you find it helpful. I'd also love patches and pull requests.

0 views
allvpv’s space 1 months ago

Environment variables are a legacy mess: Let's dive deep into them

Programming languages have rapidly evolved in recent years. But in software development, the new often meets the old, and the scaffolding that OS gives for running new processes hasn’t changed much since Unix. If you need to parametrize your application at runtime by passing a few ad-hoc variables (without special files or a custom solution involving IPC or networking), you’re doomed to a pretty awkward, outdated interface: There are no namespaces for them, no types. Just a flat, embarrassingly global dictionary of strings. But what exactly are these envvars? Is it some kind of special dictionary inside the OS? If not, who owns them and how do they propagate? In a nutshell: they’re passed from parent to child. On Linux, a program must use the syscall to execute another program. Whether you type in Bash, call in Python, or launch a code editor, it ultimately comes down to , preceded by a / . The family of C functions also relies on . This system call takes three arguments: , , . For example, for an invocation: By default, all envvars are passed from the parent to the child. However, nothing prevents a parent process from passing a completely different or even empty environment when calling ! In practice, most tooling passes the environment down: Bash, Python’s , the C library , and so on. And this is what you expect – variables are inherited by child processes. That’s the point – to track the environment. Which tools do not pass the parent’s environment? For example, the executable, used when signing into a system, sets up a fresh environment for its children. After launching the new program, the kernel dumps the variables on the stack as a sequence of null-terminated strings which contain the envvar definitions. Here is a hex view: This static layout can’t easily be modified or extended; the program must copy those variables into its own data structure. Let’s look at how Bash, C, and Python store envvars internally. I analyzed their source code and here is a summary. It stores the variables in a hashmap . Or, more precisely, in a stack of hashmaps . When you spawn a new process using Bash, it traverses the stack of hashmaps to find variables marked as exported and copies them into the environment array passed to the child. Side note: Why is traversing the stack needed? Each function invocation in Bash creates a new local scope – a new entry on the stack. If you declare your variable with , it ends up in this locally-scoped hashmap. What’s interesting is that you can export a variable too! I wouldn’t have learned this without diving into Bash source. My intuitive (wrong) assumption was that automatically makes the variable global – like ! Super interesting stuff. exposes a dynamic array, managed via and library functions. It uses an array, so the time complexity of and is linear in the number of envvars. Remember – envvars are not a high-performance dictionary and you should not abuse them. Python couples its environment to the C library, which can cause surprising inconsistencies. If you’ve programmed some Python, you’ve probably used the dictionary. On startup, is built from the C library’s array. But those dictionary values are NOT the “ground truth” for child processes. Rather, each change to invokes the native function, which in turn calls the C library’s . Note that the propagation is one-directional: modifying will call , but not the other way around. Call , and won’t be updated. The Linux kernel is very liberal about the format of environment variables, and so is . For example, your C program can manipulate the environment – the global array – such that several variables share the same name but have different values. And when you execute a child process, it will inherit this “broken” setup. You don’t even need an equals sign separating name from value! The usual entry is , but nothing prevents you from adding to the array. The kernel happily accepts any null-terminated string as an “environment variable” definition. It just imposes a size limitation: Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars. But the fact that you can do something does not mean that you should. For example, if you start Bash with the “broken” environment – duplicated names and entries without – it deduplicates the variables and drops the nonsense. One interesting edge case is a space inside the variable name . My beloved shell – Nushell – has no problem with the following assignment: Python is fine with it, too. Bash, on the other hand, can’t reference it because whitespace isn’t allowed in variable names. Fortunately, the variable isn’t lost – Bash keeps such entries in a special hashmap called and still passes them to child processes. So what name and value can you safely use for your envvar? A popular misconception, repeated on StackOverflow and by ChatGPT, is that POSIX permits only uppercase envvars, and everything else is undefined behavior. But this is seriously NOT what the standard says : These strings have the form name=value; names shall not contain the character ‘=’. For values to be portable across systems conforming to POSIX.1-2017, the value shall be composed of characters from the portable character set (except NUL and as indicated below). There is no meaning associated with the order of strings in the environment. If more than one string in an environment of a process has the same name, the consequences are undefined. Environment variable names used by the utilities in the Shell and Utilities volume of POSIX.1-2017 consist solely of uppercase letters, digits, and the <underscore> ( ‘_’ ) from the characters defined in Portable Character Set and do not begin with a digit. Other characters may be permitted by an implementation; applications shall tolerate the presence of such names. Uppercase and lowercase letters shall retain their unique identities and shall not be folded together. The name space of environment variable names containing lowercase letters is reserved for applications. Applications can define any environment variables with names from this name space without modifying the behavior of the standard utilities. Yes, POSIX-specified utilities use uppercase envvars, but that’s not prescriptive for your programs. Quite the contrary: you’re encouraged to use lowercase for your envvars so they don’t collide with the standard tools. The only strict rule is that a variable name cannot contain an equals sign. POSIX requires compliant applications to preserve all variables that conform to this rule. But in reality, not many applications use lowercase. The proper etiquette in software development is to use . …to use for names, and UTF-8 for values. You shouldn’t hit problems on Linux. If you want to be super safe: instead of UTF-8, use the POSIX-mandated Portable Character Set (PCS) – essentially ASCII without control characters. …and I hope it wasn’t a boring read. is the (the executable path), is the array of command line arguments – the implicit first (“zero”) argument is usually the executable name, is the array of envvars (typically much longer). Single variable : 128 KiB on a typical x64 Intel CPU. This is for the whole definition – name + equal sign + value. It’s computed as . No modern hardware uses pages smaller than 4 KiB, so you can treat it as a lower bound, unless you need to deal with some legacy embedded systems. Total : 2 MiB on a typical machine. This limit is shared by envvars and the command line arguments. The calculation is a bit more complicated (see the man page): On a typical system, the limiting factor is the . Remember, initially the envvars are dumped on the stack! To prevent unpredictable crashes, the system allows only 1/4 of the stack for the envvars.

0 views
André Arko 2 months ago

Announcing <code>rv</code> 0.2

With the help of many new contributors, and after many late nights wrestling with make, we are happy to (slightly belatedly) announce the 0.2 release of rv ! This version dramatically expands support for Rubies, shells, and architectures. Rubies: we have added Ruby 3.3, as well as re-compiled all Ruby 3.3 and 3.4 versions with YJIT. On Linux, YJIT increases our glibc minimum version to 2.35 or higher. That means most distro releases from 2022 or later should work, but please let us know if you run into any problems. Shells: we have added support for bash, fish, and nushell in addition to zsh. Architectures: we have added Ruby compiled for macOS on x86, in addition to Apple Silicon, and added Ruby compiled for Linux on ARM, in addition to x86. Special thanks to newest member of the maintainers’ team @adamchalmers for improving code and tests, adding code coverage and fuzzing, heroic amounts of issue triage, and nushell support. Additional thanks are due to all the new contributors in version 0.2, including @Thomascountz , @lgarron , @coezbek , and @renatolond . To upgrade, run , or check the release notes for other options.

0 views
flak 2 months ago

backporting go on openbsd

The OpenBSD ports tree generally tracks current, but sometimes backports (and stable packages) are made for more serious issues. As was the case for git 2.50.1. However, the go port has not seen a backport in quite some time. The OpenBSD release schedule aligns with the go schedule such that we always get the latest release, but not minor revisions. For instance, OpenBSD 7.7 shipped with go 1.24.1, but there’s a few minor revisions after that. We maybe don’t care about many of these backports, but issue 73570 is a backported fix for a bug specific to OpenBSD, so let’s say we want that. I always forget the procedures for building ports from scratch and waste a bunch of time running and cancelling and rerunning commands. So here’s a recipe that worked. If we don’t have the ports tree, we need to get that. If we don’t have bash, we need to install that. (There’s a magic formula to have ports install packages, but this is simpler.) We need to update the go port to a suitable revision. The port is currently on 1.25, but I’d rather stick with 1.24, so we go back a little ways. The OpenBSD port was never updated for 1.24.7, but those changes don’t look very exciting. Maybe next time I’ll try a custom update to a new version. We build the port. The bootstrap flavor is important, or we’ll end up building it twice. Tick, tock, ding, ding. Running will build and install a package. Check. Looks good. redux What if we want a version that’s not in ports? I figured this post would be pretty boring, but go just released 1.24.8, which includes security fixes I’d like, so now we definitely need to try building a new version. Let’s edit the Makefile . Now tell the ports system to download the new version and update the checksum. This downloads the new version and prints it’s checksum. Okay? Well, the go downloads page shows checksums in hex, but we can redo it to check. Looks good. Now run and again. Uh oh. Fucking FIPS, every fucking time. I don’t want to think too much about what this is doing, but the file has been renamed, so we need to update the pkg/PLIST file. Hopefully this is an aberration, as the go team is usually conservative about backporting changes, but one never knows what to expect. Now the package builds and installs correctly. And then rebuild everything that uses go.

0 views

LLMs Eat Scaffolding for Breakfast

We just deleted thousands of lines of code. Again. Each time a new LLM model comes out, that’s the same story. LLMs have limitations so we build scaffolding around them. Each models introduce new capabilities so that old scaffoldings must be deleted and new ones be added. But as we move closer to super intelligence, less scaffoldings are needed. This post is about what it takes to build successfully in AI today. Every line of scaffolding is a confession: the model wasn’t good enough. LLMs can’t read PDF? Let’s build a complex system to convert PDF to markdown LLMs can’t do math? Let’s build compute engine to return accurate numbers LLMs can’t handle structured output? Let’s build complex JSON validators and regex parsers LLMs can’t read images? Let’s use a specialized image to text model to describe the image to the LLM LLMs can’t read more than 3 pages? Let’s build a complex retrieval pipeline with a search engine to feed the best content to the LLM. LLMs can’t reason? Let’s build chain-of-thought logic with forced step-by-step breakdowns, verification loops, and self-consistency checks. etc, etc... millions of lines of code to add external capabilities to the model. But look at models today: GPT-5 is solving frontier mathematics, Grok-4 Fast can read 3000+ pages with its 2M context window, Claude 4.5 sonnet can ingest images or PDFs, all models have native reasoning capabilities and support structured outputs. The once essential scaffolding are now obsolete. Those tools are backed in the model capabilities. It’s nearly impossible to predict what scaffolding will become obsolete and when. What appears to be essential infrastructure and industry best practice today can transform into legacy technical debt within months. The best way to grasp how fast LLMs are eating scaffolding is to look at their system prompt (the top-level instruction that tells the AI how to behave). Looking at the prompt used in Codex, OpenAI coding agent from GPT-o3 model to GPT-5 is mind-blowing. GPT-o3 prompt: 310 lines GPT-5 prompt: 104 lines The new prompt removed 206 lines. A 66% reduction. GPT-5 needs way less handholding. The old prompt had complex instructions on how to behave as a coding agent (personality, preambles, when to plan, how to validate). The new prompt assumes GPT-5 already knows this and only specifies the Codex-specific technical requirements (sandboxing, tool usage, output formatting). The new prompt removed all the detailed guidance about autonomously resolving queries, coding guidelines, git usage. It’s also less prescriptive. Instead of “do this and this” it says “here are the tools at your disposal.” As we move closer to super intelligence, the models require more freedom and leeway (scary, lol!). Advanced models require simple instructions and tooling. Claude Code, the most sophisticated agent today, relies on a simple filesystem instead of a complex index and use bash commands (find, read, grep, glob) instead of complex tools. It moves so fast. Each model introduces a new paradigm shift. If you miss a paradigm shift, you’re dead. Having an edge in building AI applications require deep technical understanding, insatiable curiosity, and low ego. By the way, because everything changes, it’s good to focus on what won’t change Context window is how much text you can feed the model in a single conversation. Early model could only handle a couple of pages. Now it’s thousands of pages and it’s growing fast. Dario Amodei the founder of Anthropic expects 100M+ context windows while Sam Altman hinted at billions of context tokens . It means the LLMs can see more context so you need less scaffolding like retrieval augmented generation. November 2022 : GPT-3.5 could handle 4K context November 2023 : GPT-4 Turbo with 128K context June 2024 : Claude 3.5 Sonnet with 200K context June 2025 : Gemini 2.5 Pro with 1M context September 2025 : Grok-4 Fast with 2M context Models used to stream at 30-40 tokens per second. Today’s fastest models like Gemini 2.5 Flash and Grok-4 Fast hit 200+ tokens per second. A 5x improvement. On specialized AI chips (LPUs), providers like Cerebras push open-source models to 2,000 tokens per second. We’re approaching real-time LLM: full responses on complex task in under a second. LLMs are becoming exponentially smarter. With every new model, benchmarks get saturated. On the path to AGI, every benchmark will get saturated. Every job can be done and will be done by AI. As with humans, a key factor in intelligence is the ability to use tools to accomplish an objective. That is the current frontier: how well a model can use tools such as reading, writing, and searching to accomplish a task over a long period of time. This is important to grasp. Models will not improve their language translation skills (they are already at 100%), but they will improve how they chain translation tasks over time to accomplish a goal. For example, you can say, “Translate this blog post into every language on Earth,” and the model will work for a couple of hours on its own to make it happen. Tool use and long-horizon tasks are the new frontier. The uncomfortable truth: most engineers are maintaining infrastructure that shouldn’t exist. Models will make it obsolete and the survival of AI apps depends on how fast you can adapt to the new paradigm. That’s what startups have an edge over big companies. Bigcorp are late by at least two paradigms. Some examples of scaffolding that are on the decline: Vector databases : Companies paying thousands/month for when they could now just put docs in the prompt or use agentic-search instead of RAG ( my article on the topic ) LLM frameworks : These frameworks solved real problems in 2023. In 2025? They’re abstraction layers that slow you down. The best practice is now to use the model API directly. Prompt engineering teams : Companies hiring “prompt engineers” to craft perfect prompts when now current models just need clear instructions with open tools Model fine-tuning : Teams spending months fine-tuning models only for the next generation of out of the box models to outperform their fine-tune (cf my 2024 article on that ) Custom caching layers : Building Redis-backed semantic caches that add latency and complexity when prompt caching is built into the API. This cycle accelerates with every model release. The best AI teams master have critical skills: Deep model awareness : They understand exactly what today’s models can and cannot do, building only the minimal scaffolding needed to bridge capability gaps. Strategic foresight : They distinguish between infrastructure that solves today’s problems versus infrastructure that will survive the next model generation. Frontier vigilance : They treat model releases like breaking news. Missing a single capability announcement from OpenAI, Anthropic, or Google can render months of work obsolete. Ruthless iteration : They celebrate deleting code. When a new model makes their infrastructure redundant, they pivot in days, not months. It’s not easy. Teams are fighting powerful forces: Lack of awareness : Teams don’t realize models have improved enough to eliminate scaffolding (this is massive btw) Sunk cost fallacy : “We spent 3 years building this RAG pipeline!” Fear of regression : “What if the new approach is simple but doesn’t work as well on certain edge cases?” Organizational inertia : Getting approval to delete infrastructure is harder than building it Resume-driven development : “RAG pipeline with vector DB and reranking” looks better on a resume than “put files in prompt” In AI the best team builds for fast obsolescence and stay at the edge. Software engineering sits on top of a complex stack. More layers, more abstractions, more frameworks. Complexity was a sophistication. A simple web form in 2024? React for UI, Redux for state, TypeScript for types, Webpack for bundling, Jest for testing, ESLint for linting, Prettier for formatting, Docker for deployment…. AI is inverting this. The best AI code is simple and close to the model. Experienced engineers look at modern AI codebases and think: “This can’t be right. Where’s the architecture? Where’s the abstraction? Where’s the framework?” The answer: The model ate it bro, get over it. The worst AI codebases are the ones that were best practices 12 months ago. As models improve, the scaffolding becomes technical debt. The sophisticated architecture becomes the liability. The framework becomes the bottleneck. LLMs eat scaffolding for breakfast and the trend is accelerating. Thanks for reading! Subscribe for free to receive new posts and support my work. LLMs can’t read PDF? Let’s build a complex system to convert PDF to markdown LLMs can’t do math? Let’s build compute engine to return accurate numbers LLMs can’t handle structured output? Let’s build complex JSON validators and regex parsers LLMs can’t read images? Let’s use a specialized image to text model to describe the image to the LLM LLMs can’t read more than 3 pages? Let’s build a complex retrieval pipeline with a search engine to feed the best content to the LLM. LLMs can’t reason? Let’s build chain-of-thought logic with forced step-by-step breakdowns, verification loops, and self-consistency checks. Vector databases : Companies paying thousands/month for when they could now just put docs in the prompt or use agentic-search instead of RAG ( my article on the topic ) LLM frameworks : These frameworks solved real problems in 2023. In 2025? They’re abstraction layers that slow you down. The best practice is now to use the model API directly. Prompt engineering teams : Companies hiring “prompt engineers” to craft perfect prompts when now current models just need clear instructions with open tools Model fine-tuning : Teams spending months fine-tuning models only for the next generation of out of the box models to outperform their fine-tune (cf my 2024 article on that ) Custom caching layers : Building Redis-backed semantic caches that add latency and complexity when prompt caching is built into the API. Deep model awareness : They understand exactly what today’s models can and cannot do, building only the minimal scaffolding needed to bridge capability gaps. Strategic foresight : They distinguish between infrastructure that solves today’s problems versus infrastructure that will survive the next model generation. Frontier vigilance : They treat model releases like breaking news. Missing a single capability announcement from OpenAI, Anthropic, or Google can render months of work obsolete. Ruthless iteration : They celebrate deleting code. When a new model makes their infrastructure redundant, they pivot in days, not months. Lack of awareness : Teams don’t realize models have improved enough to eliminate scaffolding (this is massive btw) Sunk cost fallacy : “We spent 3 years building this RAG pipeline!” Fear of regression : “What if the new approach is simple but doesn’t work as well on certain edge cases?” Organizational inertia : Getting approval to delete infrastructure is harder than building it Resume-driven development : “RAG pipeline with vector DB and reranking” looks better on a resume than “put files in prompt”

0 views