Posts in Devops (20 found)

AI for Production

☕ Welcome to The Coder Cafe! These days, most posts about AI for production circle the same ideas: automated remediation, anomaly detection, alerting triage, etc. These are interesting starting points, but they share a common assumption: that AI’s job is to replace what SREs do. In this post, I want to explore the idea of having AI as a cognitive partner, something that extends what a single engineer can hold in their head at once. Get cozy, grab a coffee, and let’s begin! At Google, I’m an SRE on the  Google Distributed Cloud  team, where the infrastructure stack spans Kubernetes, Borg, distributed storage, virtualization, networking, and more. Over the past months, I’ve been experimenting with ways AI can help not only by automating work away, but also by reducing the cognitive overhead that makes production work quite overwhelming sometimes. Here are three directions that changed how I thought about the problem. In my team, we have hundreds of dashboards. Kubernetes clusters, Borg jobs, storage metrics, VM utilization, network metrics, etc. Each one tells part of the story. When something went wrong, and I wanted to understand the current state of the system, I needed to spend a significant amount of time opening tabs and cross-referencing panels to get a complete picture. This is a fundamentally human bottleneck. Each dashboard was designed to answer a specific question . The question “ What is the current situation? ” doesn’t map to any single dashboard, and navigating all of them to reconstruct an answer takes time we often don’t have. Interestingly, this is where AI can change the equation. Instead of navigating dashboards, imagine describing your system to an AI agent with access to your observability stack and simply asking: “ What’s going on? ” The agent queries across your telemetry data, picks out what stands out, and hands you back a coherent narrative , something you can actually act on. Like: “ This specific cluster has an issue with all the containers using distributed storage running on that specific node since 2h. ” This shifts the focus from navigator (opening dashboards one by one) to interpreter (acting on a synthesized summary). And that shift matters: every minute you spend navigating is a minute you're not spending on the actual problem. A few months ago, I was investigating a storage incident on a cluster. The failure itself was clear: a disk issue that surfaced as elevated latency and eventually a service degradation. What wasn’t clear was why it happened when it did. I used Gemini CLI to navigate the metrics data around the event window. What it surfaced surprised me: the root cause signals had been present in the telemetry hours before the incident triggered any alert. Subtle correlations across metrics that individually looked like noise: disk read latency creeping slightly upward, I/O wait ticking up on specific nodes, a minor memory pressure pattern. Together, they pointed directly at the failure that was coming. A human reviewing those dashboards in real time would almost certainly have missed it. Each individual signal was within an acceptable range. The pattern only became visible when we looked at all of them together, across time. This is what I’d call telemetry archaeology : using AI to go back through your metrics data and surface the correlations an alerting system wasn’t designed to catch. It’s worth being precise about what makes this different from anomaly detection. Anomaly detection tells you when something looks wrong. Telemetry archaeology is about finding the patterns that appear before anything looks wrong at all , relationships that no one thought to encode into an alert, because no one knew they existed until the incident happened. The practical implication is significant. If these correlations exist in your past incidents, they likely exist in future ones. An AI agent that continuously monitors for these multi-signal patterns could surface a warning (” This looks like the early stages of what happened last time ”) long before your system starts showing symptoms. Active incidents can be cognitively brutal . You can be debugging a live system, managing communication with stakeholders, coordinating with other engineers, and trying to remember what you checked 20 minutes ago, all at the same time. A common consequence is that the engineer with the deepest system knowledge gets pulled out of deep focus to write status updates, summarize what’s been tried, and maintain a running timeline. This work is necessary, but it’s expensive. Every context switch makes it harder to hold the full mental model of the incident in your head. And once that model fragments, rebuilding it takes time you don’t have. NOTE : This is actually one of the reasons Google developed the IMAG process, with clear role separation: The Incident Commander (IC) coordinates the overall response, the Communications Lead (CL) handles stakeholder updates, and the Operations Lead (OL) focuses on mitigating the issue. The explicit goal is to prevent any single person from being pulled in too many directions at once. AI can absorb most of this overhead . Think of it as a second brain that’s been in the room the whole time: it tracks what hypotheses have been tested, which ones were ruled out and why, what changed in the system during the incident window, and what hasn’t been explored yet. When a new engineer joins the investigation, instead of spending ten minutes getting them up to speed, you ask the AI for a summary. AI’s role here is handling the administrative layer of the incident: the parts that pull you out of flow, so you can stay in the problem instead of constantly being yanked out of it. I’ve been using AI this way during my own shifts. Even without a purpose-built tool, maintaining a running log with AI (e.g., what we’ve tried, what we know, what’s next) noticeably changes how an incident feels. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. The common “AI for production” narrative focuses on automation and replacement; cognitive augmentation is the underexplored angle. Situation awareness: AI can synthesize across hundreds of dashboards to answer “ What’s the current situation? ” in seconds, shifting your role from navigator to interpreter. Telemetry archaeology: AI can surface hidden correlations across metrics that individually look like noise, revealing root cause signals that were present hours before any alert fired. Incident co-pilot: AI can absorb the administrative layer of an active incident (status updates, running timeline, hypothesis tracking), keeping the engineer in deep focus instead of constant context switching. None of this requires replacing the engineer. The value is in extending what one person can hold in their head under pressure. Reliability Resilient, Fault-tolerant, Robust, or Reliable? Lurking Variables Google Site Reliability Engineering: Incident Management Guide The future of software engineering is SRE At Google, I’m an SRE on the  Google Distributed Cloud  team, where the infrastructure stack spans Kubernetes, Borg, distributed storage, virtualization, networking, and more. Over the past months, I’ve been experimenting with ways AI can help not only by automating work away, but also by reducing the cognitive overhead that makes production work quite overwhelming sometimes. Here are three directions that changed how I thought about the problem. Situation Awareness In my team, we have hundreds of dashboards. Kubernetes clusters, Borg jobs, storage metrics, VM utilization, network metrics, etc. Each one tells part of the story. When something went wrong, and I wanted to understand the current state of the system, I needed to spend a significant amount of time opening tabs and cross-referencing panels to get a complete picture. This is a fundamentally human bottleneck. Each dashboard was designed to answer a specific question . The question “ What is the current situation? ” doesn’t map to any single dashboard, and navigating all of them to reconstruct an answer takes time we often don’t have. Interestingly, this is where AI can change the equation. Instead of navigating dashboards, imagine describing your system to an AI agent with access to your observability stack and simply asking: “ What’s going on? ” The agent queries across your telemetry data, picks out what stands out, and hands you back a coherent narrative , something you can actually act on. Like: “ This specific cluster has an issue with all the containers using distributed storage running on that specific node since 2h. ” This shifts the focus from navigator (opening dashboards one by one) to interpreter (acting on a synthesized summary). And that shift matters: every minute you spend navigating is a minute you're not spending on the actual problem. Telemetry Archaeology A few months ago, I was investigating a storage incident on a cluster. The failure itself was clear: a disk issue that surfaced as elevated latency and eventually a service degradation. What wasn’t clear was why it happened when it did. I used Gemini CLI to navigate the metrics data around the event window. What it surfaced surprised me: the root cause signals had been present in the telemetry hours before the incident triggered any alert. Subtle correlations across metrics that individually looked like noise: disk read latency creeping slightly upward, I/O wait ticking up on specific nodes, a minor memory pressure pattern. Together, they pointed directly at the failure that was coming. A human reviewing those dashboards in real time would almost certainly have missed it. Each individual signal was within an acceptable range. The pattern only became visible when we looked at all of them together, across time. This is what I’d call telemetry archaeology : using AI to go back through your metrics data and surface the correlations an alerting system wasn’t designed to catch. It’s worth being precise about what makes this different from anomaly detection. Anomaly detection tells you when something looks wrong. Telemetry archaeology is about finding the patterns that appear before anything looks wrong at all , relationships that no one thought to encode into an alert, because no one knew they existed until the incident happened. The practical implication is significant. If these correlations exist in your past incidents, they likely exist in future ones. An AI agent that continuously monitors for these multi-signal patterns could surface a warning (” This looks like the early stages of what happened last time ”) long before your system starts showing symptoms. Incident Co-Pilot Active incidents can be cognitively brutal . You can be debugging a live system, managing communication with stakeholders, coordinating with other engineers, and trying to remember what you checked 20 minutes ago, all at the same time. A common consequence is that the engineer with the deepest system knowledge gets pulled out of deep focus to write status updates, summarize what’s been tried, and maintain a running timeline. This work is necessary, but it’s expensive. Every context switch makes it harder to hold the full mental model of the incident in your head. And once that model fragments, rebuilding it takes time you don’t have. NOTE : This is actually one of the reasons Google developed the IMAG process, with clear role separation: The Incident Commander (IC) coordinates the overall response, the Communications Lead (CL) handles stakeholder updates, and the Operations Lead (OL) focuses on mitigating the issue. The explicit goal is to prevent any single person from being pulled in too many directions at once. AI can absorb most of this overhead . Think of it as a second brain that’s been in the room the whole time: it tracks what hypotheses have been tested, which ones were ruled out and why, what changed in the system during the incident window, and what hasn’t been explored yet. When a new engineer joins the investigation, instead of spending ten minutes getting them up to speed, you ask the AI for a summary. AI’s role here is handling the administrative layer of the incident: the parts that pull you out of flow, so you can stay in the problem instead of constantly being yanked out of it. I’ve been using AI this way during my own shifts. Even without a purpose-built tool, maintaining a running log with AI (e.g., what we’ve tried, what we know, what’s next) noticeably changes how an incident feels. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Summary The common “AI for production” narrative focuses on automation and replacement; cognitive augmentation is the underexplored angle. Situation awareness: AI can synthesize across hundreds of dashboards to answer “ What’s the current situation? ” in seconds, shifting your role from navigator to interpreter. Telemetry archaeology: AI can surface hidden correlations across metrics that individually look like noise, revealing root cause signals that were present hours before any alert fired. Incident co-pilot: AI can absorb the administrative layer of an active incident (status updates, running timeline, hypothesis tracking), keeping the engineer in deep focus instead of constant context switching. None of this requires replacing the engineer. The value is in extending what one person can hold in their head under pressure. Reliability Resilient, Fault-tolerant, Robust, or Reliable? Lurking Variables Google Site Reliability Engineering: Incident Management Guide The future of software engineering is SRE

0 views
Zak Knill Yesterday

LLMs are breaking 20 year old system design

The ‘cloud-native’ architecture of the last decade is built on a 20-year-old assumption: that state lives in the database, and compute is stateless. If you want to scale, you scale the database vertically (get a larger machine) [1] [1] or design the database schema around partition the data and you scale your application servers horizontally (add more boxes). Any request can hit any server, the loadbalancer doesn’t care, and the database is the single source of truth.

0 views
neilzone 2 days ago

Fixing a proxying problem with my HomeAssistantOS installation by replacing nginx proxy manager

tl;dr: I removed the “nginx proxy manager” add-on, and replaced it with the Let’s Encrypt add-on and (second) the nginx add-on. A couple of months ago, I moved my HomeAssistant installation to HAos . I think that it is fair to say that I was not overly pleased with this. Honestly, I preferred the “Core” python-venv approach, but I also wanted a “supported” installation, and so I switched to HAos. i got it up and running okay, and I thought that I had got proxying working too, using an add-on called “nginx proxy manager”. This is not something that I had used before; I’d rather just configure nginx myself. Well, either I got something wrong, or it just does not work very well, as I kept having problems using HomeAssistant, stuck on a “loading data” screen, or it simply not responding. This bugged me for quite a while. Annoyingly, the logs available to me within HAos were unhelpful. I couldn’t spot anything indicating a problem. Using the console in my web browser, I noted that some files were not loading correctly, but why that was the case, I wasn’t sure. I thought that I’d had a similar issue with my “Core” installation years ago, which I got down to the issue of the in the file, but that looked correct here (which I was able to check, using the SSH add-on. I tried various parameters in the nginx proxy manager add-on, but to no avail. In the end, I tried removing the nginx proxy manager add-on, and replacing it with the Let’s Encrypt add-on (which I installed, configured, and ran first), and then the nginx add-on. And it immediately started working correctly. So I don’t know exactly why my original set-up was not working, but at least it is working better now.

0 views

When Escalator Breaks, It Turns Stairs

Read on the website: We need resilient systems that fall back to sanity when broken / discriminating. And not whatever.

0 views
Rob Zolkos 6 days ago

Watch Your Agents

I’ve been telling developers to watch their logs for years. Not just when something is broken. Not just when production is on fire. Watch them while you are building. Your logs are the closest thing you have to x-ray vision for a web application. Click a button in the browser, watch the request move through the app, and you can see what is really happening behind the scenes. The habit is simple: keep the server log visible while you work. When you do, you start spotting problems long before they become production issues: The logs give you immediate feedback. They make the invisible visible. Coding agents need the same treatment. When you are working with an agent, do not just look at the final diff. Watch what it is doing. Watch the commands it runs, the files it opens, the mistakes it repeats, and the little bits of glue code it keeps inventing along the way. That is the agent equivalent of watching your development log. You are not only checking whether this turn succeeded. You are looking for patterns that can make future turns better. Most coding agents keep some kind of session history: transcripts, tool calls, command output, file edits, errors, retries, and sometimes timing information. Those logs are useful after the fact. Point the agent at its own session logs and ask it to look for patterns: A prompt I like for this: This is the same habit as watching the Rails log after clicking around a page. You are looking for the part of the system that is doing too much work, guessing too often, or hiding useful signal. A useful signal is when the model keeps generating code to do the same mechanical task. For example, imagine you have a skill for publishing blog posts. Every time you run it, the model writes a small Ruby or Python snippet to: If the agent is generating that code every time, that is a smell. The model is doing work that should probably be deterministic. Ask the agent to turn that behavior into a script: Then update the skill so future agents call the script instead of improvising the logic. Bad pattern: every publishing session, the agent manually inspects YAML front matter and tries to remember the required fields. Better pattern: create that exits non-zero when , , , or are missing or malformed. Now the agent does not need to reason about the rules from scratch. It runs the command and reacts to the result. Bad pattern: the agent repeatedly writes one-off Python to resize screenshots, compare image dimensions, or calculate visual diffs. Better pattern: create with clear output like: The agent can use the result without reinventing image processing each time. Bad pattern: the agent keeps constructing ad hoc SQL to answer common questions like “which users have duplicate active subscriptions?” or “which jobs are stuck?” Better pattern: create named scripts or Rails tasks: Now the workflow is repeatable, reviewable, and safe to run again. Bad pattern: the agent writes custom code every time it needs to build a fake webhook payload or API response. Better pattern: create or a small fixture library that produces known-good examples. The agent stops guessing at payload shapes and starts using something the test suite can trust. Moving repeated agent behavior into deterministic tools gives you a few wins: Watch the agent the way you watch your logs. When you see friction, repetition, or uncertainty, ask whether the agent needs better instructions or a better tool. Sometimes the answer is a clearer prompt. Sometimes it is a skill. And sometimes the best thing you can do is take the fragile reasoning out of the model entirely and give it a boring, deterministic script to call. That is not making the agent less useful. That is making the whole system more useful. the same query firing 50 times because of an N+1 a page that feels fine locally but is doing way too much work a slow query that needs an index an unexpected redirect or extra request a cache miss you thought was a cache hit a background job being enqueued more often than expected parameters coming through in a shape you did not expect What tasks did you repeat multiple times in this session? What code did you generate only to throw away later? Which commands failed, and what would have prevented those failures? Did you write any one-off scripts that should become checked-in tools? Did you repeatedly search for the same files or project conventions? Were there project rules you had to infer that should be documented? Which parts of the workflow were deterministic enough to automate? What should be added to , a skill, or a script? If a smaller model had to do this next time, what tools or instructions would it need? parse front matter validate the title, summary, badge, tags, and date derive the final filename move the draft into Dependability: the same input produces the same output. Determinism: fewer “creative” variations in routine work. Testability: scripts can have tests; improvised reasoning usually cannot. Reviewability: a script can be read, improved, and versioned. Cost: once the workflow is encoded, you may be able to use a smaller model for that task. Speed: future turns spend less time rediscovering the same procedure.

0 views
David Bushell 6 days ago

Unscrewing lightbulbs

Giving lightbulbs a MAC address was a mistake that I’m living with. I’m literally unscrewing lightbulbs to renew their DHCP lease @dbushell.com - Bluesky Instead of enjoying the bank holiday Monday I updated my homelab software. I was ‘inspired’ by the Copy Fail Linux bug to run full distro upgrades. This is my self-hosted update for Spring 2026 (rough documentation to give future me a chance). Monday’s fun risked a week of pain. I do have backups but restoring them on a broken LAN is tricky. I have an ISP provided wifi router to dust off in an emergency. Along with an absurdly long 15 metre HDMI cable I do not care to unravel. My winter update added a hardware fallback but that too requires careful rejigging. I have Proxmox hosts, virtual machines, and Raspberry DietPis . They were all on Debian 12 (Bookworm) with a kernel potentially susceptible to the bug. Minimal Debian installs are perfect because I run everything in Docker anyway. Data volumes are easy to backup or network mount. I can change host at will for any service. Debian is just sensible, well documented no-fuss Linux. I used to run “minimal” Ubuntu server. Following 24.04 I found myself debloating most of the Ubuntu part (i.e. snaps). It sounds like the new coreutils are a CVE party . Glad I escaped before that drama! As it happens, this week’s Linux Unplugged episode had Canonical’s VP of Engineering spewing embarrassing AI platitudes. “Ubuntu is not for you” was the only thing said worth remembering. I updated most of my VMs first because they’re easy to restore if anything fails. I followed Lubos Rendek’s guide . Start with a full package update and then change the package sources before running another step-by-step upgrade. The only non-Debian sources I have are Docker and Tailscale. Yes that means I run Docker inside Proxmox VMs — and you can’t stop me! That’s not even my worse crime… After the Trixie upgrade I found VMs were failing to obtain a LAN IP address. The virtual network device had been renamed from to . I edited and just changed the reference. There is surely a better/more predictable fix but this was the quickest. The same name was used across all VMs so I guess 18 is the magic number. Everything has been stable so far. If issues arise I’ll just nuke and pave from a Debian 13 ISO. Docker config and volumes are backed up independently of the VM images. DietPi has a long Trixie upgrade post I didn’t read. I just curled to bash: I gave the script a cursory glance before hitting enter. I have a Pi 4 running failover DNS and a Pi 5 running my public Forgejo instance . DietPi is ideal because of the tiny footprint; I run Docker here too. Raspberry Pi still hasn’t merged upstream Copy Fail fixes. I’m already in trouble if this bug can be exploited but I did the temporary fix out of caution. I wasn’t going to bother with Proxmox 9 but after a GUI update I was informed version 8 “end of life” was August 2026 . That is soon! I followed the official upgrade guide on my Mini-ITX server . Proxmox has a tool to check compatibility. I saw no red lights so I stopped all VMs, updated package sources to Trixie, and ran the upgrade. It is critical to run again before rebooting. I ran into the systemd-boot issue . Apparently if this is not removed the system fails to boot. If my particular box fails to boot I’m in big trouble because I broke video output and have yet to fix it. I have another Proxmox machine running virtualised OPNsense for my home router. I can’t stop the OPNsense VM and upgrade the host to Proxmox 9 because the host would have no network access. I had two options: I specifically set up option 1 for such a purpose. I went with option 2. I figured any software running in memory is still alive until I reboot, right? I didn’t question whether Proxmox would kill any processes itself (it didn’t). The update was suspiciously fast. I ran again and saw a lot of yellow warnings. Yikes. Eventually I noticed I’d failed to update some sources to Trixie and I’d installed a franken-distro. After fixing mistakes all I could do was reboot and pray for an agonising two minutes. OPNsense is the only non-Debian operating system in my homelab. I manage it entirely via the web GUI. The 26.1 update had quite a few significant changes. My DHCP setup was considered “legacy” and my firewall rules required a manual migration. Despite dumbening my smart home my lightbulbs still demand a WiFi connection. I program them myself to avoid Home Assistant and proprietary apps. Turns out I hard-coded IP addresses (discovery protocols are a joke.) Despite having dynamic IPs they remained stable until the OPNsense 26.1 DHCP update. I had no easy way to identify each light. Why would they name themselves anything useful? That’s how I ended up unscrewing the bulbs one by one to see which MAC address fell off the network. I gave them static IPs on a VLAN for future me to appreciate. And with that, my home network is up to date! Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds. Use my failover VM YOLO it live

0 views
Sean Goedecke 6 days ago

Notes on incidents

Incidents are boring. Most of what you actually do during an incident is wait: for some other team to investigate, or for a deploy to finish, or for the result of some change to become apparent, or for someone else who’s been paged to come online. It’s stressful, but there’s often just not that much to do. Most incidents resolve on their own. People love to share war stories about incidents where some hero engineer improvised a clever fix that instantly repaired the system. That rarely happens. Well-designed software systems tend to come good by themselves, and many modern systems are at least partly well-designed, by virtue of being built out of really solid pieces. If a server process is crashing or leaking memory, Kubernetes will kill the pod and bring it back up. If a service is overloaded and jammed up, clients will (hopefully) trigger circuit breakers and back off until it can recover. Temporary spikes in expensive operations will often just fill up a queue instead of taking the entire system down. Most incident calls I’ve been on - well over half - would have come good by themselves in roughly the same time without any human intervention. Most incident-resolving actions make incidents worse. Engineers jump too quickly to resolve incidents. Oh, the queue size is huge? Don’t worry, I’m here in a production console to clear the queue! Unfortunately, some of the jobs I just nuked were doing important billing work and aren’t automatically re-queued, so this queue-latency incident just became a billing incident as well. Another classic in this genre is “engineer forces a series of redeploys to “fix” a concerning-looking metric, and the concurrent deploys cause far more stress on the system than whatever was causing the metric to look weird”. For that reason, the first thing you should do in an incident is nothing . When I was paged late at night, I used to have a habit of pouring myself a glass of scotch before I joined the call. This was only partly for the tranquilizing effects of alcohol: the main reason was to have a ritual I could go through to convince myself that I wasn’t rushing, and that it was OK to take a few breaths and relax before jumping into the problem 1 . Making a cup of tea or going for a walk around the house would probably have served as well. Effective incident-resolving actions are often dull. Typically the action needed to resolve the incident - assuming it doesn’t resolve on its own - is to temporarily disable some problematic feature until the system recovers. This is never a complex code change. Typically someone spends five minutes putting together the patch, and then an hour waiting for reviews, CI, and deploying. If you’re very lucky, you’ll get to write a “wrap a cache around it” code change. In an incident, there is no substitute for knowledge of the system. Five strong engineers can troubleshoot on an incident call and get nowhere, while one half-drunk engineer who’s familiar with the codebase can swan in and immediately fix the problem. This is because the kinds of actions that resolve incidents are so simple: if you’ve been the one working on the project, you likely already know exactly what feature flag to check and disable, or what code change to revert. Resolving incidents requires courage. Incident calls can be scary. When engineers are scared, they often reach for consensus: hedging their statements, asking the group if they agree a particular course of action is safe, deferring to each other, and so on. But if you’re the one with knowledge of the system, you have to be decisive. Say “I’m going to do X”, wait thirty seconds, then do it. While it’s usually net-negative to have a powerful manager fidgeting on the incident call, this is one of the rare cases where it can be helpful - executives are very comfortable saying “okay, do it now” about technical courses of action they don’t fully understand. Resolving incidents buys a lot of political credit. One thing that I think surprises a lot of engineers who are new to on-call is how grateful managers and executives are for even really simple fixes (i.e. “turn off the feature flag”). This is because incidents are one of the few times that non-technical leadership are directly confronted with their lack of control over the technical sphere. When the team is building a product, your VP has a lot of freedom to guide the process and make decisions. But when there’s an active incident, they have to just sit there and trust that their technical employees are going to pull them out of the fire. It’s a scary situation, particularly for someone who’s used to exercising a degree of power in the workplace. However, always resolving incidents is (by itself) not a durable position of power. This is a little counter-intuitive. Surely if you’re always resolving incidents, you’re indispensable? The problem is that incident-resolving work is almost always so techical as to be completely opaque to executives. They know the incident has resolved, but they don’t know if you did a heroic effort or merely did the obvious thing. They also can’t point to your successes as theirs (which is always the most reliable way to get VPs and directors on your side), because incidents are expected to be fixed , and it’s always better not to have had the incident at all . I don’t need to do this anymore because I just don’t get as keyed up about incidents as I used to. I don’t need to do this anymore because I just don’t get as keyed up about incidents as I used to. ↩

0 views

Building the deployment tool I wish I had

Deptool is a new declarative configuration deployment tool that I built for myself. In this post I describe the design, and I explain what problems it solves.

0 views
iDiallo 1 weeks ago

AI didn't delete your database, you did

Last week, a tweet went viral showing a guy claiming that a Cursor/Claude agent deleted his company's production database . We watched from the sidelines as he tried to get a confession from the agent: "Why did you delete it when you were told never to perform this action?" Then he tried to parse the answer to either learn from his mistake or warn us about the dangers of AI agents. I have a question too: why do you have an API endpoint that deletes your entire production database? His post rambled on about false marketing in AI, bad customer support, and so on. What was missing was accountability. I'm not one to blindly defend AI, I always err on the side of caution. But I also know you can't blame a tool for your own mistakes. In 2010, I worked with a company that had a very manual deployment process. We used SVN for version control. To deploy, we had to copy trunk, the equivalent of the master branch, into a release folder labeled with a release date. Then we made a second copy of that release and called it "current." That way, pulling the current folder always gave you the latest release. One day, while deploying, I accidentally copied trunk twice. To fix it via the CLI, I edited my previous command to delete the duplicate. Then I continued the deployment without any issues... or so I thought. Turns out, I hadn't deleted the duplicate copy at all. I had edited the wrong command and deleted trunk instead. Later that day, another developer was confused when he couldn't find it. All hell broke loose. Managers scrambled, meetings were called. By the time the news reached my team, the lead developer had already run a command to revert the deletion. He checked the logs, saw that I was responsible, and my next task was to write a script to automate our deployment process so this kind of mistake couldn't happen again. Before the day was over, we had a more robust system in place. One that eventually grew into a full CI/CD pipeline. Automation helps eliminate the silly mistakes that come with manual, repetitive work. We could have easily gone around asking "Why didn't SVN prevent us from deleting trunk?" But the real problem was our manual process. Unlike machines, we can't repeat a task exactly the same way every single day. We are bound to slip up eventually. With AI generating large swaths of code, we get the illusion of that same security. But automation means doing the same thing the same way every time. AI is more like me copying and pasting branches, it's bound to make mistakes, and it's not equipped to explain why it did what it did. The terms we use, like "thinking" and "reasoning," may look like reflection from an intelligent agent. But these are marketing terms slapped on top of AI. In reality, the models are still just generating tokens. Now, back to the main problem this guy faced. Why does a public-facing API that can delete all your production databases even exist? If the AI hadn't called that endpoint, someone else eventually would have. It's like putting a self-destruct button on your car's dashboard. You have every reason not to press it, because you like your car and it takes you from point A to point B. But a motivated toddler who wiggles out of his car seat will hit that big red button the moment he sees it. You can't then interrogate the child about his reasoning. Mine would have answered simply: "I did it because I did it." I suspect a large part of this company's application was vibe-coded. The software architects used AI to spec the product from AI-generated descriptions provided by the product team. The developers used AI to write the code. The reviewers used AI to approve it. Now, when a bug appears, the only option is to interrogate yet another AI for answers, probably not even running on the same GPU that generated the original code. You can't blame the GPU! The simple solution is know what you're deploying to production. The more realistic one is, if you're going to use AI extensively, build a process where competent developers use it as a tool to augment their work, not a way to avoid accountability. And please, don't let your CEO or CTO write the code.

0 views
マリウス 1 weeks ago

I Do Not Recommend Bitwarden

Almost four years ago I published a guide on how to run your own LastPass on hardened OpenBSD , in which I explained how to set up an OpenBSD instance, either as a cloud instance or as a Raspberry Pi bare metal installation, that would host Vaultwarden as a backend for the Bitwarden client applications. After having used a similar approach for myself for several years now, I came to the conclusion that I do not recommend the use of Bitwarden any longer. Let me explain. Wikipedia describes Bitwarden as _a freemium open-source password management service that is used to store sensitive information […] owned and developed by Bitwarden , Inc. , and that is now almost ten years old. The company behind the software is not only developing the Bitwarden server , as well as client applications for most platforms, but it is also offering a SaaS product for users who don’t want to put up with hosting this unwieldy beast on their own. More on this in just a moment. Bitwarden ’s pricing for their hosted offering is similar to their competitors' offerings, albeit with differences in terms of functionality. Regardless of whether one picks their hosted offering or decides to self-host, however, the client applications remain the same. Since 2022, Bitwarden is also backed by $100M of PSG growth equity , joined by Battery Ventures . A password manager that wants to remain open-source is one thing, but the same password manager with an investor on its board that needs to see a return on $100M is another. Without wanting to sound overly cynical, this is usually the point in time in which the rent-seeking begins and the product slowly shifts from serving its users to serving its investors. If you decide to self-host Bitwarden , however, you will relatively quickly find yourself in what I would describe as enterprise software hell . The standard Bitwarden server deployment is a heavy-weight C# backend that ships with MSSQL Express and won’t work with more Linux-native databases like PostgreSQL or MariaDB . Depending on the size of the deployment and the requirements with regard to high availability, you might want to utilize Kubernetes, which in turn adds additional overhead and complexity. Because of this, many smaller to medium-sized deployments prefer to look into Vaultwarden instead, which is an unofficial Bitwarden-compatible server written in Rust™ . The simple and lightweight nature of Vaultwarden compared to the official Bitwarden server makes such a big difference for administrators that the unofficial server project has seemingly three times the stargazers on GitHub as compared to Bitwarden ’s official implementation. This should make you think, especially as a series B -funded company with $100M, whether your (technical) users appreciate the current direction your software stack is heading towards, or whether you might want to look into bringing the people that built a vastly more successful backend implementation on-board to optimize and accelerate your official stack. And surely that’s what Bitwarden decided to do, right? Sadly, however, it seems that Bitwarden ’s NIH syndrome was too strong to simply take over Vaultwarden as an official project. Instead, the company seemingly hired the main developer of the Vaultwarden project and decided to publish a “lighter” version of their existing backend dubbed Bitwarden unified lite , which is still a service built on Microsoft ’s .NET , and which still appears to require more than three times the RAM a Vaultwarden instance usually consumes. Regarding the open-source part of Bitwarden , things have been getting murkier over the past year or so. In late 2024, users started noticing that a new dependency, , had been pulled into the clients. Its license read: You may not use this SDK to develop applications for use with software other than Bitwarden (including non-compatible implementations of Bitwarden) or to develop another SDK. For a product that prides itself on being open-source, this is a fairly significant plot twist . After considerable backlash in the community, however, Bitwarden called it a “packaging bug” and eventually relicensed the SDK under GPLv3 . Technically, the issue is resolved. Philosophically, however, this episode tells you all you need to know about where Bitwarden is heading: The freeware parts are bait , the actual product is the SaaS subscription, and the community is there to contribute issues and translations as long as it doesn’t cost the company anything. Setting aside the backend, however, the real culprit with regard to Bitwarden are the client applications. Advertised functions do not work as expected, basic features are non-existent (after ten years!) and the user interface is poor to put it mildly, especially when compared to equally priced alternatives. And don’t get me wrong, if Bitwarden was purely a FOSS-effort and not funded by venture capital all these flaws could be brushed aside because, after all, it would be a community effort. However, Bitwarden isn’t a community effort , which is reflected very noticeably in the bureaucratic processes they drowned the community in, but more on this in a moment. About a year ago, I supported someone who tried to switch from a competitor to Bitwarden under the thought of rather supporting open-source software with a yearly subscription than some proprietary platform that one has no insights into. Part of the migration was naturally importing existing vaults from the previous password manager into the new Bitwarden account. As can be seen in my bug report on GitHub , however, this went sideways very quickly, and resulted in at least one vault requiring significant technical workarounds for the import to work. The response from what sounded like an official Bitwarden employee left me frankly stunned. Despite the migration/import feature being advertised in multiple places throughout Bitwarden ’s marketing materials and documentation, and despite dozens of users having already complained about the exact same issue, Bitwarden simply decided to ignore the issue report and instead requested opening another likely dead-ended discussion in their community forum. This level of corporate bureaucracy is not at all what open-source software should look and feel like, and it is definitely completely unjustified for a feature that is being advertised on both the open-source software, as well as the paid product, but that simply does not work as advertised. Similarly, many other issues are funneled through this process of community discussions , which more often than not turn out as not much more than lengthy threads of pointless back-and-forth, and almost never materialize in actual implementations. Note: The same import was tested with proprietary alternatives to Bitwarden and worked flawlessly. Migration pain is not limited to the initial import. Even when you’re already inside Bitwarden and simply want to shuffle entries between an organization vault and your individual vault, or the other way around, there is, to this day, no proper “move the selected items to …” feature. For a handful of logins you can clone/edit each one manually, but anyone who has ever tried this with a few hundred items (say, after cleaning up a collection , leaving a company, or consolidating several organizations ) knows that this quickly becomes a carpal tunnel -inducing exercise. The official workaround that Bitwarden support and community threads recommend is to export the source vault as unencrypted JSON , edit the file, and then re-import it into the destination vault. Setting aside the obvious security footgun of having 500+ credentials sitting in plain text in , or worse, a directory that’s silently synced to the cloud (think Dropbox , OneDrive , iCloud , …) while you figure out where to put them, the process happily loses a non-trivial amount of data along the way: […] if there are file attachments in any of your vault items, then these will not be included in the export […] the export will not include items in the Trash , or any password histories or timestamps. For any organization that relies on attachments (e.g. SSH key files, licence keys, recovery codes as images) or on password history for compliance/audit reasons, this is plainly unacceptable. For a product whose entire job is to be the source of truth for your credentials, the complete absence of a “move these 500 items to that vault, keep everything intact, click OK” button in year ten of its existence speaks volumes about where Bitwarden ’s engineering priorities lie. Another example concerns client updates. It appears that Bitwarden pushes new updates to their clients that can lead to vaults becoming inaccessible (on the client side) at random, without any heads-up to the users. I personally encountered this issue while travelling. When I had my phone plugged-in overnight, F-Droid decided it’s a good time to update a few apps, one of which was Bitwarden . The next morning I had to log into my banking and when I opened the Bitwarden app on my phone I was unable to access my vault. It took some time to figure out what was going on ( via Vaultwarden ), and I was lucky that I had my UPDC (which hosts my Bitwarden backend) with me, as otherwise I could have ended up in a pretty bad situation with my whole vault being unavailable. The sheer irresponsibility with which Bitwarden appears to push what looks like breaking protocol changes between the clients and the backend is frightening. As someone who relies heavily on my password manager to work in offline mode, this experience taught me that Bitwarden cannot be trusted. From that moment on, I disabled automatic updates for the Bitwarden clients and exported a current snapshot of all passwords to a local backup in KeePassChi / KeePassXC / KeePassDX . This is, by the way, not a Vaultwarden -specific issue, despite Bitwarden staff claiming so. Searches through the repository return a long list of very similar reports, for example around the 2025.12.x release introducing regressions that prompted users for the master password twice after login and then crashed the app, or the 2025.6.0 release that simply crashed on startup for many users. The Android app in particular went through a full rewrite from .NET MAUI to native Kotlin in 2024, which shipped alongside a trail of regressions that continue to show up in quarterly releases. Aside from the aforementioned technical details, Bitwarden is (and has always been) one of the subjectively worst applications on my phones and my desktop in terms of user interface. The UI/UX is in fact so horrible, that even after years of use I still dread opening the ungoogled-chromium extension, let alone any of the desktop and mobile apps. Aside from the fact that building the Electron -based desktop app from source is a huge PITA and that the pre-built Flatpaks are not working properly on Wayland , one more general, major issue that I’m experiencing with the Bitwarden client applications (and extensions) is the fact that while they clearly support offline use, they’re not intentionally built for it. Hence, whenever I open the mobile app or the browser extension, there’s a noticeable delay that sometimes takes literal seconds or even minutes, in which the client application seemingly tries to reach the backend, which often isn’t around (because I’m not hosting my Bitwarden backend on the open internet). While this sounds like a nitpick, it truly slows down things whenever one has to unlock Bitwarden (which is almost always, as I do not trust especially the browser extension to remain unlocked all the time). Sadly, there seems to be no way to turn off syncing when unlocking the vault to prevent the clients from waiting unnecessarily. Another example of a bad user experience is the logins overview (titled Vault ). Whenever I am on a website (in my desktop browser) and I would like Bitwarden to fill the login form, I tend to click the extension’s icon in the toolbar and then click the entry in the list. This has been how all other password manager UIs that I have used in the past have worked; Not Bitwarden , though. There, you need to click the small Fill button on the right side of the list item. If you click the big list item itself, which is highlighted on mouse-over, you simply open that item to show its details. Instead of allowing the user to click the big UI element (which is the whole list item), Bitwarden forces them to click a significantly smaller, harder to hit UI element (a button on top of a clickable list item). As with the syncing feature, there’s also no way to flip this behavior, so that clicking the list item would fill in the form, while clicking the tiny button would open the item’s details page. I’m apparently not alone in this sentiment. A quick glance at recurring Hacker News threads on the topic reveals that users have been complaining about pretty much every single one of these issues, ranging from the desktop app not focusing correctly when opened , to “loading for over 5 minutes before showing my passwords” , to the browser extension asking to save passwords that are already there , to broken biometric login on iOS, laggy mobile apps, and, of course, the famous “Log-In suggestions not showing” . Feature requests that have been sitting in the community forum since 2021 (such as a simple edit history for entries) remain untouched, which is a pattern that MSP resellers also called out publicly as “glacial feature development” . Speaking about lists, the Bitwarden CLI has an equally bad user interface. For example, the command of the tool will unexpectedly output every detail of every item, including passwords and TOTP codes, without the need for an additional e.g. flag. There’s no way that reasonable engineers looked at this and said “Yep, that’s how we do things, because we cannot imagine a single situation in which anyone might mistakenly pipe to some place and unintentionally expose all their credentials” . Also, can we take a step back and talk about the fact that the Bitwarden CLI is a terminal tool built in TypeScript ? Not only because it requires a metric ton of runtime and dependencies, but also because JavaScript isn’t exactly the stack anymore that you’d run carefree on your continuous integration environments. “Why?” , you ask? Hold my beer… A password manager has, essentially, one job : Keeping the user safe, by keeping their credentials safe. For a product that has been around since 2016 , Bitwarden has accumulated a surprisingly long list of incidents in which it at least partially failed at exactly that task. And no, I’m not talking about theoretical vulnerabilities, I’m talking about things that actually shipped to production. In January 2023, shortly after the LastPass breach had the entire industry questioning the real-world strength of cloud-hosted password vaults, security researcher Wladimir Palant published an analysis showing that Bitwarden ’s advertised 200,001 PBKDF2 iterations were, in practice, closer to 100,000 . The reason was that the additional server-side iterations were only applied to the master password hash used for login , but not to the encryption key protecting the vault data. An attacker with access to a leaked vault could therefore bypass the server entirely and was left with the same effective security as with LastPass . Additionally, the default client-side iteration count was still at 100,000 , below OWASP recommendations at the time, and a concern that had been raised as far back as 2020 . Bitwarden eventually raised the default to 600,000 and added Argon2 support, but (mirroring LastPass ’ earlier mistakes) the change initially applied only to new accounts, leaving existing users responsible for manually updating their own KDF settings. Still in 2023, RedTeam Pentesting disclosed “Bitwarden Heist” ( CVE-2023-27706 ), a vulnerability in the Windows desktop client that allowed attackers with domain-administrator access to extract the vault decryption key from the local DPAPI storage without ever prompting Windows Hello or the master password. In the words of the researchers: Any process running as the low-privileged user session can simply ask DPAPI for the credentials to unlock the vault, no questions asked. The fix eventually shipped in version 2023.4.0 , months after initial disclosure. Also in 2023, CVE-2023-27974 was disclosed. The vulnerability was about the Bitwarden browser extension, which happily offered to fill credentials into cross-domain iframes embedded on trusted pages, as long as the base domain matched. Meaning, if embedded an iframe from (e.g. on a subdomain controlled by a third party), credentials could be stolen. Bitwarden ’s response was that iframes “must be handled this way for compatibility reasons” , and that “Auto-fill on page load” was not enabled by default. Small comfort if you did enable it. Fast-forward to August 2025, when security researcher Marek Tóth publicly disclosed a class of DOM-based clickjacking attacks that could trick the Bitwarden browser extension into autofilling credit card details and personal information after a single click on a malicious page. The vulnerability had been reported four months earlier, in April 2025, but was classified by Bitwarden as “moderate severity” and was not patched until version 2025.8.2 , shipped on the very day the researcher’s embargo expired. And then, a few days before I started writing this post, news broke that the official Bitwarden CLI client ( ) was compromised in the ongoing Checkmarx supply chain attack : The affected package version appears to be , and the malicious code was published in , a file included in the package contents. The attack appears to have leveraged a compromised GitHub Action in Bitwarden’s CI/CD pipeline , consistent with the pattern seen across other affected repositories in this campaign. Organizations that installed the malicious Bitwarden npm package should treat this incident as a credential exposure and CI/CD compromise event . The payload downloaded the Bun runtime, decrypted a second-stage Shai-Hulud worm and started harvesting GitHub and npm tokens, SSH keys, shell history, AWS , GCP , Azure credentials, GitHub Actions secrets, and even MCP configuration files used by AI tooling. The data was then exfiltrated by auto-creating a public repository on the victim’s own GitHub account and uploading the stolen credentials there. Bitwarden ’s npm distribution pipeline stayed compromised for approximately 19 hours and 334 developers had enough time to pull the malicious package before it was caught. Bitwarden ’s official statement emphasised that no end-user vault data was accessed , which is technically true and entirely beside the point. Everyone running in a CI pipeline just handed the attackers whatever else happened to live on that machine. For a company whose one job is keeping secrets safe, distributing an actively malicious CLI through its official channels is not a great look. It also ties back nicely to the earlier rant about shipping a password manager CLI as a Node package. Had been a single statically-linked binary in Go or Rust (as most of the ecosystem has moved towards) the npm -shaped blast radius simply wouldn’t exist in that form. And while supply-chain attacks within the Go and Rust ecosystems are on the rise as well, the barriers for successful attacks are still higher. Note: None of the above incidents are world-ending on their own. Every non-trivial piece of software will ship with bugs, and critical vulnerabilities happen to everyone. What bothers me is the pattern . The reactive (rather than proactive) security posture, the “working-as-intended” responses to embarrassing findings, the reliance on a Node.js toolchain for a security-critical CLI, and the fact that several of these issues had been quietly flagged by external researchers long before they were actually addressed. As this post is not an ad-driven hit-piece by any of Bitwarden ’s competitors, you won’t be reading anything along the lines of "… switch to <insert SaaS product here> now and get 50% off your first year with promo code SWORDFISH" . Instead, I will describe the approach that I’m taking moving forward, which might be something that you, as an equally frustrated long-time Bitwarden user, might be interested in exploring as well. Over the past years, I came to the conclusion that there’s no single password manager that will work perfectly for every use case and setup. For example, in my personal life, I do not need the ability to share vaults or individual passwords with other people. In my professional life, however, that is a fairly common occurrence. Similarly, the login credentials for bank accounts or insurance portals do not need to be available through a CLI tool, but they have to be available across multiple devices. Secrets for cloud storage or SSH private keys for deployments, however, don’t need to sync to any of my phones , but they do need to be accessible from a command-line tool that can be invoked programmatically. With these requirements in mind, it only makes sense to think of a way to better compartmentalize each set of credentials, rather than trying to find a single software or platform that can kill ten birds with one stone. Also, looking at it from a security perspective, it makes total sense to split up these password groups into different softwares and services in order to minimize the impact that a data breach might have. Generally, the approach that I came up with splits my credentials into the following groups: For group A I’m going with a SaaS password manager that offers proper vault sharing, integrates with the tools clients actually use (SSO, browser extensions on corporate machines, audit logs), and takes the hosting burden off my plate. The platform is proprietary, which I would normally not be thrilled about, but given that the scope of this group is client work only , I’m accepting the trade-off. For group B , the rationale is a bit counter-intuitive at first. The accounts tied to these credentials already contain personal information like name, address, date of birth, maybe payment details, which is regularly leaked by the very same services anyway, as a quick look at Have I Been Pwned confirms. A breach of the password manager itself would therefore not meaningfully expand the attacker’s knowledge. With TOTP and Passkeys in place, it frankly doesn’t even matter anymore at this point. What does matter here is cross-device availability, realiability and offline capabilities. I’m using a second, separate cloud-based password manager for this group, from a different vendor, with a different master password and different recovery mechanisms, so that a compromise of group A doesn’t automatically compromise group B and vice-versa. As I will be running their mobile app on at least one GrapheneOS device, I prefer a solution that doesn’t depend on Google Play Services and ideally offers an open-source/source-available client. Group C covers all the accounts I have on internet forums, websites, privacy-respecting services, and anything that doesn’t hold PII. For these, I don’t need, nor do I want, a cloud service. I’m using KeePassChi / KeePassXC / KeePassDX with the database file sitting in a folder that is being synced across my devices via Syncthing , which is an approach I have already written about in the past . The file is itself encrypted, which means that even if Syncthing were compromised (and the attacker somehow got their hands on the file), they would still need to break the KeePassChi / KeePassXC encryption to get anything useful out of it. On mobile, KeePassDX on Android reads the same file without fuss. For group D , I’m using a mixed approach of storing personal credentials using the same approach taken in group C , and credentials that are actually used by scripts, CI jobs, and remote servers, using HashiCorp Vault , which is the same one I was already running for PKI in my OpenBSD setup. Vault is a bit of an overkill for a single user, but it gives me proper access policies, token-based authentication for automated agents, short-lived credentials for things that support it, and audit logs. Having that said, I’m looking into Infisical . For group E , the API keys, personal access tokens, and random secrets that I only ever use from the command line, I’ve settled on the venerable utility. It stores each secret as an individual GPG -encrypted file in a Git repository, which is conceptually simple, easy to audit, and cooperates perfectly with shell scripts and my dotfiles . The Git repository lives on my own infrastructure, not on GitHub , and it’s only synced manually when I actually need to access it from a different machine. This might all sound like a lot of moving parts, and I understand if it looks like overkill for someone coming from a single-vault world. The reality, however, is that after years of using Bitwarden as a one size fits all solution, I realised that one size fits all meant one size fits poorly . Splitting credentials across multiple tools turned out to be significantly less painful than I had initially assumed, mostly because each tool is individually well-suited to its specific task. And if any one of them gets breached, the blast radius is limited to one category of secrets, not the whole lot. After several years of self-hosting Bitwarden , I’ve come to the conclusion that the product has drifted further and further away from what I originally signed up for. The enterprise-first architecture that barely fits on a Raspberry Pi, the half-hearted attempt at a “lighter” backend, the SDK licensing situation , the slow pace at which features are being addressed, the avoidable UX paper-cuts that haven’t been fixed in years, and finally the string of security issues that shouldn’t have shipped in the first place, all paint a picture that I find hard to reconcile with the “open-source password manager for everyone” narrative. I’m not suggesting that the alternatives are universally better or free of their own issues, because password managers are simply hard, and every player in this space has its fair share of skeletons. What I am suggesting is that you take a hard look at how much trust you are placing into a single piece of software for all of your credentials, and whether that bet is still the right one, which for me, it no longer was. Here are some other views on this topic: A: Credentials for professional/client projects (think platform logins, etc.) B: Credentials for accounts containing PII (think bank accounts, online shops, etc.) C: Credentials for accounts that do not contain PII (think accounts on internet forums, online platforms, etc.) D: Credentials for infrastructure (think server logins, SSH keys) E: One-off credentials (think API keys, tokens, etc.) Ask HN: Alternatives to Bitwarden? Bitwarden CLI Compromised in Ongoing Checkmarx Supply Chain Campaign Bitwarden CLI Compromised in Ongoing Checkmarx Supply Chain Campaign Concerns Over Bitwarden Moving Away from Open Source

0 views
Allen Pike 1 weeks ago

We Can Do Hard Things

Years ago, back when I was leading a mobile dev team, my friend had an idea for a business. You see, back then the most frustrating thing about mobile dev was the final step: getting your app on actual phones. Builds, provisioning, and code signing made for a harrowing trial, festooned with obtuse errors and other sharp spikes. So, Dennis had a pitch for me. “What if,” he asked, “we did all your apps’ builds and provisioning and signing for you, in the cloud?” I raised an eyebrow. “Well, obviously that would be great. In theory. But it would be too annoying to build that. Apple drops Xcode versions and switches submission requirements with no warning. And you’d need to make sure that…” He stopped me with a wave. “Right, but: if we did it, and it worked. Would you use it?” “Well, of course we would. But I don’t think you want to run this.” My attempt to discourage him didn’t work. Perversely, the idea that this was a hard problem got him more excited. He immediately dove in. Three years later, Buddybuild was acquired with fanfare . They’d accomplished what they set out to do, made a tidy profit, and they were even able to keep their team here in Vancouver. Wisely they ignored me, and chose to do the hard thing. Doing something hard yet pointless is foolish. But doing something hard yet valuable has a lot of benefits. Consider that. If you have a great team, less competition, but more ambition and discipline, then you’re set up to do well. These days are well suited to attempting hard things. Our tools are improving so fast that a project which seemed straightforward last year might be trivial next year. Better to dial up the ambition a bit. Of course, there are a few pitfalls to trying hard things. You’re more likely to burn out, for one – it’s very important to sleep, exercise, and manage your own energy when your work is kicking your ass. And it can sometimes be difficult to tell when the “hard and purposeful” parts end, and when the “overcomplicating things” or “naive folly” begins. I highly recommend having a co-founder that finds hard and purposeful problems motivating, yet takes a dim view of overcomplication. Doing hard things is best not attempted alone. But, all in all, it’s a good default. We can do hard things. It’s easier to recruit a great team to tackle hard, worthwhile problems. It leads to less competition, due to schlep blindness . It’s a great way to hone your ambition and discipline – over time, working on hard things feels less hard.

0 views
daniel.haxx.se 2 weeks ago

Approaching zero bugs?

In this era of powerful tools to find software bugs , we now see tools find a lot of problems at a high speed. This causes problems for developers, as dealing with the growing list of issues is hard. It may take a longer time to address the problems than to find them – not to mention to put them into releases and then it takes yet another extended time until users out in the wild actually get that updated version into their hands. In order to find many bugs fast, they have to already exist in source code. These new tools don’t add or create the problems. They just find them, filter them out and bring them to the surface for exposure. A better filter in the pool filters out more rubbish. The more bugs we fix, the fewer bugs remain in the code. Assuming the developers manage to fix problems at a decent enough pace. For every bugfix we merge, there is a risk that the change itself introduces one more more new separate problems. We also tend to keep adding features and changing behavior as we want to improve our products, and when doing so we occasionally slip up and introduce new problems as well. Source code analyzing tools is a concept as old as source code itself. There has always existed tools that have tried to identify coding mistakes. Now they just recently got better so they can find more mistakes. These new tools, similar to the old ones, don’t find all the problems. Even these new modern tools sometimes suggest fixes to the problems they find that are incomplete and in fact sometimes downright buggy. Undoubtedly code analyzer tooling will improve further. The tools of tomorrow will find even more bugs, some of them were not found when the current generation of tools scanned the code yesterday. Of course, we now also introduce these tools in CI and general development pipelines, which should make us land better code with fewer mistakes going forward. Ideally. If we assume that we fix bugs faster than we introduce new ones and we assume that the AI tools can improve further, the question is then more how much more they can improve and for how long that improvement can go on. Will the tools find 10% more bugs? 100%? 1000%? Is the tool improving going to gradually continue for the next two, ten or fifty years? Can they actually find all bugs? Can we reach the utopia where we have no bugs left in a given software project and when we do merge a new one, it gets detected and fixed almost instantly? If we assume that there is at least a theoretical chance to reach that point, how would we know when we reach it? Or even just if we are getting closer? I propose that one way to measure if we are getting closer to zero bugs is to check the age of reported and fixed bugs. If the tools are this good, we should soon only be fixing bugs we introduced very recently. In the curl project we don’t keep track of the age of regular bugs, but we do for vulnerabilities. The worst kind of bugs. If the tools can find almost all problems, they should soon only be finding very recently added vulnerabilities too. The age of new finds should plummet and go towards zero. If the age of newly reported vulnerabilities are getting younger, it should make the average and median age of the total collection go down over time. The average and median time vulnerabilities had existed in the curl source code by the time they were found and reported to the project. Accumulated vulnerability age when reported Bugfixes When the tools have found most problems there should be less bugs left to fix. The bugfix rate should go down rapidly – independently of how you count them or how liberal we are in counting exactly what is a bugfix. Bugfixes Given the data from the curl project, there does not seem to be fewer bugfixes done – yet. Maybe the bugfix speed goes up before it goes down? Given the look of these graphs I don’t think we are close to zero bugs yet. These two curves do not seem to even start to fall yet. Yes, these graphs are based on data from a single project, which makes it super weak to draw statistical conclusions from, but this is all I have to work with. I think that’s mostly an indication of what you believe the tooling can do and how good they can eventually end up becoming. I don’t know. I will keep fixing bugs.

0 views
David Bushell 2 weeks ago

GitHub is sinking

TL;DR: GitHub used to be cool and now it’s a lame slop graveyard. GitHub is racing towards the mythical zero nines of uptime. Users are starting to notice that GitHub is now a Microsoft product. Eww! Official uptime paints a concerning chart. The missing status page tell a far worse story. Whatever the truth, it’s impossible to miss the delightful experience that is Microsoft GitHub if you use it semi-regularly. Microsoft acquired GitHub and applied their unique brand of enshittification. Amongst their achievements was the spawning of the Copilot circle of hell . Now they’re effectively DDoSing themselves with slop . I won’t dwell on what else went wrong. I don’t know and I don’t care. GitHub is impressively bad now. It’s embarrassing. Shameful. As I write this the obituaries are flooding in: It’s long past time to get off this sinking ship! GitHub has become synonymous with “source control” and I worry too many users don’t know that Git is not GitHub. The core technology of Git is open source. It’s distributed, meaning that all repositories are equal. Git works without a centralised service. Such a practice is a construct of social convenience. GitHub was a useful add-on. Microsoft has turned GitHub into an expensive liability. Network effects are hard to topple but if anyone can do it, Microsoft can. GitHub’s fake star economy is worthless. GitHub is inundated with bots and drowning in slop and doing everything to encourage it. Microsoft is turning GitHub into the Moltbook of code, it ain’t for you and me anymore. Your CI pipeline is over-engineered and GitHub Actions are an abomination (see: [1] [2] ). Finding another solution is an absolute chore but do you trust GitHub to be reliable? Look, the ship is sinking! Sure, the water looks freezing. Don’t hang around and allow Microsoft to pull you under. You don’t need to move everything in one go. Start the process. The nearest lifeboat to escape GitHub is another centralised Git forge. Just sign up and push your repo to the new upstream. Some services can automate the migration and maybe even import issues. Personally I’d leave issues behind in a tragic boating accident. Codeberg — a non-profit and community-led project with an established track record. This is the safe alternative that’ll stick around. It’s the flagship instance of Forgejo . Tangled — an alpha stage start-up with interesting AT protocol integration. Worth considering for smaller solo projects. Seems cool. Gitea — they offer cloud managed Git hosting. It’s the original open source project that Codeberg/Forgejo forked away from. GitLab — enterprise grade, meaning it’s bloated and confusing but it’ll impress your boss. This could be the choice if you need multiple meetings to make the choice. Bitbucket — trade one soul destroying corpo fun vacuum for another. Strongly discouraged, but Bitbucket does technically fit the anything but GitHub category. If you’re cool like me , you or your organisation can self-host a Git forge with actions and releases . My recommendation is Forgejo . There is talk of federation between Forgejo instances but it’s not happening anytime soon. If you want open collaboration push a copy to Codeberg. Gitea and GitLab also have self-hosted options. Be aware, GitLab is a comparative chonker. When I said “Git is not GitHub” the same applies to other forges. Do you need those add-ons? Nothing is stopping you from raw-doggin’ Git over SSH: How you manage collaboration is another question. If Linux can be maintained by sending patches to an email mailing list, “doesn’t work at scale” arguments are skill issues. But seriously, a centralised Git forge is a decent compromise in my opinion. Maybe they collapse like GitHub in future. Always have an exit plan. Just use anything but GitHub. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds. Ditching GitHub - Lonami Ghostty Is Leaving GitHub - Mitchell Hashimoto Before GitHub - by Armin Ronacher From GitHub to Codeberg/Forgejo - Jonas Hietala

0 views
iDiallo 2 weeks ago

Don't use localhost:3000, use your own custom domain

After presenting a demo of how an internal tool works, I was flooded with questions. Not about the tool, but about why I had bought a domain just to run the demo. "Why didn't you use the staging server?" they asked. I was confused. I didn't buy a domain. I was running it locally. But instead of the URL being , it was a fully formed domain. . In fact, some people told me that they couldn't access the website on their devices. They thought I had to whitelist their IP to grant them access. To feel young again... Setting up a custom domain locally was common practice when I started web programming. But with the advent of Node.js (and rails?), everyone has resorted to just pointing to with an incrementing port number. The main reason is that the webserver is often bundled into the application itself. It’s easy to just run and call it a day. However, if you have multiple long-term projects running locally, especially if they need to communicate with one another, then managing a mental map of ports like , , and quickly gets tiring. This is where my old school approach shines. By combining the system hosts file with a reverse proxy like Nginx, you can run different projects locally with actual domain names. I usually end up with for active development, for a stable local build, and the actual production URL for the live site. Here is how to set it up. First, we need to tell your computer where to find these domains. Think of as your computer's personal contact list. When you type a URL, your computer looks here first. By adding an entry, you are telling your computer: "Don't bother checking the internet when I ask for myproject.com, I am actually talking about this machine." It creates a manual override that maps a friendly name directly to your machine's IP address. You can edit the file here: Linux/macOS: Windows: Open the file in your editor. In this file, right after the block of entries for Adobe (active.adobe.com...), add this line: Now, when you access those domains in your browser, they don't point to the wider internet, but directly to your own machine. Now that the domain is pointed to your own machine, we want to redirect it to the right application. If your app runs on port , navigating to will default to port and fail. This is where Nginx comes in. It listens on port and forwards the traffic to the specific port your app is running on. Here is a simplified Nginx config to make it work: Restart Nginx, and voilà! You have clean, professional URLs for your local environment. If you are running your services inside Windows Subsystem for Linux (WSL2), networking is handled a little differently because the Linux instance has its own virtual IP. You can get your instance's IP address with this command: You would use that IP address in your Windows hosts file instead of . After that demo, some people were disappointed to learn the trick. They thought I was so committed that I had bought a domain name just to give them the raw deal with my demo. Someone mused about a shirt with the words "real men don't use localhost:3000". That could have started a whole new motivational speaking career for me. A custom domain just looks very professional and is practical for separating environments. It just feels cooler than staring at all day. That's how you separate yourself from vibe-coders. Anyway, back to earth. I feel like this is a lost skill and I'm keeping it alive by sharing it. That's how you run a custom URL locally.

0 views
Ahmad Alfy 3 weeks ago

Stop Hardcoding Your Timeouts

A developer rant about tools built for one kind of internet Recently, I’ve been losing my mind to hardcoded timeouts . Silent, arbitrary, unconfigurable time limits baked into tools by developers who apparently have never had to wait more than 200ms for anything in their lives. Let me tell you about my week. Now that coding agents are everywhere, everyone is using skills. The popular way to add them is through packages developed by vercel-labs, and the go-to collection is awesome-copilot , a curated set of skills sitting at 30K+ stars at the time of writing. Except I can’t use it. The repository is too big, and the installer just chokes and dies. There’s an open issue about this since February #278 on the vercel-labs/skills repo and no one has responded. I’d be happy to send a PR and fix it myself. I just need someone to acknowledge it exists. Is there a configuration option? A flag? An environment variable? No, there is nothing. The workaround I found? Clone the repo manually first, then install from the local copy. It works, mostly. Except now points to a path on my machine. My colleagues cannot use it. I also have to update my copy everytime I want update my skills. One workaround creates a lot of other problems. Then came Docker Gordon, the AI-powered debugging assistant baked into Docker. Useful concept. I was stepping through a container build issue, the kind that requires iteration: tweak, rebuild, inspect, repeat. I’ve never used Gordon but when the error manifested itself, it came with a suggestion to try Gordon and so I did. Except Gordon has a hard limit: if your container doesn’t finish building within two minutes , it gives up. The session dies. You start over. A two-minute build might sound like plenty if you’re in a fast environment with warm caches and pulled base images. But if you’re pulling a fresh base image over a slower connection? Debugging a multi-stage build with several heavy layers? Forget it. Gordon has already moved on. There is no way to configure this. No env var. No flag. Nothing. The tool just assumes that two minutes is forever, and if you need more, that’s your problem. Developers often working on fast machines, in offices or homes with gigabit connections, in cities with world-class infrastructure. They build tools with timeout defaults that reflect their own experience. And then they ship those tools to the whole world, with no knobs to turn. The thing is, timeouts need to exist. Infinite waits are bad. Hanging processes are bad. I’m not arguing against timeouts. I’m arguing against unconfigurable timeouts. Against the implicit message that says: if you can’t do this in 60 seconds, your environment is wrong, not my assumption. A timeout should be: This isn’t hard. It’s respect for your users. I’m writing this from Cairo. My internet is decent, better than many places in the world. But it’s not 1 Gbps symmetric fiber. It’s not co-located next to an npm registry mirror. A of a large repo takes time. Pulling a Docker image takes time. These are not failures. They are physics. When your tool dies silently after 60 seconds without any way to change that limit, you haven’t built a tool for the world. You’ve built a tool for your office. And this matters more than most developers acknowledge. The global developer community isn’t located in San Francisco or Amsterdam or London. It’s in Lagos, in Karachi, in Cairo. It’s people on 4G connections, on shared broadband, on connections that have real latency because the nearest CDN edge is 50ms away instead of 5. When you assume a fast connection, you’re not making a neutral technical decision. You’re making a statement about whose experience matters. I don’t think anyone is doing this maliciously. I think it’s a blind spot. Your internet is fast, so a 60-second timeout feels generous. Your machines are powerful, so a 2-minute build window seems like plenty. But please: before you ship a timeout, ask yourself: And then add a config option. One environment variable. One flag. That’s all it takes to go from “this tool doesn’t work for me” to “this tool works for me.” As Bruce Lawson once said: it’s the World Wide Web, not the Wealthy Western Web. The web and the tools we build on top of it are for everyone. Let’s start acting like it. A safe default for the common case Clearly documented so users know it exists Overridable via a flag, an environment variable, a config file, something What if the user is on a slower connection? What if their repo is larger than mine? What if they’re debugging something slow, and that’s the whole point?

0 views
The Coder Cafe 3 weeks ago

Systems Thinking Explained

☕ Welcome to The Coder Cafe! In a previous post , I briefly touched on systems thinking after reading Learning Systems Thinking . My honest take: it was an interesting introduction, but I wasn’t fully convinced. The concepts felt abstract, the examples too sparse. Then I read Thinking in Systems by Donella Meadows. It might be one of the best books I’ve read in my career (and it’s not even a computer science book). This post is my own introduction to the core concepts, grounded in a real example from my experience. Get cozy, grab a coffee, and let’s begin! Introduction Have you ever fixed an incident, only to see it come back two weeks later? Or made a change that improved one metric while quietly degrading another? Or spent months firefighting without ever feeling like things were actually getting better? These aren’t signs of bad engineering. They’re signs of reacting to events without understanding the structures that produce them. Understanding those structures requires a different kind of thinking, and that’s exactly what systems thinking is: the ability to shift from reacting to events through responsive patterns of behaviors to generating improved systemic structures. This post is an introduction to systems thinking, covering the core concepts through a real example from my experience at Google. First, let’s define what a system is. In essence, a system is: A set of elements Interconnected To achieve something Distributed systems are an obvious example. For example, a 3-node, single leader database is composed of: 3 nodes (elements) Connections from the leader to the replicas (interconnections) With the goal of storing data reliably over time Interestingly, this is why distributed systems can surprise even their own designers: add enough nodes, replication lag, and competing writes, and the system starts behaving in ways no single component would predict. To reason about how systems change over time, we need two important concepts: A stock is an accumulation of material or information that has built up in a system over time. For example: the number of machines available in a cluster, the size of a message queue, the amount of technical debt in a codebase. A flow is what changes a stock: material or information entering or leaving it. For example: machines being added or removed from service, messages being enqueued and consumed, or requests being received and processed. The key thing to keep in mind: stocks take time to change because flows take time to flow . You can’t instantly restore machine availability or drain a queue with a single action. This has real consequences for how systems behave under pressure. We will come back to it. One of the most important concepts in systems thinking is the feedback loop . A feedback loop is what the system does automatically because its own result feeds back into it. Said differently: If causes , then influences . Let’s take a concrete example. Suppose you live in a house with a central thermostat set at 20°C. It turns the heating on when the temperature drops to 19°C, and off when it reaches 21°C. The feedback loop works like this: : Temperature change : Thermostat turns heating on or off The thermostat turning on or off ( ) is caused by the temperature change ( ). But the temperature change ( ) is in turn influenced by the thermostat ( ). Each effect feeds back into its own cause. This is a feedback loop. There are two kinds of feedback loops. A balancing feedback loop resists change : It pushes the system back toward a goal or limit. Think of it as a stabilizer: when something moves away from the target, the loop acts to bring it back. The thermostat is a perfect example. As the temperature drifts away from 20°C, the thermostat reacts, and the system returns to equilibrium. A reinforcing feedback loop amplifies change : More leads to more, less leads to less. An action produces a result that drives more of the same action, generating growth or decline at an accelerating rate. The YouTube algorithm is a clear illustration: the more a video is viewed, the more the algorithm surfaces it; the more it’s surfaced, the more views it gets. More formally, we can have 4 cases of feedback loops: Balancing ceiling : If causes , then influences Balancing floor : If causes , then influences Reinforcing growth : If causes , then influences Reinforcing collapse : If causes , then influences The more feedback loops a system contains, the more complex and surprising its behavior becomes, especially when those loops interact. An often overlooked but critical property of feedback loops is the delay between an action and its effects . Delays are pervasive in systems and strong determinants of behavior. When the gap between action and effect is long, two things happen: Foresight becomes essential : Acting only when a problem becomes obvious means missing the window to address it early. Oscillations become likely : We overreact because the system hasn’t had time to respond, then overreact again in the other direction. Think of an autoscaler that takes 3 minutes to provision new instances. By the time the new capacity is ready, the traffic spike has already peaked. The window to act had opened before the problem was even visible on the dashboard. This is why foresight matters: when there is a significant delay between action and effect, reacting to what you see now means always acting too late. And the consequences compound. The autoscaler, still responding to the old signal, overshoots. Then it sees too much capacity and scales down, right before the next spike arrives. One example, two problems: a system that needed foresight got a reaction, and then oscillated because of it. The delay didn’t change the goal. It made the system work against itself. System boundaries are artificial . They help us frame a problem, but in reality, everything is interconnected. The boundaries we draw determine what we see and, therefore, what we miss. Consider a microservices architecture in which each team owns a service. Every team has solid SLOs, careful on-call rotations, and clean dashboards. And yet end-to-end latency keeps creeping up, and users are complaining. Each team looks at its own service and sees green. The problem is that the boundary is wrong; no one is looking at the system as a whole . This is one of the most common traps in engineering: optimizing within a boundary while the real issue lives outside it. Before changing a system, it is worth asking: Am I looking at the right boundary? When something goes wrong in a system, what do we actually see? Usually just the surface: an incident, a spike, an outage. The iceberg model gives us a way to think beneath it. The model has four levels: Events are what’s visible: the incident alert, the latency spike on the dashboard. This is where most of our attention goes, and where reactive thinking lives. Patterns and trends are what you find when you zoom out. Has this happened before? At what frequency? Under what circumstances? Patterns reveal that what felt like a one-off event is actually part of a larger rhythm. Structure is the underlying system design: the feedback loops, the incentives, the processes that produce the patterns. You can’t fix a pattern without understanding the structure that generates it. Mental models are the beliefs and assumptions that shaped the structure in the first place. They’re the hardest to see and the hardest to change. Credits Most incident response lives at the event level. Systems thinking asks us to go deeper. As an SRE, this model resonates: we’re trained not just to react to incidents but to understand the why: the patterns, the structures, and eventually the assumptions that caused them. Let me now bring all of these concepts together through a concrete example from my previous role at Google, where I worked on the systems powering Google’s ML infrastructure. I was heavily focused on a system called the Safe Removal Service 1 (SRS). This service had a simple API and one core responsibility: to say yes or no when another system requested permission to disrupt a given entity . Indeed, most disruptive services at Google, the ones that reboot machines, drain jobs, or take clusters offline, were designed to ask this service before acting. In our context, the key constraint was preserving capacity, meaning ML TPUs and GPUs. For example, within a given cluster, at least 90% of TPUs must remain available at all times. So if 95% were currently available, SRS could approve disruptions, as long as availability didn’t drop below 90%. NOTE : The threshold values and other details have been altered for confidentiality reasons. The API was deliberately simple: “ Can I reboot this machine? ” → Yes/No “ Can I drain this job? ” → Yes/No “ Can I take down this cluster? ” → Yes/No SRS implemented several balancing feedback loops . For example, when available capacity dropped toward 90%, the service would start refusing disruptive requests, pushing availability back up. This was the primary loop: a governor that kept the system in a safe zone. There was also an implicit reinforcing loop on the positive side: by allowing maintenance to proceed when capacity was healthy, the service enabled machines to be upgraded, patched, and kept in good shape, which in turn kept capacity high. So far, so good. But here’s where it gets interesting. The balancing loop protected current capacity. What it didn’t account for was what happened when capacity was already constrained. When available capacity hovered near 90%, SRS would block most maintenance requests. Machines couldn’t be patched. Hardware with known error trends couldn’t be swapped. Security upgrades were deferred. Maintenance debt accumulated, silently, invisibly. This created a first hidden reinforcing loop: Less capacity → Deferred maintenance → More failures → Even less capacity The balancing loop was actively feeding the very problem it was trying to prevent. A second reinforcing loop emerged from human behavior: Low capacity → More incidents → Bypass mechanisms invoked → Riskier actions taken → Capacity lower still When the system was under stress, operators would sometimes override SRS to unblock critical work. Each bypass, reasonable in isolation, eroded the safety margins that the balancing loop was designed to protect. There’s a principle from Thinking in Systems that describes this precisely: System behavior is particularly sensitive to the goals of feedback loops . If the goals—the indicators of satisfaction of the rules—are defined inaccurately or incompletely, the system may obediently work to produce a result that is not really intended or wanted. Specify indicators and goals that reflect the real welfare of the system . Be especially careful not to confuse effort with result or you will end up with a system that is producing effort, not result. SRS was measuring the right-looking metric: current capacity. But the current capacity was not the same as the real health . A cluster at 92% availability, accumulating maintenance debt and hardware errors, was far more fragile than a cluster at 91% that was fully patched and stable. The balancing loop couldn’t tell the difference. The deeper fix wasn’t just tuning the threshold. It was making the controller health-aware, not just capacity-aware . Rather than gating only on “ % available right now ,” the system needed to incorporate slow indicators: maintenance backlog growth rate, share of fleet on known-bad firmware versions, hardware error trendlines, override and bypass rates. By the time the reinforcing loops made their effects visible, the stock (cluster health) had already been degrading for weeks. The delay between cause and effect made the problem invisible until it was expensive to fix. This example was not about a flawed design. It was about a structure that, taken as a whole, was quietly working against itself. A system is a set of elements interconnected to achieve a goal. Stocks are accumulations that change over time through flows; stocks take time to change. A feedback loop occurs when an effect feeds back into its own cause. Balancing feedback loops resist change and push toward equilibrium; reinforcing feedback loops amplify change. Delays between action and effect can cause oscillations and make problems invisible until too late. System boundaries are artificial; the boundary we draw determines what we see and miss. The iceberg model: events are visible, but patterns, structure, and mental models lie beneath. System goals must reflect real welfare, not just what’s measurable; inaccurate goals lead to unwanted behaviors. A well-designed balancing loop can mask hidden reinforcing dynamics. The most dangerous moment is when a system appears to be working. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Working on Complex Systems Probabilistic Increment Thinking In Systems Learning Systems Thinking Leverage Points: Places to Intervene in a System ❤️ If you enjoyed this post, please hit the like button. 💬 Have you ever built or maintained a system that looked healthy on the dashboard while something was quietly accumulating underneath? Leave a comment I already mentioned that service in a previous post. You can find more information in this whitepaper: VM Live Migration At Scale . Introduction Have you ever fixed an incident, only to see it come back two weeks later? Or made a change that improved one metric while quietly degrading another? Or spent months firefighting without ever feeling like things were actually getting better? These aren’t signs of bad engineering. They’re signs of reacting to events without understanding the structures that produce them. Understanding those structures requires a different kind of thinking, and that’s exactly what systems thinking is: the ability to shift from reacting to events through responsive patterns of behaviors to generating improved systemic structures. This post is an introduction to systems thinking, covering the core concepts through a real example from my experience at Google. What Is a System? First, let’s define what a system is. In essence, a system is: A set of elements Interconnected To achieve something 3 nodes (elements) Connections from the leader to the replicas (interconnections) With the goal of storing data reliably over time A stock is an accumulation of material or information that has built up in a system over time. For example: the number of machines available in a cluster, the size of a message queue, the amount of technical debt in a codebase. A flow is what changes a stock: material or information entering or leaving it. For example: machines being added or removed from service, messages being enqueued and consumed, or requests being received and processed. : Temperature change : Thermostat turns heating on or off A balancing feedback loop resists change : It pushes the system back toward a goal or limit. Think of it as a stabilizer: when something moves away from the target, the loop acts to bring it back. The thermostat is a perfect example. As the temperature drifts away from 20°C, the thermostat reacts, and the system returns to equilibrium. A reinforcing feedback loop amplifies change : More leads to more, less leads to less. An action produces a result that drives more of the same action, generating growth or decline at an accelerating rate. The YouTube algorithm is a clear illustration: the more a video is viewed, the more the algorithm surfaces it; the more it’s surfaced, the more views it gets. Balancing ceiling : If causes , then influences Balancing floor : If causes , then influences Reinforcing growth : If causes , then influences Reinforcing collapse : If causes , then influences Foresight becomes essential : Acting only when a problem becomes obvious means missing the window to address it early. Oscillations become likely : We overreact because the system hasn’t had time to respond, then overreact again in the other direction. Events are what’s visible: the incident alert, the latency spike on the dashboard. This is where most of our attention goes, and where reactive thinking lives. Patterns and trends are what you find when you zoom out. Has this happened before? At what frequency? Under what circumstances? Patterns reveal that what felt like a one-off event is actually part of a larger rhythm. Structure is the underlying system design: the feedback loops, the incentives, the processes that produce the patterns. You can’t fix a pattern without understanding the structure that generates it. Mental models are the beliefs and assumptions that shaped the structure in the first place. They’re the hardest to see and the hardest to change. Credits Most incident response lives at the event level. Systems thinking asks us to go deeper. As an SRE, this model resonates: we’re trained not just to react to incidents but to understand the why: the patterns, the structures, and eventually the assumptions that caused them. A Concrete Example: Safe Removal Service Let me now bring all of these concepts together through a concrete example from my previous role at Google, where I worked on the systems powering Google’s ML infrastructure. I was heavily focused on a system called the Safe Removal Service 1 (SRS). This service had a simple API and one core responsibility: to say yes or no when another system requested permission to disrupt a given entity . Indeed, most disruptive services at Google, the ones that reboot machines, drain jobs, or take clusters offline, were designed to ask this service before acting. In our context, the key constraint was preserving capacity, meaning ML TPUs and GPUs. For example, within a given cluster, at least 90% of TPUs must remain available at all times. So if 95% were currently available, SRS could approve disruptions, as long as availability didn’t drop below 90%. NOTE : The threshold values and other details have been altered for confidentiality reasons. The API was deliberately simple: “ Can I reboot this machine? ” → Yes/No “ Can I drain this job? ” → Yes/No “ Can I take down this cluster? ” → Yes/No A system is a set of elements interconnected to achieve a goal. Stocks are accumulations that change over time through flows; stocks take time to change. A feedback loop occurs when an effect feeds back into its own cause. Balancing feedback loops resist change and push toward equilibrium; reinforcing feedback loops amplify change. Delays between action and effect can cause oscillations and make problems invisible until too late. System boundaries are artificial; the boundary we draw determines what we see and miss. The iceberg model: events are visible, but patterns, structure, and mental models lie beneath. System goals must reflect real welfare, not just what’s measurable; inaccurate goals lead to unwanted behaviors. A well-designed balancing loop can mask hidden reinforcing dynamics. The most dangerous moment is when a system appears to be working. Working on Complex Systems Probabilistic Increment Thinking In Systems Learning Systems Thinking Leverage Points: Places to Intervene in a System

0 views
blog.philz.dev 3 weeks ago

Ralph and Lisa

We know the Ralph Wiggum loop: The Ralph loop is about context management; doing things one "turn" at a time can be more effective and cheaper than doing many turns (see "Expensively Quadratic" ). Let me introduce you to the Lisa loop: Write a script that does this or that. If it fails, be sure to exit non-zero, and another agent will fix it. Script failed (with last output ). It was intending to do what was described in . Fix it in place. When your agent loop finishes, it will execute again, and the agent will be re-invoked if there are errors. Use to leave notes for future iterations. Lisa is self-sufficient. Lisa is self-healing. Lisa is smart.

0 views
Martin Fowler 3 weeks ago

Fragments: April 21

Last week Thoughtworks released the 34th volume of our Technology Radar . This radar is our biannual survey of our experience of the technology scene, highlighting tools, techniques, platforms, and languages that we’ve used or otherwise caught our eye. This edition contains 118 blips, each briefly describing our impressions of one of these elements. As we would expect, the radar is dominated by AI-oriented topics. Part of this is revisiting familiar ground with LLM-assisted eyes: An interesting consequence of AI in software development is that it’s not only forcing us to look to the future; it’s also pushing us to revisit the foundations of our craft. While assembling this edition, we found ourselves returning to many established techniques, from pair programming to zero trust architecture, and from mutation testing to DORA metrics. We also revisited core principles of software craftsmanship, such as clean code, deliberate design, testability and accessibility as a first-class concern. This is not nostalgia, but a necessary counterweight to the speed at which AI tools can generate complexity. We also observed a resurgence of the command line: After years of abstracting it away in the name of usability, agentic tools are bringing developers back to the terminal as a primary interface. I was especially happy to see my colleague Jim Gumbley added to the writing team, he’s been a regular source of security information for me over the years, including working on this site’s Threat Modeling Guide . Having a strong security presence on the radar team is especially important given the serious security concerns around using LLMs. One of the themes of the radar is securing “permission hungry” agents: “Permission hungry” describes the bind at the heart of the current agent moment: the agents worth building are the ones that need access to everything. OpenClaw and Claude Cowork supervise real work tasks; Gas Town coordinates agent swarms across entire codebases. These agents require broad access to private data, external communication and real systems — each arguing that the payoff justifies it. However, like a skier who’s just learned to turn and confidently points themselves at the hardest black run, the safeguards haven’t caught up with that ambition. The appetite for access collides with unsolved problems. Prompt injection means models still can’t reliably distinguish trusted instructions from untrusted input. Given all of this, many of this radar’s blips are about Harness Engineering, indeed the radar meeting was a major source of ideas for Birgitta’s excellent article on the subject. The radar includes several blips suggesting the guides and sensors necessary for a well-fitting harness. I expect that when the next radar appears in six months time, that list will increase. ❄                ❄                ❄                ❄                ❄ Mike Mason looks what happens when developers aren’t reading the code . The Python codebase Claude produced was largely working. Unit tests passed, and a few hours of real-world testing showed it was successfully managing a fairly complex piece of my infrastructure. But somewhere around 100KB of total code I noticed something: the main file had grown to about 50KB (2,000 lines) and Claude Code, when it needed to make edits, had started reaching for sed to find and modify code within that file. When I saw that, it was a serious alarm bell. As well as the experience of “a friend”, he ponders the 500,000 lines of Claude Code after the leak. Both things are true: there is good architecture in Claude Code, and there is also an incomprehensible mess. That’s actually the point. You don’t get to know which is which without reading the code. His conclusion is a rough framework. Throw-away analysis scripts are fine to vibe away. Tooling you need to maintain and durable code, needs regular human review - even if it’s just a human asking a model to evaluate the code with some hints as to what good code looks like The moment you say “I’m getting uncomfortable with how big this is getting, can we do something better?” it does the right thing: sensible decomposition, new classes, sometimes even unit tests for the new thing. It knew, it just didn’t volunteer it. He does recommend being serious with , I don’t know if he’s tried many of the patterns that Rahul Garg has recently posted to break the similar frustration loop that he saw. ❄                ❄                ❄                ❄                ❄ Dan Davies poses an annoying philosophy thought experiment for us to consider how we feel about LLMs indulging in ghost writing. ❄                ❄                ❄                ❄                ❄ DOGE dismantled many useful things during their brief period with the wood chipper. One of these was DirectFile, a government program that supported people filing their taxes online. Don Moynihan has talked to many folks involved in Direct File, has penned a worthwhile essay that isn’t just relevant to DirectFile and other U.S. government technology projects, but indeed any technology initiative in a large organization. Moynihan highlights: a paradox of government reform: the simpler a potential change appears, the more likely that it has not been implemented because it features deceptive complexity that others have tried and failed to resolve. I’ve heard that tale in many a large corporation too One way government initiatives are different is that, at its best, it’s built on an attitude of public service Many who worked on Direct File drew a sharp contrast with DOGE and their approach to building tech products. One point of distinction was DOGE’s seeming disinterest in public interest goals and of the public itself: “if you do not think government has a responsibility to serve people, I think it draws into question how good are you going to be at making government work better for people if you just don’t believe in that underlying principle” The tragedy for U.S. taxpayers like me is that we’ve lost an effective way to go through the annual hassle of taxes. In addition the IRS is much weaker - it’s lost 25% of its staff and its budget is 40% below what it was in 2010. Much though we hate tax collectors, this isn’t a good thing. An efficient tax system is an important part of national security, many historians consider the ability to raise taxes effectively was an important reason why Britain won its century-long struggle with France in the Eighteenth century. A wonky tax system is also a major reason why the French monarchy, so powerful at the start of that century, fell to revolution. Indeed there is considerable evidence that increasing the budget of the IRS would more than pay for itself by increasing revenue .

0 views
Nelson Figueroa 3 weeks ago

How to Install a Specific Version of a Homebrew Package with brew extract

I previously wrote about how to install older versions of homebrew packages . That method involves installing a package from a Ruby file but it’s outdated and doesn’t always work. There’s a better way with , although it still comes with caveats. I’ll be using as an example. Let’s say I wanted to install v0.145.0 because v0.146.0 introduced breaking changes that broke my theme. To install hugo v0.145.0: Note that this process will point your command to the older version, but you can switch between versions with . It will enable developer mode. This is normal and safe. Next, run . At the time of writing, it’s a 1.3GB download. This is necessary to get this working because Homebrew no longer keeps homebrew-core cloned locally. The command needs the full git history to search for older versions. Now we can use . This command will find a commit where the formula was at the version we want and copy that locally as . In this case we want Hugo v0.145.0, so we run : This isn’t needed for every formula and is something I ran into specifically with Hugo. Without this patch, you’ll run into errors. After running , edit the file: . Change this line: The reason we need to patch this file is because it prevents the error: It’s a mismatch between the path Homebrew expects ( ) vs the path that is created when using on Hugo ( ). Now that Hugo is extracted and patched, we can install with : Hugo v0.145.0 is now installed. There’s a warning with long output in the previous example due to the normal Hugo package being already installed but that is expected. Homebrew is now pointing the binary to v0.145.0 instead of the latest version (v0.160.1 at the time of writing). We can verify with : We can also see that Hugo v0.145.0 is installed along with the latest version with : Currently the command is pointing to v0.145.0. To have it point back to the regular version, run : And if we want to point back to the old version, run At first I expected to work right off the bat, but running both and is necessary to switch between versions properly. This is because homebrew tracks linked formulas and actual symlinks on disk separately. To help Homebrew track things properly we need to run both to clean the records, then to write the new symlinks. There’s no need to use to prevent the older version of Hugo from updating. Since this is a local copy, there is no remote repository that would be updated that would in turn update our local version. You can even try running to see the warning message: If you no longer need Hugo v0.145.0 you can run : If you don’t have any other packages you extracted with , you can also remove your local tap with Finally, if you don’t plan on using again in the future, you can remove the local clone of homebrew-core with . This will clean up the 1.3GB of files that was downloaded: Then re-link to the latest version with : Create a local tap with Tap homebrew/core which is a 1.3GB clone at the time of writing Extract the formula with Patch the formula. This isn’t needed for every formula. Install as usual https://docs.brew.sh/Manpage https://github.com/orgs/Homebrew/discussions/2941 https://emmer.dev/blog/installing-old-homebrew-formula-versions/

0 views
Krebs on Security 1 months ago

Patch Tuesday, April 2026 Edition

Microsoft today pushed software updates to fix a staggering 167 security vulnerabilities in its Windows operating systems and related software, including a SharePoint Server zero-day and a publicly disclosed weakness in Windows Defender dubbed “ BlueHammer .” Separately, Google Chrome fixed its fourth zero-day of 2026, and an emergency update for Adobe Reader nixes an actively exploited flaw that can lead to remote code execution. Redmond warns that attackers are already targeting CVE-2026-32201 , a vulnerability in Microsoft SharePoint Server that allows attackers to spoof trusted content or interfaces over a network. Mike Walters , president and co-founder of Action1 , said CVE-2026-32201 can be used to deceive employees, partners, or customers by presenting falsified information within trusted SharePoint environments. “This CVE can enable phishing attacks, unauthorized data manipulation, or social engineering campaigns that lead to further compromise,” Walters said. “The presence of active exploitation significantly increases organizational risk.” Microsoft also addressed BlueHammer ( CVE-2026-33825 ), a privilege escalation bug in Windows Defender. According to BleepingComputer, the researcher who discovered the flaw published exploit code for it after notifying Microsoft and growing exasperated with their response. Will Dormann , senior principal vulnerability analyst at Tharros , says he confirmed that the public BlueHammer exploit code no longer works after installing today’s patches. Satnam Narang , senior staff research engineer at Tenable , said April marks the second-biggest Patch Tuesday ever for Microsoft. Narang also said there are indications that a zero-day flaw Adobe patched in an emergency update on April 11 — CVE-2026-34621 — has seen active exploitation since at least November 2025. Adam Barnett , lead software engineer at Rapid7 , called the patch total from Microsoft today “a new record in that category” because it includes nearly 60 browser vulnerabilities. Barnett said it might be tempting to imagine that this sudden spike was tied to the buzz around the announcement a week ago today of Project Glasswing — a much-hyped but still unreleased new AI capability from Anthropic that is reportedly quite good at finding bugs in a vast array of software. But he notes that Microsoft Edge is based on the Chromium engine, and the Chromium maintainers acknowledge a wide range of researchers for the vulnerabilities which Microsoft republished last Friday. “A safe conclusion is that this increase in volume is driven by ever-expanding AI capabilities,” Barnett said. “We should expect to see further increases in vulnerability reporting volume as the impact of AI models extend further, both in terms of capability and availability.” Finally, no matter what browser you use to surf the web, it’s important to completely close out and restart the browser periodically. This is really easy to put off (especially if you have a bajillion tabs open at any time) but it’s the only way to ensure that any available updates get installed. For example, a Google Chrome update released earlier this month fixed 21 security holes, including the high-severity zero-day flaw CVE-2026-5281 . For a clickable, per-patch breakdown, check out the SANS Internet Storm Center Patch Tuesday roundup . Running into problems applying any of these updates? Leave a note about it in the comments below and there’s a decent chance someone here will pipe in with a solution.

0 views