Posts in Devops (20 found)
Martin Fowler 3 days ago

The VibeSec Reckoning

Vibe coding has significantly accelerated software prototyping but AI agents frequently recommend insecure configurations, creating security problems. Gautam Koul, Lucian Moss, Neil Drew-Lopez, and Daberechi Ruth Edeokoh share their experience while building applications for Thoughtworks's global marketing. They learned that to combat this we need to write a security context file to guide the AI, be cautious with AI permission requests, create a daily security intelligence feed, and provide builders with a secure-by-default harness and templates.

0 views
The Coder Cafe 3 days ago

Metastable Failures in Distributed Systems

☕ Welcome to The Coder Cafe! Today, we explore one of the nastiest failure patterns in distributed systems: metastable failures. Based on the Metastable Failures in Distributed Systems whitepaper, we break down why these failures happen, why they persist, what we can do about them, and why our instinct to fix them is probably wrong. Get cozy, grab a coffee, and let’s begin! Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. The three states of a metastable failure. A system can drift into the vulnerable state unnoticed, and a single trigger is enough to push it into the metastable state it cannot escape on its own. The most disorienting property of a metastable failure: stopping the trigger doesn’t stop the failure. To understand why, we need to talk about feedback loops. In a previous post on Systems Thinking Explained , we defined a feedback loop as: If causes , then influences . A feedback loop is exactly the mechanism that keeps a system stuck in the metastable state . There is always a sustaining effect, a feedback loop, that prevents recovery. The trigger is just what pushes the system over the edge. The loop is what keeps it there. Blaming the trigger is the natural instinct, and almost always the wrong diagnosis. Let’s discuss a concrete example to make this clear. Imagine a web application that queries a database. The database comfortably handles up to 300 QPS. The application retries any query that doesn’t respond within 1 second. The system is running at 280 QPS, healthy and fast, within the database’s capacity. Then, a transient network issue occurs for 10 seconds. When the issue is over, all the queued requests flood in at once. The database gets hit with a surge it can’t absorb: latency spikes and queries start timing out. So the application retries them. This doubles the effective load to 560 QPS. The database, already struggling, falls further behind. More timeouts. More retries. The loop is now self-sustaining: High load → Timeouts → Retries → Higher load → More timeouts → More retries The transient network issue was fixed minutes ago. Yet, the system is still completely broken. The trigger is gone; the feedback loop is not . The only way out is to dramatically cut the load or disable retries entirely. This is a metastable failure . The system was vulnerable because it was operating close to its hidden capacity . A minor, transient trigger pushed it over the edge and into a self-sustaining failure state it couldn’t escape on its own. The retry mechanism, a feature designed to improve reliability, became the very thing that prevented recovery. This is one example, but the same pattern appears with caches, connection pools, failover logic, and more. The shape is always the same: a feedback loop that turns a temporary problem into a permanent one . Two things make metastable failures particularly nasty. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. We defined hidden capacity earlier as the load level below which the system can self-heal from any disruption. It’s different, and always lower, than the advertised capacity. In our example, the numbers make it concrete: the advertised capacity is 300 QPS, but the hidden capacity is only 150 QPS, because retries double the load under failure. The gap between those two numbers is where vulnerability lives . Measuring the hidden capacity is not straightforward, though. One possible approach is to apply a trigger at a given load level and observe whether the system recovers on its own: If it does, we are below the hidden capacity. If it doesn’t, we are above it. We can also estimate it indirectly: in the retry example, retries double the load under failure, so the hidden capacity is roughly half the advertised capacity. Metastable failures are not bugs . We can’t write a unit test that catches them. They are emergent behaviors: properties that arise from the interaction of a system’s components under specific conditions, not logic errors in any individual component. No single piece of code is buggy, no single configuration is wrong. The failure is a consequence of how everything fits together under load. This changes how we need to think about them. The right question after an outage is not “ What failed? ” but “ What loop sustained it? ” And before an outage, the danger is not having bugs; it’s optimizing so aggressively for efficiency that we push the system deeper into the vulnerable state without realizing it . Retries, caches, failover logic, connection pools: these are all features that improve reliability in the common case. They are also, under the right conditions, the sustaining mechanisms of metastable failures. The same design decision that makes a system more resilient in normal operation can also prevent it from recovering when things go wrong. The paper describes several approaches to reduce the risk of metastable failures: Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A note of humility from the paper: there is no systematic solution yet. These are ad-hoc mitigations developed in response to known failures. Detecting vulnerable states before they collapse remains an open problem. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems Stable, Vulnerable, Metastable Metastable failures borrow their name from physics, where metastable means something that looks stable but isn’t . To understand how a distributed system can end up in such a state, we need to look at three distinct states it can be in: Stable: The system recovers on its own after any disruption. This is what we call resilience in Resilient, Fault-tolerant, Robust, or Reliable . Vulnerable : The system looks perfectly healthy, but it's operating above its hidden capacity : the load level below which it can self-heal from any disruption. It responds fast, metrics are green, and nothing is alarming. Many production systems deliberately operate here because it's more efficient: resources are used closer to their limit. But there's no slack left . And the deeper the system operates in a vulnerable state, the smaller the trigger needed to push it over the edge. Indeed, a system just above its hidden capacity can survive large disruptions; a system near its advertised capacity can be tipped by almost anything. Metastable failure : A trigger (e.g., a network blip, a deployment, a traffic spike) pushes the system over its hidden capacity. The system is not fully broken: processes are alive, and it’s still running. But goodput collapses: it’s no longer doing any useful work. Technically up, effectively down . And unlike a regular outage, removing the trigger doesn’t fix it. Getting out requires a strong corrective push: a restart, a dramatic load reduction, a manual intervention. NOTE : If you’re not familiar with the concept of goodput, it’s the throughput of useful work completed successfully. For example, in a web application receiving 1000 requests per second but returning errors for 800 of them, the goodput is only 200 RPS. We can be tempted to blame the wrong thing . When an outage happens, the trigger is what’s visible and recent: a spike, a deployment, a hardware fault. It’s the obvious culprit. But the trigger only exposed the problem; it didn’t create it. The sustaining feedback loop was already there, structural and invisible. When analyzing the problem in retrospect, teams focus on the trigger; fixes address the trigger; and the system remains vulnerable to the next one. The authors of the paper observed teams declare a metastable failure “resolved” multiple times before realizing the real cause had never been touched. The feedback loop grows stronger with scale . Small-scale tests won’t reveal it. A staging environment running at 10% capacity may handle the same trigger without falling into a metastable state, because the loop isn’t strong enough at that scale to be self-sustaining. This means these failures can slip past even rigorous testing regimes and only manifest in production at full load. If it does, we are below the hidden capacity. If it doesn’t, we are above it. Retry budgets and circuit breakers : Instead of retrying indefinitely, cap the total number of retries in flight at any given time. This directly weakens the feedback loop by limiting work amplification. LIFO scheduling under overload : Counterintuitively, switching from FIFO to LIFO when the system is overloaded allows some requests to complete within their deadline, preserving goodput instead of letting every request time out. NOTE : I already wrote a post about that approach in Adaptive LIFO . Fast error paths : Success paths are heavily optimized, but error paths often aren’t. An expensive error path (stack traces, DNS lookups, disk writes) under high failure rates can itself become a sustaining mechanism. Optimizing error paths reduces this risk. Read-through caches over look-aside caches : A read-through cache (where the cache itself fetches missing data from the database) can continue filling itself even when the application has given up on a request, steadily increasing the hit rate and helping the system recover. A look-aside cache (where the application is responsible for populating the cache) can’t. Production stress testing : Small-scale tests won’t reveal metastable failures. Testing against a portion of production traffic, with engineers ready to intervene, is the most reliable way to surface them. A distributed system can pass through three states: stable, vulnerable, and metastable. The vulnerable state looks healthy, but it isn’t. The threshold between stable and vulnerable is invisible. Systems can operate in the vulnerable state for months without any sign of trouble. When a trigger pushes a vulnerable system into a metastable failure, a feedback loop sustains the failure even after the trigger is gone. The trigger is not the root cause. The feedback loop is. Fixing the trigger leaves the system vulnerable to the next one. Reliability features like retries and caches can become the sustaining mechanism of a metastable failure under the right conditions. Metastable failures are emergent behaviors, not bugs. We can’t unit test for them, and optimizing for efficiency makes them more likely. Mitigations exist (retry budgets, circuit breakers, LIFO scheduling, fast error paths), but they are all ad-hoc responses to known failures. Detecting vulnerable states before they collapse remains an open problem. Resilient, Fault-tolerant, Robust, or Reliable? Adaptive LIFO Fail Open vs. Fail Closed Metastable Failures in Distributed Systems Metastability and Distributed Systems

0 views
Andre Garzia 3 days ago

We need to own our computing experience

Originally when I talked about owning our own platform is this blog, I meant owning the stack that powers and serves the blog. Moving to your own VPS or servers or static pages in which you didn't depend on some *Blog As A Service* company such as Wordpress.com. Eventually, I [started talking about owning the workflow that empowered your blog experience](https://andregarzia.com/2026/02/building-your-own-blogging-tools-is-a-fun-journey.html) not only your posting experience but your reading experience. To that effect, I showed how I created my own blog reader and integrated that into Firefox and also my own blog editor. Recently, I think that we need to move further into owning more and more of our computing experience. The avalanche of LLM/AI based slop solutions being force fed into our lives is radicalising me towards a very specific path in which owning my own platform now needs to mean controlling my own computing experience. I been an Apple user for a very long time and have [spoken previously about my recent desire to leave the platform](https://andregarzia.com/2026/03/apple-just-lost-me.html) because of a recent decrease in quality of macOS, change in priority for Apple in regards to being an independent developer in their ecosystem, and a general feeling that I must move away from big tech. In that post, I outlined my desire to move to an [MNT Pocket Reform](https://mntre.com), [Fairphone Gen 6 with potentially Murena /e/OS](https://fairphone.com) and maybe a NAS. I already purchased the Pocket Reform and am waiting for assembly and shipment, but I changed my approach for the next two items in that list. Instead of buying a NAS, I decided first to experiment with self-hosting and homelabbing by converting an old x86 MacBook Pro into a server using [Yunohost](https://yunohost.org). That server is going surprisingly well for me and I am moving more and more of my computing to inside the house. I will eventually get a proper NAS or build one, but at the moment that server is all I need. I am even hosting my [fediverse account](https://social.soapdog.org/@soapdog) in it using [GoToSocial](https://gotosocial.org). I reckon that I will spend close to 500 pounds to get the Fairphone with /e/OS. I don't have that budget right now and am afraid of doing it blind cause I been checking the forums and it seems like WhatsApp stopped working in the last update and not all features of Halifax UK bank app are working. I don't want a switch to a deGoogled OS to prevent me from talking to my friends or using my bank. I know that sucks, but those are not easily solvable problems. Like my original plan with the NAS, I think I might be able to test the waters of e/OS/ by buying an old second-hand smartphone and installing it and seeing for myself how well it works. That will cost me much less and then if I like it enough, I can make the move to a Fairphone. So now the issue is figuring out what phone to buy on a budget of 150 pounds or less. Moving back to Linux on open hardware and to Android but deGoogled is my slow journey towards computing autonomy. Google was never worth trust, but the recent move to prevent side-loading on Android and stop showing links on their search result page, becoming a de facto slop as service engine, is something I can't really abide. Apple hypermaniacal need to control the experience of their users and milk both developers and users as much as possible reached a tiping point for me. My Macbook Air doesn't feel like mine since there are piling frictions when trying to run software that is not coming from the App Store. I'm done with that. What is left then? We need to return to a human-focused FOSS community. Not the fast turnaround LLM/AI commits into every single repo cause whoever is sponsoring this project needs it to move FAST. The best thing about the free and open source community has never been the code, but the ethos. Made by humans, to be understandable by humans, to be modifiable by humans. This crazy trend towards LLM assisted coding is removing the understandable part. Lots of commits are being generated by machine and reviewed by machines without a single person actually having read the whole thing. That will erode skills and also lead to code that is impossible to maintain cause no one has ever fully understood it. Hence why I am starting to also build my own tools. There are of course tools I depend on that are too large for me to build from scratch, goddess forbid trying to build a web browser, in those cases it is okay to use a FOSS solution like Firefox. But things that are dear to me like blogging, well I can build my own tools for that. Or epub manipulation tools, or small decentralisation apps. The more I build, the more I can be sure I can maintain it in the long run. I don't want a Web where all we do as creators is feed training models so that gigantic greedy corporations can get it all wrong and regurgitate shit to users. FAANG erected a wall inside the internet and creators are now on the outside. Fighting back is not done by creating local models, or ethical AI companies, fighting back is done by walking away and playing a different game. We can't win over Google and Apple at their own game. It is rigged. But we can play a different game in which they don't matter. For me that game is building offline-first, local-first, decentralised tools and apps for my friends and whoever else can benefit from them. Create for those around you, for those that matter. Forget web scale, think in terms of a village. Get back to Linux, deGoogle yourself if you're able to. Create FOSS and also use the tools you create. Use repairable tech if you can afford it and make sure to step out of this consumption and slop cycle the digital world has become.

0 views
James Stanley 1 weeks ago

How to publish your secrets on Docker Hub

This week I have been looking inside public Docker images, with the aim of finding API keys etc. inside, and then reporting them and claiming bug bounties. It has been a partial success, in the sense that I found loads of private credentials inside public Docker images, and a partial failure, in the sense that I have not (yet?) received any bug bounties. There is an article on this kind of thing from flare.io in December . Feroz pointed out that all of the low-hanging fruit will have been picked already, and the remaining intersection between companies that leak secrets on Docker Hub, and companies that pay bug bounties, will be approximately 0. To do this work I built a tool to automatically pull down the latest pushed images on Docker Hub and grep them for secrets. I'm not releasing this because of the obvious potential for abuse. But I have released a public Docker Explorer tool for looking inside images manually. It's kind of surprising that Docker Hub doesn't have this kind of thing built-in. (Btw, pulling down lots of Docker images is very disk-intensive and my tool is very much vibe-coded, so it is possible that it will fall over soon, sorry). It lets you put in a public Docker image and look at the Dockerfile directives that built it, as well as the file contents of each layer (even if later deleted), extracts .zip and .jar files, and lets you explore bundled git repositories with gitweb . Docker Explorer is hosted on exe.dev . My brief review of exe.dev is that it is refreshingly geek-friendly, allowing configuration over SSH as well as the web interface. The billing model is a flat monthly fee for resources allocated, regardless of how many VMs you attach to them, which means you avoid the "surprise bankruptcy via AWS" scenario, and you also avoid paying another $10/mo every time you want to add a new VM. It automatically acquires TLS certificates for you, which is very convenient. The biggest downside is that as far as I can tell it only supports HTTP, you can't just run random other services and expose them to the internet. So it would be no good for hosting Protohackers solutions for example. Also no good for hosting a mail server, DNS server, IRC server, etc.; it's only for websites. From looking in public Docker images so far I have come across: AWS keys Google Cloud keys SSH keys Stripe keys GitHub access tokens GitHub passwords OpenAI/Anthropic/OpenRouter API keys SMTP passwords Telegram bot tokens MongoDB passwords Postgres passwords And an extremely long tail of API keys for various services I've never heard of before In many cases these seem to be included accidentally (e.g. a developer had the credentials on their local disk when they built the image and didn't realise they would be copied into it), but in probably most cases I think people put them in the image on purpose, to use them, but didn't realise that the image would be public! There is kind of a footgun with the Docker Hub free tier where it only lets you have one private image, and if you push any more images then they are just automatically public. So obviously watch out for that. Follows a list of ways to publish these things on Docker Hub. Hard-code the secrets into your source code If you're looking to accidentally publish secrets, then you should be doing this already. Hard-coding secrets in the source code means you get to publish them in both your git repository and your container image without any extra work. Put them in a .env file Preferably you will commit the .env file to git so as to increase the attack surface. Putting secrets in a .env file makes them particularly easy to find because you can find them just by looking at filenames, without having to grep over the entire codebase. But even if you don't commit them to git, if you put them in the Docker image with "COPY . ." then they will get included anyway if present on your local machine when you build the image. Put them in the Dockerfile Dockerfile : This does successfully avoid writing the secret to the image filesystem , but it is easy to see that the information is still there , otherwise your daemon wouldn't be able to read it. And in fact the environment variables are straightforwardly stored in the JSON metadata of the image. ARG is similar but for values that are only present while building the image, rather than running it. These also leak into the image metadata, so I would also suggest putting secrets in ARG directives if you want to leak them. Delete them at build time Dockerfile : If you docker exec -it --rm image bash then you'll find that /root/.ssh/id_rsa has indeed been deleted. But because Docker builds up a container image as a series of "layers" that are applied on top of one another, you are free to extract the content at the layer created by the "COPY" line, and grab out the private SSH key. Docker Build secrets documentation has suggestions for what to do if you don't want to leak credentials in your public images. Hide them with .dockerignore .dockerignore : Now when you copy your working directory into the Docker image with COPY . . , your .env file will be ignored. Boo! But your .git directory will still be included, so if .env was committed to git then it will still be accessible via the .git directory. Leave them in .git/config .git/config : Including your .git directory in the image not only leaks your entire git repository contents, it also leaks the URLs to your remotes (typically just an "origin" on github), which you may want to keep private, and credentials if you have configured any. Even if your project is open source and your git repository is public, your .git/config may contain secrets that you don't want to be made public. Namely, your github credentials. When the image is built using the GitHub actions/checkout to clone the repository, it will be a "shallow clone" (i.e. only contains the most recent commit), and will contain a GitHub token which expires when the job finishes, so will be already revoked by the time you see it. The most recent commit still contains the committer name and email address as well as the commit message, so for a private repo it's still worth including if your goal is to leak secrets. I'd recommend always bundling .git into the image, because you never know, it might work. Finally: never check Having built a Docker image, never check it to see if there is anything inside that you didn't expect, that way you won't have to find out if you leaked any secrets and you can sleep easily. What to actually do, real talk Obviously, do the opposite of all of this! Don't commit secrets to git. Don't put .env files containing secrets into your Docker image. That much is obvious. Less obvious is don't put secrets in the Dockerfile. Don't put secrets into the image and then delete them later on. Don't copy the .git directory into the image. And maybe glance over your public images on Docker Explorer to check that you aren't leaking anything. Google Cloud keys Stripe keys GitHub access tokens GitHub passwords OpenAI/Anthropic/OpenRouter API keys SMTP passwords Telegram bot tokens MongoDB passwords Postgres passwords And an extremely long tail of API keys for various services I've never heard of before

0 views
Martin Fowler 1 weeks ago

Maintainability sensors for coding agents

In her recent article about harness engineering for coding agent users, Birgitta Böckeler laid out a mental model for expanding a coding agent harness: a system of guides and sensors that increase the probability of good agent outputs and enable self-correction before issues reach human eyes. Birgitta has now started publishing an article where she walks though her experiences using sensors to keep a codebase maintainable. This part looks at static analysis with basic code linting.

0 views

CISA Admin Leaked AWS GovCloud Keys on Github

Until this past weekend, a contractor for the Cybersecurity & Infrastructure Security Agency (CISA) maintained a public GitHub repository that exposed credentials to several highly privileged AWS GovCloud accounts and a large number of internal CISA systems. Security experts said the public archive included files detailing how CISA builds, tests and deploys software internally, and that it represents one of the most egregious government data leaks in recent history. On May 15, KrebsOnSecurity heard from Guillaume Valadon , a researcher with the security firm GitGuardian . Valadon’s company constantly scans public code repositories at GitHub and elsewhere for exposed secrets, automatically alerting the offending accounts of any apparent sensitive data exposures. Valadon said he reached out because the owner in this case wasn’t responding and the information exposed was highly sensitive. A redacted screenshot of the now-defunct “Private CISA” repository maintained by a CISA contractor. The GitHub repository that Valadon flagged was named “ Private-CISA ,” and it harbored a vast number of internal CISA/DHS credentials and files, including cloud keys, tokens, plaintext passwords, logs and other sensitive CISA assets. Valadon said the exposed CISA credentials represent a textbook example of poor security hygiene, noting that the commit logs in the offending GitHub account show that the CISA administrator disabled the default setting in GitHub that blocks users from publishing SSH keys or other secrets in public code repositories. “Passwords stored in plain text in a csv, backups in git, explicit commands to disable GitHub secrets detection feature,” Valadon wrote in an email. “I honestly believed that it was all fake before analyzing the content deeper. This is indeed the worst leak that I’ve witnessed in my career. It is obviously an individual’s mistake, but I believe that it might reveal internal practices.” One of the exposed files, titled “importantAWStokens,” included the administrative credentials to three Amazon AWS GovCloud servers. Another file exposed in their public GitHub repository — “AWS-Workspace-Firefox-Passwords.csv” — listed plaintext usernames and passwords for dozens of internal CISA systems. According to Caturegli, those systems included one called “LZ-DSO,” which appears short for “Landing Zone DevSecOps,” the agency’s secure code development environment. Philippe Caturegli , founder of the security consultancy Seralys , said he tested the AWS keys only to see whether they were still valid and to determine which internal systems the exposed accounts could access. Caturegli said the GitHub account that exposed the CISA secrets exhibits a pattern consistent with an individual operator using the repository as a working scratchpad or synchronization mechanism rather than a curated project repository. “The use of both a CISA-associated email address and a personal email address suggests the repository may have been used across differently configured environments,” Caturegli observed. “The available Git metadata alone does not prove which endpoint or device was used.” The Private CISA GitHub repo exposed dozens of plaintext credentials for important CISA GovCloud resources. Caturegli said he validated that the exposed credentials could authenticate to three AWS GovCloud accounts at a high privilege level. He said the archive also includes plain text credentials to CISA’s internal “artifactory” — essentially a repository of all the code packages they are using to build software — and that this would represent a juicy target for malicious attackers looking for ways to maintain a persistent foothold in CISA systems. “That would be a prime place to move laterally,” he said. “Backdoor in some software packages, and every time they build something new they deploy your backdoor left and right.” In response to questions, a spokesperson for CISA said the agency is aware of the reported exposure and is continuing to investigate the situation. “Currently, there is no indication that any sensitive data was compromised as a result of this incident,” the CISA spokesperson wrote. “While we hold our team members to the highest standards of integrity and operational awareness, we are working to ensure additional safeguards are implemented to prevent future occurrences.” A review of the GitHub account and its exposed passwords show the “Private CISA” repository was maintained by an employee of Nightwing , a government contractor based in Dulles, Va. Nightwing declined to comment, directing inquiries to CISA. CISA has not responded to questions about the potential duration of the data exposure, but Caturegli said the Private CISA repository was created on November 13, 2025. The contractor’s GitHub account was created back in September 2018. The GitHub account that included the Private CISA repo was taken offline shortly after both KrebsOnSecurity and Seralys notified CISA about the exposure. But Caturegli said the exposed AWS keys inexplicably continued to remain valid for another 48 hours. CISA is currently operating with only a fraction of its normal budget and staffing levels. The agency has lost nearly a third of its workforce since the beginning of the second Trump administration, which forced a series of early retirements, buyouts, and resignations across the agency’s various divisions. The now-defunct Private CISA repo showed the contractor also used easily-guessed passwords for a number of internal resources; for example, many of the credentials used a password consisting of each platform’s name followed by the current year. Caturegli said such practices would constitute a serious security threat for any organization even if those credentials were never exposed externally, noting that threat actors often use key credentials exposed on the internal network to expand their reach after establishing initial access to a targeted system. “What I suspect happened is [the CISA contractor] was using this GitHub to synchronize files between a work laptop and a home computer, because he has regularly committed to this repo since November 2025,” Caturegli said. “This would be an embarrassing leak for any company, but it’s even more so in this case because it’s CISA.”

0 views
The Coder Cafe 2 weeks ago

AI for Production

☕ Welcome to The Coder Cafe! These days, most posts about AI for production circle the same ideas: automated remediation, anomaly detection, alerting triage, etc. These are interesting starting points, but they share a common assumption: that AI’s job is to replace what SREs do. In this post, I want to explore the idea of having AI as a cognitive partner, something that extends what a single engineer can hold in their head at once. Get cozy, grab a coffee, and let’s begin! At Google, I’m an SRE on the  Google Distributed Cloud  team, where the infrastructure stack spans Kubernetes, Borg, distributed storage, virtualization, networking, and more. Over the past months, I’ve been experimenting with ways AI can help not only by automating work away, but also by reducing the cognitive overhead that makes production work quite overwhelming sometimes. Here are three directions that changed how I thought about the problem. In my team, we have hundreds of dashboards. Kubernetes clusters, Borg jobs, storage metrics, VM utilization, network metrics, etc. Each one tells part of the story. When something went wrong, and I wanted to understand the current state of the system, I needed to spend a significant amount of time opening tabs and cross-referencing panels to get a complete picture. This is a fundamentally human bottleneck. Each dashboard was designed to answer a specific question . The question “ What is the current situation? ” doesn’t map to any single dashboard, and navigating all of them to reconstruct an answer takes time we often don’t have. Interestingly, this is where AI can change the equation. Instead of navigating dashboards, imagine describing your system to an AI agent with access to your observability stack and simply asking: “ What’s going on? ” The agent queries across your telemetry data, picks out what stands out, and hands you back a coherent narrative , something you can actually act on. Like: “ This specific cluster has an issue with all the containers using distributed storage running on that specific node since 2h. ” This shifts the focus from navigator (opening dashboards one by one) to interpreter (acting on a synthesized summary). And that shift matters: every minute you spend navigating is a minute you're not spending on the actual problem. A few months ago, I was investigating a storage incident on a cluster. The failure itself was clear: a disk issue that surfaced as elevated latency and eventually a service degradation. What wasn’t clear was why it happened when it did. I used Gemini CLI to navigate the metrics data around the event window. What it surfaced surprised me: the root cause signals had been present in the telemetry hours before the incident triggered any alert. Subtle correlations across metrics that individually looked like noise: disk read latency creeping slightly upward, I/O wait ticking up on specific nodes, a minor memory pressure pattern. Together, they pointed directly at the failure that was coming. A human reviewing those dashboards in real time would almost certainly have missed it. Each individual signal was within an acceptable range. The pattern only became visible when we looked at all of them together, across time. This is what I’d call telemetry archaeology : using AI to go back through your metrics data and surface the correlations an alerting system wasn’t designed to catch. It’s worth being precise about what makes this different from anomaly detection. Anomaly detection tells you when something looks wrong. Telemetry archaeology is about finding the patterns that appear before anything looks wrong at all , relationships that no one thought to encode into an alert, because no one knew they existed until the incident happened. The practical implication is significant. If these correlations exist in your past incidents, they likely exist in future ones. An AI agent that continuously monitors for these multi-signal patterns could surface a warning (” This looks like the early stages of what happened last time ”) long before your system starts showing symptoms. Active incidents can be cognitively brutal . You can be debugging a live system, managing communication with stakeholders, coordinating with other engineers, and trying to remember what you checked 20 minutes ago, all at the same time. A common consequence is that the engineer with the deepest system knowledge gets pulled out of deep focus to write status updates, summarize what’s been tried, and maintain a running timeline. This work is necessary, but it’s expensive. Every context switch makes it harder to hold the full mental model of the incident in your head. And once that model fragments, rebuilding it takes time you don’t have. NOTE : This is actually one of the reasons Google developed the IMAG process, with clear role separation: The Incident Commander (IC) coordinates the overall response, the Communications Lead (CL) handles stakeholder updates, and the Operations Lead (OL) focuses on mitigating the issue. The explicit goal is to prevent any single person from being pulled in too many directions at once. AI can absorb most of this overhead . Think of it as a second brain that’s been in the room the whole time: it tracks what hypotheses have been tested, which ones were ruled out and why, what changed in the system during the incident window, and what hasn’t been explored yet. When a new engineer joins the investigation, instead of spending ten minutes getting them up to speed, you ask the AI for a summary. AI’s role here is handling the administrative layer of the incident: the parts that pull you out of flow, so you can stay in the problem instead of constantly being yanked out of it. I’ve been using AI this way during my own shifts. Even without a purpose-built tool, maintaining a running log with AI (e.g., what we’ve tried, what we know, what’s next) noticeably changes how an incident feels. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. The common “AI for production” narrative focuses on automation and replacement; cognitive augmentation is the underexplored angle. Situation awareness: AI can synthesize across hundreds of dashboards to answer “ What’s the current situation? ” in seconds, shifting your role from navigator to interpreter. Telemetry archaeology: AI can surface hidden correlations across metrics that individually look like noise, revealing root cause signals that were present hours before any alert fired. Incident co-pilot: AI can absorb the administrative layer of an active incident (status updates, running timeline, hypothesis tracking), keeping the engineer in deep focus instead of constant context switching. None of this requires replacing the engineer. The value is in extending what one person can hold in their head under pressure. Reliability Resilient, Fault-tolerant, Robust, or Reliable? Lurking Variables Google Site Reliability Engineering: Incident Management Guide The future of software engineering is SRE At Google, I’m an SRE on the  Google Distributed Cloud  team, where the infrastructure stack spans Kubernetes, Borg, distributed storage, virtualization, networking, and more. Over the past months, I’ve been experimenting with ways AI can help not only by automating work away, but also by reducing the cognitive overhead that makes production work quite overwhelming sometimes. Here are three directions that changed how I thought about the problem. Situation Awareness In my team, we have hundreds of dashboards. Kubernetes clusters, Borg jobs, storage metrics, VM utilization, network metrics, etc. Each one tells part of the story. When something went wrong, and I wanted to understand the current state of the system, I needed to spend a significant amount of time opening tabs and cross-referencing panels to get a complete picture. This is a fundamentally human bottleneck. Each dashboard was designed to answer a specific question . The question “ What is the current situation? ” doesn’t map to any single dashboard, and navigating all of them to reconstruct an answer takes time we often don’t have. Interestingly, this is where AI can change the equation. Instead of navigating dashboards, imagine describing your system to an AI agent with access to your observability stack and simply asking: “ What’s going on? ” The agent queries across your telemetry data, picks out what stands out, and hands you back a coherent narrative , something you can actually act on. Like: “ This specific cluster has an issue with all the containers using distributed storage running on that specific node since 2h. ” This shifts the focus from navigator (opening dashboards one by one) to interpreter (acting on a synthesized summary). And that shift matters: every minute you spend navigating is a minute you're not spending on the actual problem. Telemetry Archaeology A few months ago, I was investigating a storage incident on a cluster. The failure itself was clear: a disk issue that surfaced as elevated latency and eventually a service degradation. What wasn’t clear was why it happened when it did. I used Gemini CLI to navigate the metrics data around the event window. What it surfaced surprised me: the root cause signals had been present in the telemetry hours before the incident triggered any alert. Subtle correlations across metrics that individually looked like noise: disk read latency creeping slightly upward, I/O wait ticking up on specific nodes, a minor memory pressure pattern. Together, they pointed directly at the failure that was coming. A human reviewing those dashboards in real time would almost certainly have missed it. Each individual signal was within an acceptable range. The pattern only became visible when we looked at all of them together, across time. This is what I’d call telemetry archaeology : using AI to go back through your metrics data and surface the correlations an alerting system wasn’t designed to catch. It’s worth being precise about what makes this different from anomaly detection. Anomaly detection tells you when something looks wrong. Telemetry archaeology is about finding the patterns that appear before anything looks wrong at all , relationships that no one thought to encode into an alert, because no one knew they existed until the incident happened. The practical implication is significant. If these correlations exist in your past incidents, they likely exist in future ones. An AI agent that continuously monitors for these multi-signal patterns could surface a warning (” This looks like the early stages of what happened last time ”) long before your system starts showing symptoms. Incident Co-Pilot Active incidents can be cognitively brutal . You can be debugging a live system, managing communication with stakeholders, coordinating with other engineers, and trying to remember what you checked 20 minutes ago, all at the same time. A common consequence is that the engineer with the deepest system knowledge gets pulled out of deep focus to write status updates, summarize what’s been tried, and maintain a running timeline. This work is necessary, but it’s expensive. Every context switch makes it harder to hold the full mental model of the incident in your head. And once that model fragments, rebuilding it takes time you don’t have. NOTE : This is actually one of the reasons Google developed the IMAG process, with clear role separation: The Incident Commander (IC) coordinates the overall response, the Communications Lead (CL) handles stakeholder updates, and the Operations Lead (OL) focuses on mitigating the issue. The explicit goal is to prevent any single person from being pulled in too many directions at once. AI can absorb most of this overhead . Think of it as a second brain that’s been in the room the whole time: it tracks what hypotheses have been tested, which ones were ruled out and why, what changed in the system during the incident window, and what hasn’t been explored yet. When a new engineer joins the investigation, instead of spending ten minutes getting them up to speed, you ask the AI for a summary. AI’s role here is handling the administrative layer of the incident: the parts that pull you out of flow, so you can stay in the problem instead of constantly being yanked out of it. I’ve been using AI this way during my own shifts. Even without a purpose-built tool, maintaining a running log with AI (e.g., what we’ve tried, what we know, what’s next) noticeably changes how an incident feels. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Summary The common “AI for production” narrative focuses on automation and replacement; cognitive augmentation is the underexplored angle. Situation awareness: AI can synthesize across hundreds of dashboards to answer “ What’s the current situation? ” in seconds, shifting your role from navigator to interpreter. Telemetry archaeology: AI can surface hidden correlations across metrics that individually look like noise, revealing root cause signals that were present hours before any alert fired. Incident co-pilot: AI can absorb the administrative layer of an active incident (status updates, running timeline, hypothesis tracking), keeping the engineer in deep focus instead of constant context switching. None of this requires replacing the engineer. The value is in extending what one person can hold in their head under pressure. Reliability Resilient, Fault-tolerant, Robust, or Reliable? Lurking Variables Google Site Reliability Engineering: Incident Management Guide The future of software engineering is SRE

0 views
Zak Knill 2 weeks ago

LLMs are breaking 20 year old system design

The ‘cloud-native’ architecture of the last decade is built on a 20-year-old assumption: that state lives in the database, and compute is stateless. If you want to scale, you scale the database vertically (get a larger machine) [1] [1] or design the database schema around partition the data and you scale your application servers horizontally (add more boxes). Any request can hit any server, the loadbalancer doesn’t care, and the database is the single source of truth.

0 views
neilzone 2 weeks ago

Fixing a proxying problem with my HomeAssistantOS installation by replacing nginx proxy manager

tl;dr: I removed the “nginx proxy manager” add-on, and replaced it with the Let’s Encrypt add-on and (second) the nginx add-on. A couple of months ago, I moved my HomeAssistant installation to HAos . I think that it is fair to say that I was not overly pleased with this. Honestly, I preferred the “Core” python-venv approach, but I also wanted a “supported” installation, and so I switched to HAos. i got it up and running okay, and I thought that I had got proxying working too, using an add-on called “nginx proxy manager”. This is not something that I had used before; I’d rather just configure nginx myself. Well, either I got something wrong, or it just does not work very well, as I kept having problems using HomeAssistant, stuck on a “loading data” screen, or it simply not responding. This bugged me for quite a while. Annoyingly, the logs available to me within HAos were unhelpful. I couldn’t spot anything indicating a problem. Using the console in my web browser, I noted that some files were not loading correctly, but why that was the case, I wasn’t sure. I thought that I’d had a similar issue with my “Core” installation years ago, which I got down to the issue of the in the file, but that looked correct here (which I was able to check, using the SSH add-on. I tried various parameters in the nginx proxy manager add-on, but to no avail. In the end, I tried removing the nginx proxy manager add-on, and replacing it with the Let’s Encrypt add-on (which I installed, configured, and ran first), and then the nginx add-on. And it immediately started working correctly. So I don’t know exactly why my original set-up was not working, but at least it is working better now.

0 views

When Escalator Breaks, It Turns Stairs

Read on the website: We need resilient systems that fall back to sanity when broken / discriminating. And not whatever.

0 views
Rob Zolkos 3 weeks ago

Watch Your Agents

I’ve been telling developers to watch their logs for years. Not just when something is broken. Not just when production is on fire. Watch them while you are building. Your logs are the closest thing you have to x-ray vision for a web application. Click a button in the browser, watch the request move through the app, and you can see what is really happening behind the scenes. The habit is simple: keep the server log visible while you work. When you do, you start spotting problems long before they become production issues: The logs give you immediate feedback. They make the invisible visible. Coding agents need the same treatment. When you are working with an agent, do not just look at the final diff. Watch what it is doing. Watch the commands it runs, the files it opens, the mistakes it repeats, and the little bits of glue code it keeps inventing along the way. That is the agent equivalent of watching your development log. You are not only checking whether this turn succeeded. You are looking for patterns that can make future turns better. Most coding agents keep some kind of session history: transcripts, tool calls, command output, file edits, errors, retries, and sometimes timing information. Those logs are useful after the fact. Point the agent at its own session logs and ask it to look for patterns: A prompt I like for this: This is the same habit as watching the Rails log after clicking around a page. You are looking for the part of the system that is doing too much work, guessing too often, or hiding useful signal. A useful signal is when the model keeps generating code to do the same mechanical task. For example, imagine you have a skill for publishing blog posts. Every time you run it, the model writes a small Ruby or Python snippet to: If the agent is generating that code every time, that is a smell. The model is doing work that should probably be deterministic. Ask the agent to turn that behavior into a script: Then update the skill so future agents call the script instead of improvising the logic. Bad pattern: every publishing session, the agent manually inspects YAML front matter and tries to remember the required fields. Better pattern: create that exits non-zero when , , , or are missing or malformed. Now the agent does not need to reason about the rules from scratch. It runs the command and reacts to the result. Bad pattern: the agent repeatedly writes one-off Python to resize screenshots, compare image dimensions, or calculate visual diffs. Better pattern: create with clear output like: The agent can use the result without reinventing image processing each time. Bad pattern: the agent keeps constructing ad hoc SQL to answer common questions like “which users have duplicate active subscriptions?” or “which jobs are stuck?” Better pattern: create named scripts or Rails tasks: Now the workflow is repeatable, reviewable, and safe to run again. Bad pattern: the agent writes custom code every time it needs to build a fake webhook payload or API response. Better pattern: create or a small fixture library that produces known-good examples. The agent stops guessing at payload shapes and starts using something the test suite can trust. Moving repeated agent behavior into deterministic tools gives you a few wins: Watch the agent the way you watch your logs. When you see friction, repetition, or uncertainty, ask whether the agent needs better instructions or a better tool. Sometimes the answer is a clearer prompt. Sometimes it is a skill. And sometimes the best thing you can do is take the fragile reasoning out of the model entirely and give it a boring, deterministic script to call. That is not making the agent less useful. That is making the whole system more useful. the same query firing 50 times because of an N+1 a page that feels fine locally but is doing way too much work a slow query that needs an index an unexpected redirect or extra request a cache miss you thought was a cache hit a background job being enqueued more often than expected parameters coming through in a shape you did not expect What tasks did you repeat multiple times in this session? What code did you generate only to throw away later? Which commands failed, and what would have prevented those failures? Did you write any one-off scripts that should become checked-in tools? Did you repeatedly search for the same files or project conventions? Were there project rules you had to infer that should be documented? Which parts of the workflow were deterministic enough to automate? What should be added to , a skill, or a script? If a smaller model had to do this next time, what tools or instructions would it need? parse front matter validate the title, summary, badge, tags, and date derive the final filename move the draft into Dependability: the same input produces the same output. Determinism: fewer “creative” variations in routine work. Testability: scripts can have tests; improvised reasoning usually cannot. Reviewability: a script can be read, improved, and versioned. Cost: once the workflow is encoded, you may be able to use a smaller model for that task. Speed: future turns spend less time rediscovering the same procedure.

0 views
David Bushell 3 weeks ago

Unscrewing lightbulbs

Giving lightbulbs a MAC address was a mistake that I’m living with. I’m literally unscrewing lightbulbs to renew their DHCP lease @dbushell.com - Bluesky Instead of enjoying the bank holiday Monday I updated my homelab software. I was ‘inspired’ by the Copy Fail Linux bug to run full distro upgrades. This is my self-hosted update for Spring 2026 (rough documentation to give future me a chance). Monday’s fun risked a week of pain. I do have backups but restoring them on a broken LAN is tricky. I have an ISP provided wifi router to dust off in an emergency. Along with an absurdly long 15 metre HDMI cable I do not care to unravel. My winter update added a hardware fallback but that too requires careful rejigging. I have Proxmox hosts, virtual machines, and Raspberry DietPis . They were all on Debian 12 (Bookworm) with a kernel potentially susceptible to the bug. Minimal Debian installs are perfect because I run everything in Docker anyway. Data volumes are easy to backup or network mount. I can change host at will for any service. Debian is just sensible, well documented no-fuss Linux. I used to run “minimal” Ubuntu server. Following 24.04 I found myself debloating most of the Ubuntu part (i.e. snaps). It sounds like the new coreutils are a CVE party . Glad I escaped before that drama! As it happens, this week’s Linux Unplugged episode had Canonical’s VP of Engineering spewing embarrassing AI platitudes. “Ubuntu is not for you” was the only thing said worth remembering. I updated most of my VMs first because they’re easy to restore if anything fails. I followed Lubos Rendek’s guide . Start with a full package update and then change the package sources before running another step-by-step upgrade. The only non-Debian sources I have are Docker and Tailscale. Yes that means I run Docker inside Proxmox VMs — and you can’t stop me! That’s not even my worse crime… After the Trixie upgrade I found VMs were failing to obtain a LAN IP address. The virtual network device had been renamed from to . I edited and just changed the reference. There is surely a better/more predictable fix but this was the quickest. The same name was used across all VMs so I guess 18 is the magic number. Everything has been stable so far. If issues arise I’ll just nuke and pave from a Debian 13 ISO. Docker config and volumes are backed up independently of the VM images. DietPi has a long Trixie upgrade post I didn’t read. I just curled to bash: I gave the script a cursory glance before hitting enter. I have a Pi 4 running failover DNS and a Pi 5 running my public Forgejo instance . DietPi is ideal because of the tiny footprint; I run Docker here too. Raspberry Pi still hasn’t merged upstream Copy Fail fixes. I’m already in trouble if this bug can be exploited but I did the temporary fix out of caution. I wasn’t going to bother with Proxmox 9 but after a GUI update I was informed version 8 “end of life” was August 2026 . That is soon! I followed the official upgrade guide on my Mini-ITX server . Proxmox has a tool to check compatibility. I saw no red lights so I stopped all VMs, updated package sources to Trixie, and ran the upgrade. It is critical to run again before rebooting. I ran into the systemd-boot issue . Apparently if this is not removed the system fails to boot. If my particular box fails to boot I’m in big trouble because I broke video output and have yet to fix it. I have another Proxmox machine running virtualised OPNsense for my home router. I can’t stop the OPNsense VM and upgrade the host to Proxmox 9 because the host would have no network access. I had two options: I specifically set up option 1 for such a purpose. I went with option 2. I figured any software running in memory is still alive until I reboot, right? I didn’t question whether Proxmox would kill any processes itself (it didn’t). The update was suspiciously fast. I ran again and saw a lot of yellow warnings. Yikes. Eventually I noticed I’d failed to update some sources to Trixie and I’d installed a franken-distro. After fixing mistakes all I could do was reboot and pray for an agonising two minutes. OPNsense is the only non-Debian operating system in my homelab. I manage it entirely via the web GUI. The 26.1 update had quite a few significant changes. My DHCP setup was considered “legacy” and my firewall rules required a manual migration. Despite dumbening my smart home my lightbulbs still demand a WiFi connection. I program them myself to avoid Home Assistant and proprietary apps. Turns out I hard-coded IP addresses (discovery protocols are a joke.) Despite having dynamic IPs they remained stable until the OPNsense 26.1 DHCP update. I had no easy way to identify each light. Why would they name themselves anything useful? That’s how I ended up unscrewing the bulbs one by one to see which MAC address fell off the network. I gave them static IPs on a VLAN for future me to appreciate. And with that, my home network is up to date! Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds. Use my failover VM YOLO it live

0 views
Sean Goedecke 3 weeks ago

Notes on incidents

Incidents are boring. Most of what you actually do during an incident is wait: for some other team to investigate, or for a deploy to finish, or for the result of some change to become apparent, or for someone else who’s been paged to come online. It’s stressful, but there’s often just not that much to do. Most incidents resolve on their own. People love to share war stories about incidents where some hero engineer improvised a clever fix that instantly repaired the system. That rarely happens. Well-designed software systems tend to come good by themselves, and many modern systems are at least partly well-designed, by virtue of being built out of really solid pieces. If a server process is crashing or leaking memory, Kubernetes will kill the pod and bring it back up. If a service is overloaded and jammed up, clients will (hopefully) trigger circuit breakers and back off until it can recover. Temporary spikes in expensive operations will often just fill up a queue instead of taking the entire system down. Most incident calls I’ve been on - well over half - would have come good by themselves in roughly the same time without any human intervention. Most incident-resolving actions make incidents worse. Engineers jump too quickly to resolve incidents. Oh, the queue size is huge? Don’t worry, I’m here in a production console to clear the queue! Unfortunately, some of the jobs I just nuked were doing important billing work and aren’t automatically re-queued, so this queue-latency incident just became a billing incident as well. Another classic in this genre is “engineer forces a series of redeploys to “fix” a concerning-looking metric, and the concurrent deploys cause far more stress on the system than whatever was causing the metric to look weird”. For that reason, the first thing you should do in an incident is nothing . When I was paged late at night, I used to have a habit of pouring myself a glass of scotch before I joined the call. This was only partly for the tranquilizing effects of alcohol: the main reason was to have a ritual I could go through to convince myself that I wasn’t rushing, and that it was OK to take a few breaths and relax before jumping into the problem 1 . Making a cup of tea or going for a walk around the house would probably have served as well. Effective incident-resolving actions are often dull. Typically the action needed to resolve the incident - assuming it doesn’t resolve on its own - is to temporarily disable some problematic feature until the system recovers. This is never a complex code change. Typically someone spends five minutes putting together the patch, and then an hour waiting for reviews, CI, and deploying. If you’re very lucky, you’ll get to write a “wrap a cache around it” code change. In an incident, there is no substitute for knowledge of the system. Five strong engineers can troubleshoot on an incident call and get nowhere, while one half-drunk engineer who’s familiar with the codebase can swan in and immediately fix the problem. This is because the kinds of actions that resolve incidents are so simple: if you’ve been the one working on the project, you likely already know exactly what feature flag to check and disable, or what code change to revert. Resolving incidents requires courage. Incident calls can be scary. When engineers are scared, they often reach for consensus: hedging their statements, asking the group if they agree a particular course of action is safe, deferring to each other, and so on. But if you’re the one with knowledge of the system, you have to be decisive. Say “I’m going to do X”, wait thirty seconds, then do it. While it’s usually net-negative to have a powerful manager fidgeting on the incident call, this is one of the rare cases where it can be helpful - executives are very comfortable saying “okay, do it now” about technical courses of action they don’t fully understand. Resolving incidents buys a lot of political credit. One thing that I think surprises a lot of engineers who are new to on-call is how grateful managers and executives are for even really simple fixes (i.e. “turn off the feature flag”). This is because incidents are one of the few times that non-technical leadership are directly confronted with their lack of control over the technical sphere. When the team is building a product, your VP has a lot of freedom to guide the process and make decisions. But when there’s an active incident, they have to just sit there and trust that their technical employees are going to pull them out of the fire. It’s a scary situation, particularly for someone who’s used to exercising a degree of power in the workplace. However, always resolving incidents is (by itself) not a durable position of power. This is a little counter-intuitive. Surely if you’re always resolving incidents, you’re indispensable? The problem is that incident-resolving work is almost always so techical as to be completely opaque to executives. They know the incident has resolved, but they don’t know if you did a heroic effort or merely did the obvious thing. They also can’t point to your successes as theirs (which is always the most reliable way to get VPs and directors on your side), because incidents are expected to be fixed , and it’s always better not to have had the incident at all . I don’t need to do this anymore because I just don’t get as keyed up about incidents as I used to. I don’t need to do this anymore because I just don’t get as keyed up about incidents as I used to. ↩

0 views

Building the deployment tool I wish I had

Deptool is a new declarative configuration deployment tool that I built for myself. In this post I describe the design, and I explain what problems it solves.

0 views
iDiallo 3 weeks ago

AI didn't delete your database, you did

Last week, a tweet went viral showing a guy claiming that a Cursor/Claude agent deleted his company's production database . We watched from the sidelines as he tried to get a confession from the agent: "Why did you delete it when you were told never to perform this action?" Then he tried to parse the answer to either learn from his mistake or warn us about the dangers of AI agents. I have a question too: why do you have an API endpoint that deletes your entire production database? His post rambled on about false marketing in AI, bad customer support, and so on. What was missing was accountability. I'm not one to blindly defend AI, I always err on the side of caution. But I also know you can't blame a tool for your own mistakes. In 2010, I worked with a company that had a very manual deployment process. We used SVN for version control. To deploy, we had to copy trunk, the equivalent of the master branch, into a release folder labeled with a release date. Then we made a second copy of that release and called it "current." That way, pulling the current folder always gave you the latest release. One day, while deploying, I accidentally copied trunk twice. To fix it via the CLI, I edited my previous command to delete the duplicate. Then I continued the deployment without any issues... or so I thought. Turns out, I hadn't deleted the duplicate copy at all. I had edited the wrong command and deleted trunk instead. Later that day, another developer was confused when he couldn't find it. All hell broke loose. Managers scrambled, meetings were called. By the time the news reached my team, the lead developer had already run a command to revert the deletion. He checked the logs, saw that I was responsible, and my next task was to write a script to automate our deployment process so this kind of mistake couldn't happen again. Before the day was over, we had a more robust system in place. One that eventually grew into a full CI/CD pipeline. Automation helps eliminate the silly mistakes that come with manual, repetitive work. We could have easily gone around asking "Why didn't SVN prevent us from deleting trunk?" But the real problem was our manual process. Unlike machines, we can't repeat a task exactly the same way every single day. We are bound to slip up eventually. With AI generating large swaths of code, we get the illusion of that same security. But automation means doing the same thing the same way every time. AI is more like me copying and pasting branches, it's bound to make mistakes, and it's not equipped to explain why it did what it did. The terms we use, like "thinking" and "reasoning," may look like reflection from an intelligent agent. But these are marketing terms slapped on top of AI. In reality, the models are still just generating tokens. Now, back to the main problem this guy faced. Why does a public-facing API that can delete all your production databases even exist? If the AI hadn't called that endpoint, someone else eventually would have. It's like putting a self-destruct button on your car's dashboard. You have every reason not to press it, because you like your car and it takes you from point A to point B. But a motivated toddler who wiggles out of his car seat will hit that big red button the moment he sees it. You can't then interrogate the child about his reasoning. Mine would have answered simply: "I did it because I did it." I suspect a large part of this company's application was vibe-coded. The software architects used AI to spec the product from AI-generated descriptions provided by the product team. The developers used AI to write the code. The reviewers used AI to approve it. Now, when a bug appears, the only option is to interrogate yet another AI for answers, probably not even running on the same GPU that generated the original code. You can't blame the GPU! The simple solution is know what you're deploying to production. The more realistic one is, if you're going to use AI extensively, build a process where competent developers use it as a tool to augment their work, not a way to avoid accountability. And please, don't let your CEO or CTO write the code.

0 views
マリウス 4 weeks ago

I Do Not Recommend Bitwarden

Almost four years ago I published a guide on how to run your own LastPass on hardened OpenBSD , in which I explained how to set up an OpenBSD instance, either as a cloud instance or as a Raspberry Pi bare metal installation, that would host Vaultwarden as a backend for the Bitwarden client applications. After having used a similar approach for myself for several years now, I came to the conclusion that I do not recommend the use of Bitwarden any longer. Let me explain. Wikipedia describes Bitwarden as _a freemium open-source password management service that is used to store sensitive information […] owned and developed by Bitwarden , Inc. , and that is now almost ten years old. The company behind the software is not only developing the Bitwarden server , as well as client applications for most platforms, but it is also offering a SaaS product for users who don’t want to put up with hosting this unwieldy beast on their own. More on this in just a moment. Bitwarden ’s pricing for their hosted offering is similar to their competitors' offerings, albeit with differences in terms of functionality. Regardless of whether one picks their hosted offering or decides to self-host, however, the client applications remain the same. Since 2022, Bitwarden is also backed by $100M of PSG growth equity , joined by Battery Ventures . A password manager that wants to remain open-source is one thing, but the same password manager with an investor on its board that needs to see a return on $100M is another. Without wanting to sound overly cynical, this is usually the point in time in which the rent-seeking begins and the product slowly shifts from serving its users to serving its investors. If you decide to self-host Bitwarden , however, you will relatively quickly find yourself in what I would describe as enterprise software hell . The standard Bitwarden server deployment is a heavy-weight C# backend that ships with MSSQL Express and won’t work with more Linux-native databases like PostgreSQL or MariaDB . Depending on the size of the deployment and the requirements with regard to high availability, you might want to utilize Kubernetes, which in turn adds additional overhead and complexity. Because of this, many smaller to medium-sized deployments prefer to look into Vaultwarden instead, which is an unofficial Bitwarden-compatible server written in Rust™ . The simple and lightweight nature of Vaultwarden compared to the official Bitwarden server makes such a big difference for administrators that the unofficial server project has seemingly three times the stargazers on GitHub as compared to Bitwarden ’s official implementation. This should make you think, especially as a series B -funded company with $100M, whether your (technical) users appreciate the current direction your software stack is heading towards, or whether you might want to look into bringing the people that built a vastly more successful backend implementation on-board to optimize and accelerate your official stack. And surely that’s what Bitwarden decided to do, right? Sadly, however, it seems that Bitwarden ’s NIH syndrome was too strong to simply take over Vaultwarden as an official project. Instead, the company seemingly hired the main developer of the Vaultwarden project and decided to publish a “lighter” version of their existing backend dubbed Bitwarden unified lite , which is still a service built on Microsoft ’s .NET , and which still appears to require more than three times the RAM a Vaultwarden instance usually consumes. Regarding the open-source part of Bitwarden , things have been getting murkier over the past year or so. In late 2024, users started noticing that a new dependency, , had been pulled into the clients. Its license read: You may not use this SDK to develop applications for use with software other than Bitwarden (including non-compatible implementations of Bitwarden) or to develop another SDK. For a product that prides itself on being open-source, this is a fairly significant plot twist . After considerable backlash in the community, however, Bitwarden called it a “packaging bug” and eventually relicensed the SDK under GPLv3 . Technically, the issue is resolved. Philosophically, however, this episode tells you all you need to know about where Bitwarden is heading: The freeware parts are bait , the actual product is the SaaS subscription, and the community is there to contribute issues and translations as long as it doesn’t cost the company anything. Setting aside the backend, however, the real culprit with regard to Bitwarden are the client applications. Advertised functions do not work as expected, basic features are non-existent (after ten years!) and the user interface is poor to put it mildly, especially when compared to equally priced alternatives. And don’t get me wrong, if Bitwarden was purely a FOSS-effort and not funded by venture capital all these flaws could be brushed aside because, after all, it would be a community effort. However, Bitwarden isn’t a community effort , which is reflected very noticeably in the bureaucratic processes they drowned the community in, but more on this in a moment. About a year ago, I supported someone who tried to switch from a competitor to Bitwarden under the thought of rather supporting open-source software with a yearly subscription than some proprietary platform that one has no insights into. Part of the migration was naturally importing existing vaults from the previous password manager into the new Bitwarden account. As can be seen in my bug report on GitHub , however, this went sideways very quickly, and resulted in at least one vault requiring significant technical workarounds for the import to work. The response from what sounded like an official Bitwarden employee left me frankly stunned. Despite the migration/import feature being advertised in multiple places throughout Bitwarden ’s marketing materials and documentation, and despite dozens of users having already complained about the exact same issue, Bitwarden simply decided to ignore the issue report and instead requested opening another likely dead-ended discussion in their community forum. This level of corporate bureaucracy is not at all what open-source software should look and feel like, and it is definitely completely unjustified for a feature that is being advertised on both the open-source software, as well as the paid product, but that simply does not work as advertised. Similarly, many other issues are funneled through this process of community discussions , which more often than not turn out as not much more than lengthy threads of pointless back-and-forth, and almost never materialize in actual implementations. Note: The same import was tested with proprietary alternatives to Bitwarden and worked flawlessly. Migration pain is not limited to the initial import. Even when you’re already inside Bitwarden and simply want to shuffle entries between an organization vault and your individual vault, or the other way around, there is, to this day, no proper “move the selected items to …” feature. For a handful of logins you can clone/edit each one manually, but anyone who has ever tried this with a few hundred items (say, after cleaning up a collection , leaving a company, or consolidating several organizations ) knows that this quickly becomes a carpal tunnel -inducing exercise. The official workaround that Bitwarden support and community threads recommend is to export the source vault as unencrypted JSON , edit the file, and then re-import it into the destination vault. Setting aside the obvious security footgun of having 500+ credentials sitting in plain text in , or worse, a directory that’s silently synced to the cloud (think Dropbox , OneDrive , iCloud , …) while you figure out where to put them, the process happily loses a non-trivial amount of data along the way: […] if there are file attachments in any of your vault items, then these will not be included in the export […] the export will not include items in the Trash , or any password histories or timestamps. For any organization that relies on attachments (e.g. SSH key files, licence keys, recovery codes as images) or on password history for compliance/audit reasons, this is plainly unacceptable. For a product whose entire job is to be the source of truth for your credentials, the complete absence of a “move these 500 items to that vault, keep everything intact, click OK” button in year ten of its existence speaks volumes about where Bitwarden ’s engineering priorities lie. Another example concerns client updates. It appears that Bitwarden pushes new updates to their clients that can lead to vaults becoming inaccessible (on the client side) at random, without any heads-up to the users. I personally encountered this issue while travelling. When I had my phone plugged-in overnight, F-Droid decided it’s a good time to update a few apps, one of which was Bitwarden . The next morning I had to log into my banking and when I opened the Bitwarden app on my phone I was unable to access my vault. It took some time to figure out what was going on ( via Vaultwarden ), and I was lucky that I had my UPDC (which hosts my Bitwarden backend) with me, as otherwise I could have ended up in a pretty bad situation with my whole vault being unavailable. The sheer irresponsibility with which Bitwarden appears to push what looks like breaking protocol changes between the clients and the backend is frightening. As someone who relies heavily on my password manager to work in offline mode, this experience taught me that Bitwarden cannot be trusted. From that moment on, I disabled automatic updates for the Bitwarden clients and exported a current snapshot of all passwords to a local backup in KeePassChi / KeePassXC / KeePassDX . This is, by the way, not a Vaultwarden -specific issue, despite Bitwarden staff claiming so. Searches through the repository return a long list of very similar reports, for example around the 2025.12.x release introducing regressions that prompted users for the master password twice after login and then crashed the app, or the 2025.6.0 release that simply crashed on startup for many users. The Android app in particular went through a full rewrite from .NET MAUI to native Kotlin in 2024, which shipped alongside a trail of regressions that continue to show up in quarterly releases. Aside from the aforementioned technical details, Bitwarden is (and has always been) one of the subjectively worst applications on my phones and my desktop in terms of user interface. The UI/UX is in fact so horrible, that even after years of use I still dread opening the ungoogled-chromium extension, let alone any of the desktop and mobile apps. Aside from the fact that building the Electron -based desktop app from source is a huge PITA and that the pre-built Flatpaks are not working properly on Wayland , one more general, major issue that I’m experiencing with the Bitwarden client applications (and extensions) is the fact that while they clearly support offline use, they’re not intentionally built for it. Hence, whenever I open the mobile app or the browser extension, there’s a noticeable delay that sometimes takes literal seconds or even minutes, in which the client application seemingly tries to reach the backend, which often isn’t around (because I’m not hosting my Bitwarden backend on the open internet). While this sounds like a nitpick, it truly slows down things whenever one has to unlock Bitwarden (which is almost always, as I do not trust especially the browser extension to remain unlocked all the time). Sadly, there seems to be no way to turn off syncing when unlocking the vault to prevent the clients from waiting unnecessarily. Another example of a bad user experience is the logins overview (titled Vault ). Whenever I am on a website (in my desktop browser) and I would like Bitwarden to fill the login form, I tend to click the extension’s icon in the toolbar and then click the entry in the list. This has been how all other password manager UIs that I have used in the past have worked; Not Bitwarden , though. There, you need to click the small Fill button on the right side of the list item. If you click the big list item itself, which is highlighted on mouse-over, you simply open that item to show its details. Instead of allowing the user to click the big UI element (which is the whole list item), Bitwarden forces them to click a significantly smaller, harder to hit UI element (a button on top of a clickable list item). As with the syncing feature, there’s also no way to flip this behavior, so that clicking the list item would fill in the form, while clicking the tiny button would open the item’s details page. I’m apparently not alone in this sentiment. A quick glance at recurring Hacker News threads on the topic reveals that users have been complaining about pretty much every single one of these issues, ranging from the desktop app not focusing correctly when opened , to “loading for over 5 minutes before showing my passwords” , to the browser extension asking to save passwords that are already there , to broken biometric login on iOS, laggy mobile apps, and, of course, the famous “Log-In suggestions not showing” . Feature requests that have been sitting in the community forum since 2021 (such as a simple edit history for entries) remain untouched, which is a pattern that MSP resellers also called out publicly as “glacial feature development” . Speaking about lists, the Bitwarden CLI has an equally bad user interface. For example, the command of the tool will unexpectedly output every detail of every item, including passwords and TOTP codes, without the need for an additional e.g. flag. There’s no way that reasonable engineers looked at this and said “Yep, that’s how we do things, because we cannot imagine a single situation in which anyone might mistakenly pipe to some place and unintentionally expose all their credentials” . Also, can we take a step back and talk about the fact that the Bitwarden CLI is a terminal tool built in TypeScript ? Not only because it requires a metric ton of runtime and dependencies, but also because JavaScript isn’t exactly the stack anymore that you’d run carefree on your continuous integration environments. “Why?” , you ask? Hold my beer… A password manager has, essentially, one job : Keeping the user safe, by keeping their credentials safe. For a product that has been around since 2016 , Bitwarden has accumulated a surprisingly long list of incidents in which it at least partially failed at exactly that task. And no, I’m not talking about theoretical vulnerabilities, I’m talking about things that actually shipped to production. In January 2023, shortly after the LastPass breach had the entire industry questioning the real-world strength of cloud-hosted password vaults, security researcher Wladimir Palant published an analysis showing that Bitwarden ’s advertised 200,001 PBKDF2 iterations were, in practice, closer to 100,000 . The reason was that the additional server-side iterations were only applied to the master password hash used for login , but not to the encryption key protecting the vault data. An attacker with access to a leaked vault could therefore bypass the server entirely and was left with the same effective security as with LastPass . Additionally, the default client-side iteration count was still at 100,000 , below OWASP recommendations at the time, and a concern that had been raised as far back as 2020 . Bitwarden eventually raised the default to 600,000 and added Argon2 support, but (mirroring LastPass ’ earlier mistakes) the change initially applied only to new accounts, leaving existing users responsible for manually updating their own KDF settings. Still in 2023, RedTeam Pentesting disclosed “Bitwarden Heist” ( CVE-2023-27706 ), a vulnerability in the Windows desktop client that allowed attackers with domain-administrator access to extract the vault decryption key from the local DPAPI storage without ever prompting Windows Hello or the master password. In the words of the researchers: Any process running as the low-privileged user session can simply ask DPAPI for the credentials to unlock the vault, no questions asked. The fix eventually shipped in version 2023.4.0 , months after initial disclosure. Also in 2023, CVE-2023-27974 was disclosed. The vulnerability was about the Bitwarden browser extension, which happily offered to fill credentials into cross-domain iframes embedded on trusted pages, as long as the base domain matched. Meaning, if embedded an iframe from (e.g. on a subdomain controlled by a third party), credentials could be stolen. Bitwarden ’s response was that iframes “must be handled this way for compatibility reasons” , and that “Auto-fill on page load” was not enabled by default. Small comfort if you did enable it. Fast-forward to August 2025, when security researcher Marek Tóth publicly disclosed a class of DOM-based clickjacking attacks that could trick the Bitwarden browser extension into autofilling credit card details and personal information after a single click on a malicious page. The vulnerability had been reported four months earlier, in April 2025, but was classified by Bitwarden as “moderate severity” and was not patched until version 2025.8.2 , shipped on the very day the researcher’s embargo expired. And then, a few days before I started writing this post, news broke that the official Bitwarden CLI client ( ) was compromised in the ongoing Checkmarx supply chain attack : The affected package version appears to be , and the malicious code was published in , a file included in the package contents. The attack appears to have leveraged a compromised GitHub Action in Bitwarden’s CI/CD pipeline , consistent with the pattern seen across other affected repositories in this campaign. Organizations that installed the malicious Bitwarden npm package should treat this incident as a credential exposure and CI/CD compromise event . The payload downloaded the Bun runtime, decrypted a second-stage Shai-Hulud worm and started harvesting GitHub and npm tokens, SSH keys, shell history, AWS , GCP , Azure credentials, GitHub Actions secrets, and even MCP configuration files used by AI tooling. The data was then exfiltrated by auto-creating a public repository on the victim’s own GitHub account and uploading the stolen credentials there. Bitwarden ’s npm distribution pipeline stayed compromised for approximately 19 hours and 334 developers had enough time to pull the malicious package before it was caught. Bitwarden ’s official statement emphasised that no end-user vault data was accessed , which is technically true and entirely beside the point. Everyone running in a CI pipeline just handed the attackers whatever else happened to live on that machine. For a company whose one job is keeping secrets safe, distributing an actively malicious CLI through its official channels is not a great look. It also ties back nicely to the earlier rant about shipping a password manager CLI as a Node package. Had been a single statically-linked binary in Go or Rust (as most of the ecosystem has moved towards) the npm -shaped blast radius simply wouldn’t exist in that form. And while supply-chain attacks within the Go and Rust ecosystems are on the rise as well, the barriers for successful attacks are still higher. Note: None of the above incidents are world-ending on their own. Every non-trivial piece of software will ship with bugs, and critical vulnerabilities happen to everyone. What bothers me is the pattern . The reactive (rather than proactive) security posture, the “working-as-intended” responses to embarrassing findings, the reliance on a Node.js toolchain for a security-critical CLI, and the fact that several of these issues had been quietly flagged by external researchers long before they were actually addressed. As this post is not an ad-driven hit-piece by any of Bitwarden ’s competitors, you won’t be reading anything along the lines of "… switch to <insert SaaS product here> now and get 50% off your first year with promo code SWORDFISH" . Instead, I will describe the approach that I’m taking moving forward, which might be something that you, as an equally frustrated long-time Bitwarden user, might be interested in exploring as well. Over the past years, I came to the conclusion that there’s no single password manager that will work perfectly for every use case and setup. For example, in my personal life, I do not need the ability to share vaults or individual passwords with other people. In my professional life, however, that is a fairly common occurrence. Similarly, the login credentials for bank accounts or insurance portals do not need to be available through a CLI tool, but they have to be available across multiple devices. Secrets for cloud storage or SSH private keys for deployments, however, don’t need to sync to any of my phones , but they do need to be accessible from a command-line tool that can be invoked programmatically. With these requirements in mind, it only makes sense to think of a way to better compartmentalize each set of credentials, rather than trying to find a single software or platform that can kill ten birds with one stone. Also, looking at it from a security perspective, it makes total sense to split up these password groups into different softwares and services in order to minimize the impact that a data breach might have. Generally, the approach that I came up with splits my credentials into the following groups: For group A I’m going with a SaaS password manager that offers proper vault sharing, integrates with the tools clients actually use (SSO, browser extensions on corporate machines, audit logs), and takes the hosting burden off my plate. The platform is proprietary, which I would normally not be thrilled about, but given that the scope of this group is client work only , I’m accepting the trade-off. For group B , the rationale is a bit counter-intuitive at first. The accounts tied to these credentials already contain personal information like name, address, date of birth, maybe payment details, which is regularly leaked by the very same services anyway, as a quick look at Have I Been Pwned confirms. A breach of the password manager itself would therefore not meaningfully expand the attacker’s knowledge. With TOTP and Passkeys in place, it frankly doesn’t even matter anymore at this point. What does matter here is cross-device availability, realiability and offline capabilities. I’m using a second, separate cloud-based password manager for this group, from a different vendor, with a different master password and different recovery mechanisms, so that a compromise of group A doesn’t automatically compromise group B and vice-versa. As I will be running their mobile app on at least one GrapheneOS device, I prefer a solution that doesn’t depend on Google Play Services and ideally offers an open-source/source-available client. Group C covers all the accounts I have on internet forums, websites, privacy-respecting services, and anything that doesn’t hold PII. For these, I don’t need, nor do I want, a cloud service. I’m using KeePassChi / KeePassXC / KeePassDX with the database file sitting in a folder that is being synced across my devices via Syncthing , which is an approach I have already written about in the past . The file is itself encrypted, which means that even if Syncthing were compromised (and the attacker somehow got their hands on the file), they would still need to break the KeePassChi / KeePassXC encryption to get anything useful out of it. On mobile, KeePassDX on Android reads the same file without fuss. For group D , I’m using a mixed approach of storing personal credentials using the same approach taken in group C , and credentials that are actually used by scripts, CI jobs, and remote servers, using HashiCorp Vault , which is the same one I was already running for PKI in my OpenBSD setup. Vault is a bit of an overkill for a single user, but it gives me proper access policies, token-based authentication for automated agents, short-lived credentials for things that support it, and audit logs. Having that said, I’m looking into Infisical . For group E , the API keys, personal access tokens, and random secrets that I only ever use from the command line, I’ve settled on the venerable utility. It stores each secret as an individual GPG -encrypted file in a Git repository, which is conceptually simple, easy to audit, and cooperates perfectly with shell scripts and my dotfiles . The Git repository lives on my own infrastructure, not on GitHub , and it’s only synced manually when I actually need to access it from a different machine. This might all sound like a lot of moving parts, and I understand if it looks like overkill for someone coming from a single-vault world. The reality, however, is that after years of using Bitwarden as a one size fits all solution, I realised that one size fits all meant one size fits poorly . Splitting credentials across multiple tools turned out to be significantly less painful than I had initially assumed, mostly because each tool is individually well-suited to its specific task. And if any one of them gets breached, the blast radius is limited to one category of secrets, not the whole lot. After several years of self-hosting Bitwarden , I’ve come to the conclusion that the product has drifted further and further away from what I originally signed up for. The enterprise-first architecture that barely fits on a Raspberry Pi, the half-hearted attempt at a “lighter” backend, the SDK licensing situation , the slow pace at which features are being addressed, the avoidable UX paper-cuts that haven’t been fixed in years, and finally the string of security issues that shouldn’t have shipped in the first place, all paint a picture that I find hard to reconcile with the “open-source password manager for everyone” narrative. I’m not suggesting that the alternatives are universally better or free of their own issues, because password managers are simply hard, and every player in this space has its fair share of skeletons. What I am suggesting is that you take a hard look at how much trust you are placing into a single piece of software for all of your credentials, and whether that bet is still the right one, which for me, it no longer was. Here are some other views on this topic: A: Credentials for professional/client projects (think platform logins, etc.) B: Credentials for accounts containing PII (think bank accounts, online shops, etc.) C: Credentials for accounts that do not contain PII (think accounts on internet forums, online platforms, etc.) D: Credentials for infrastructure (think server logins, SSH keys) E: One-off credentials (think API keys, tokens, etc.) Ask HN: Alternatives to Bitwarden? Bitwarden CLI Compromised in Ongoing Checkmarx Supply Chain Campaign Bitwarden CLI Compromised in Ongoing Checkmarx Supply Chain Campaign Concerns Over Bitwarden Moving Away from Open Source

0 views
Allen Pike 1 months ago

We Can Do Hard Things

Years ago, back when I was leading a mobile dev team, my friend had an idea for a business. You see, back then the most frustrating thing about mobile dev was the final step: getting your app on actual phones. Builds, provisioning, and code signing made for a harrowing trial, festooned with obtuse errors and other sharp spikes. So, Dennis had a pitch for me. “What if,” he asked, “we did all your apps’ builds and provisioning and signing for you, in the cloud?” I raised an eyebrow. “Well, obviously that would be great. In theory. But it would be too annoying to build that. Apple drops Xcode versions and switches submission requirements with no warning. And you’d need to make sure that…” He stopped me with a wave. “Right, but: if we did it, and it worked. Would you use it?” “Well, of course we would. But I don’t think you want to run this.” My attempt to discourage him didn’t work. Perversely, the idea that this was a hard problem got him more excited. He immediately dove in. Three years later, Buddybuild was acquired with fanfare . They’d accomplished what they set out to do, made a tidy profit, and they were even able to keep their team here in Vancouver. Wisely they ignored me, and chose to do the hard thing. Doing something hard yet pointless is foolish. But doing something hard yet valuable has a lot of benefits. Consider that. If you have a great team, less competition, but more ambition and discipline, then you’re set up to do well. These days are well suited to attempting hard things. Our tools are improving so fast that a project which seemed straightforward last year might be trivial next year. Better to dial up the ambition a bit. Of course, there are a few pitfalls to trying hard things. You’re more likely to burn out, for one – it’s very important to sleep, exercise, and manage your own energy when your work is kicking your ass. And it can sometimes be difficult to tell when the “hard and purposeful” parts end, and when the “overcomplicating things” or “naive folly” begins. I highly recommend having a co-founder that finds hard and purposeful problems motivating, yet takes a dim view of overcomplication. Doing hard things is best not attempted alone. But, all in all, it’s a good default. We can do hard things. It’s easier to recruit a great team to tackle hard, worthwhile problems. It leads to less competition, due to schlep blindness . It’s a great way to hone your ambition and discipline – over time, working on hard things feels less hard.

0 views
daniel.haxx.se 1 months ago

Approaching zero bugs?

In this era of powerful tools to find software bugs , we now see tools find a lot of problems at a high speed. This causes problems for developers, as dealing with the growing list of issues is hard. It may take a longer time to address the problems than to find them – not to mention to put them into releases and then it takes yet another extended time until users out in the wild actually get that updated version into their hands. In order to find many bugs fast, they have to already exist in source code. These new tools don’t add or create the problems. They just find them, filter them out and bring them to the surface for exposure. A better filter in the pool filters out more rubbish. The more bugs we fix, the fewer bugs remain in the code. Assuming the developers manage to fix problems at a decent enough pace. For every bugfix we merge, there is a risk that the change itself introduces one more more new separate problems. We also tend to keep adding features and changing behavior as we want to improve our products, and when doing so we occasionally slip up and introduce new problems as well. Source code analyzing tools is a concept as old as source code itself. There has always existed tools that have tried to identify coding mistakes. Now they just recently got better so they can find more mistakes. These new tools, similar to the old ones, don’t find all the problems. Even these new modern tools sometimes suggest fixes to the problems they find that are incomplete and in fact sometimes downright buggy. Undoubtedly code analyzer tooling will improve further. The tools of tomorrow will find even more bugs, some of them were not found when the current generation of tools scanned the code yesterday. Of course, we now also introduce these tools in CI and general development pipelines, which should make us land better code with fewer mistakes going forward. Ideally. If we assume that we fix bugs faster than we introduce new ones and we assume that the AI tools can improve further, the question is then more how much more they can improve and for how long that improvement can go on. Will the tools find 10% more bugs? 100%? 1000%? Is the tool improving going to gradually continue for the next two, ten or fifty years? Can they actually find all bugs? Can we reach the utopia where we have no bugs left in a given software project and when we do merge a new one, it gets detected and fixed almost instantly? If we assume that there is at least a theoretical chance to reach that point, how would we know when we reach it? Or even just if we are getting closer? I propose that one way to measure if we are getting closer to zero bugs is to check the age of reported and fixed bugs. If the tools are this good, we should soon only be fixing bugs we introduced very recently. In the curl project we don’t keep track of the age of regular bugs, but we do for vulnerabilities. The worst kind of bugs. If the tools can find almost all problems, they should soon only be finding very recently added vulnerabilities too. The age of new finds should plummet and go towards zero. If the age of newly reported vulnerabilities are getting younger, it should make the average and median age of the total collection go down over time. The average and median time vulnerabilities had existed in the curl source code by the time they were found and reported to the project. Accumulated vulnerability age when reported Bugfixes When the tools have found most problems there should be less bugs left to fix. The bugfix rate should go down rapidly – independently of how you count them or how liberal we are in counting exactly what is a bugfix. Bugfixes Given the data from the curl project, there does not seem to be fewer bugfixes done – yet. Maybe the bugfix speed goes up before it goes down? Given the look of these graphs I don’t think we are close to zero bugs yet. These two curves do not seem to even start to fall yet. Yes, these graphs are based on data from a single project, which makes it super weak to draw statistical conclusions from, but this is all I have to work with. I think that’s mostly an indication of what you believe the tooling can do and how good they can eventually end up becoming. I don’t know. I will keep fixing bugs.

0 views
David Bushell 1 months ago

GitHub is sinking

TL;DR: GitHub used to be cool and now it’s a lame slop graveyard. GitHub is racing towards the mythical zero nines of uptime. Users are starting to notice that GitHub is now a Microsoft product. Eww! Official uptime paints a concerning chart. The missing status page tell a far worse story. Whatever the truth, it’s impossible to miss the delightful experience that is Microsoft GitHub if you use it semi-regularly. Microsoft acquired GitHub and applied their unique brand of enshittification. Amongst their achievements was the spawning of the Copilot circle of hell . Now they’re effectively DDoSing themselves with slop . I won’t dwell on what else went wrong. I don’t know and I don’t care. GitHub is impressively bad now. It’s embarrassing. Shameful. As I write this the obituaries are flooding in: It’s long past time to get off this sinking ship! GitHub has become synonymous with “source control” and I worry too many users don’t know that Git is not GitHub. The core technology of Git is open source. It’s distributed, meaning that all repositories are equal. Git works without a centralised service. Such a practice is a construct of social convenience. GitHub was a useful add-on. Microsoft has turned GitHub into an expensive liability. Network effects are hard to topple but if anyone can do it, Microsoft can. GitHub’s fake star economy is worthless. GitHub is inundated with bots and drowning in slop and doing everything to encourage it. Microsoft is turning GitHub into the Moltbook of code, it ain’t for you and me anymore. Your CI pipeline is over-engineered and GitHub Actions are an abomination (see: [1] [2] ). Finding another solution is an absolute chore but do you trust GitHub to be reliable? Look, the ship is sinking! Sure, the water looks freezing. Don’t hang around and allow Microsoft to pull you under. You don’t need to move everything in one go. Start the process. The nearest lifeboat to escape GitHub is another centralised Git forge. Just sign up and push your repo to the new upstream. Some services can automate the migration and maybe even import issues. Personally I’d leave issues behind in a tragic boating accident. Codeberg — a non-profit and community-led project with an established track record. This is the safe alternative that’ll stick around. It’s the flagship instance of Forgejo . Tangled — an alpha stage start-up with interesting AT protocol integration. Worth considering for smaller solo projects. Seems cool. Gitea — they offer cloud managed Git hosting. It’s the original open source project that Codeberg/Forgejo forked away from. GitLab — enterprise grade, meaning it’s bloated and confusing but it’ll impress your boss. This could be the choice if you need multiple meetings to make the choice. Bitbucket — trade one soul destroying corpo fun vacuum for another. Strongly discouraged, but Bitbucket does technically fit the anything but GitHub category. If you’re cool like me , you or your organisation can self-host a Git forge with actions and releases . My recommendation is Forgejo . There is talk of federation between Forgejo instances but it’s not happening anytime soon. If you want open collaboration push a copy to Codeberg. Gitea and GitLab also have self-hosted options. Be aware, GitLab is a comparative chonker. When I said “Git is not GitHub” the same applies to other forges. Do you need those add-ons? Nothing is stopping you from raw-doggin’ Git over SSH: How you manage collaboration is another question. If Linux can be maintained by sending patches to an email mailing list, “doesn’t work at scale” arguments are skill issues. But seriously, a centralised Git forge is a decent compromise in my opinion. Maybe they collapse like GitHub in future. Always have an exit plan. Just use anything but GitHub. Thanks for reading! Follow me on Mastodon and Bluesky . Subscribe to my Blog and Notes or Combined feeds. Ditching GitHub - Lonami Ghostty Is Leaving GitHub - Mitchell Hashimoto Before GitHub - by Armin Ronacher From GitHub to Codeberg/Forgejo - Jonas Hietala

0 views
iDiallo 1 months ago

Don't use localhost:3000, use your own custom domain

After presenting a demo of how an internal tool works, I was flooded with questions. Not about the tool, but about why I had bought a domain just to run the demo. "Why didn't you use the staging server?" they asked. I was confused. I didn't buy a domain. I was running it locally. But instead of the URL being , it was a fully formed domain. . In fact, some people told me that they couldn't access the website on their devices. They thought I had to whitelist their IP to grant them access. To feel young again... Setting up a custom domain locally was common practice when I started web programming. But with the advent of Node.js (and rails?), everyone has resorted to just pointing to with an incrementing port number. The main reason is that the webserver is often bundled into the application itself. It’s easy to just run and call it a day. However, if you have multiple long-term projects running locally, especially if they need to communicate with one another, then managing a mental map of ports like , , and quickly gets tiring. This is where my old school approach shines. By combining the system hosts file with a reverse proxy like Nginx, you can run different projects locally with actual domain names. I usually end up with for active development, for a stable local build, and the actual production URL for the live site. Here is how to set it up. First, we need to tell your computer where to find these domains. Think of as your computer's personal contact list. When you type a URL, your computer looks here first. By adding an entry, you are telling your computer: "Don't bother checking the internet when I ask for myproject.com, I am actually talking about this machine." It creates a manual override that maps a friendly name directly to your machine's IP address. You can edit the file here: Linux/macOS: Windows: Open the file in your editor. In this file, right after the block of entries for Adobe (active.adobe.com...), add this line: Now, when you access those domains in your browser, they don't point to the wider internet, but directly to your own machine. Now that the domain is pointed to your own machine, we want to redirect it to the right application. If your app runs on port , navigating to will default to port and fail. This is where Nginx comes in. It listens on port and forwards the traffic to the specific port your app is running on. Here is a simplified Nginx config to make it work: Restart Nginx, and voilà! You have clean, professional URLs for your local environment. If you are running your services inside Windows Subsystem for Linux (WSL2), networking is handled a little differently because the Linux instance has its own virtual IP. You can get your instance's IP address with this command: You would use that IP address in your Windows hosts file instead of . After that demo, some people were disappointed to learn the trick. They thought I was so committed that I had bought a domain name just to give them the raw deal with my demo. Someone mused about a shirt with the words "real men don't use localhost:3000". That could have started a whole new motivational speaking career for me. A custom domain just looks very professional and is practical for separating environments. It just feels cooler than staring at all day. That's how you separate yourself from vibe-coders. Anyway, back to earth. I feel like this is a lost skill and I'm keeping it alive by sharing it. That's how you run a custom URL locally.

0 views