Latest Posts (20 found)
Unsung Today

Early names

The original 2004 Gmail iteration of the now-ubiquitous modern status bar (here presenting undo send ) was internally nicknamed a butter bar because… well, just look at it: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/1.1600w.avif" type="image/avif"> (I believe at least Google today calls this a snackbar .) The UI pop-up element hosting Google Talk inside Gmail – the very same thing that’s more commonly called a “toast” these days – was originally termed a mole : = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/2.1600w.avif" type="image/avif"> The column view in NeXTSTEP was called a browser , but a few years later someone put together a different kind of a browser on that very same machine, and the original term has been sunset – after NeXTSTEP became Mac OS, the view was renamed to “ column view ”: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/3.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/early-names/3.1600w.avif" type="image/avif"> These three are off the top of my head. Please send in more! #history #interface design

0 views

Model-Harness-Fit

Why mixing a frontier model with a foreign harness quietly tanks performance, and what the open source code tells us about why. I keep three coding agents alive on the same workstation. Claude Code in one terminal. Codex CLI in another. GitHub Copilot CLI in a third. Same files. Same git tree. Same bash. Three different harnesses that look indistinguishable. A few weeks ago I ran the same prompt through all three and the behavior was visibly different in ways that went well past the surface differences of style and speed that I had expected to see across vendors. The Codex run cited a memory entry I had taught it months ago, applied the rule, and kept going without asking. The Claude Code run flagged the same context but refused to assert it without first verifying that the file path was still valid. The Copilot CLI run produced a longer, more cautious plan and asked me to approve it before taking any side effect on disk. The hand wave answer is that "models behave differently because they are different models." But Copilot CLI was running Claude Opus, the same family that Claude Code runs by default. Same model family, same prompt, two harnesses, materially different output. The hand wave does not cover it. Models are post trained against the harness, not just the API. The tool names they expect, the input schemas they emit, the citation tags they wrap around remembered facts, the file structure of skills they invoke, the planning protocol they follow when the harness says "make a plan first" (none of these are generic capabilities of the model). They are byte level conventions baked into the post training of one specific model against one specific harness. Pull the model out of its harness and you give up performance you cannot get back without rewriting either side. This has a direct consequence that anyone who has tried to ship a "model agnostic" agent has run into. You cannot just swap a model. Supporting BYOK and multi model (which is the responsible posture, since relying on a single provider is risky) adds real engineering complexity, and that complexity is worth paying. To swap a model cleanly, you have to swap the harness with it: the tool surface, the schema shapes, the skill bodies that name those tools, the citation contract, the memory ritual, the system prompt structure, sometimes the planning protocol. Everything above the model has to move when the model moves. That is why every agent vendor that supports multiple providers ends up either (a) running a degraded variant of every model they support, or (b) maintaining a separate full stack per model and exposing the choice to the user as "you are picking a product, not just a model." Option (b) is the path that wins on quality, and it is worth the engineering cost to avoid being locked into one lab. Swapping orchestrators is not a cosmetic change. It is a model swap in disguise. The frontier lab spent the last year shaping the model's instincts to a particular tool surface, a particular memory ritual, a particular skill format. When you mix and match, you spend that work. I think this is the single most underrated constraint in agent design today, and it has a clean name. Call it model harness fit . I dug into three open implementations that ship today: Codex CLI (OpenAI, fully open source at , Rust workspace, ~80 crates), Claude Code (Anthropic, closed binary, but a Rust port called at tracks upstream behavior closely enough to read at ~48,600 LOC across 9 crates, and Claude Code's own runtime injects observable blocks on every turn that confirm or contradict claims from the port), and GitHub Copilot CLI , where the SDK is fully open source MIT licensed at with five language bindings (Node.js TypeScript at 5208 LOC across 8 files, plus Python, Go, .NET, Java), and the JSON RPC wire protocol is documented at (currently version 3). The CLI binary that the SDK spawns as the agent runtime server is closed, but the client wrapper, the protocol, the session lifecycle, the system prompt section overrides, and every RPC method are all open source and readable. Here is what I will cover: Companion piece: I covered the memory layer in detail at Agent Memory Engineering . This article is about everything else, with memory revisited only where it intersects orchestration. If you want the bottom up tour of how MEMORY.md indexes, system reminder injection, age in days warnings, and signal gates work, read that one first. Before any argument about architecture, look at the leaderboard. Terminal-Bench 2.0 evaluates agents on bash heavy multi step tasks, and it ranks by harness plus model pair, not by model alone. From on April 30, 2026: Two things jump out. First, Claude Opus 4.6 paired with ForgeCode hits 79.8%, while the same model paired with Capy hits 75.3%. Same weights, different harness, and a 4.5 percentage point spread between them on a benchmark where every entry is fighting for a tenth of a point. Second, the upper rankings are not dominated by the labs that trained the models. ForgeCode is a third party harness that lands three of the top six entries by routing across model families. Stanford's IRIS Lab paired Opus 4.6 with an automated harness evolution system called Meta-Harness and pushed the same model to 76.4% on the same benchmark, well past the best baseline they started from. The harness is moving the score by more than the model upgrades are moving it. Cursor's research team makes the point even sharper. In their April 30 post on harness engineering, they note that they took their own coding agent from "Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness." Same model. Same benchmark. Different scaffolding. A 25-position jump on a public leaderboard, attributable to the harness alone. That is not a tuning artifact. That is the entire ranking. LangChain's Vivek Trivedy puts the same observation in one sentence: "Opus 4.6 in Claude Code scores far below Opus 4.6 in other harnesses." Anthropic's flagship model in Anthropic's flagship harness loses to the same weights in third party scaffolding. If you only saw the model name on the spec sheet, you would not predict that. This is the empirical case for model harness fit. Hold the model fixed and swap the harness, and the pass rate moves by enough to outweigh a model generation upgrade. Anyone shipping a coding agent in 2026 who picks the model first and the harness second is leaving most of the performance on the floor. The rest of this article is about why. What exactly does the harness do that lets two implementations of the same model produce different scores? Each harness picks a different orchestration protocol. The model was trained on that protocol's exact wire format. These are not three implementations of the same idea. They are three different contracts between model and runtime. Codex is a typed asynchronous protocol. The model emits a with an and gets back a stream of typed messages. The protocol is defined at with explicit enums. There is a second protocol layered on top: is 10,721 lines of JSON RPC for cross process clients (IDE plugin, desktop app), where v1 (245 lines) is frozen and all new RPCs go to v2. Methods are named with singular resource names, camelCase wire format. The two protocols stack: agent layer for in process, JSON RPC layer for cross process. The model was trained to emit submissions and consume events. Claude Code is a direct typed conversation loop. The runtime's consumes a per turn from . variants are , , , , and . There is no separate submission queue. The protocol is the Anthropic Messages API plus a tight in process tool dispatcher. The model was trained to emit tool calls inside an assistant message and respond to tool results in the next turn. GitHub Copilot CLI is a supervisor protocol. The host app does not run the agent loop. It spawns the bundled binary as a subprocess, opens a channel over stdio, and sends with the full configuration: model, system message, tools, MCP servers, custom agents, skill directories, hook flags. The agent loop runs inside the child process. The host gets notifications back. The model was trained to run inside this supervisor and emit JSON RPC events that the supervisor can route. You can see the architectural commitment harden in each design. Codex's literally polices crate growth: "Resist adding code to . The largest crate is explicitly off limits for new features." A 500 line soft cap, 800 line hard cap per Rust module. New features pay rent in the form of a new crate. This is a compiler toolchain attitude applied to an agent harness, and the model was trained to operate inside it. Claude Code's port enforces a different rule: "one agent loop, not a fan out of specialized agents," which is why subagents in Claude Code start with a fresh context and cannot recurse. Copilot CLI's supervisor model is what lets a single binary serve three surfaces (terminal, cloud agent, third party hosts). Each surface gets the same model behavior because the model is always running inside the same supervisor. Now imagine you swap models. Take a model trained to emit and feed it Claude Code's stream. The model has been taught one wire shape. The harness expects another. The mismatch shows up not as an outright failure but as a quiet degradation: missed tool calls, wrong reasoning effort levels, inconsistent compaction triggers, citation tags that the harness never parses. The wire format is part of the model. This is where post training is most visible. Every harness has a tool registry. The names look similar at the top: , , , , . But once you go past the first six, the surfaces diverge in ways that the model has been taught to exploit. Codex's exposes a particular vocabulary: Claude Code's port enumerates 40 specs in : Copilot CLI bundles a different default, drawn from the public changelog: A model trained on Codex's eight verb subagent surface knows how to send a message to a running subagent. A model trained on Claude Code's tool does not have that verb in its instinct set. The harness can paper over this with a router, but the router cannot give the model an instinct it does not have. Cursor's harness team puts the underlying mechanic plainly. From their April 30 research post: "OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement. Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes. So in our harness, we provision each model with the tool format it had during training." This is the single cleanest description of model harness fit I have seen from any vendor, and it is not a hand wave about model preferences but a specific measurable cost in reasoning tokens paired with an observable increase in error rate, recorded at scale across millions of agent turns in production. This is where model harness fit shows up most visibly. The tool surface is the model's vocabulary for the world. Cross train on a different vocabulary and you lose precision in every interaction. Skills look interchangeable on the surface. All three harnesses use a file with YAML frontmatter ( , , optional metadata). Codex even baked in cross compat: parses Claude style markdown skills. Copilot CLI explicitly reads config. The format is so similar that the same body would parse in all three. But skills are not just markdown. A skill carries an implicit contract about which tools it expects to call. That contract is not in the frontmatter. It is embedded in the body, in the form of imperative instructions that name specific tools by name, with specific argument shapes, and with specific verbs the model must emit. Look at what each harness ships as a system skill. Codex's bootstrap skills, baked in via and extracted to on first launch, are five: , , , , . The body invokes and as scripts ( ). It assumes the model can call to run a Python script. It assumes the model knows that scripts in of a skill folder are invokable. It assumes a sparse checkout fallback for private repos. None of that is in the frontmatter. All of it is in the body. Claude Code's skills are different. The plugin ships , , , , , plus many more. The bodies invoke Claude's specific tools: to bootstrap into a workflow, to track steps, to dispatch parallel subagents, / for file changes, / for search. The skills also encode hard process rules: "Use this BEFORE any creative work," "Use when about to claim work is complete." These rules anchor on the harness's injection model, which Codex does not have in the same form. Copilot CLI's skills are part of the plugin marketplace ecosystem, and the changelog reveals a different posture. v1.0.5 added "Embedding based dynamic retrieval of MCP and skill instructions per turn" as experimental. The model was trained to consume skill instructions delivered as a per turn injection chosen by an embedding ranker, rather than as a description match. A skill body that assumes "you will see all skills in the system reminder" does not behave the same way when the harness ranks skills via embedding and only injects the top three. This is why "we both use SKILL.md" is misleading. The format is identical; the contract underneath is not. Skills carry tool specs implicitly, and the implicit specs are pinned to the harness that authored them. The same applies to plugin manifests. Copilot CLI's v1.0.22 explicitly added: "Plugins using or manifest directories now load their MCP and LSP servers correctly." That is GitHub treating Claude Code's plugin format as a substrate to interoperate with at the file level. But the skills inside those plugins still bring assumptions about Claude Code's tool surface. Loading the file does not give the model the right vocabulary. The lesson generalizes. A skills marketplace that claims to be cross harness is a routing problem, not just a parsing problem. Each skill needs to either declare its target harness explicitly, or get rewritten per harness, or run inside a router that translates tool calls between dialects. None of these are free. I covered memory in detail in Agent Memory Engineering , so I will keep this section to the parts that matter for harness fit. Three memory architectures, three different bets: The architectural choices already differ. But the harness fit story is sharper than that. Each model was trained to write memory using a specific tool with a specific schema, and to cite memory using a specific tag with a specific format. Codex's model writes a structured raw memory artifact via Phase 1 extraction with a strict JSON schema: The Phase 2 consolidation prompt is 841 lines. . Schema validation rejects malformed output at parse time. The model citations are wrapped in blocks. The harness has a parser at that increments in the SQLite state DB whenever a citation arrives. This is the model's memory ritual. Strip the citation tag and the harness loses its decay signal. Claude Code's model writes memory using the standard and tools, into one file per memory under . There is no separate memory tool. The model picks one of four types ( , , , ) by file name prefix. The body uses a convention for behavioral rules. The harness wraps every body read in a block with the dynamic age in days and a verification reminder. The model was trained to read memory through that wrapper, weight it accordingly, and skip stale claims. Copilot CLI's model invokes as a dedicated tool. The body of the memory goes to a remote backend. Cross session memory was added in v0.0.412 as experimental. The retrieval surface is a server side query, not a local grep. The model expects the backend to be there. When the backend is unavailable (v1.0.23 fix), the agent used to hang on the first turn. That is a load bearing dependency. Now mix and match. Run a Codex trained model on Claude Code's harness. The model will look for a memory write tool, find , and write a file — but it will write a file in Codex's structured format, with headers and annotations, into a directory that Claude Code does not auto load on the next session. The harness does not know to inject the index. The next session does not see the memory. And critically, the model will emit blocks that Claude Code never parses. Memory effectively does not exist on the next turn. Run a Claude trained model on Codex's harness. The model will not emit citation tags. Codex's decay signal stops incrementing. Memories that were used silently rank below memories that were not used, because the harness sees zero citations. Within a few weeks, the wrong memories are getting evicted. Run either on Copilot CLI's harness with the remote backend. The model's local file instincts do not transfer. The tool is the only path, the schema is different, and the cross session retrieval is keyword search against a server, not the always loaded index plus on demand body read pattern that the model was trained on. The first turns will look fine because the model has memory shaped instincts. The retention will be different. The memory layer is the densest collision surface for model harness fit. Tools, schemas, citation tags, decay signals, retrieval rituals — all of these are coupled, all of these were learned together during post training, and none of them transfer cleanly when you swap one side. The tag is a microcosm of the larger problem. Codex's model emits a small XML block at the end of an assistant message whenever it pulled in memory: The harness has a parser that strips the block before showing the assistant message to the user, and uses the parsed to bump and columns in . The parser is at . The SQL is in migration : This is the model's contract with the harness. Cite what you used. The harness will reward what you cited by keeping it alive. The Phase 2 consolidator ranks memories by and decays anything with no citations and no fresh after 30 days. Claude Code's model has no equivalent citation tag. The harness does not need one because memory is read via the standard tool, and the agent's verification grep is what doubles as the "I used this" signal. The reminder text in front of every body read explicitly tells the model: "Records can become stale over time. Verify before recommending." There is no decay loop because the harness assumes the user will prune or the verification will fail in place. Copilot CLI's model talks to a remote memory backend. The store, retrieve, and rank logic is server side. The model does not need a citation tag because the backend tracks reads on its own. Now look at what happens in a cross harness run. A six character XML tag becomes the difference between a memory system that improves with use and one that degrades silently. This is what I mean by "the wire format is part of the model." The citation tag is not a feature on a roadmap. It is a habit the model picked up during post training, and that habit only pays off inside the harness that taught it. The Copilot CLI SDK exposes its system prompt as a structured object with ten section IDs. Hosts can override each section, replace it, or take full control. From the open source TypeScript at : This is not just a documentation surface. It is the public contract of the model's training distribution . Each section has a specific role, and the model was trained to read each section as a particular kind of instruction. The section is harder than . The section is consulted when the model is mid tool call. The section is what the model reads right before emitting a turn. Codex has its own equivalent, less explicit. The developer prompt is assembled in this order: Memory comes after policy and identity, before behavioral overrides. The model was trained to read this exact order. Claude Code's static prefix: A different shape, a different ordering, and a different set of precedence claims about what the model should treat as binding. The Claude trained model knows that instructions "OVERRIDE any default behavior and you MUST follow them exactly as written." That phrase lives inside the harness rather than inside the model itself, but the model has been trained to recognize the heading and treat its contents as binding. A model trained against this prefix will hunt for and react accordingly, while a model trained against a different prefix simply will not see the heading the same way and will give it the weight of any other piece of context. This is the same lesson as the citation tag, scaled up. The system prompt is not generic. It is a structured artifact with section conventions that the model was taught to read in a specific way. Swap harnesses and you keep the model's reading habits but lose the structure they apply to. GitHub Copilot CLI is the most interesting harness in the comparison because it explicitly tries to route across model families. Sonnet is the default. The picker exposes Sonnet, Opus, Haiku, and the GPT 5.x family. v1.0.32 added an mode that selects per session. How does Copilot CLI handle the model harness fit problem? Looking at the changelog, the strategy has three legs. The tool is included only when the active model is from the Codex family . v0.0.366: "Codex specific patch toolchain." The harness knows which models were trained on and only exposes it to those models. Anthropic models get the and shape they were trained on. This is not a translation layer. It is a per model tool surface. The router does not pretend and are the same operation. It serves the right tool to the right model. v1.0.13: "Tool search for Claude models." The implication: Claude trained models expect a deferred tool loading pattern via . The harness only exposes the discovery loop to those models. OpenAI trained models do not get the same loop. They get the full tool list up front because that is what they were trained on. v1.0.18: "New Critic agent automatically reviews plans and complex implementations using a complementary model to catch errors early (available in experimental mode for Claude models)." The Critic is a different model than the main agent. Plans get reviewed by the complementary model. This is multi model orchestration baked into the harness, and the routing is explicit. This is what a real router looks like. Not "translate everything to a common dialect," but "serve the right dialect to each model." It is more code, more state, more telemetry. It is also the only way to get top performance from each model. The cost of this approach is honesty. The harness has to admit that "Claude on Copilot CLI" and "GPT on Copilot CLI" are different products. The user picks one or the other and gets different behavior. There is no neutral common denominator. This is the right honest answer to model harness fit, and Copilot CLI is the only harness in the open or semi open set that actually ships it. The strategic logic is worth naming clearly. Multi model is the crucial bet for any serious agent platform in 2026 , and at GitHub and Microsoft we made that bet deliberately and early. Most customers are running multi model workflows whether their vendor admits it or not, and the only way to give every model its best performance is to build the per model routing surface inside the harness itself. We committed to that answer up front, which is what positions Copilot CLI to keep pace with whatever the labs ship next without having to redo its core architecture each time the leaderboard reshuffles. The matched pair is the unit of analysis, but the matched harness across many models is the unit of platform, and that is the level we are operating at. The single sharpest concrete demonstration of model harness fit comes from what happens when a user switches models mid conversation. Cursor's research team describes this carefully in their April 30 post, and the failure surface is worth walking through because every assumption that breaks here is an assumption a single model harness pair quietly relies on. Three things break at the moment of a model switch. First, the conversation history itself is now out of distribution. The previous model produced tool calls in its native vocabulary: blocks, tags, six or eight verb subagent dispatches. The new model was trained against a different vocabulary and now has to reason about a transcript full of tool calls it would not have emitted. Cursor handles this by injecting a custom instruction explicitly telling the model "you are taking over mid chat from another model" plus steering it away from the prior model's tools. That mitigates but does not eliminate the cost. The model is still reading a transcript that does not match its instincts. Second, the prompt cache breaks. Caches are provider and model specific, which means a switch is a guaranteed cache miss. For a long session, this turns the first turn after the switch into a full price re entry of every byte of system prompt and conversation history. Cursor's mitigation is to summarize the conversation at switch time, which yields a shorter clean transcript that costs less to re cache, at the price of losing details that the summary did not preserve. Third, the tools themselves change shape. The new model's harness loads its native tool set. If the user was deep into a subagent dispatch flow with one set of verbs, the next turn presents a different set. The model has to figure out whether the prior tools are still valid (they are not) and which of its own tools maps to the user's apparent intent. Cursor's recommendation, after building the mitigations, is honest: "we generally recommend staying with one model for the duration of a conversation, unless you have a reason to switch." The cleanest workaround they describe is to spawn a subagent with a different model rather than switch the main conversation. A subagent starts with a fresh context window, no transcript bias, no cache to break, and the new model's native tool surface from the first turn. Each of these failure modes maps directly back to the thesis. The transcript, the cache prefix, and the tool surface are all parts of the wire format the model was trained against. Change the model and you change the contract on all three sides at once. A model switch is not a model swap. It is a harness swap, a tool swap, and a cache invalidation, all at once. The model harness fit framing is no longer a subterranean observation. Two of the labs publishing the most interesting agent work in 2026 say it openly, and the AI infrastructure community has converged on a clean one line definition. Cursor's Stefan Heule and Jediah Katz describe their harness work as "obsessively stacking small optimizations" specifically because a step change is rare and the gains compound only inside a matched pair. Their team builds in custom prompting per provider and per model version, citing OpenAI's literal precision versus Claude's tolerance for imprecise instructions as concrete differentiators that flow back into prompt design. They report driving unexpected tool call errors down by an order of magnitude in one focused sprint. Tool call reliability is not a model property. It is a harness property, and one that compounds every turn the agent stays alive. Anthropic's Prithvi Rajasekaran ran a related experiment in his March 24 post on long running application development. The architecture: a planner, a generator, and an evaluator agent, modeled on Generative Adversarial Networks. The evaluator uses Playwright MCP to actually click through the running application as a user would, then grades against a rubric. Out of the box, Rajasekaran reports, "Claude is a poor QA agent" — it identifies legitimate issues and then talks itself into approving the work anyway. Tuning the evaluator prompt over multiple rounds is what turns it into a reliable judge. The harness creates the judgment surface; the model alone does not. The deeper lesson from Rajasekaran's work is about how harnesses should evolve as models improve. He built one harness against Claude Sonnet 4.5, which exhibited "context anxiety" strongly enough that compaction alone was not sufficient. The harness needed full context resets between sessions, with structured handoff artifacts to carry state across the boundary. When Opus 4.6 shipped, that behavior was largely gone. Rajasekaran dropped the entire context reset machinery and ran one continuous session for over two hours. Every component in a harness encodes an assumption about what the model cannot do on its own. Those assumptions go stale. The matched pair is not static. It moves as the model matures, and the harness has to retire scaffolding that is no longer load bearing. LangChain's Vivek Trivedy has the cleanest framing I have seen: "Agent = Model + Harness. If you're not the model, you're the harness." The harness in this view is every piece of code, configuration, and execution logic that is not the weights themselves. System prompts, tool descriptions, bundled infrastructure, orchestration logic, hooks, middleware. Working backwards from the desired agent behavior, every harness primitive earns its place by patching a specific model gap. Filesystems for durable state, bash for arbitrary action, sandboxes for safe execution, memory for continual learning, planning and self verification for long horizons. Each primitive started life as a workaround for a specific deficiency the model had at training time. Some of those primitives will get absorbed back into the model over time. Others will compound. Trivedy also names the mechanism that makes model harness fit so durable: a co-evolution feedback loop. "Useful primitives are discovered, added to the harness, and then used when training the next generation of models. As this cycle repeats, models become more capable within the harness they were trained in." This is the pipeline that hardens the matched pair over generations. A new harness primitive ships in week one. By month three, it shows up in millions of agent traces. By month six, those traces are training data for the next model. By month twelve, the next model has the primitive baked into its instincts and the harness can lean on it. The loop is what makes "swap to a foreign harness" not just clumsy but compounding clumsy. The model's habits got shaped by the previous generation of its own harness, which itself was shaped by the generation before. Move sideways and you skip every cycle of that compounding. Trivedy is honest about the cost of this loop, and I want to flag the counter argument cleanly. Quoting him: "A truly intelligent model should have little trouble switching between patch methods, but training with a harness in the loop creates this overfitting." If the model's tool format preference is overfit to its training harness, you could argue that the right long term move is to train against a more diverse set of harnesses so the model generalizes. That argument has merit. The labs that ship one model and one harness as a pair are buying near term performance at the cost of the model's portability. Whether that trade is the right one depends on whether portability is something the customer values, and right now the customer mostly values the leaderboard. Three independent posts published within weeks of each other, all converging on a single thesis: the model is only half of the system, the harness is the other half, the matched pair is the proper unit of analysis, and the vendors that ship the matched pair as a single product are the ones currently sitting at the top of the leaderboards. The harness side of the contract has converged on a markdown file per concern, and the file names are now load bearing across the ecosystem. A model trained on one harness recognizes the file names and knows which one carries which kind of authority. The key observation: the file names are now part of the wire format. A model that has been trained to look for a block under a heading will hunt for that exact heading on a turn. A model trained against will look for and miss . A model trained against will load personality from and ignore the same content if you put it in . This is why the AGENTS.md feature request against Anthropic's repo matters. It is not a docs migration. It is a request for the model's training distribution to expand its file recognition vocabulary. Until Anthropic post trains Claude to read , that file is invisible to Claude Code even if it sits next to in the repo. The SOUL.md ecosystem is a stress test of this thesis. SOUL.md is not yet recognized by any major harness's default loader. So the SOUL.md repo's installation instructions are revealing: copy your directory into the project, then add a few lines to pointing the model at it. That is a manual bridge from a non-recognized convention to a recognized one. The SOUL.md authors understand that the bytes do not work unless the model knows where to look, and "where to look" is a habit fixed in post training. The same routing problem shows up in the open. GitHub Copilot CLI v1.0.4 added: "Read .claude/settings.json and .claude/settings.local.json as additional repo config sources." v1.0.36 walked some of it back: "Custom agents, skills, and commands from ~/.claude/ are no longer loaded by the Copilot CLI." That is a router that tried to be permissive about file names, then narrowed when the user surface got confusing. The lesson sits underneath the changelog: even the harness that runs Claude models cannot treat files as authoritative without negotiating with the user about which conventions count. Pick the convention. Ship the post training to match. Or ship a router that explicitly maps each file to the model that recognizes it. The middle path of "be permissive and load anything that looks plausible" loses every time. After months of running these three harnesses side by side, reading the open source code, and tracking the Terminal-Bench leaderboard: The harness is no longer a wrapper around the model. The harness is part of the model's effective parameters. The post training process embeds the harness's tool surface, schema shapes, memory rituals, citation contracts, and system prompt structure into the model's instinct set. You can take the weights to a different harness, but you cannot take the instincts. The instincts only fire when the harness presents the world the way the post training presented it. This has three consequences worth naming. For agent platform builders: pick a harness, pick a model, ship them as a pair. Do not pretend the model is portable. Do not pretend the harness is neutral. The frontier labs are publishing model harness pairs whether they say so or not, and the per pair performance is the only number that matters. Copilot CLI's "different tools for different models" approach is the honest version of this. The dishonest versions ship a common denominator and underperform on every model they serve. For model labs: the harness is product strategy, not infrastructure. The harness is where the lab's post training investment compounds. Anthropic's injection model, the typed memory taxonomy, the verification on every body read, are not infrastructure choices. They are the surface the model was sculpted against, and they are the moat that makes the model less interchangeable than it would otherwise be. Same for Codex's two phase memory pipeline, the citation tag, the strict JSON schema. Same for Copilot CLI's ten section system prompt skeleton. The harness is where the model becomes irreplaceable. For users: the cost of switching is higher than it looks, and lower than vendors would like you to think. Higher because the model and the harness fused over months of training and you cannot pull them apart cleanly. Lower because the simple stack underneath is shared, and the conventions on top are documentable. A honest port — replicate the tool surface, replicate the citation contract, replicate the system prompt structure, replicate the memory ritual — would close most of the gap. It just costs as much as the original post training did to set up. The matched pair is not static. It shifts as the model matures. This is the most useful nuance from Rajasekaran's Anthropic post. A harness component that was load bearing for Sonnet 4.5 (context resets, sprint decomposition, aggressive compaction) became dead weight on Opus 4.6 because the model started doing that work natively. The right harness for a model in March is not the right harness for that model's successor in October. The discipline is to read the traces, identify which components are still earning their place, and retire the ones that are now patches over solved problems. Cursor's blog says the same thing in different words: "Every component in a harness encodes an assumption about what the model cannot do on its own, and those assumptions go stale." So back to the question I started with. Why does the same prompt produce visibly different output across three harnesses running the same model? Because the model running on three harnesses is effectively three different models, even though the weights on disk are byte for byte identical. The instincts that fire at runtime are not stored only in the weights, they are conditioned by the harness the weights were trained against, and the instincts turn out to be most of what shows up in the assistant's output on any given turn. The interesting design move now is not a better model. It is not a better harness either. It is the matched pair, designed end to end, where the post training and the runtime reinforce each other turn after turn until the model becomes legibly better at the things this specific harness rewards. You can see the major builders converging on this idea from three different starting points. Anthropic shipped Claude Code as the canonical Claude harness, with the post training and the runtime co-designed as a single product. OpenAI shipped Codex CLI as the canonical Codex harness, with the same vertical integration on the OpenAI side of the house. At GitHub and Microsoft we shipped Copilot CLI with explicit per model routing because multi model is crucial: customers run every frontier model they can get their hands on, and our job is to make each one perform at its best inside a harness designed to serve all of them well. The result is the most pragmatically honest harness in the open or semi open set today, and the one positioned to compound across model generations rather than locking to any single lab. Three different theories of what to do about model harness fit, all three coherent, and all three paying a real engineering price for the choice they made. The frontier work in 2026 is not about new model architectures. It is about new harness primitives. Ralph Loops, where a hook intercepts the model's exit attempt and reinjects the original prompt in a clean context window, forcing the agent to keep grinding against the goal. Just-in-time harness assembly, where the tool surface and the system prompt get composed per task instead of pre-configured per session. Self-tracing agents that read their own logs to find harness-level failure modes and patch them without human intervention. Each one of these is a primitive that some model will eventually be post trained against, and that pairing will show up at the top of the next leaderboard. The Terminal-Bench leaderboard tells you who is paying the price right. Look at it again in six months. The Evidence: Terminal-Bench 2.0 : what the leaderboard actually shows about model harness pairs Three Harnesses, Three Bets : SQ/EQ vs typed conversation loop vs JSON RPC supervisor The Tool Surface : where post training is most visible Skills Carry Tool Specs : why "same SKILL.md format" does not mean "interchangeable" The Memory Layer : synchronous live writes vs deferred batch vs server side, and why the citation tag matters The Citation Discipline : how the model talks back to the harness The System Prompt Skeleton : ten section IDs is a contract The Routing Reality : what GitHub Copilot CLI is actually doing about all this Mid-Chat Model Switching : the cleanest concrete failure mode What the Labs Are Saying : Cursor, Anthropic, and LangChain all converging on the same framing The Identity File Convention : CLAUDE.md, AGENTS.md, SOUL.md, USER.md, and what each one is for What This Means : the model is no longer the moat alone, and the matched pair shifts as the model matures — Codex's custom diff format. Two flavors: a freeform Lark grammar at and a JSON variant. The model was trained to emit patches in this format. It is not interchangeable with Claude Code's (which takes / ). — the bash family. Plus and for long lived processes that the model can drive with stdin writes after the fact. — the plan/todo tool. A model not trained on this tool will use a different convention to track work. — model can request expanded permissions mid turn. Codex is the only harness with this exact verb. — multi agent orchestration with , , , , , , , . Eight verbs. The model knows all eight. , — tools that find other tools. Codex's answer to deferred tool loading. — , , . Tied to migration . , , — lower case names internally, surfaced to the model as CamelCase ( , , ). The model was trained on the CamelCase variant. requires , , optional . Not the same shape as Codex's . has the deepest sandbox surface: , , , , , , , . The model knows when to set and pair it with the tool. and — the lazy load primitives. — single tool for subagent dispatch. Takes , , optional , optional . The post training has the model emit short imperative descriptions for these. / — both permission. Toggles a worktree local override. / — wrap for subagent isolation. — streams stdout from a background process. Pairs with . The model knows this pattern; Codex does not have it. — the workflow scaffolding tool. The model writes triplets in a particular pattern. , (bundled ripgrep), , — file reading with explicit range params. — built in (v0.0.374). Rejects URLs. , , — three verb interactive shell control. — subagent dispatch with depth and concurrency limits. , — multi turn subagent control. A different shape from Codex's six verb agent surface. — interactive clarification. — persistent memory tied to a remote backend. Memory is not local files here. — included specifically when serving Codex models. A different patch toolchain than Codex's own. , , , , .

0 views
Unsung Today

Mouse pointer as a mere mortal

I gasped when I first saw Lightroom do this: I know this won’t have the same effect on you just watching. What happened was that, after I clicked on the Disable button, Lightroom moved the mouse pointer for me . I don’t think I have ever seen anything like this, and it provoked many thoughts and emotions: So seeing this now, yeah, I’d bundle this inside the “some interactions are 100% sacred” bucket, alongside focus never being hijacked randomly (especially in the middle of typing), avoiding scrolling anything until I specifically ask, undo and copy/​paste needing utmost protection, and a few more. In the opposite camp, here’s a fun new project by Neal Agarwal (only worth clicking on a computer with a mouse). This is a situation where it feels perfectly fine for a cursor to be hijacked; as a matter of fact, there is something really interesting about a mouse pointer feeling less like a deity floating above it all, and more like a regular in-game actor. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/mouse-pointer-as-a-mere-mortal/2.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/mouse-pointer-as-a-mere-mortal/2.1600w.avif" type="image/avif"> This reminded me of that time, in the earlier days of Figma, when I prototyped an interaction where you could select someone else’s pointer and press Backspace to delete it: We didn’t seriously consider it because it felt just too weird, and not that effective in solving “the other person’s cursor is distracting me” problem. But today it feels like it belongs to the same category as the two examples above. I’ll let you decide if it’s closer to Agarwal’s delight or Lightroom’s terror. #games #interface design #mouse #onboarding #principles This feels wrong. If the mouse is the extension of my fingers, and the mouse pointer the extension of the mouse, this is in effect the app grabbing my hand and moving it. I did not know this was even possible. I can see how moving the mouse pointer programmatically can be useful in very specific situations (like scrubbing, or accessibility), but… not like this. If you do something for the user, won’t that make it harder for them to remember how to do it themselves? I’ve seen this kind of a thing many times in my career: Someone genuinely asks “hey, if this is such a huge transgression, why wasn’t it codified somewhere in the style guide?” But to me the challenge is that it’s hard to imagine everything that needs to be preemptively captured and prohibited. I have to imagine this stuff for living, and I literally did not think anyone would just move a mouse pointer like this.

0 views

Photo Journal - Day 4

Took my soon to a park near our house. He had his kid's camera and had a blast taking photos with me. ↑ One of my son's photos of me.

0 views

AIDHD - AI coding workflow as an exclusion machine

I was recently diagnosed with an attention disorder with sustained attention issues, combined with planning and initiation difficulties. I favour quality over speed, which means it takes longer to do certain tasks, and the longer time spent on them increases fatigue exponentially. For a while I thought my inability to stay interested and accomplish tasks others could do in minutes was a lack of determination or character, a personality flaw. Learning that my brain is objectively bad at them was a huge relief. Unfortunately, reviewing code is among those difficult tasks, it demands a huge mental effort from me. I can write code for hours because I am the one doing it , and doing it delivers dopamine shots every few minutes, when I compile/refresh my page to see my changes. It keeps me focused and interested. But code reviewing? It's such a drag to me. It's long, non-interactive, boring, and worst of all it requires sustained attention due to the multiple parameters you have to take into account. That's not something you can botch, and so it fries my brain. Any distraction is an opportunity to stop, which drags it even longer. It's a vicious cycle. AI driven development, where the code is generated by the machine and reviewed by a human, will therefore never be compatible with my brain. If it becomes mandatory to work this way (as it is increasingly seems it will be), it will exclude not just me, but all other people with a similar neurodiversity. For a technology which is all about empowering people , I find this perspective ironic, even if unsurprising.

0 views

Gas & Fonts

I was at the gas station filling up the tank (unfortunately we still have 1 gas vehicle), when I noticed the numbers were in the typical 7-segment display style, but the screen was a modern LCD. It's curious that they made this intentional choice to imitate an older technology on a modern display. I imagine it must be due either to familiarity (most gas pumps still use actual 7-segment displays), or readability. I also wonder why they chose to use an LCD. The numbers only occupied a small corner of the screen. The only other active pixels were showing a tiny padlock icon. Maybe the screen facilitates maintenance operations not typically seen by a customer? Unfortunately I didn't snap a picture, still don't feel comfortable pulling out my phone near a gas pump!

0 views

Scaling, stretching and shifting sinusoids

This is a brief and simple [1] explanation of how to adjust the standard sinusoid sin(x) to change its amplitude, frequency and phase shift. More precisely, given the general function: We’ll see how adjusting the parameters , and affect the shape of s(x) . Each section below covers one of these aspects mathematically, and you can use the demo at the bottom to experiment with the topic visually. Scaling is conceptually the simplest change; we adjust to increase or decrease the amplitude (maximal height) of s(x) . Setting A=2 will make the value twice as large (in both the positive and negative direction) as the original function. Stretching changes the frequency of sin(x) , which is inverse proportional to its period. The baseline function sin(x) has a period of 2\pi , meaning it repeats every 2\pi . In other words, sin(x)=sin(x+2\pi) for any . If we set w=2 , we get sin(2x) . This function repeats itself twice as fast as sin(x) , because is multiplied by 2 before being fed into the sinusoid. If changes by \pi , the sinusoid’s input changes by 2\pi . Therefore, the period of sin(2x) is \pi , the period of sin(4x) is \frac{\pi}{2} and so on. [2] More generally, the period of sin(wx) is \frac{2\pi}{w} . Play with the demo below to see this in action, by changing and observing how the waveform changes. If we know the period p we want, we can easily calculate the that gives us this period: The final parameter we discuss is ; it’s called the phase of the sinusoid. In the baseline sin(x) , . The sinusoid is 0 at x=0 , achieves its positive peak at x=\frac{\pi}{2} , crosses 0 again at x=\pi , negative peak at x=\frac{3\pi}{2} and returns to its original position at x=2\pi where the repetition begins. By adding a non-zero , we don’t affect the sinusoid’s amplitude or frequency, but we do shift it right or left along the axis. For example, suppose we use the function sin(x+\theta) with \theta=\frac{\pi}{2} . Then when x=0 , we have sin(\frac{\pi}{2}) , so the sinusoid is already at its positive peak; at x=\frac{\pi}{2} , the sinusoid crosses 0 into the negatives, etc. Everything happens earlier (by exactly the value of \theta=\frac{\pi}{2} ) than in the baseline sinusoid. In other words, we’ve shifted the function left by \frac{\pi}{2} . Similarly, when is negative, everything happens later, and the function is shifted right . We’ve now gone over all the parameters for the function: Use the demo below to adjust these parameters and observe their effect on the sinusoid: controls the scaling factor (amplitude). is the frequency and controls the repetition period controls the phase - how much the sinusoid is shifted left or right

0 views

A moment with a silly creature

It’s so funny how much a creature like this silly dog can change someone’s life. He certainly change mine, for better or for worse, and he also changed me in the process. But physically and spiritually. Thank you for keeping RSS alive. You're awesome. Email me :: Sign my guestbook :: Support for 1$/month :: See my generous supporters :: Subscribe to People and Blogs

0 views
ava's blog Yesterday

small thoughts part 10

In ‘ small thoughts ’ posts, I’m posting a collection of short thoughts and opinions that don’t warrant their own post. :) I have a hard time finding the words, but I am sad that so much of our inventions seem to reinforce shame we already have culturally in real life. I think I feel it the most with the internet - it's like a place you get to have a second life in where you can have a different name, different looks, different gender, different presentation, voice, interests, whatever you want. But commercialization of it all and the need to have a real name presence in some places or industries online destroys it; people wanting to hunt you down or misuse your material destroys it; social media sites linking your profiles together and letting people find you by number, email, or suggestions by default destroys it. You always have to be afraid of someone linking it back to you, of it costing you your job or career, of someone tracking you down, whatever. I feel like realistically, the internet should have been this place where you can be as you cannot be in real life, but we increasingly did away with that for no good reason. I know some people might say you can still just make an anonymous account, but that misses how there is still metadata like device and location data, stuff in the background of your pictures, recognizable tattoos or jewelry, recognizable writing style, the things you talk about matching your real life in some unavoidable ways and more. It reminds me of how I would be okay with walking somewhere topless (aside from honking cars and harassment), being naked in saunas or at lakes, or in some more daring or weird outfits. But the ubiquity of cameras anywhere and the need for people to film everything for clout or take pictures secretly has ruined things and made it unsafe to enjoy that. The tech extends the social shame and reinforces it, it covers me up, it dresses me a certain way. Then at least people should be able to live that online, no? But it's not so; nudity is only okay when it is about earning money or if it's art, and not even then (oftentimes). You always have to be afraid of posing a problem for platforms, worried about minors, worried about opsec and about mixing up the accounts, worried about if you should split certain aspects of yourself around 5 different accounts or not, worried about someone creating content off of it, worried about offending conservative people. Online is not a safe space to be your true self for many and it sucks. Something good about weirdness is how ephemeral it can be. That you can wear this or expose that or dye your hair this color and technically, there doesn't need to be any proof of that it happened at all aside from someone's memory they can't externalize into proof for millions of people. But now everything needs to be shown off, or people take pictures and videos of others like it's nothing, and there are less and less spaces where it is understood or even mandated that this is a space where we don't record things. Nothing feels intimate anymore, or ephemeral. It's seemingly not okay for things to pass without having proof it happened. And I think it affects how weird people are willing to be! Both in personality and clothing and hobbies. You might want to engage in weird things, but you don't want to leave a record of it, which is understandable. We live in an era where leaving a record of it all is very easy and normalized and happens against your will even. But not everyone has the skin and stomach to be weird on the record. We have reinforced the steel bars of the cage, and the people who are treating others like a spectacle to consume have been doing the most for it. “Being invited out but you have to do Pilates and eat clean and do your skincare routine and be in bed by 9pm” isn’t something to brag about. Isolating yourself to participate in trend consumption under the guise of selfcare isn’t desirable. “She works on herself for herself by herself” isn’t a flex. You shouldn’t have to do it by yourself. No one asked you to. Your self improvement or fitness or education doesn’t have to come at the cost of relationships. You can share recipes together, go to the gym together, hold each other accountable, have study sessions together. You can let new people in and make new friends on the path you’re on. Telling yourself only you can understand you and only you have your back and you’re all you need is a coping mechanism to profound loneliness. It’s self-obsessed to protect. You don’t have to accept or anticipate being alone for your most transformative and important times and the sooner you see that other people aren’t just a distraction but also offer to transform you and help you, the better. Reply via email Published 02 May, 2026

0 views
iDiallo Yesterday

Editing my LLM assisted Articles

Last year, I used AI to help me write articles. As I've mentioned before , it's convenient when you are doing so because it saves you time. But the problem comes up when you try to quote those articles back. Whatever you think you wrote is not what's in there. I always cringe when I read them back. As I've said before, I'm rewriting those articles so that they capture my voice, and so that I can actually quote the real thoughts I had in mind at the time of writing. I will show you exactly what the article looked like before and after. From prompt, to the final post, and the new edit. Prompt: Feb 4th, 2025 on DeepSeek I am writing a blog post, help me refine it . It should be a fun read that helps convince readers that building useless tools is part of the journey and career growth. Programmers don't use calculators, they build them and abandon them 3 quarters completed. The start of a project is always exciting. You have an brilliant idea, you have the skills to build it, and you have a blind spot. This blind spot propels you into a journey of excitement, setting up the project, deciding on frameworks, paying for AWS servers. You even get to test some of your ideas in a prototype version of your project. As you keep building the project, and resolving the low hanging fruits, you realize that it takes more than the CRUD part to bring your project to life. This is where we start making excuses. The clearer the blind spot gets, the more we lose interest. Resistance is the secret to creating a collection of abandoned projects. Early in my career, i found myself with a brand new life altering idea several times a week. My weekends were spent in front of a laptop, typing away on my IDE and testing an idea. I have to admit I often get distracted by creating meta code, like a ORM that should be ultra efficient, or a html parser that will allow me to use custom html tags that automatically render into components in real time. At the end, I have a robust framework with 3 quarters of an application that is getting less and less exciting. But this is not a warning call to tell you your idea sucks and shouldn't work on it. It's a confession to tell you that those useless projects are exactly how developers hone their skills. We are tricked into wanting to change the world by our own mind. But the result is that we end up learning new marketable skills. I wrote about the importance of kids playing before, it helps them hone physical skills that you can't learn from a book. You have to climb the monkey bar at the park to develop the strength of your arms. In that same way, we need to build 3 quarters of a calculator to learn how to properly compartmentalize your util functions DeepSeek "thought" for 38 seconds, and gave me a draft titled: "Why Building Useless Tools is the Secret Sauce of Developer Growth" . First Draft: Title: Why Building Useless Tools is the Secret Sauce of Developer Growth Hook: Programmers don’t use calculators—they build them, abandon them three-quarters finished, and then accidentally invent the next big thing while trying to fix a typo in the README. Every developer knows the rush: you’re struck by a “life-altering” idea (your 14th this week), and suddenly, you’re setting up repositories, debating frameworks, and buying AWS servers at 2 AM. The blind spot? You’re convinced this time, you’ll finish. Spoiler: You won’t. But that’s okay. “The first 90% of the project takes 10% of the time. The remaining 10% takes another lifetime.” —Ancient Developer Proverb (probably). That initial blind spot isn’t a flaw—it’s rocket fuel. It lets you ignore the boring bits (user auth, error handling, documentation ) and dive headfirst into the fun: prototyping wild features, inventing a “revolutionary” ORM named LazyLlama , or crafting custom HTML tags that’ll “change frontend development forever.” Sure, your enthusiasm fades faster than a free-tier server under load, but along the way, you’ve accidentally learned: - How to optimize database queries (to make LazyLlama slightly less slow). - Why shadow DOM exists (after your custom tag broke 14 legacy sites). As the blind spot clears, reality hits: your app needs more than CRUD and caffeine to survive. Suddenly, you’re procrastinating by rewriting config files or alphabetizing your npm dependencies. This resistance isn’t failure—it’s a teacher . Every abandoned project is a masterclass in: - Scope creep : “Wait, should the calculator also solve quantum physics?” - Tooling : “I spent 3 days building a CI/CD pipeline… for a todo app.” - Prioritization : “The login page can wait—let’s add animated emojis first.” Kids don’t climb monkey bars to “build upper body strength”—they do it because it’s fun. Likewise, developers don’t build half-finished tools to pad their résumés. We do it because playing with code is joyful . And just like kids, we walk away stronger: - That abandoned Twitter-for-Cats app taught you websockets. - The “ultra-efficient” ORM? Now you understand query optimization. - The calculator left at 75%? You’ve mastered state management… and humility. Your GitHub is a museum of “almosts,” and that’s beautiful . Every skeleton in your code closet taught you something. Remember: - Instagram started as a check-in app. - Slack was a gaming company’s side hustle. - Your Blockchain-Based Pet Rock project? It taught you smart contracts—and that not every idea needs to see daylight. The next time guilt whispers, “You never finish anything,” laugh in its face. You’re not failing—you’re iterating . The world needs developers who’ve climbed the mountain of “useless” projects, survived the valleys of tech debt, and lived to code another day. So go ahead: build that calculator. Abandon it at 75%. Then use what you learned to accidentally create something actually world-changing. Final Thought: The only thing better than a finished project? A dozen unfinished ones that turned you into a coding wizard. Call to Action: What’s your favorite abandoned project? Tag it #UselessButBrilliant and confess your coding sins. We’ve all got a LazyLlama in our closets. I've made some edits, asked to remove the headings, and this is what I got: Title: Why Building Useless Tools is the Secret Sauce of Developer Growth Programmers don’t use calculators—they build them, abandon them three-quarters finished, and then accidentally invent the next big thing while trying to fix a typo in the README. Every developer knows the rush: you’re struck by a “life-altering” idea (your 14th this week), and suddenly, you’re setting up repositories, debating frameworks, and buying AWS servers at 2 AM. The blind spot? You’re convinced this time, you’ll finish. Spoiler: You won’t. But that’s okay. That initial blind spot isn’t a flaw—it’s rocket fuel. It lets you ignore the boring bits (user auth, error handling, documentation ) and dive headfirst into the fun: prototyping wild features, inventing a “revolutionary” ORM named LazyLlama , or crafting custom HTML tags that’ll “change frontend development forever.” Sure, your enthusiasm fades faster than a free-tier server under load, but along the way, you’ve accidentally learned how to optimize database queries (to make LazyLlama slightly less slow) or why shadow DOM exists (after your custom tag broke 14 legacy sites). As the blind spot clears, reality hits: your app needs more than CRUD and caffeine to survive. Suddenly, you’re procrastinating by rewriting config files or alphabetizing your npm dependencies. This resistance isn’t failure—it’s a teacher . Every abandoned project is a masterclass in scope creep (“Wait, should the calculator also solve quantum physics?”), tooling (“I spent 3 days building a CI/CD pipeline… for a todo app”), and prioritization (“The login page can wait—let’s add animated emojis first”). Kids don’t climb monkey bars to “build upper body strength”—they do it because it’s fun. Likewise, developers don’t build half-finished tools to pad their résumés. We do it because playing with code is joyful . And just like kids, we walk away stronger. That abandoned Twitter-for-Cats app taught you websockets. The “ultra-efficient” ORM? Now you understand query optimization. The calculator left at 75%? You’ve mastered state management… and humility. Your GitHub is a museum of “almosts,” and that’s beautiful . Every skeleton in your code closet taught you something. Instagram started as a check-in app. Slack was a gaming company’s side hustle. Your Blockchain-Based Pet Rock project? It taught you smart contracts—and that not every idea needs to see daylight. The next time guilt whispers, “You never finish anything,” laugh in its face. You’re not failing—you’re iterating . The world needs developers who’ve climbed the mountain of “useless” projects, survived the valleys of tech debt, and lived to code another day. So go ahead: build that calculator. Abandon it at 75%. Then use what you learned to accidentally create something actually world-changing. The only thing better than a finished project? A dozen unfinished ones that turned you into a coding wizard. I've retitled it, removed the headings from the draft, edited it down, and published this version: Article Published on March 6th, 2025: Every developer knows the rush. You are driving and suddenly you’re struck by a “life-altering” idea (your 14th this week). At the next red light, you record an audio while driving, avoiding eye contact with what clearly looks like a cop’s car. At 2 AM, you wake abruptly remembering the recording. Now you’re setting up repositories, debating frameworks, and buying AWS servers in the middle of the night. The blind spot? You’re convinced this time, you’ll finish. Spoiler: You won’t. But that’s okay. Programmers don’t use calculators. They build them, abandon them three-quarters finished, and then accidentally invent the next big thing while trying to fix a typo in the README. That initial blind spot isn’t a flaw. It’s rocket fuel. It lets you ignore the boring bits (user auth, error handling, documentation) and dive headfirst into the fun: prototyping wild features, inventing a “revolutionary” ORM named LazyLlama, or crafting custom HTML tags that’ll “change frontend development forever.” Sure, your enthusiasm fades faster than a free-tier server under load, but along the way, you’ve accidentally learned how to optimize database queries (to make LazyLlama slightly less slow) or why shadow DOM exists (after your custom tag broke the browser's rendering engine). As the blind spot clears, reality hits: your app needs more than CRUD and caffeine to survive. Suddenly, you’re procrastinating by rewriting config files or alphabetizing your npm dependencies. This resistance isn’t failure. It’s a teacher. Every abandoned project is a masterclass in scope creep (“Wait, should the calculator also solve quantum physics?”), tooling (“I spent 3 days building a CI/CD pipeline… for a todo app”), and prioritization (“The login page can wait—let’s add animated emojis first”). Kids don’t climb monkey bars to “build upper body strength.” They do it because it’s fun. Likewise, developers don’t build half-finished tools to pad their résumés. We do it because playing with code is joyful. And just like kids, we walk away stronger. That abandoned Twitter-for-Cats app taught you websockets. The “ultra-efficient” ORM? Now you understand query optimization. The calculator left at 75%? You’ve mastered state management… and humility. Your GitHub is a museum of “almosts,” and that’s beautiful. Every skeleton in your code closet taught you something. Instagram started as a check-in app. Slack was a gaming company’s side hustle. Your Blockchain-Based Pet Rock project? It taught you smart contracts and that not every idea needs to see daylight. The next time guilt whispers, “You never finish anything,” laugh in its face. “You’ve got the wrong fellow,” you answer. You’re not failing, you’re iterating. The world needs developers who’ve climbed the mountain of “useless” projects, survived the valleys of tech debt, and lived to code another day. So go ahead: build that calculator. Abandon it at 75%. Then use what you learned to accidentally create something actually world-changing. The only thing better than a finished project? A dozen unfinished ones that turned you into a coding wizard. It sounds very much like any LLM, and I couldn't stand reading it. At the time, I was trying to save time with my heavy schedule of writing every other day for a whole year. But I ended with this. If you read it, it captures the idea I was trying to share. As far as being functional, it did exactly what it was supposed to do. But it wasn't my human experience with the subject. In my new edit, I've removed things that do not sound like me. Phrasings that are awkward to me. I'm happy with the result. It's not a banger, but it captures my sentiment on why developers build calculators. Read edited article here (May 1st 2026) It's the only way to learn

0 views
マリウス Yesterday

I Do Not Recommend Bitwarden

Almost four years ago I published a guide on how to run your own LastPass on hardened OpenBSD , in which I explained how to set up an OpenBSD instance, either as a cloud instance or as a Raspberry Pi bare metal installation, that would host Vaultwarden as a backend for the Bitwarden client applications. After having used a similar approach for myself for several years now, I came to the conclusion that I do not recommend the use of Bitwarden any longer. Let me explain. Wikipedia describes Bitwarden as _a freemium open-source password management service that is used to store sensitive information […] owned and developed by Bitwarden , Inc. , and that is now almost ten years old. The company behind the software is not only developing the Bitwarden server , as well as client applications for most platforms, but it is also offering a SaaS product for users who don’t want to put up with hosting this unwieldy beast on their own. More on this in just a moment. Bitwarden ’s pricing for their hosted offering is similar to their competitors' offerings, albeit with differences in terms of functionality. Regardless of whether one picks their hosted offering or decides to self-host, however, the client applications remain the same. Since 2022, Bitwarden is also backed by $100M of PSG growth equity , joined by Battery Ventures . A password manager that wants to remain open-source is one thing, but the same password manager with an investor on its board that needs to see a return on $100M is another. Without wanting to sound overly cynical, this is usually the point in time in which the rent-seeking begins and the product slowly shifts from serving its users to serving its investors. If you decide to self-host Bitwarden , however, you will relatively quickly find yourself in what I would describe as enterprise software hell . The standard Bitwarden server deployment is a heavy-weight C# backend that ships with MSSQL Express and won’t work with more Linux-native databases like PostgreSQL or MariaDB . Depending on the size of the deployment and the requirements with regard to high availability, you might want to utilize Kubernetes, which in turn adds additional overhead and complexity. Because of this, many smaller to medium-sized deployments prefer to look into Vaultwarden instead, which is an unofficial Bitwarden-compatible server written in Rust™ . The simple and lightweight nature of Vaultwarden compared to the official Bitwarden server makes such a big difference for administrators that the unofficial server project has seemingly three times the stargazers on GitHub as compared to Bitwarden ’s official implementation. This should make you think, especially as a series B -funded company with $100M, whether your (technical) users appreciate the current direction your software stack is heading towards, or whether you might want to look into bringing the people that built a vastly more successful backend implementation on-board to optimize and accelerate your official stack. And surely that’s what Bitwarden decided to do, right? Sadly, however, it seems that Bitwarden ’s NIH syndrome was too strong to simply take over Vaultwarden as an official project. Instead, the company seemingly hired the main developer of the Vaultwarden project and decided to publish a “lighter” version of their existing backend dubbed Bitwarden unified lite , which is still a service built on Microsoft ’s .NET , and which still appears to require more than three times the RAM a Vaultwarden instance usually consumes. Regarding the open-source part of Bitwarden , things have been getting murkier over the past year or so. In late 2024, users started noticing that a new dependency, , had been pulled into the clients. Its license read: You may not use this SDK to develop applications for use with software other than Bitwarden (including non-compatible implementations of Bitwarden) or to develop another SDK. For a product that prides itself on being open-source, this is a fairly significant plot twist . After considerable backlash in the community, however, Bitwarden called it a “packaging bug” and eventually relicensed the SDK under GPLv3 . Technically, the issue is resolved. Philosophically, however, this episode tells you all you need to know about where Bitwarden is heading: The freeware parts are bait , the actual product is the SaaS subscription, and the community is there to contribute issues and translations as long as it doesn’t cost the company anything. Setting aside the backend, however, the real culprit with regard to Bitwarden are the client applications. Advertised functions do not work as expected, basic features are non-existent (after ten years!) and the user interface is poor to put it mildly, especially when compared to equally priced alternatives. And don’t get me wrong, if Bitwarden was purely a FOSS-effort and not funded by venture capital all these flaws could be brushed aside because, after all, it would be a community effort. However, Bitwarden isn’t a community effort , which is reflected very noticeably in the bureaucratic processes they drowned the community in, but more on this in a moment. About a year ago, I supported someone who tried to switch from a competitor to Bitwarden under the thought of rather supporting open-source software with a yearly subscription than some proprietary platform that one has no insights into. Part of the migration was naturally importing existing vaults from the previous password manager into the new Bitwarden account. As can be seen in my bug report on GitHub , however, this went sideways very quickly, and resulted in at least one vault requiring significant technical workarounds for the import to work. The response from what sounded like an official Bitwarden employee left me frankly stunned. Despite the migration/import feature being advertised in multiple places throughout Bitwarden ’s marketing materials and documentation, and despite dozens of users having already complained about the exact same issue, Bitwarden simply decided to ignore the issue report and instead requested opening another likely dead-ended discussion in their community forum. This level of corporate bureaucracy is not at all what open-source software should look and feel like, and it is definitely completely unjustified for a feature that is being advertised on both the open-source software, as well as the paid product, but that simply does not work as advertised. Similarly, many other issues are funneled through this process of community discussions , which more often than not turn out as not much more than lengthy threads of pointless back-and-forth, and almost never materialize in actual implementations. Note: The same import was tested with proprietary alternatives to Bitwarden and worked flawlessly. Migration pain is not limited to the initial import. Even when you’re already inside Bitwarden and simply want to shuffle entries between an organization vault and your individual vault, or the other way around, there is, to this day, no proper “move the selected items to …” feature. For a handful of logins you can clone/edit each one manually, but anyone who has ever tried this with a few hundred items (say, after cleaning up a collection , leaving a company, or consolidating several organizations ) knows that this quickly becomes a carpal tunnel -inducing exercise. The official workaround that Bitwarden support and community threads recommend is to export the source vault as unencrypted JSON , edit the file, and then re-import it into the destination vault. Setting aside the obvious security footgun of having 500+ credentials sitting in plain text in , or worse, a directory that’s silently synced to the cloud (think Dropbox , OneDrive , iCloud , …) while you figure out where to put them, the process happily loses a non-trivial amount of data along the way: […] if there are file attachments in any of your vault items, then these will not be included in the export […] the export will not include items in the Trash , or any password histories or timestamps. For any organization that relies on attachments (e.g. SSH key files, licence keys, recovery codes as images) or on password history for compliance/audit reasons, this is plainly unacceptable. For a product whose entire job is to be the source of truth for your credentials, the complete absence of a “move these 500 items to that vault, keep everything intact, click OK” button in year ten of its existence speaks volumes about where Bitwarden ’s engineering priorities lie. Another example concerns client updates. It appears that Bitwarden pushes new updates to their clients that can lead to vaults becoming inaccessible (on the client side) at random, without any heads-up to the users. I personally encountered this issue while travelling. When I had my phone plugged-in overnight, F-Droid decided it’s a good time to update a few apps, one of which was Bitwarden . The next morning I had to log into my banking and when I opened the Bitwarden app on my phone I was unable to access my vault. It took some time to figure out what was going on ( via Vaultwarden ), and I was lucky that I had my UPDC (which hosts my Bitwarden backend) with me, as otherwise I could have ended up in a pretty bad situation with my whole vault being unavailable. The sheer irresponsibility with which Bitwarden appears to push what looks like breaking protocol changes between the clients and the backend is frightening. As someone who relies heavily on my password manager to work in offline mode, this experience taught me that Bitwarden cannot be trusted. From that moment on, I disabled automatic updates for the Bitwarden clients and exported a current snapshot of all passwords to a local backup in KeePassChi / KeePassXC / KeePassDX . This is, by the way, not a Vaultwarden -specific issue, despite Bitwarden staff claiming so. Searches through the repository return a long list of very similar reports, for example around the 2025.12.x release introducing regressions that prompted users for the master password twice after login and then crashed the app, or the 2025.6.0 release that simply crashed on startup for many users. The Android app in particular went through a full rewrite from .NET MAUI to native Kotlin in 2024, which shipped alongside a trail of regressions that continue to show up in quarterly releases. Aside from the aforementioned technical details, Bitwarden is (and has always been) one of the subjectively worst applications on my phones and my desktop in terms of user interface. The UI/UX is in fact so horrible, that even after years of use I still dread opening the ungoogled-chromium extension, let alone any of the desktop and mobile apps. Aside from the fact that building the Electron -based desktop app from source is a huge PITA and that the pre-built Flatpaks are not working properly on Wayland , one more general, major issue that I’m experiencing with the Bitwarden client applications (and extensions) is the fact that while they clearly support offline use, they’re not intentionally built for it. Hence, whenever I open the mobile app or the browser extension, there’s a noticeable delay that sometimes takes literal seconds or even minutes, in which the client application seemingly tries to reach the backend, which often isn’t around (because I’m not hosting my Bitwarden backend on the open internet). While this sounds like a nitpick, it truly slows down things whenever one has to unlock Bitwarden (which is almost always, as I do not trust especially the browser extension to remain unlocked all the time). Sadly, there seems to be no way to turn off syncing when unlocking the vault to prevent the clients from waiting unnecessarily. Another example of a bad user experience is the logins overview (titled Vault ). Whenever I am on a website (in my desktop browser) and I would like Bitwarden to fill the login form, I tend to click the extension’s icon in the toolbar and then click the entry in the list. This has been how all other password manager UIs that I have used in the past have worked; Not Bitwarden , though. There, you need to click the small Fill button on the right side of the list item. If you click the big list item itself, which is highlighted on mouse-over, you simply open that item to show its details. Instead of allowing the user to click the big UI element (which is the whole list item), Bitwarden forces them to click a significantly smaller, harder to hit UI element (a button on top of a clickable list item). As with the syncing feature, there’s also no way to flip this behavior, so that clicking the list item would fill in the form, while clicking the tiny button would open the item’s details page. I’m apparently not alone in this sentiment. A quick glance at recurring Hacker News threads on the topic reveals that users have been complaining about pretty much every single one of these issues, ranging from the desktop app not focusing correctly when opened , to “loading for over 5 minutes before showing my passwords” , to the browser extension asking to save passwords that are already there , to broken biometric login on iOS, laggy mobile apps, and, of course, the famous “Log-In suggestions not showing” . Feature requests that have been sitting in the community forum since 2021 (such as a simple edit history for entries) remain untouched, which is a pattern that MSP resellers also called out publicly as “glacial feature development” . Speaking about lists, the Bitwarden CLI has an equally bad user interface. For example, the command of the tool will unexpectedly output every detail of every item, including passwords and TOTP codes, without the need for an additional e.g. flag. There’s no way that reasonable engineers looked at this and said “Yep, that’s how we do things, because we cannot imagine a single situation in which anyone might mistakenly pipe to some place and unintentionally expose all their credentials” . Also, can we take a step back and talk about the fact that the Bitwarden CLI is a terminal tool built in TypeScript ? Not only because it requires a metric ton of runtime and dependencies, but also because JavaScript isn’t exactly the stack anymore that you’d run carefree on your continuous integration environments. “Why?” , you ask? Hold my beer… A password manager has, essentially, one job : Keeping the user safe, by keeping their credentials safe. For a product that has been around since 2016 , Bitwarden has accumulated a surprisingly long list of incidents in which it at least partially failed at exactly that task. And no, I’m not talking about theoretical vulnerabilities, I’m talking about things that actually shipped to production. In January 2023, shortly after the LastPass breach had the entire industry questioning the real-world strength of cloud-hosted password vaults, security researcher Wladimir Palant published an analysis showing that Bitwarden ’s advertised 200,001 PBKDF2 iterations were, in practice, closer to 100,000 . The reason was that the additional server-side iterations were only applied to the master password hash used for login , but not to the encryption key protecting the vault data. An attacker with access to a leaked vault could therefore bypass the server entirely and was left with the same effective security as with LastPass . Additionally, the default client-side iteration count was still at 100,000 , below OWASP recommendations at the time, and a concern that had been raised as far back as 2020 . Bitwarden eventually raised the default to 600,000 and added Argon2 support, but (mirroring LastPass ’ earlier mistakes) the change initially applied only to new accounts, leaving existing users responsible for manually updating their own KDF settings. Still in 2023, RedTeam Pentesting disclosed “Bitwarden Heist” ( CVE-2023-27706 ), a vulnerability in the Windows desktop client that allowed attackers with domain-administrator access to extract the vault decryption key from the local DPAPI storage without ever prompting Windows Hello or the master password. In the words of the researchers: Any process running as the low-privileged user session can simply ask DPAPI for the credentials to unlock the vault, no questions asked. The fix eventually shipped in version 2023.4.0 , months after initial disclosure. Also in 2023, CVE-2023-27974 was disclosed. The vulnerability was about the Bitwarden browser extension, which happily offered to fill credentials into cross-domain iframes embedded on trusted pages, as long as the base domain matched. Meaning, if embedded an iframe from (e.g. on a subdomain controlled by a third party), credentials could be stolen. Bitwarden ’s response was that iframes “must be handled this way for compatibility reasons” , and that “Auto-fill on page load” was not enabled by default. Small comfort if you did enable it. Fast-forward to August 2025, when security researcher Marek Tóth publicly disclosed a class of DOM-based clickjacking attacks that could trick the Bitwarden browser extension into autofilling credit card details and personal information after a single click on a malicious page. The vulnerability had been reported four months earlier, in April 2025, but was classified by Bitwarden as “moderate severity” and was not patched until version 2025.8.2 , shipped on the very day the researcher’s embargo expired. And then, a few days before I started writing this post, news broke that the official Bitwarden CLI client ( ) was compromised in the ongoing Checkmarx supply chain attack : The affected package version appears to be , and the malicious code was published in , a file included in the package contents. The attack appears to have leveraged a compromised GitHub Action in Bitwarden’s CI/CD pipeline , consistent with the pattern seen across other affected repositories in this campaign. Organizations that installed the malicious Bitwarden npm package should treat this incident as a credential exposure and CI/CD compromise event . The payload downloaded the Bun runtime, decrypted a second-stage Shai-Hulud worm and started harvesting GitHub and npm tokens, SSH keys, shell history, AWS , GCP , Azure credentials, GitHub Actions secrets, and even MCP configuration files used by AI tooling. The data was then exfiltrated by auto-creating a public repository on the victim’s own GitHub account and uploading the stolen credentials there. Bitwarden ’s npm distribution pipeline stayed compromised for approximately 19 hours and 334 developers had enough time to pull the malicious package before it was caught. Bitwarden ’s official statement emphasised that no end-user vault data was accessed , which is technically true and entirely beside the point. Everyone running in a CI pipeline just handed the attackers whatever else happened to live on that machine. For a company whose one job is keeping secrets safe, distributing an actively malicious CLI through its official channels is not a great look. It also ties back nicely to the earlier rant about shipping a password manager CLI as a Node package. Had been a single statically-linked binary in Go or Rust (as most of the ecosystem has moved towards) the npm -shaped blast radius simply wouldn’t exist in that form. And while supply-chain attacks within the Go and Rust ecosystems are on the rise as well, the barriers for successful attacks are still higher. Note: None of the above incidents are world-ending on their own. Every non-trivial piece of software will ship with bugs, and critical vulnerabilities happen to everyone. What bothers me is the pattern . The reactive (rather than proactive) security posture, the “working-as-intended” responses to embarrassing findings, the reliance on a Node.js toolchain for a security-critical CLI, and the fact that several of these issues had been quietly flagged by external researchers long before they were actually addressed. As this post is not an ad-driven hit-piece by any of Bitwarden ’s competitors, you won’t be reading anything along the lines of "… switch to <insert SaaS product here> now and get 50% off your first year with promo code SWORDFISH" . Instead, I will describe the approach that I’m taking moving forward, which might be something that you, as an equally frustrated long-time Bitwarden user, might be interested in exploring as well. Over the past years, I came to the conclusion that there’s no single password manager that will work perfectly for every use case and setup. For example, in my personal life, I do not need the ability to share vaults or individual passwords with other people. In my professional life, however, that is a fairly common occurrence. Similarly, the login credentials for bank accounts or insurance portals do not need to be available through a CLI tool, but they have to be available across multiple devices. Secrets for cloud storage or SSH private keys for deployments, however, don’t need to sync to any of my phones , but they do need to be accessible from a command-line tool that can be invoked programmatically. With these requirements in mind, it only makes sense to think of a way to better compartmentalize each set of credentials, rather than trying to find a single software or platform that can kill ten birds with one stone. Also, looking at it from a security perspective, it makes total sense to split up these password groups into different softwares and services in order to minimize the impact that a data breach might have. Generally, the approach that I came up with splits my credentials into the following groups: For group A I’m going with a SaaS password manager that offers proper vault sharing, integrates with the tools clients actually use (SSO, browser extensions on corporate machines, audit logs), and takes the hosting burden off my plate. The platform is proprietary, which I would normally not be thrilled about, but given that the scope of this group is client work only , I’m accepting the trade-off. For group B , the rationale is a bit counter-intuitive at first. The accounts tied to these credentials already contain personal information like name, address, date of birth, maybe payment details, which is regularly leaked by the very same services anyway, as a quick look at Have I Been Pwned confirms. A breach of the password manager itself would therefore not meaningfully expand the attacker’s knowledge. With TOTP and Passkeys in place, it frankly doesn’t even matter anymore at this point. What does matter here is cross-device availability, realiability and offline capabilities. I’m using a second, separate cloud-based password manager for this group, from a different vendor, with a different master password and different recovery mechanisms, so that a compromise of group A doesn’t automatically compromise group B and vice-versa. As I will be running their mobile app on at least one GrapheneOS device, I prefer a solution that doesn’t depend on Google Play Services and ideally offers an open-source/source-available client. Group C covers all the accounts I have on internet forums, websites, privacy-respecting services, and anything that doesn’t hold PII. For these, I don’t need, nor do I want, a cloud service. I’m using KeePassChi / KeePassXC / KeePassDX with the database file sitting in a folder that is being synced across my devices via Syncthing , which is an approach I have already written about in the past . The file is itself encrypted, which means that even if Syncthing were compromised (and the attacker somehow got their hands on the file), they would still need to break the KeePassChi / KeePassXC encryption to get anything useful out of it. On mobile, KeePassDX on Android reads the same file without fuss. For group D , I’m using a mixed approach of storing personal credentials using the same approach taken in group C , and credentials that are actually used by scripts, CI jobs, and remote servers, using HashiCorp Vault , which is the same one I was already running for PKI in my OpenBSD setup. Vault is a bit of an overkill for a single user, but it gives me proper access policies, token-based authentication for automated agents, short-lived credentials for things that support it, and audit logs. Having that said, I’m looking into Infisical . For group E , the API keys, personal access tokens, and random secrets that I only ever use from the command line, I’ve settled on the venerable utility. It stores each secret as an individual GPG -encrypted file in a Git repository, which is conceptually simple, easy to audit, and cooperates perfectly with shell scripts and my dotfiles . The Git repository lives on my own infrastructure, not on GitHub , and it’s only synced manually when I actually need to access it from a different machine. This might all sound like a lot of moving parts, and I understand if it looks like overkill for someone coming from a single-vault world. The reality, however, is that after years of using Bitwarden as a one size fits all solution, I realised that one size fits all meant one size fits poorly . Splitting credentials across multiple tools turned out to be significantly less painful than I had initially assumed, mostly because each tool is individually well-suited to its specific task. And if any one of them gets breached, the blast radius is limited to one category of secrets, not the whole lot. After several years of self-hosting Bitwarden , I’ve come to the conclusion that the product has drifted further and further away from what I originally signed up for. The enterprise-first architecture that barely fits on a Raspberry Pi, the half-hearted attempt at a “lighter” backend, the SDK licensing situation , the slow pace at which features are being addressed, the avoidable UX paper-cuts that haven’t been fixed in years, and finally the string of security issues that shouldn’t have shipped in the first place, all paint a picture that I find hard to reconcile with the “open-source password manager for everyone” narrative. I’m not suggesting that the alternatives are universally better or free of their own issues, because password managers are simply hard, and every player in this space has its fair share of skeletons. What I am suggesting is that you take a hard look at how much trust you are placing into a single piece of software for all of your credentials, and whether that bet is still the right one, which for me, it no longer was. Here are some other views on this topic: A: Credentials for professional/client projects (think platform logins, etc.) B: Credentials for accounts containing PII (think bank accounts, online shops, etc.) C: Credentials for accounts that do not contain PII (think accounts on internet forums, online platforms, etc.) D: Credentials for infrastructure (think server logins, SSH keys) E: One-off credentials (think API keys, tokens, etc.) Ask HN: Alternatives to Bitwarden? Bitwarden CLI Compromised in Ongoing Checkmarx Supply Chain Campaign Bitwarden CLI Compromised in Ongoing Checkmarx Supply Chain Campaign Concerns Over Bitwarden Moving Away from Open Source

0 views
Phil Eaton Yesterday

Automating Hermitage to see how transactions differ in MySQL and MariaDB

This is an external post of mine. Click here if you are not redirected.

0 views
Julia Evans Yesterday

Testing Vue components in the browser

Hello! One of my long term projects on here is figuring out how to write frontend Javascript without using Node or any other server JS runtime. One issue I run into a lot in my frontend JS projects is that I don’t know how to write tests for them. I’ve tried to use Playwright in the past, but it felt slow and unwieldy to be starting these new browser processes all the time, and it involved some Node code to orchestrate the tests. The result is that I just don’t test my frontend code which doesn’t feel great. Usually I don’t update my projects much either so it doesn’t come up that much, but it would be nice to be able to make changes with more confidence! So a way to do frontend testing that I like has been on my wishlist for a long time. Alex Chan wrote a great post a while back called Testing JavaScript without a (third-party) framework in response to one of my previous posts in this series that explained how to write a tiny unit-testing framework that runs in a page in browser. I loved this post at the time, but it only talked about unit testing and I wanted to write end-to-end integration tests for my Vue components, and I didn’t know how to do that. So when I was talking to Marco the other day and he said something like “you know, you can just run tests for your Vue components in the browser”, I thought “hey, I should try that again!!!” I just did all of this yesterday so certainly there’s a lot to improve but I wanted to write down a few things I noticed about the process before I forget. This was a bit tricky for me because the Vue site usually assumes that you’re using Node as part of your build process in some way (there’s a lot of “step 1: ), and I didn’t want to use Node/Deno/etc. But it turned out to not be too complicated. The project I’m going to talk about testing is this zine feedback site I wrote in 2023 . I used QUnit . It worked great but I don’t have anything interesting to say about how it works so I’ll leave it at that. I think that Alex’s “write your own test framework” approach would have worked too. I followed these directions . I did appreciate that QUnit has a “rerun test” button that will only rerun 1 test. Because there are so many network requests in my tests, having a way to run just 1 test makes it a lot less confusing to debug the test. The first thing I needed to do was get my Vue components set up in the test environment. I changed my main app to put all my components in , kind of like this: Then I was able to write a function which does basically exactly the same thing my normal main app does (render a tiny template with the component I want to use). The only differences are: Here’s what using the function looks like: and here’s the code for it: The result is a div where I can programmatically click, fill in form data, check that the right content appears, etc. Because I was writing end-to-end integration tests to make sure my client JS worked properly with my server, I needed to have some test data in my database. So I wrote ~25 lines of SQL to set up some test data in my database, and added an endpoint to my dev server to run the SQL to reset the test data to a known state. Then I just run at the beginning of any test that needs the test data. My function actually doesn’t always totally reset everything which is kind of bad, but it was workable to start with and can always be improved. Here’s what a basic test looks like! Basically we’re rendering the div and make sure it contains some approximately correct data. Those are all the basic pieces! Now here are a few issues I ran into along the way I have a lot of network requests in my tests, and it takes time for them to finish and for the Vue code to do what it has to do with the results and update the DOM. I think we all learned a long time ago that putting random calls in your tests and hoping that the timings are right is slow and flaky and extremely frustrating, so I needed a different way. As far as I can tell the normal way to deal with this is to figure out a way to tell from the DOM whether it’s okay to proceed or not. Like “if this button is visible, we can “. So I wrote a little function that polls every 20ms to see if a condition has finished yet. It times out after 2 seconds. Here’s what using it looks like: It looks like there are a lot of implementations of this concept out there and they’re all better thought-through than mine. (from a quick Google: qunit-wait-for , playwright expect.poll ) In some cases I thought I’d identified the right thing to wait for in the DOM (“just wait for this textarea to appear!’) but it turned out that because of some internal details of how my program works, actually I needed to wait for something else later on which was hard to pin down. I ended up changing one of my components to add some random value to the DOM when it was finished an important action (like ) which didn’t feel great. My best guess is that the right way to fix this kind of test issue is a refactor that also makes the app more reliable for the users: if there’s an element in the DOM that isn’t actually ready for the user to interact with, maybe I shouldn’t be displaying it yet! I ended up adding a few classes to HTML elements that I needed to find in the tests, either because I needed to click on them or wait for them to appear in the DOM. I might want to change this approach later - frontend testing frameworks seem to suggest avoiding using CSS classes and instead using something like getByRole or as a last resort something like a data-testid . Feels like there’s a way to make the app more accessible and easier to test at the same time. To fill out a form, I can’t just set the , I also need to dispatch an event to tell Vue that the element has changed. For example, and need different kinds of events. This is kind of annoying and it made me realize why I might want to use some kind of UI testing library, for example: I want to have an idea of what my test coverage was, and it turns out that Chrome actually has a built-in code coverage feature for JS and CSS! My JS is bundled into a file called with esbuild, so I could just look at and see which lines weren’t covered. The process was a little finicky: I had to turn off sourcemaps in the Chrome devtools to get this to work, and there’s a specific not super obvious series of actions I have to do in order to see the coverage data. As usual with these posts I’ve never really worked as a frontend or backend developer (other than for myself!) and I feel like I’m constantly learning how to do super basic tasks. I really had a blast doing this. My frontend projects always feel so fragile because they’re untested, and maybe one day I’ll have a test suite I’m confident in! Some things I’m still thinking about: I can optionally pass some some extra data to use as its props. It mounts the component to a temporary invisible div which will get removed from the DOM after the test is done. The div is positioned off the page ( ) so you can’t see it. Testing Library’s example of filling out a form looks extremely different from what I’m doing Vue Test Utils: their section on form handling looks like it simplifies this a lot. While writing this post I found this frontend testing library called Testing Library that has a lot of guidelines for how to write tests that are very different from my initial ideas. I experimented with rewriting everything to use Testing Library and it felt pretty good, so we’ll see how that goes. They distribute a file that works without Node. I’m not sure how I feel about not having a way to run these tests on the command line at all. Maybe there’s a simple way to work primarily in the browser but have an way to run them in CI too if I want?

0 views
iDiallo Yesterday

Disable Auto-Update

How is it possible that a feature I use every day, in an app I rely on daily, entirely offline, just disappeared from my phone? I use a fitness app. My metrics, such as steps, workout routines, heart rate, are collected from a wearable device like a smartwatch and sent to the app via Bluetooth. No third-party servers are involved in that transaction. The data lives on the phone. It costs the developer nothing to maintain, because there's nothing to maintain on their end. Then the app updates, just once, and that data is no longer accessible. Not because it was deleted or corrupted. Because the developer decided you now need to create an account on their servers to access information that already exists on your own device. That's why I have auto-update disabled on every device I own. Some of the apps on my phone are older than my children. You couldn't download them today even if you wanted to. The developer no longer offers that version. One of my apps is a single screen that displays information based on GPS data and compass orientation. I downloaded it in 2014. I've switched phones twice since then, and each time I've made sure to carry that app with me. I didn't keep it out of nostalgia, and because I have a hard time letting go. I kept it because the current version has three ads crammed onto that single screen. A full-screen ad hijacks the display at random intervals, complete with one of those countdown timers that slows down as it approaches zero. And of course, there are notifications now. None of that is for my benefit. I just need that one screen. Open the app, read the information, put it away. You might say I'm being cheap. That if I've used the app for over a decade, I clearly value it. So I should pay for the subscription and lose the ads. Fair point. But I have the old version. It was free, had no ads, and worked flawlessly. No future version can improve that. On top of it, those ads expose me. Advertising is one of the most common attack vectors in mobile security. Malvertising is a real thing. Updating to the ad-supported version wouldn't make my phone more secure. I don't update apps unless I've read about a specific vulnerability. Even then, I'll often delete the app rather than update it. I can't accept software that changes arbitrarily, especially when those changes almost never benefit me and almost always serve someone's bottom line. As a developer myself, I have the advantage of actually reading changelogs. When an update says "bug fixes," that's not a reason for me to act, unless I've encountered those bugs personally. Every user engages with a different 20% of an app's features . Someone else's bugs may never be mine. And why do developers push account creation so aggressively? Because your account is the product. An account means data. Data means third-party revenue. Every update is a decision point for me. It requires me to set aside time, read about the changes, and think about what I'm about to embark into. My workflow matters. My data matters. My time matters. If a developer breaks what worked for me without a compelling reason, I'll find another app that respects those things. There's always one out there, probably one that hasn't been improved yet.

0 views

Photo Journal - Day 3

Life has been busy and I missed the past 2 days, but thankfully I remembered to bring the camera with me today! I snuck out in the brief calm between rain storms, don't particularly want to test how waterproof my camera is. ↑ This is the side of the building I'm coworking in today. ↑ Sometimes I really wish I had a macro lens! ↑ I love how this one turned out.

0 views
Stratechery Yesterday

2026.18: Long-term, Peripheral & Myopic Visions

Welcome back to This Week in Stratechery! As a reminder, each week, every Friday, we’re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone . Additionally, you have complete control over what we send to you. If you don’t want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings . On that note, here were a few of our favorites this week. This week’s Stratechery video is on Tim Cook’s Impeccable Timing . Amazon and AI . When it comes to AI, every quarter seems to bring a new winner and loser. For my part, the company that I find increasingly compelling is Amazon . Things didn’t look promising a couple of years ago, when training was the most important infrastructure use case, but Amazon — whether through vision or good fortune — was positioning itself well for a world defined by inference (given that their inference chip is called “Trainium”, I’m going with a little bit of column A and a little bit of column B). Now the company is adding OpenAI’s models to its offerings, and collaborating with the frontier lab on an entirely new kind of enterprise product: Bedrock Managed Agents, the subject of a Stratechery Interview with AWS CEO Matt Garman and OpenAI CEO Sam Altman . — Ben Thompson The Future of AR Devices.  Amidst a never-ending conversation about AI, software and infrastructure spending, it was refreshing this week to dream about the possibilities for the future of hardware. Ben’s Daily Update on Monday traced his experience with the Meta Display glasses and culminated with an epiphany on what the future of AR should look like. We dove deeper on Sharp Tech with an extended conversation about why the Display glasses are superior to Meta’s Orion prototype, notes on what future VR headsets should emphasize, and whether phones (or books?) should be characterized as AR devices.  — Andrew Sharp Beijing’s Myopia in AI and Elsewhere. On Sharp China this week Bill and I unpacked the implications of a terrific mess in Singapore , as China’s National Development and Reform Commission has moved to block Meta’s $2 billion acquisition of Manus, a formerly Chinese AI company that had reincorporated in Singapore and had already received payment and integrated its products and employees into Meta’s operations. Then, on Sharp Text this morning, I wrote about Beijing’s geopolitical behavior in 2026 , what Western media tends to get wrong, and — with the Manus decision being a good example — why the CCP’s geopolitical and domestic strategies are generally reactive, not proactive, and often counterproductive. — AS AI Hardware, Meta Display, Redefining VR and AR — I finally tried the Meta Ray-Ban Display, and it completely changed how I think about AR and VR. An Interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman About Bedrock Managed Agents — An interview with OpenAI CEO Sam Altman and AWS CEO Matt Garman about their new partnership, plus my thoughts on OpenAI and Microsoft’s new deal. Intel Earnings, Intel’s Differentiation?, Whither Terafab — Intel’s earnings were very impressive, but the chief driver was a structural shift in demand for CPUs for AI. Plus, what is going on with Terafab? Amazon Earnings, Trainium and Commodity Markets, Additional Amazon Notes — Amazon’s earnings suggest that the shift away from training towards inference and agents means their bet on Trainium is paying off. Plus, additional notes on ads, agents, and sports rights. Beijing Is Not Playing the Long Game — Every single week, someone in the Western media will tell you that China is playing “the long game.” Don’t believe them. Meta Ray-Ban Display OpenAI, Musk & Microsoft Fanuc and the Numerical Control Revolution Beijing Kills Meta’s Manus Deal; April Politburo Takeaways; Foreign Forces Afflicting the Youth; US Countermeasures Mounting NAW and CJ and CA CAWWWWW, DEFCON 2 for Jokic and the Nuggets, Notes on OKC, Toronto, and VJ Edgecombe Playoff Stock Watch: Scottie Barnes Awareness, Pistons Repricing, Jokic Market Corrections, and Lots More AWS History and Trainium’s AI Future, OpenAI Makes a Deal With Microsoft, Meta and the Future of Wearable Devices

0 views
Unsung Yesterday

“Examining the changelog in its entirety would be a massive task, given that it was now over 200,000 words long.”

I had some idea that many popular games have mods to tweak them – from small appearance changes and fan-made translations, to bigger gameplay or UI changes (and even an occasional trojan horse ). What I didn’t know was that for some games there is a whole community of modders who do one thing and one thing only: they fix bugs that the developer didn’t bother fixing. This 1.5-hour (sic!) video by Fredrik Knudsen talks about a story of such a community for a popular game Elder Scrolls V: Skyrim: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/examining-the-changelog-in-its-entirety-would-be-a-massive-task-given-that-it-was-now-over-200000-words-long/yt1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/examining-the-changelog-in-its-entirety-would-be-a-massive-task-given-that-it-was-now-over-200000-words-long/yt1.1600w.avif" type="image/avif"> I won’t lie: this video was a bit of a frustrating watch. The presentation is dry and takes its time. I was annoyed at Bethesda for not fixing the bugs to begin with and creating the whole mess. Also, some of the people in this story do not appear very mature, and post-Gamergate I have little patience for that kind of behaviour. On the other hand, this covers so, so many interesting things and provoked so many thoughts: Not to mention these topics: If you are responsible for bug-fixing processes at a company or with a community, I am curious if you find this video valuable. I did. The funniest moment was that drama/​debacle about a certain in-game portal was nicknamed… Gategate. = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/examining-the-changelog-in-its-entirety-would-be-a-massive-task-given-that-it-was-now-over-200000-words-long/1.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/examining-the-changelog-in-its-entirety-would-be-a-massive-task-given-that-it-was-now-over-200000-words-long/1.1600w.avif" type="image/avif"> Not to mention the ending is truly poetic, and not something I expected. #bugs #games #process #software evolution #youtube how hard it is to agree what a bug even is, how a bug fix can introduce more bugs and be an overall net negative, how a new distribution method for something can drastically change its nature, that everything, as always, boils down to communication, that in community- and volunteer-led projects, not spending time on governance will come back and bite you. dependencies change management centralization vs. federation copyright and DMCA version control volunteer burnout issues of trust and ego and power

0 views

re:My Fear of Flying

This is in reply to Kev writing about his fears of flying. The first time I flew was also the first time I left the country. In September of 2012 my mother dropped me off at Columbus International Airport for a morning flight destined to Narita International Airport. I was fucking terrified. I was so inside of my own head with fears of flying I nearly missed my flight. The loudspeakers called out my name for a final boarding call...I was sitting right in front of the gate, completely oblivious to the fact that the whole plane had boarded. Once I was in the air, my fears started to ease as the excitement at experiencing air travel started to take over. It also helped that I had my first (and second) legal beer (after the stewardess confirmed we were safely over Canada). I flew semi frequently after that, yearly trips to Mexico, visiting family (and getting married) in China, etc. Flying became normal, and my fears were mostly gone. But after my son was born nearly 4 years ago, we stopped traveling. Last year, my wife and I were lucky enough to visit family in Australia for 2-weeks. That flight was terrifying for me. I'm not sure what changed, but I could not stop thinking of how high and vulnerable one is when flying. I calmed my nerves a bit with bad in-flight movies, but was still extremely relieved when we finally landed. During our 2 weeks in Australia, the D.C. AA 5342 disaster occurred, which was in addition to reduced/overworked ATC staffing due to "government efficiency". That was an extremely terrifying flight home, my hands were completely covered in sweat as we finally landed. I haven't flown since, admittedly less due to fear and more due to having a second kid now keeping us even busier. I did opt-out of attending a conference that would require air travel though. I'm sure I'll have to fly again within the next year or so, potentially to China. I'm curious where my comfort level will sit, if I had to guess I would say somewhere in the middle of calm and terrified.

0 views
Kev Quirk Yesterday

Thoughts on Leaving GitHub

I've read a few posts about people leaving GitHub recently, and following my short note to the Fediverse a number of people have piped up saying they're not fans of GitHub, either. From the reading I've done, these frustrations are usually threefold: In all honesty, none of the factors above really bother me that much. I think that's because I don't rely on GitHub for anything significant. I'm not a professional software developer, so my livelihood doesn't depend on it. As for Copilot being trained on open source software, and them repeatedly ignoring the GPL to do so, it does irk me, but I kind of expect shit like this from Microsoft at this point. I went into using GitHub assuming that any code I upload there can (and probably will) be used for shitty stuff. But even that isn't enough in isolation to put me off GitHub. The way I see it is that public code is for the public, and if Microsoft want to use my code in that way, while not ideal, doesn't piss me off that much. So why think about moving at all? Well, for me it's about reliance on big tech. I'm trying to reduce it where possible, but the social and "centre of mass" aspects of GitHub are giving me pause. For example, the Simple.css repo has a whopping 5,000 stars! Do I really want to lose that visibility? Buuuuuuuuuut, I can always redirect any popular repos to another platform, just like I did with 512KB Club when I handed that to Brad . Plus, let's be honest, it's all just popularity bullshit. It doesn't really mean anything. What's important is that the code is readily available for people to use. It's like leaving Facebook - when I was thinking about it, I was worried if I'd miss my friends or be out the loop. It's been over a decade at this point and I don't miss it one bit - no regrets whatsoever. I think moving off of GitHub would be the same. I plan to slowly start migrating public repositories over to Codeberg so that all my projects are hosted there. I'll also use it as an opportunity to archive off any old repos that I no longer need. Codeberg also supports logging in with GitHub and Gitea, so anyone who contributes to my projects on GitHub, should be able to do so easily in Codeberg too. Then, for my private repos (of which there are many that host personal projects) I've installed Synology's Git server on my Synology, and have been playing with that for a few days. It works extremely well, so all my private repos will live there, safe and sound, away from Microsoft's greasy mitts. Ultimately it's personal choice. For me it's about reducing my reliance on big tech, but also making my private repos more private. I won't be deleting my GitHub account though, as I think it will be important to use as a marker for anyone who wants to find my source code when it moves. Have you thought about leaving GitHub? Thanks for reading this post via RSS. RSS is ace, and so are you. ❤️ You can reply to this post by email , or leave a comment . Microsoft ownership Microsoft training Copilot on open source software Large amounts of downtime

0 views
Unsung Yesterday

CleanShot’s onboarding via settings

I recently installed a screenshotting utility CleanShot , and I was enamored with its settings: There’s much to like here – thoughtful grouping and layout, good explanations, more details than expected. There are some nice interaction moments, for example the hints swapping to reflect the current status: The fact that the tool allows you to override its single-key shortcuts, which are the hardest to change using third-party keyboard customization apps: = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/cleanshots-onboarding-via-settings/4.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/cleanshots-onboarding-via-settings/4.1600w.avif" type="image/avif"> = 2x) and (width >= 700px)" srcset="https://unsung.aresluna.org/_media/cleanshots-onboarding-via-settings/5.2096w.avif" type="image/avif"> = 3x) or (width >= 700px)" srcset="https://unsung.aresluna.org/_media/cleanshots-onboarding-via-settings/5.1600w.avif" type="image/avif"> Or, when you want to customize the key visualization, Settings shows a nice preview: There was even this lil molly guard : But also just the settings themselves gave me a sort of competence contact high. A few clicks in, and I thought “oh, they do know what they’re talking about.” So many things here were for me, to solve specific problems I encountered. It all gave me confidence this is the right tool for the job. (Also, perhaps a corollary: has there even been a bad tool with well-designed settings?) Compare with also-new-to-me settings from Affinity, which I was much less impressed with: It uses the troubled right-aligned style originating in iOS, the capitalization is clumsy, and the navigation muddy (it feels like in-page links on the web, which are always confusing). Is this a fair comparison? Not at all. I don’t actually want to say that CleanShot is better and Affinity is worse. This is so very much east coast apples and west coast oranges. I don’t even want to say settings are always worth designing well in the traditional sense; sometimes the only thing between you and 20 unnecessary options in your app is simply having no surface that could host them. A limited (but never unpleasant!) settings UI might be an intentional design decision. But there was a nice quote in the Shadow of the Colossus book : “I often find myself exploring simply because it’s beautiful.” I too became a tourist in all of CleanShot’s settings because they were put together so well, and I was so curious what’s behind the next corner. Its creators understood that the best way to get to know what the tool is capable of is to take a peek through the settings. I think it’s a good case study at how a proper welcome mat doesn’t always have to be a few onboarding tooltips flying spastically around the screen. Sometimes it won’t look like a welcome mat at all. #above and beyond #onboarding #writing

0 views