Model-Harness-Fit
Why mixing a frontier model with a foreign harness quietly tanks performance, and what the open source code tells us about why. I keep three coding agents alive on the same workstation. Claude Code in one terminal. Codex CLI in another. GitHub Copilot CLI in a third. Same files. Same git tree. Same bash. Three different harnesses that look indistinguishable. A few weeks ago I ran the same prompt through all three and the behavior was visibly different in ways that went well past the surface differences of style and speed that I had expected to see across vendors. The Codex run cited a memory entry I had taught it months ago, applied the rule, and kept going without asking. The Claude Code run flagged the same context but refused to assert it without first verifying that the file path was still valid. The Copilot CLI run produced a longer, more cautious plan and asked me to approve it before taking any side effect on disk. The hand wave answer is that "models behave differently because they are different models." But Copilot CLI was running Claude Opus, the same family that Claude Code runs by default. Same model family, same prompt, two harnesses, materially different output. The hand wave does not cover it. Models are post trained against the harness, not just the API. The tool names they expect, the input schemas they emit, the citation tags they wrap around remembered facts, the file structure of skills they invoke, the planning protocol they follow when the harness says "make a plan first" (none of these are generic capabilities of the model). They are byte level conventions baked into the post training of one specific model against one specific harness. Pull the model out of its harness and you give up performance you cannot get back without rewriting either side. This has a direct consequence that anyone who has tried to ship a "model agnostic" agent has run into. You cannot just swap a model. Supporting BYOK and multi model (which is the responsible posture, since relying on a single provider is risky) adds real engineering complexity, and that complexity is worth paying. To swap a model cleanly, you have to swap the harness with it: the tool surface, the schema shapes, the skill bodies that name those tools, the citation contract, the memory ritual, the system prompt structure, sometimes the planning protocol. Everything above the model has to move when the model moves. That is why every agent vendor that supports multiple providers ends up either (a) running a degraded variant of every model they support, or (b) maintaining a separate full stack per model and exposing the choice to the user as "you are picking a product, not just a model." Option (b) is the path that wins on quality, and it is worth the engineering cost to avoid being locked into one lab. Swapping orchestrators is not a cosmetic change. It is a model swap in disguise. The frontier lab spent the last year shaping the model's instincts to a particular tool surface, a particular memory ritual, a particular skill format. When you mix and match, you spend that work. I think this is the single most underrated constraint in agent design today, and it has a clean name. Call it model harness fit . I dug into three open implementations that ship today: Codex CLI (OpenAI, fully open source at , Rust workspace, ~80 crates), Claude Code (Anthropic, closed binary, but a Rust port called at tracks upstream behavior closely enough to read at ~48,600 LOC across 9 crates, and Claude Code's own runtime injects observable blocks on every turn that confirm or contradict claims from the port), and GitHub Copilot CLI , where the SDK is fully open source MIT licensed at with five language bindings (Node.js TypeScript at 5208 LOC across 8 files, plus Python, Go, .NET, Java), and the JSON RPC wire protocol is documented at (currently version 3). The CLI binary that the SDK spawns as the agent runtime server is closed, but the client wrapper, the protocol, the session lifecycle, the system prompt section overrides, and every RPC method are all open source and readable. Here is what I will cover: Companion piece: I covered the memory layer in detail at Agent Memory Engineering . This article is about everything else, with memory revisited only where it intersects orchestration. If you want the bottom up tour of how MEMORY.md indexes, system reminder injection, age in days warnings, and signal gates work, read that one first. Before any argument about architecture, look at the leaderboard. Terminal-Bench 2.0 evaluates agents on bash heavy multi step tasks, and it ranks by harness plus model pair, not by model alone. From on April 30, 2026: Two things jump out. First, Claude Opus 4.6 paired with ForgeCode hits 79.8%, while the same model paired with Capy hits 75.3%. Same weights, different harness, and a 4.5 percentage point spread between them on a benchmark where every entry is fighting for a tenth of a point. Second, the upper rankings are not dominated by the labs that trained the models. ForgeCode is a third party harness that lands three of the top six entries by routing across model families. Stanford's IRIS Lab paired Opus 4.6 with an automated harness evolution system called Meta-Harness and pushed the same model to 76.4% on the same benchmark, well past the best baseline they started from. The harness is moving the score by more than the model upgrades are moving it. Cursor's research team makes the point even sharper. In their April 30 post on harness engineering, they note that they took their own coding agent from "Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness." Same model. Same benchmark. Different scaffolding. A 25-position jump on a public leaderboard, attributable to the harness alone. That is not a tuning artifact. That is the entire ranking. LangChain's Vivek Trivedy puts the same observation in one sentence: "Opus 4.6 in Claude Code scores far below Opus 4.6 in other harnesses." Anthropic's flagship model in Anthropic's flagship harness loses to the same weights in third party scaffolding. If you only saw the model name on the spec sheet, you would not predict that. This is the empirical case for model harness fit. Hold the model fixed and swap the harness, and the pass rate moves by enough to outweigh a model generation upgrade. Anyone shipping a coding agent in 2026 who picks the model first and the harness second is leaving most of the performance on the floor. The rest of this article is about why. What exactly does the harness do that lets two implementations of the same model produce different scores? Each harness picks a different orchestration protocol. The model was trained on that protocol's exact wire format. These are not three implementations of the same idea. They are three different contracts between model and runtime. Codex is a typed asynchronous protocol. The model emits a with an and gets back a stream of typed messages. The protocol is defined at with explicit enums. There is a second protocol layered on top: is 10,721 lines of JSON RPC for cross process clients (IDE plugin, desktop app), where v1 (245 lines) is frozen and all new RPCs go to v2. Methods are named with singular resource names, camelCase wire format. The two protocols stack: agent layer for in process, JSON RPC layer for cross process. The model was trained to emit submissions and consume events. Claude Code is a direct typed conversation loop. The runtime's consumes a per turn from . variants are , , , , and . There is no separate submission queue. The protocol is the Anthropic Messages API plus a tight in process tool dispatcher. The model was trained to emit tool calls inside an assistant message and respond to tool results in the next turn. GitHub Copilot CLI is a supervisor protocol. The host app does not run the agent loop. It spawns the bundled binary as a subprocess, opens a channel over stdio, and sends with the full configuration: model, system message, tools, MCP servers, custom agents, skill directories, hook flags. The agent loop runs inside the child process. The host gets notifications back. The model was trained to run inside this supervisor and emit JSON RPC events that the supervisor can route. You can see the architectural commitment harden in each design. Codex's literally polices crate growth: "Resist adding code to . The largest crate is explicitly off limits for new features." A 500 line soft cap, 800 line hard cap per Rust module. New features pay rent in the form of a new crate. This is a compiler toolchain attitude applied to an agent harness, and the model was trained to operate inside it. Claude Code's port enforces a different rule: "one agent loop, not a fan out of specialized agents," which is why subagents in Claude Code start with a fresh context and cannot recurse. Copilot CLI's supervisor model is what lets a single binary serve three surfaces (terminal, cloud agent, third party hosts). Each surface gets the same model behavior because the model is always running inside the same supervisor. Now imagine you swap models. Take a model trained to emit and feed it Claude Code's stream. The model has been taught one wire shape. The harness expects another. The mismatch shows up not as an outright failure but as a quiet degradation: missed tool calls, wrong reasoning effort levels, inconsistent compaction triggers, citation tags that the harness never parses. The wire format is part of the model. This is where post training is most visible. Every harness has a tool registry. The names look similar at the top: , , , , . But once you go past the first six, the surfaces diverge in ways that the model has been taught to exploit. Codex's exposes a particular vocabulary: Claude Code's port enumerates 40 specs in : Copilot CLI bundles a different default, drawn from the public changelog: A model trained on Codex's eight verb subagent surface knows how to send a message to a running subagent. A model trained on Claude Code's tool does not have that verb in its instinct set. The harness can paper over this with a router, but the router cannot give the model an instinct it does not have. Cursor's harness team puts the underlying mechanic plainly. From their April 30 research post: "OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement. Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes. So in our harness, we provision each model with the tool format it had during training." This is the single cleanest description of model harness fit I have seen from any vendor, and it is not a hand wave about model preferences but a specific measurable cost in reasoning tokens paired with an observable increase in error rate, recorded at scale across millions of agent turns in production. This is where model harness fit shows up most visibly. The tool surface is the model's vocabulary for the world. Cross train on a different vocabulary and you lose precision in every interaction. Skills look interchangeable on the surface. All three harnesses use a file with YAML frontmatter ( , , optional metadata). Codex even baked in cross compat: parses Claude style markdown skills. Copilot CLI explicitly reads config. The format is so similar that the same body would parse in all three. But skills are not just markdown. A skill carries an implicit contract about which tools it expects to call. That contract is not in the frontmatter. It is embedded in the body, in the form of imperative instructions that name specific tools by name, with specific argument shapes, and with specific verbs the model must emit. Look at what each harness ships as a system skill. Codex's bootstrap skills, baked in via and extracted to on first launch, are five: , , , , . The body invokes and as scripts ( ). It assumes the model can call to run a Python script. It assumes the model knows that scripts in of a skill folder are invokable. It assumes a sparse checkout fallback for private repos. None of that is in the frontmatter. All of it is in the body. Claude Code's skills are different. The plugin ships , , , , , plus many more. The bodies invoke Claude's specific tools: to bootstrap into a workflow, to track steps, to dispatch parallel subagents, / for file changes, / for search. The skills also encode hard process rules: "Use this BEFORE any creative work," "Use when about to claim work is complete." These rules anchor on the harness's injection model, which Codex does not have in the same form. Copilot CLI's skills are part of the plugin marketplace ecosystem, and the changelog reveals a different posture. v1.0.5 added "Embedding based dynamic retrieval of MCP and skill instructions per turn" as experimental. The model was trained to consume skill instructions delivered as a per turn injection chosen by an embedding ranker, rather than as a description match. A skill body that assumes "you will see all skills in the system reminder" does not behave the same way when the harness ranks skills via embedding and only injects the top three. This is why "we both use SKILL.md" is misleading. The format is identical; the contract underneath is not. Skills carry tool specs implicitly, and the implicit specs are pinned to the harness that authored them. The same applies to plugin manifests. Copilot CLI's v1.0.22 explicitly added: "Plugins using or manifest directories now load their MCP and LSP servers correctly." That is GitHub treating Claude Code's plugin format as a substrate to interoperate with at the file level. But the skills inside those plugins still bring assumptions about Claude Code's tool surface. Loading the file does not give the model the right vocabulary. The lesson generalizes. A skills marketplace that claims to be cross harness is a routing problem, not just a parsing problem. Each skill needs to either declare its target harness explicitly, or get rewritten per harness, or run inside a router that translates tool calls between dialects. None of these are free. I covered memory in detail in Agent Memory Engineering , so I will keep this section to the parts that matter for harness fit. Three memory architectures, three different bets: The architectural choices already differ. But the harness fit story is sharper than that. Each model was trained to write memory using a specific tool with a specific schema, and to cite memory using a specific tag with a specific format. Codex's model writes a structured raw memory artifact via Phase 1 extraction with a strict JSON schema: The Phase 2 consolidation prompt is 841 lines. . Schema validation rejects malformed output at parse time. The model citations are wrapped in blocks. The harness has a parser at that increments in the SQLite state DB whenever a citation arrives. This is the model's memory ritual. Strip the citation tag and the harness loses its decay signal. Claude Code's model writes memory using the standard and tools, into one file per memory under . There is no separate memory tool. The model picks one of four types ( , , , ) by file name prefix. The body uses a convention for behavioral rules. The harness wraps every body read in a block with the dynamic age in days and a verification reminder. The model was trained to read memory through that wrapper, weight it accordingly, and skip stale claims. Copilot CLI's model invokes as a dedicated tool. The body of the memory goes to a remote backend. Cross session memory was added in v0.0.412 as experimental. The retrieval surface is a server side query, not a local grep. The model expects the backend to be there. When the backend is unavailable (v1.0.23 fix), the agent used to hang on the first turn. That is a load bearing dependency. Now mix and match. Run a Codex trained model on Claude Code's harness. The model will look for a memory write tool, find , and write a file — but it will write a file in Codex's structured format, with headers and annotations, into a directory that Claude Code does not auto load on the next session. The harness does not know to inject the index. The next session does not see the memory. And critically, the model will emit blocks that Claude Code never parses. Memory effectively does not exist on the next turn. Run a Claude trained model on Codex's harness. The model will not emit citation tags. Codex's decay signal stops incrementing. Memories that were used silently rank below memories that were not used, because the harness sees zero citations. Within a few weeks, the wrong memories are getting evicted. Run either on Copilot CLI's harness with the remote backend. The model's local file instincts do not transfer. The tool is the only path, the schema is different, and the cross session retrieval is keyword search against a server, not the always loaded index plus on demand body read pattern that the model was trained on. The first turns will look fine because the model has memory shaped instincts. The retention will be different. The memory layer is the densest collision surface for model harness fit. Tools, schemas, citation tags, decay signals, retrieval rituals — all of these are coupled, all of these were learned together during post training, and none of them transfer cleanly when you swap one side. The tag is a microcosm of the larger problem. Codex's model emits a small XML block at the end of an assistant message whenever it pulled in memory: The harness has a parser that strips the block before showing the assistant message to the user, and uses the parsed to bump and columns in . The parser is at . The SQL is in migration : This is the model's contract with the harness. Cite what you used. The harness will reward what you cited by keeping it alive. The Phase 2 consolidator ranks memories by and decays anything with no citations and no fresh after 30 days. Claude Code's model has no equivalent citation tag. The harness does not need one because memory is read via the standard tool, and the agent's verification grep is what doubles as the "I used this" signal. The reminder text in front of every body read explicitly tells the model: "Records can become stale over time. Verify before recommending." There is no decay loop because the harness assumes the user will prune or the verification will fail in place. Copilot CLI's model talks to a remote memory backend. The store, retrieve, and rank logic is server side. The model does not need a citation tag because the backend tracks reads on its own. Now look at what happens in a cross harness run. A six character XML tag becomes the difference between a memory system that improves with use and one that degrades silently. This is what I mean by "the wire format is part of the model." The citation tag is not a feature on a roadmap. It is a habit the model picked up during post training, and that habit only pays off inside the harness that taught it. The Copilot CLI SDK exposes its system prompt as a structured object with ten section IDs. Hosts can override each section, replace it, or take full control. From the open source TypeScript at : This is not just a documentation surface. It is the public contract of the model's training distribution . Each section has a specific role, and the model was trained to read each section as a particular kind of instruction. The section is harder than . The section is consulted when the model is mid tool call. The section is what the model reads right before emitting a turn. Codex has its own equivalent, less explicit. The developer prompt is assembled in this order: Memory comes after policy and identity, before behavioral overrides. The model was trained to read this exact order. Claude Code's static prefix: A different shape, a different ordering, and a different set of precedence claims about what the model should treat as binding. The Claude trained model knows that instructions "OVERRIDE any default behavior and you MUST follow them exactly as written." That phrase lives inside the harness rather than inside the model itself, but the model has been trained to recognize the heading and treat its contents as binding. A model trained against this prefix will hunt for and react accordingly, while a model trained against a different prefix simply will not see the heading the same way and will give it the weight of any other piece of context. This is the same lesson as the citation tag, scaled up. The system prompt is not generic. It is a structured artifact with section conventions that the model was taught to read in a specific way. Swap harnesses and you keep the model's reading habits but lose the structure they apply to. GitHub Copilot CLI is the most interesting harness in the comparison because it explicitly tries to route across model families. Sonnet is the default. The picker exposes Sonnet, Opus, Haiku, and the GPT 5.x family. v1.0.32 added an mode that selects per session. How does Copilot CLI handle the model harness fit problem? Looking at the changelog, the strategy has three legs. The tool is included only when the active model is from the Codex family . v0.0.366: "Codex specific patch toolchain." The harness knows which models were trained on and only exposes it to those models. Anthropic models get the and shape they were trained on. This is not a translation layer. It is a per model tool surface. The router does not pretend and are the same operation. It serves the right tool to the right model. v1.0.13: "Tool search for Claude models." The implication: Claude trained models expect a deferred tool loading pattern via . The harness only exposes the discovery loop to those models. OpenAI trained models do not get the same loop. They get the full tool list up front because that is what they were trained on. v1.0.18: "New Critic agent automatically reviews plans and complex implementations using a complementary model to catch errors early (available in experimental mode for Claude models)." The Critic is a different model than the main agent. Plans get reviewed by the complementary model. This is multi model orchestration baked into the harness, and the routing is explicit. This is what a real router looks like. Not "translate everything to a common dialect," but "serve the right dialect to each model." It is more code, more state, more telemetry. It is also the only way to get top performance from each model. The cost of this approach is honesty. The harness has to admit that "Claude on Copilot CLI" and "GPT on Copilot CLI" are different products. The user picks one or the other and gets different behavior. There is no neutral common denominator. This is the right honest answer to model harness fit, and Copilot CLI is the only harness in the open or semi open set that actually ships it. The strategic logic is worth naming clearly. Multi model is the crucial bet for any serious agent platform in 2026 , and at GitHub and Microsoft we made that bet deliberately and early. Most customers are running multi model workflows whether their vendor admits it or not, and the only way to give every model its best performance is to build the per model routing surface inside the harness itself. We committed to that answer up front, which is what positions Copilot CLI to keep pace with whatever the labs ship next without having to redo its core architecture each time the leaderboard reshuffles. The matched pair is the unit of analysis, but the matched harness across many models is the unit of platform, and that is the level we are operating at. The single sharpest concrete demonstration of model harness fit comes from what happens when a user switches models mid conversation. Cursor's research team describes this carefully in their April 30 post, and the failure surface is worth walking through because every assumption that breaks here is an assumption a single model harness pair quietly relies on. Three things break at the moment of a model switch. First, the conversation history itself is now out of distribution. The previous model produced tool calls in its native vocabulary: blocks, tags, six or eight verb subagent dispatches. The new model was trained against a different vocabulary and now has to reason about a transcript full of tool calls it would not have emitted. Cursor handles this by injecting a custom instruction explicitly telling the model "you are taking over mid chat from another model" plus steering it away from the prior model's tools. That mitigates but does not eliminate the cost. The model is still reading a transcript that does not match its instincts. Second, the prompt cache breaks. Caches are provider and model specific, which means a switch is a guaranteed cache miss. For a long session, this turns the first turn after the switch into a full price re entry of every byte of system prompt and conversation history. Cursor's mitigation is to summarize the conversation at switch time, which yields a shorter clean transcript that costs less to re cache, at the price of losing details that the summary did not preserve. Third, the tools themselves change shape. The new model's harness loads its native tool set. If the user was deep into a subagent dispatch flow with one set of verbs, the next turn presents a different set. The model has to figure out whether the prior tools are still valid (they are not) and which of its own tools maps to the user's apparent intent. Cursor's recommendation, after building the mitigations, is honest: "we generally recommend staying with one model for the duration of a conversation, unless you have a reason to switch." The cleanest workaround they describe is to spawn a subagent with a different model rather than switch the main conversation. A subagent starts with a fresh context window, no transcript bias, no cache to break, and the new model's native tool surface from the first turn. Each of these failure modes maps directly back to the thesis. The transcript, the cache prefix, and the tool surface are all parts of the wire format the model was trained against. Change the model and you change the contract on all three sides at once. A model switch is not a model swap. It is a harness swap, a tool swap, and a cache invalidation, all at once. The model harness fit framing is no longer a subterranean observation. Two of the labs publishing the most interesting agent work in 2026 say it openly, and the AI infrastructure community has converged on a clean one line definition. Cursor's Stefan Heule and Jediah Katz describe their harness work as "obsessively stacking small optimizations" specifically because a step change is rare and the gains compound only inside a matched pair. Their team builds in custom prompting per provider and per model version, citing OpenAI's literal precision versus Claude's tolerance for imprecise instructions as concrete differentiators that flow back into prompt design. They report driving unexpected tool call errors down by an order of magnitude in one focused sprint. Tool call reliability is not a model property. It is a harness property, and one that compounds every turn the agent stays alive. Anthropic's Prithvi Rajasekaran ran a related experiment in his March 24 post on long running application development. The architecture: a planner, a generator, and an evaluator agent, modeled on Generative Adversarial Networks. The evaluator uses Playwright MCP to actually click through the running application as a user would, then grades against a rubric. Out of the box, Rajasekaran reports, "Claude is a poor QA agent" — it identifies legitimate issues and then talks itself into approving the work anyway. Tuning the evaluator prompt over multiple rounds is what turns it into a reliable judge. The harness creates the judgment surface; the model alone does not. The deeper lesson from Rajasekaran's work is about how harnesses should evolve as models improve. He built one harness against Claude Sonnet 4.5, which exhibited "context anxiety" strongly enough that compaction alone was not sufficient. The harness needed full context resets between sessions, with structured handoff artifacts to carry state across the boundary. When Opus 4.6 shipped, that behavior was largely gone. Rajasekaran dropped the entire context reset machinery and ran one continuous session for over two hours. Every component in a harness encodes an assumption about what the model cannot do on its own. Those assumptions go stale. The matched pair is not static. It moves as the model matures, and the harness has to retire scaffolding that is no longer load bearing. LangChain's Vivek Trivedy has the cleanest framing I have seen: "Agent = Model + Harness. If you're not the model, you're the harness." The harness in this view is every piece of code, configuration, and execution logic that is not the weights themselves. System prompts, tool descriptions, bundled infrastructure, orchestration logic, hooks, middleware. Working backwards from the desired agent behavior, every harness primitive earns its place by patching a specific model gap. Filesystems for durable state, bash for arbitrary action, sandboxes for safe execution, memory for continual learning, planning and self verification for long horizons. Each primitive started life as a workaround for a specific deficiency the model had at training time. Some of those primitives will get absorbed back into the model over time. Others will compound. Trivedy also names the mechanism that makes model harness fit so durable: a co-evolution feedback loop. "Useful primitives are discovered, added to the harness, and then used when training the next generation of models. As this cycle repeats, models become more capable within the harness they were trained in." This is the pipeline that hardens the matched pair over generations. A new harness primitive ships in week one. By month three, it shows up in millions of agent traces. By month six, those traces are training data for the next model. By month twelve, the next model has the primitive baked into its instincts and the harness can lean on it. The loop is what makes "swap to a foreign harness" not just clumsy but compounding clumsy. The model's habits got shaped by the previous generation of its own harness, which itself was shaped by the generation before. Move sideways and you skip every cycle of that compounding. Trivedy is honest about the cost of this loop, and I want to flag the counter argument cleanly. Quoting him: "A truly intelligent model should have little trouble switching between patch methods, but training with a harness in the loop creates this overfitting." If the model's tool format preference is overfit to its training harness, you could argue that the right long term move is to train against a more diverse set of harnesses so the model generalizes. That argument has merit. The labs that ship one model and one harness as a pair are buying near term performance at the cost of the model's portability. Whether that trade is the right one depends on whether portability is something the customer values, and right now the customer mostly values the leaderboard. Three independent posts published within weeks of each other, all converging on a single thesis: the model is only half of the system, the harness is the other half, the matched pair is the proper unit of analysis, and the vendors that ship the matched pair as a single product are the ones currently sitting at the top of the leaderboards. The harness side of the contract has converged on a markdown file per concern, and the file names are now load bearing across the ecosystem. A model trained on one harness recognizes the file names and knows which one carries which kind of authority. The key observation: the file names are now part of the wire format. A model that has been trained to look for a block under a heading will hunt for that exact heading on a turn. A model trained against will look for and miss . A model trained against will load personality from and ignore the same content if you put it in . This is why the AGENTS.md feature request against Anthropic's repo matters. It is not a docs migration. It is a request for the model's training distribution to expand its file recognition vocabulary. Until Anthropic post trains Claude to read , that file is invisible to Claude Code even if it sits next to in the repo. The SOUL.md ecosystem is a stress test of this thesis. SOUL.md is not yet recognized by any major harness's default loader. So the SOUL.md repo's installation instructions are revealing: copy your directory into the project, then add a few lines to pointing the model at it. That is a manual bridge from a non-recognized convention to a recognized one. The SOUL.md authors understand that the bytes do not work unless the model knows where to look, and "where to look" is a habit fixed in post training. The same routing problem shows up in the open. GitHub Copilot CLI v1.0.4 added: "Read .claude/settings.json and .claude/settings.local.json as additional repo config sources." v1.0.36 walked some of it back: "Custom agents, skills, and commands from ~/.claude/ are no longer loaded by the Copilot CLI." That is a router that tried to be permissive about file names, then narrowed when the user surface got confusing. The lesson sits underneath the changelog: even the harness that runs Claude models cannot treat files as authoritative without negotiating with the user about which conventions count. Pick the convention. Ship the post training to match. Or ship a router that explicitly maps each file to the model that recognizes it. The middle path of "be permissive and load anything that looks plausible" loses every time. After months of running these three harnesses side by side, reading the open source code, and tracking the Terminal-Bench leaderboard: The harness is no longer a wrapper around the model. The harness is part of the model's effective parameters. The post training process embeds the harness's tool surface, schema shapes, memory rituals, citation contracts, and system prompt structure into the model's instinct set. You can take the weights to a different harness, but you cannot take the instincts. The instincts only fire when the harness presents the world the way the post training presented it. This has three consequences worth naming. For agent platform builders: pick a harness, pick a model, ship them as a pair. Do not pretend the model is portable. Do not pretend the harness is neutral. The frontier labs are publishing model harness pairs whether they say so or not, and the per pair performance is the only number that matters. Copilot CLI's "different tools for different models" approach is the honest version of this. The dishonest versions ship a common denominator and underperform on every model they serve. For model labs: the harness is product strategy, not infrastructure. The harness is where the lab's post training investment compounds. Anthropic's injection model, the typed memory taxonomy, the verification on every body read, are not infrastructure choices. They are the surface the model was sculpted against, and they are the moat that makes the model less interchangeable than it would otherwise be. Same for Codex's two phase memory pipeline, the citation tag, the strict JSON schema. Same for Copilot CLI's ten section system prompt skeleton. The harness is where the model becomes irreplaceable. For users: the cost of switching is higher than it looks, and lower than vendors would like you to think. Higher because the model and the harness fused over months of training and you cannot pull them apart cleanly. Lower because the simple stack underneath is shared, and the conventions on top are documentable. A honest port — replicate the tool surface, replicate the citation contract, replicate the system prompt structure, replicate the memory ritual — would close most of the gap. It just costs as much as the original post training did to set up. The matched pair is not static. It shifts as the model matures. This is the most useful nuance from Rajasekaran's Anthropic post. A harness component that was load bearing for Sonnet 4.5 (context resets, sprint decomposition, aggressive compaction) became dead weight on Opus 4.6 because the model started doing that work natively. The right harness for a model in March is not the right harness for that model's successor in October. The discipline is to read the traces, identify which components are still earning their place, and retire the ones that are now patches over solved problems. Cursor's blog says the same thing in different words: "Every component in a harness encodes an assumption about what the model cannot do on its own, and those assumptions go stale." So back to the question I started with. Why does the same prompt produce visibly different output across three harnesses running the same model? Because the model running on three harnesses is effectively three different models, even though the weights on disk are byte for byte identical. The instincts that fire at runtime are not stored only in the weights, they are conditioned by the harness the weights were trained against, and the instincts turn out to be most of what shows up in the assistant's output on any given turn. The interesting design move now is not a better model. It is not a better harness either. It is the matched pair, designed end to end, where the post training and the runtime reinforce each other turn after turn until the model becomes legibly better at the things this specific harness rewards. You can see the major builders converging on this idea from three different starting points. Anthropic shipped Claude Code as the canonical Claude harness, with the post training and the runtime co-designed as a single product. OpenAI shipped Codex CLI as the canonical Codex harness, with the same vertical integration on the OpenAI side of the house. At GitHub and Microsoft we shipped Copilot CLI with explicit per model routing because multi model is crucial: customers run every frontier model they can get their hands on, and our job is to make each one perform at its best inside a harness designed to serve all of them well. The result is the most pragmatically honest harness in the open or semi open set today, and the one positioned to compound across model generations rather than locking to any single lab. Three different theories of what to do about model harness fit, all three coherent, and all three paying a real engineering price for the choice they made. The frontier work in 2026 is not about new model architectures. It is about new harness primitives. Ralph Loops, where a hook intercepts the model's exit attempt and reinjects the original prompt in a clean context window, forcing the agent to keep grinding against the goal. Just-in-time harness assembly, where the tool surface and the system prompt get composed per task instead of pre-configured per session. Self-tracing agents that read their own logs to find harness-level failure modes and patch them without human intervention. Each one of these is a primitive that some model will eventually be post trained against, and that pairing will show up at the top of the next leaderboard. The Terminal-Bench leaderboard tells you who is paying the price right. Look at it again in six months. The Evidence: Terminal-Bench 2.0 : what the leaderboard actually shows about model harness pairs Three Harnesses, Three Bets : SQ/EQ vs typed conversation loop vs JSON RPC supervisor The Tool Surface : where post training is most visible Skills Carry Tool Specs : why "same SKILL.md format" does not mean "interchangeable" The Memory Layer : synchronous live writes vs deferred batch vs server side, and why the citation tag matters The Citation Discipline : how the model talks back to the harness The System Prompt Skeleton : ten section IDs is a contract The Routing Reality : what GitHub Copilot CLI is actually doing about all this Mid-Chat Model Switching : the cleanest concrete failure mode What the Labs Are Saying : Cursor, Anthropic, and LangChain all converging on the same framing The Identity File Convention : CLAUDE.md, AGENTS.md, SOUL.md, USER.md, and what each one is for What This Means : the model is no longer the moat alone, and the matched pair shifts as the model matures — Codex's custom diff format. Two flavors: a freeform Lark grammar at and a JSON variant. The model was trained to emit patches in this format. It is not interchangeable with Claude Code's (which takes / ). — the bash family. Plus and for long lived processes that the model can drive with stdin writes after the fact. — the plan/todo tool. A model not trained on this tool will use a different convention to track work. — model can request expanded permissions mid turn. Codex is the only harness with this exact verb. — multi agent orchestration with , , , , , , , . Eight verbs. The model knows all eight. , — tools that find other tools. Codex's answer to deferred tool loading. — , , . Tied to migration . , , — lower case names internally, surfaced to the model as CamelCase ( , , ). The model was trained on the CamelCase variant. requires , , optional . Not the same shape as Codex's . has the deepest sandbox surface: , , , , , , , . The model knows when to set and pair it with the tool. and — the lazy load primitives. — single tool for subagent dispatch. Takes , , optional , optional . The post training has the model emit short imperative descriptions for these. / — both permission. Toggles a worktree local override. / — wrap for subagent isolation. — streams stdout from a background process. Pairs with . The model knows this pattern; Codex does not have it. — the workflow scaffolding tool. The model writes triplets in a particular pattern. , (bundled ripgrep), , — file reading with explicit range params. — built in (v0.0.374). Rejects URLs. , , — three verb interactive shell control. — subagent dispatch with depth and concurrency limits. , — multi turn subagent control. A different shape from Codex's six verb agent surface. — interactive clarification. — persistent memory tied to a remote backend. Memory is not local files here. — included specifically when serving Codex models. A different patch toolchain than Codex's own. , , , , .