Open Source Models and Harness Engineering

Unfortunately, my company has decided to stop using Anthropic models and products to create code as result of the Defense Department’s on-going conflict with Anthropic. While larger, more influential companies decided to stick with the tools they know best, I found myself in a situation where I almost had to start from scratch with my workflows: finding a new agentic harness, pointing it to new models, and reworking all of my existing skills and guidance.

In the process, I found myself thinking a lot about open source models, the importance of the agentic harness when evaluating models, and what makes Claude Code feel so good (and bad) to work with.

Harness Engineering

Let’s be honest, “harness engineering” is a very trendy buzz word these days.

I don’t really want to retread everything covered by everyone else, so I’ll focus on why choosing a coding harness is as important as choosing the model you use: a model on its own is not good enough - models are trained on the capabilities of the harnesses they are meant to run in. Providing the wrong tools can decrease the performance of a model and OpenAI engineers say that the Codex models are trained for the Codex harness. Model-agnostic harnesses need custom prompts and/or toolsets per provider.

It’d make sense, then, that native harnesses like Claude Code and Codex would be the best options for their respective models. Looking at Terminal Bench, though, they’re surprisingly far down the list. According to ForgeCode (currently in the top slot), it ultimately comes down to the biases of the benchmark. Their initial implementation was at 25% with Gemini 3.1 Pro. Disabling interactive mode and adjusting their tool names to match what the model expects got them to ~40%. Forcing their todo list planning tool to be used more consistently pushed them up to ~65%. The final push was actually optimizing for speed (i.e. parallel agents, tool routing, and downshifting from slower high reasoning to faster low reasoning after planning) since the benchmark has a strict wall clock - slow and great is strictly worse than fast but good enough. Gemini CLI doesn’t have any entries for that model, but ForgeCode recently added two scores for Opus 4.6 and GPT-5.4 at ~80% while the native harnesses sit ~60%. I suspect that optimizing for speed accounts for the majority of the gains. Even if Opus is fast, downshifting from High to Medium for execution automatically would likely be better. It could very well be per-model tweaks or some of the further harness improvements they mention, but my guess is that neither Anthropic nor OpenAI devs are optimizing for the TB 2.0 benchmark at all, thus the ~20% gap.

Personally, though, I found myself missing things from Claude Code:

Codex CLI has skill support, and plugins are coming soon per release notes, but it isn’t clear how they plan to support distribution since these are hardcoded so far. Subagents are still experimental - they’ve fixed issues like recursion and calling itself Batman and I suspect they’ll launch officially or at least as beta soon. The Github Action is practically abandoned and hasn’t been updated since November 2025. Codex models still feel unbearably slow but it might just be that Claude Code relies on more back and forth while Codex thinks deeply and expects to be left alone.

OpenCode feels like it should be the next best thing - open source, betting on the fact that you will want multiple models, and built by an experienced team used to building dev tools. The story around skills/plugins is basically the same as current Codex - you can use skills (even those from global claude folders) but no story yet around plugin distribution and no support for skills installed in Claude Code by plugins. Subagents exist and can work in parallel, but the experience isn’t on par with Claude Code or Codex CLI. It looks like some people are trying to rebuild Claude Code swarms in Opencode but none of the PRs have landed yet. While OpenCode does have a Github Action, it doesn’t have the niceties of the Claude Code one, like sticky comments or the inline comment MCP.

Other harnesses like LettaCode or ForgeCode (or even Cursor) rely on accounts to gate access to server-side features. They push cloud-based products that turn me off from even trying them - I already begrudgingly give my money and data to Anthropic and OpenAI. Why add a middleman who might also upsell later? OpenCode isn’t in the clear here either, given the subscription service, but they aren’t gating the whole product behind login.

Ultimately, it is extremely hard to give up how comfortable Claude Code is. The first-mover advantage and switching costs are real, and no other harness has the rest of the ecosystem (Remote Control/Dispatch/Claude Code on the Web + Github CI/CD).

Opensource Models

While using OpenCode and accidentally using Opus 4.6 to plan while using GPT-5.4 to build, I found myself remembering the old opusplan model from Claude Code, which used Opus for planning and then switched to Sonnet for execution. It was removed with the launch of Sonnet 4.5 because Anthropic believed the new model was better than Opus 4.1, but my experience with Sonnet 4.5 was so poor it pushed me to actively switch to GPT-5 and Codex CLI for work. Opus 4.5 was a return to form, but opusplan never came back. These days, both Claude Code and Codex CLI encourage the use of a single model at a time. While it is possible to use different models in custom subagents (which is how OpenCode does it too - they just allow customizing the built-in subagents as well), you generally choose one model and use it for everything. It makes sense that you may not want to move context as you switch models, especially now that 1m context is the default. It’s honestly strange that Claude Code encourages clearing context between planning and execution phases; I would have expected the opposite.

It was kind of fun to try a bunch of open source models too. It helped that they’re (relatively) dirt cheap - GLM 5 benchmarks near Opus 4.5 at less than 20% of the cost and it’s one of the more expensive options. Using GLM 5 for as an Opus replacement and Kimi K2.5 for the actual execution felt a bit like the purported 90% of the performance for 1/6 the cost. But I found them painfully lacking - a lack of reliable parallelism, lower reasoning depth, slower inference and mostly a worse experience than Codex CLI with the GPT models. Looking at the release notes, you’ll see that most open source models are benchmarking against models that came out ~6 months before them - they bring up the floor while Anthropic, OpenAI, and Google continually raise the ceiling. After the Deepseek R1 launch, Anthropic CEO Dario Amodei acknowledged a similar pattern. At some point we can expect a plateau where things will even out or converge, but right now the step change from Opus 4.5 to 4.6 and GPT-5 through 5.4 are big enough that you have to consider whether using a cheaper open source model is worth the step back in model performance or increase in tokens spent. While the raw coding execution differences aren’t major, the differences in reasoning capabilities (for planning, debugging, etc.) will lag the SOTA and limit the size of the problem an agent can solve.

In many ways, which one you choose depends on how you look at scarcity. If you’re looking for the best performance per dollar today - the open source models are a clear winner. Subscriptions to the Chinese cloud vendors or via aggregated proxying like OpenCode Zen or OpenRouter are extremely cheap and offer ~80% to 90% of the performance for a fraction of the cost. If you’re looking to stay on the bleeding edge or move the fastest, assuming the latest from the frontier models will always be that, then sticking to an American closed source frontier lab (Anthropic, OpenAI, or Google) with a more expensive subscription makes the most sense. If you care about uptime and access, you’ll likely shy away from subscriptions in favor of Enterprise/API pay-as-you-go access. And if you’re thinking that GPU and memory shortages combined with market capture/pressure will drive up costs at any of the subscription fees, you’ll lose the benefits of economy at scale. Pair large up-front capital expenditure to self-host your hardware with increasing costs of maintenance as power costs drift upward while accepting the downsides of open source models while also accepting performance loss from smaller quantizations.

My best workflow still blends Claude models with GPT-5.x models - the GPT-5.x reviews catch things that Opus doesn’t catch despite leveraging more specific sub-agents. I can see a lot of value in a harness like OpenCode or ForgeCode that allows both models to co-exist in the same harness. But I also get a lot of mileage from my subscriptions and it’s easy enough to have the two work together without violating Anthropic’s subscription ToS. I understand why someone might want their subscription to work with other tools and be annoyed that Anthropic is creating a walled garden, but given the value of the Anthropic subscription features I feel like I was getting a lot of mileage out of my subscription. If OpenClaw or other unsanctioned uses are important, there’s obviously pay as you go API key access. Being forced to use Codex as my primary harness for work has stung; I still use Claude Code at home for personal projects.

So, personally, I’m picking the dual subscription lane for now. Obviously, if costs go up prohibitively or something else changes, I can change it later. It seems unlikely that an open source model will truly push match the reasoning capabilities of a current SOTA frontier model, and the costs of maintaining compatibility with all of them (loss of parallelism, remote control, easy skill reuse, easy CI integrations, etc.) just doesn’t seem worth it right now. Multiple smaller niche models will encourage or require breaking down workflows further, risking the bitter lesson or just simple over-engineering. I still appreciate the effort of OpenCode. Given the sheer number of models, supporting and tracking the performance of each one is a Herculean task - an open source approach that leverages the community to fix what’s broken for each user seems like a winning strategy. But that comes at a cost - every change needs to be tested against every provider, etc. And who knows, maybe distilled models are close enough to their “parents” that they don’t need too much unique work? I hope that if and when I’m forced to change lanes again that OpenCode has caught up enough to be a straightforward switch.

Notes

Explorer

Open Source Models and Harness Engineering

Harness Engineering

Opensource Models

Graph View

Table of Contents

Recent Notes

Open Source Models and Harness Engineering

Current AI Development Setup

Combining Claude and Codex