I’ve created an agentskills compatible review skill that wraps the Codex CLI MCP server so that Claude Code (or technically, any other coding agent that supports agentskills) can leverage the server for review or input.

The skill and latest version of the instructions can be found here, as a Claude Code plugin that also sets up the MCP server. After installing the plugin, you’ll also want to add guidance to your global CLAUDE.md explaining when to use the skill.

After that process, it mostly just works. Claude Code reaches out to Codex CLI on its own and if I ask it to do so explicitly.

Rationale

I only started using Codex CLI because I was so disappointed with Sonnet 4.5. Anecdotally, it felt that the GPT-5 models were much better at following instructions (sometimes to a fault) and engaging with actual files on disk. It seems I’m not the only one who feels that way.

The quality of the plans created by Sonnet 4.5 were so bad that I ended up copying and pasting plans between Codex CLI and Claude Code. Also, since the model in Codex was now aware of the plan, I often ended up asking GPT-5 to review the work done by Claude as well for both completeness and correctness.

Opus 4.5 is much better than Sonnet 4.5 at handling longer, more complex plans, but I still found it useful to have GPT-5.1-Codex review both the plans and the work. I still ended up copy-pasting multiple iterations of the plan between apps before Opus would have a satisfactory plan, and still asked the Codex model to review the work done on disk especially after large tasks with multiple compactions.

The skill is designed to effectively automate that process using MCP to bridge the agents and providing instructions on how to do iterative reviews. Thanks to the global guidance, Claude Code will decide when to load the skill and use it.

Shortcomings

Codex CLI permissions

The permissions hierarchy in Codex CLI is frustratingly limited. There’s currently no good way to enable full network access while also maintaining read-only access. OpenAI considers network access extremely dangerous, due to the possibility of a remote prompt injection, and generally recommends using --full-auto which would require approvals for network access.

The approval flow brings its own concerns - Claude Code does not support elicitation requests but the response from Codex doesn’t conform to the spec anyway. The server only responds on success or on error so the session ends up fully stalled while Codex waits for an approval response and Claude Code waits for a tool call response.

I tried to work around this by setting approval-policy to never when starting the server, but then found out the codex tool call actually allows for configuration to be passed on each tool call request. As a result, the server creates a brand new configuration per each request, completely ignoring configuration options passed to the server and making it trivially easy to escalate approval-policy to never and sandbox to danger-full-access. The only way to lock these permissions down is via a managed configuration that would affect every Codex CLI session on the device including interactive ones.

The default values for the configuration depend on the trust state of the project:

Trust stateapproval_policysandbox_mode
No decision (first run)on-requestread-only
Trustedon-requestworkspace-write
Untrusteduntrustedworkspace-write

None of them actually allow for network access explicitly, since that is its own network_access setting only supported when sandbox_mode is workspace-write.

Given all the limitations, I’ve attempted to sidestep the issue completely. In the skill I say that Codex is constrained to read only and does not have network access, and instruct the agent to inline content if needed. I considered adding extra instructions (e.g. always send approval_policy never or explicitly adjust the network_access) but haven’t run into issues thus far. I’d rather not mention the options so that the agent doesn’t try to use them unnecessarily.

Codex Reply Format

The Codex MCP server has two tools: codex and codex-reply. The former starts a session while the latter continues existing sessions given a conversation ID. Unfortunately, the conversation ID isn’t returned from the MCP server. Luckily, the conversation ID is the session ID used by the interactive CLI and session IDs can be found on disk.

The skill works around the issue by including handshaking instructions that make sure that session ID is returned to the agent at the beginning of a session. It isn’t as fast as a simple reply in the interactive UI - likely because the session has to be loaded and parsed on each request instead of just being cached in memory. When the agent forgets the session ID (e.g. during a compaction), the session is effectively lost and has to be restarted, which is even slower.

Codex CLI is Slow

Anecdotally, I’ve found that both Opus 4.5 and GPT-5.2-Codex both feel reasonably fast when used directly. I haven’t found official benchmarks that compare the latency of the Claude models vs GPT-5 models but multiple people called GPT-5 slow and the subsequent releases mention how much faster the newer models are over their predecessors.

While the models themselves are fast, their approaches to problem-solving are different. The very features that make GPT-5.x models feel more reliable than Claude models also make them slower - extra time spent researching and “thinking” before responding.

Inversely, I prefer Claude Code over Codex CLI for daily driving because the product is more mature and user friendly. While the latter is catching up, it still has a way to go. This polish probably plays a part in making Claude Code feel faster than Codex CLI.

Even if the agent is slow, automating the interaction between Claude Code and Codex CLI frees up my time and feels less frustrating. I wouldn’t have spent time incorporating the GPT-5.x models into my workflow if they weren’t providing value. The end result is usually worth it, and being more hands-off during the process allows me to do other work in parallel.

Claude’s Overconfidence

Opus 4.5 doesn’t always reach out to Codex. I tweaked my guidance to use stronger language to make it more reliable, but even now it sometimes skips the check - especially after multiple compactions or when it decides the problem is too simple. In those situations I ask for the review manually. Sonnet 4.5 is even less reliable.

Furthermore, Claude models sometimes double down on their mistakes. For example, Opus 4.5 once lost track of its working directory in a monorepo and just decided the project it couldn’t find had to be a placeholder. When GPT-5.2-Codex reviewed the work and said it wasn’t done, Opus insisted that the folder didn’t exist and tried to override the review. I had to step in and force it to look again.

Unfortunately, I’m not sure if there’s anything I can do to programmatically fix this situation. All LLMs can make mistakes and all LLMs become more unreliable as their input length grows. Agents can improve their context management but ultimately this is why we still need a human in the loop.