For anything that I don’t think can be done trivially, I use a plan, execute, and review loop. This isn’t too different from my standard problem-solving approach: breaking the problem into steps, federating tasks out to other team members where it makes sense, and reviewing work as it progresses to make sure it consolidates into a working solution. Nowadays, I’m just passing problems to subagents and trying to find out ways to let them work as autonomously as possible.

Learning

While LLMs are getting better at tackling larger problems over time, it is still on us humans to deliver a solution that works. Even before planning, I hope to have some sense of the “shape” of a solution and what I expect the full loop to generate.

Without guidance, both Claude and Codex have a strong tendency to build their own solutions from scratch. While LLMs change the math, I still believe that

The best code is no code at all. Every new line of code you willingly bring into the world is code that has to be debugged, code that has to be read and understood, code that has to be supported.

Jeff Atwood

I quite like the way Dijkstra framed it:

...if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.

Edsger W. Dijkstra

Therefore, it’s still on the human to investigate competing approaches - whether existing solutions exist, industry best practices, build vs buy. Frontier models with web search aren’t bad at aggregating information, especially with purpose-built tools like Deep Research. They can speed up what would formerly have been multiple days of googling, pinging friends, and sifting through Stack Overflow, Medium, Hacker News. Obviously, be mindful of the “[Your favorite AI product] can make mistakes.” disclaimers and click through to the source to make sure.

There’s no beating lived experience. But when you can’t find someone who knows “where to tap”, an LLM might connect you to online resources.

After doing research, I sketch out the project end to end, building out a high-level plan and rough milestones. Within each milestone, I carve out clear-cut deliverables that can be assigned to agents. These are the goals for my plan, execute, and review loops with agents.

Planning

When I have a sense for what I’d like an agent to solve, I shift my harness into planning mode and state my intention, with the details I think are relevant, and then iterate on the plan that comes back. I ask for explanations and web research as needed. The process works best with a strong reasoning model that can ask questions and think through the problem rather than focus on producing an answer. For plans that are complicated or that I don’t fully understand, I also ask a second model to review. With Claude Code, I instruct the model to ask Codex via my codex plugin. In other harnesses, I open a second session and initiate a manual review. Either way, I review the feedback and continue to iterate on the plan through reviews till I feel good about the plan.

The plan needs to serve as a specification artifact. For simpler problems or open-ended exploratory work, the plan doesn’t need to be very concrete and sometimes I skip it entirely. But generally I err towards being overly expressive and detailed so that a reviewer can easily understand my intent later. If my harness doesn’t write the plan to disk, I make sure the first step of my plan is to write the plan down so that a reviewer can use it later.

For larger problems, the plan also needs to include how the agent should tackle the problem. Models generally want to complete one step across every part of the problem before moving to the next. I’ve heard this called horizontal slicing, but I prefer to call it “opening too many cans of worms.” I make sure the plan guides the agents towards vertical slicing, tackling the problem end to end in a narrow scope. It’s easier to verify for both humans and agents. Conveniently, it’s often easier to parallelize: each subagent can take a slice and run in its own worktree, leveraging version control as a safety net, without coordinating with the others.

I also verify that the plan clearly includes my expectations for testing. In smaller codebases global guidance is usually enough, but larger, legacy codebases often have too many competing patterns for a model to determine the right approach. The tests work as a simple sniff test for the agent to know whether it’s done.

In the past, I’d pack as much code as possible into the plan so the execution model could pick up where the spec left off. These days, one model usually handles both phases, but the trick still works when I use two different models.

Execution

Once I’m happy with the proposed plan, I let the model take over.

Trying to “pair-program” with a model means trying to keep pace with something that runs much faster than I do, which slows the model down or stresses me out. It’s easier to take a backseat while the model drives and spins up subagents, and step back in once it stops. These days I’ve disabled Claude Code’s notifications - I’ll check on it on my own time.

A good chunk of the work of giving agents real autonomy is harness setup. Configuring safe baseline permissions ahead of time heads off unnecessary permission requests. Hooks handle the boring automatable stuff like formatting and blocking commands that will run afoul of permissions. Project conventions live in skills and ADRs instead of getting in the primary guidance files so that more of an agent’s context can be reserved for the problem. Providing agents with the right tools is as important as the plan.

I’m still uncomfortable with yolo mode. As IBM told us all the way back in 1979, “A computer can never be held accountable.” Recently, I’ve been leaning on Claude Code’s auto mode as a middle ground between approving every shell call and throwing all caution to the wind. When auto mode isn’t available, I only run unattended inside a VM but setting up a dev container just to make it safe is too much friction for daily work. Thankfully, most harnesses will “accumulate” granted permissions so I don’t come back to a permission dialog that often.

Review

I have general guidance that instructs my harnesses to verify the plan has been completed via self review before telling me that they’re done. Hopefully by the time I start to look over the work it’s in decent shape.

The first thing I do manually is ask a second session (usually with a GPT model) to verify that the work done on disk matches the plan. While that runs, I run a second pass of the reviewer framework to get a sense of how much polish will be needed.

If something has veered off course from the plan, I’ll ask the agent why and then try to course correct. Once I’m sure the plan is completed, I’ll let the reviewer framework handle missing polish. 3-5 passes with the reviewer are usually enough to take care of most serious issues.

At that point, I push the branch up. Ideally, the review pass that runs in CI comes back clean. This is usually the first time I read the work line by line.

Humans in the Loop

I think the new name for this style is spec-driven development, but it’s really just what effective teams have been doing for years - writing down what needs to happen and federating it to whoever’s doing the work.

Generative AI makes it extremely easy to create code and has recently gotten to the point that it can review its own work effectively. It is still on humans to decide what to build. I expect that humans will have to stay in the loop directing the growth of a product as long as the product is for humans.

That said, the role of a software engineer has shifted considerably. Per the HBR, AI won’t replace humans but humans with AI will replace humans without AI. Yet

The heart of software is its ability to solve domain-related problems for its user.

Eric Evans

I’m probably a worse programmer than I was a year or two ago and definitely worse than I was when I was a senior engineer. As a staff engineer and director, my role evolved beyond code generation into setting engineering strategy and managing technical quality while staying in alignment with company leadership.

It’s hard not to see a similar parallel with directing agents. Karpathy says “The hottest new programming language is English”. I truly think that the best AI engineers are former managers.