It's Not the Model

A colleague at work recently proposed a series of experiments to "see if AI can do X yet, or if we need to wait for the technology to mature." He was sincere. He thought he was proposing a principled de-risking strategy. He wasn't — he was de-risking the wrong variable. I already knew without question that AI could do the thing. We just hadn't built the software around the model to let it do the thing yet. No experiment required; a straightforward software engineering problem.

Most engineers I know who've spent real time with Claude Code come away genuinely impressed. The model is good. The interesting conversation isn't whether the technology works — it does — but what happens when someone hits a specific wall. "AI isn't good at frontend work." "AI can't handle really large codebases." I've heard both in the last few months from sharp engineers. Both are wrong: Agent GTD has a slick React frontend Claude built end-to-end; Personal KB is a substantial Python codebase — multi-backend storage, hybrid search, knowledge graph, ReAct retrieval loop — that Claude has shipped over hundreds of sessions. The model can do these things. The work to let it do them, in the specific contexts where those engineers tried, hadn't been done.

When AI fails on a task you actually needed it to do, the reflex is to reach for "the model isn't ready yet." Sometimes that's correct. Usually it isn't. Frontier models are already capable enough for most normal knowledge work. The failures you actually see fall into two buckets — and once you can name them, you stop blaming the model and start fixing the layer that's actually broken.

The two failures

Context failures. The model has the capability to do the task but doesn't know what it needs to know. A common shape: a confident, plausible answer that's subtly wrong because the model is reasoning from training-data memory instead of current reality. A few weeks ago I watched Claude recommend brew install aws-sam-cli. AWS dropped the Homebrew tap in 2023. The model wasn't incapable of answering correctly — it just didn't have current docs in scope, and its priors were three years stale.

Orchestration failures. The model has the capability but the workflow won't let it use it. A common shape: a feature broken into six obvious sub-tasks gets implemented sequentially in a single conversation over four hours, when six headless agents could have shipped the same work in thirty minutes of parallel execution. The capability was there. The workflow couldn't reach it.

Almost everything I see attributed to "the model can't" turns out to be one of these two. Capability is the most visible variable, so it's the one that gets blamed. It's almost always the wrong answer.

The diagnostic

When you hit a wall with AI, ask three questions in order:

  1. Did the model genuinely lack the capability? This is the least likely answer. Reach for it last.
  2. Did it lack the right context? Were the relevant docs in scope? Was a prior decision findable? Did it have to reason from training-data memory instead of source-of-truth?
  3. Did the workflow force the model to operate below its capacity? Were tasks done in series that could have been done in parallel? Was the human re-explaining things across sessions instead of capturing them once? Was dispatch capacity sitting unused because the planning surface didn't support it?

The three answers point in three different directions. Capability failures mean wait for the next model. Context failures mean engineer the information layer around the model. Orchestration failures mean engineer the workflow. Two of those three are within your reach today.

Two failures, two tools

Both of the systems I've spent the last few months building came from recognizing a recurring failure pattern, naming it, and engineering across it.

Personal KB came from a recurring context failure. Every session with Claude started cold. Every hard-won lesson — debugging insight, architectural decision, gotcha that cost me an hour — evaporated when the context window closed. The next session would re-derive it, sometimes correctly, sometimes not, always at a cost. The model could synthesize across history; nothing held the history. So I built a knowledge base the agent reads from and writes to as it works. Not a database. A memory layer. The model's capability stayed constant; the context surrounding it now compounds.

Agent GTD came from a recurring orchestration failure — and from running into my own ceiling as a human. Before I built it, I was running six Claude Code sessions at once on my laptop, cycling through them, driving the build-test-learn-fix loop in real time across half a dozen projects. It was exhausting. I couldn't stop, because the features were right there, waiting to be shipped, and stopping felt like leaving money on the table. But I also couldn't keep going at that pace.

The fix wasn't more discipline. It was an interface that separated planning from doing. With Agent GTD (and its dispatch worker), I spend twenty minutes with a tech-lead agent fleshing out the next two or three hours of work, then dispatch it to headless agents while I go do the same thing on another project. The plan persists as state. The execution runs without me. My human capabilities don't scale; the AI's do; the tool bridges that gap.

The bigger thing this is actually about

Step back from the diagnostic for a moment and look at what the solutions to these failures actually are.

A memory layer because human session-spanning attention can't hold what the AI can synthesize. A state machine because human planning can't be re-executed twenty times a day for twenty dispatches. A dispatch service because human supervision can't sit at a keyboard and watch six parallel work streams indefinitely.

These aren't productivity hacks. They're interface inventions. The desktop-era assumption — that the human is the rate-limiter and the machine waits — has flipped. The AI can run six conversations indefinitely. The human can't. The tools that bridge that asymmetry are doing fundamental HCI work, even when they look like task-management apps and MCP servers.

That's the layer most people building with Claude Code don't realize they're working at. They think they're improving their workflow. They're actually inventing the next generation of human-machine interface, garage by garage, ahead of any framework or design language. The interfaces that win in this era won't be the ones that copy the patterns of the desktop era — they assumed the wrong bottleneck. They'll be the ones that internalize the new asymmetry and design around it.

What to do Monday

Install my tools if they help you skip ahead — that's what I built them for. Adopt them, fork them, replace them with better ones. None of that is the point.

The point is the diagnostic. When AI fails on a task you genuinely needed it to do, don't reach for "the model isn't ready yet." Ask which failure this is. If it's context, engineer the information layer. If it's orchestration, engineer the workflow. If it's neither — and it usually isn't — then yes, wait for the next model. The labs will keep pushing. They always do.

But pay attention to the kinds of failures you keep hitting. Patterns there are signal. Each of my repos started as a pattern I'd hit five times in a week and finally got tired of working around. The next generation of tools for AI-augmented engineering will be built by the people who recognized those patterns earliest, named them honestly, and decided not to wait for someone else to fix them.

← Back to archive