Give the Planner an Adversary

A couple of weeks ago in The Harness Is a Stack I laid out the layers I ship features through: a lead session that does the expensive thinking, plan agents that groom ambiguous tasks into dispatch-ready specs, build agents that implement them, a manage agent that handles the merge dance. I described the plan agents like they were a solved problem.

They weren't — and the honest version is that I didn't know it. I thought plan dispatch was good: it produced clean specs, build agents shipped against them, features landed. It turned out to be the weakest link in the stack, and I couldn't see it until dynamic workflows came along and groomed circles around it. This is the post about the gap I didn't know was there.

The problem with a planner that nobody checks

Agent GTD, my task system, has always had two ways to hand work to an agent — both executed by a dispatch worker on an always-on box at home. Build mode: implement this, push a branch. Plan mode: take this vague task and groom it — write the acceptance criteria, find the files to touch, draw the scope boundaries — so a build agent can one-shot it later.

Plan mode is one agent, one pass. It reads the task, reads some code, and writes a spec. And then nothing checks that spec. The agent that drafts it is also the only agent that ever looks at it. You find out whether the plan was any good when a build agent acts on it and comes back with the wrong thing — which is the most expensive possible place to discover a bad spec.

Plan agents fail in a specific, recurring way: they take the path of least resistance. They ground in the design doc instead of the code, because the doc is right there in the prompt and the code is work to go read. They name a structure without enumerating it — "the catalog supports five moves" and then stop, instead of listing all five with their contracts. They assume a prerequisite shipped because the doc says it should have. The spec reads clean and confident. It's wrong in ways that stay invisible until someone builds against it.

I've started calling this a design-miss factory: a spec that names a structure without committing to its members will manufacture the same class of bug over and over, because every build agent downstream fills the gap with a guess.

What changed

Last week Anthropic shipped dynamic workflows in Claude Code — a way to orchestrate subagents from a deterministic script instead of leaving the fan-out to a model's judgment. You don't write the JavaScript yourself — you hand the lead your intent and it writes the orchestration script: what runs in parallel, what pipelines, what gets retried, what shape the output has to be. The control flow is code; the cognition is agents. And the code itself gets written by an agent too.

That turned the plan step from a single dispatch into something I can give structure. My grooming workflow now runs three phases per task, and it pipelines them — so one task is being critiqued while the next is still being drafted:

Draft. One agent reads the cited design doc and the actual repo and returns a structured spec — acceptance criteria, files to modify, scope boundaries, an honest readiness flag.
Critic. Two agents, in parallel, whose entire job is to disagree. One is a code-grounding lens: open every file the draft named, confirm each path, symbol, and signature exists in the repo today. The other is a spec-rigor lens: every enum enumerated, every constant pinned to a literal, no "appropriate" or "as needed" hiding a decision nobody actually made.
Synthesize. Fold the critic findings back in, re-open whatever got flagged, and set a readiness that tells the truth: ready, or blocked on a named prerequisite, or needs a human decision.

The word that matters here is adversarial. Research agents in parallel do fact-finding and merge additively — more coverage, same direction. Critic agents produce sharpening through conflict. The drafter wants to ship a clean spec; the critic is paid to find the lie in it. You don't get that from one agent reviewing its own work, and you don't get it from agents that are all rowing the same way.

What it caught yesterday

Here's a real one. I was grooming two coupled tasks in a side project — an AI running coach. Both were about making the app's training-plan adaptation visible: when you log a workout that was too hard, the coach should adjust the rest of your week.

The draft specs looked fine. Then the code-grounding critic opened the files and found the thing I'd never have caught from the design doc: the adaptation engine the entire feature is designed around has never run in production. There are two engines in the codebase. The one the docs describe — the one I assumed was live — is wired into nothing. The one actually running is an older, simpler one that bypasses the whole design.

A single plan agent, grounding in the design doc, would have written a confident spec against an engine that doesn't run, and a build wave would have shipped against a phantom. The critic turned a falsely-confident "ready" into an honest "needs a decision" — and handed me, the layer that's supposed to make load-bearing calls, the actual architectural question: which engine is the real one, and what's the cutover? That's the best possible outcome. The expensive judgment landed on my desk, correctly scoped, instead of leaking into a build agent's guess.

The stack got a new move

The harness-is-a-stack argument was: each layer protects the layer above from toil. Plan agents burn their tokens so the lead doesn't burn its. That's still true — but it was only half the job, because a plan agent can protect me from toil while quietly failing to protect me from error, handing up clean specs that are confidently wrong.

Adversarial grooming adds the missing half. The stack doesn't just push toil downward now; it manufactures disagreement downward. It generates the conflict that surfaces a bad spec before it costs a build wave, and it routes the genuinely-human decisions — the ones the critics couldn't resolve against the code — back up to the one layer that should be spending judgment on them.

It's still agents all the way down. I just stopped letting the most important one work without a critic.

Steal this

One caveat up front: the wiring below is specific to my own tooling. The task system is Agent GTD and the dispatch worker that runs the build agents — both open source. Grab them, use them, fork them, whatever.

But the pattern — groom with an adversarial workflow, then write the result through a thin CLI — ports to whatever task tracker you already have.

The mechanics are worth a paragraph, because they're where the leverage actually leaks if you're not careful. The grooming workflow returns its specs as JSON, and someone has to write them into the task system. That someone is the lead — the exact layer whose context I'm trying to protect. Originally the lead did the write through tool calls that pulled the full acceptance-criteria-and-files payload back through its own window: about 10KB an item, spent on nothing but a database write. That's the toil the whole stack exists to avoid, smuggled right back into the most expensive layer.

So I added two subcommands to the Agent GTD CLI — add-item and update-item — that take the spec as JSON on stdin or from a file, auto-fetch the optimistic-lock version so there's no read-modify-write round trip, and print the new ID. Now the workflow's output file pipes straight into the task system: groom writes finals.json → jq each spec out → agent-gtd update-item <id> --from-json <spec>.json --status ready. The big arrays never touch the lead's context. The groom-to-dispatch handoff costs the control plane almost nothing.

Two pieces make it real. The first is the steering in CLAUDE.md that tells the lead when to reach for the adversarial groom — and to write the results through the CLI rather than inline:

## Grooming complex items — draft → critic → synthesize

A single plan-mode dispatch is one agent, one pass, no adversarial check. For a
small task that just needs a spec, that's fine. For anything multi-file,
cross-package, or where the design doc might be ahead of the code, that single
pass ships wrong specs — phantom file paths, stale signatures, enums left
un-enumerated. For those, groom with the `groom-to-ready` Workflow instead.

The shape, per item, pipelined (draft → critic → synthesize):
1. Draft — one agent reads the cited design docs AND the actual repo, and returns
   a structured spec (acceptance_criteria, files_to_modify, scope_out, readiness,
   blocking_prereqs, open_questions, grounding_notes).
2. Critic — two parallel lenses on the draft:
   - code-grounding / reality-check — open every named file; confirm each path,
     symbol, signature, and prerequisite exists in the repo TODAY.
   - concrete-by-default / spec-rigor — every enum enumerated, every constant
     pinned to a literal, no "appropriate"/"as needed" acceptance criteria.
3. Synthesize — fold the findings in, re-open any flagged file, set an honest
   readiness (ready vs blocked-on-prereq vs needs-decision).

Ground in current code, not the design doc — a doc claim is a claim to verify. An
honest "blocked-on-prereq" naming the exact missing code beats a confident
"ready" against a phantom API.

The workflow RETURNS the specs; the control plane reviews them and writes them
into the task system — it does NOT write them directly (the review is the
checkpoint, and where you make any load-bearing call the critics surfaced).

Write specs with the CLI, not inline tool calls: `agent-gtd add-item` /
`update-item` take the spec as JSON via --from-json/--stdin, auto-fetch the
version, and print the new ID — keeping the big AC/files arrays out of the
control-plane context.

The second is the workflow script itself. This is the actual thing I run — project-agnostic, so the grooming discipline is hardcoded and you pass project context and the items at call time:

/**
 * groom-to-ready.workflow.js — reusable Claude Code Workflow script.
 *
 * Grooms one or more GTD items to dispatch-ready via draft → critic → synthesize,
 * grounding every acceptance criterion in the CURRENT codebase. Project-agnostic:
 * the universal grooming discipline is hardcoded; you supply project context and
 * the items at call time.
 *
 * USAGE (from a Claude Code session):
 *   Workflow({
 *     scriptPath: "~/scripts/groom-to-ready.workflow.js",   // expand ~ to an abs path
 *     args: {
 *       context: "Project: <name>. Stack: <...>. WHERE TO LOOK: <key dirs/files/docs>. " +
 *                "Ground-truth facts the agents should treat as authoritative: <...>",
 *       items: [
 *         { id: "<gtd-uuid>", slug: "short-slug", title: "<item title>",
 *           seed: "<the design intent / current state to verify / scope>",
 *           // optional, per-item steering for the critics:
 *           criticalConstraint: "<e.g. 'do NOT touch fileX (a rollout owns it)'>" }
 *       ]
 *     }
 *   })
 *
 * RETURNS: { finals: [ <SPEC per item> ] }. The control plane reviews each spec,
 * then writes it into GTD (update_item AC/files/scope + status), or keeps it `new`
 * with the blockers recorded if readiness != "ready". Do NOT have the workflow write
 * GTD directly — the review is the checkpoint.
 *
 * COST: ~4 agents per item (1 draft + 2 critics + 1 synth). Pipelined across items.
 *
 * Notes: plain JS (no TS). No Date.now()/Math.random(). args may arrive stringified.
 */

export const meta = {
  name: 'groom-to-ready',
  description: 'Groom GTD items to dispatch-ready via draft→critic→synthesize, grounding every AC in the current codebase. Pass project context in args.context and the items (with seeds) in args.items.',
  phases: [
    { title: 'Draft', detail: 'read the cited docs + actual repo, draft a code-grounded spec per item' },
    { title: 'Critic', detail: 'code-grounding + concrete-by-default lenses per draft' },
    { title: 'Synthesize', detail: 'fold critiques into a final spec with an honest readiness verdict' },
  ],
}

const SPEC_SCHEMA = {
  type: 'object',
  additionalProperties: false,
  properties: {
    item_id: { type: 'string' },
    title: { type: 'string' },
    readiness: { type: 'string', enum: ['ready', 'blocked-on-prereq', 'needs-decision'] },
    blocking_prereqs: { type: 'array', items: { type: 'string' } },
    acceptance_criteria: { type: 'array', items: { type: 'string' }, minItems: 1 },
    files_to_modify: {
      type: 'array',
      items: {
        type: 'object',
        additionalProperties: false,
        properties: { path: { type: 'string' }, change: { type: 'string' } },
        required: ['path', 'change'],
      },
      minItems: 1,
    },
    scope_out: { type: 'array', items: { type: 'string' } },
    build_engine: { type: 'string' },
    open_questions: { type: 'array', items: { type: 'string' } },
    grounding_notes: { type: 'string' },
  },
  required: ['item_id', 'title', 'readiness', 'blocking_prereqs', 'acceptance_criteria', 'files_to_modify', 'scope_out', 'build_engine', 'open_questions', 'grounding_notes'],
}

const CRITIQUE_SCHEMA = {
  type: 'object',
  additionalProperties: false,
  properties: {
    lens: { type: 'string' },
    ready: { type: 'boolean' },
    findings: {
      type: 'array',
      items: {
        type: 'object',
        additionalProperties: false,
        properties: {
          severity: { type: 'string', enum: ['critical', 'major', 'minor'] },
          issue: { type: 'string' },
          fix: { type: 'string' },
        },
        required: ['severity', 'issue', 'fix'],
      },
    },
  },
  required: ['lens', 'ready', 'findings'],
}

const UNIVERSAL = [
  'You are GROOMING a GTD item to dispatch-ready. You are NOT implementing it. Do NOT modify any files; read only.',
  '',
  'CARDINAL RULE: ground every claim in the CURRENT codebase. Design docs may be ahead of (or behind) reality — a doc claim is a claim to VERIFY by opening the file. Before an AC says "replace X with Y", confirm BOTH that X exists where claimed AND that Y exists. If a prerequisite the item depends on does not exist yet, that is the single most important thing to report.',
  '',
  'HONEST READINESS (the readiness field):',
  '- "ready" — every prerequisite exists in the code today and a build agent could one-shot it.',
  '- "blocked-on-prereq" — depends on code that does NOT exist yet; list exactly what is missing in blocking_prereqs. Still produce the best AC you can, but be explicit it cannot dispatch until the prereq lands.',
  '- "needs-decision" — a real design ambiguity needs the human before grooming can finish.',
  'Do NOT inflate readiness. A confident "ready" against a phantom API is the worst possible outcome — it sends a build agent to implement against something that is not there.',
  '',
  'QUALITY BAR (concrete-by-default): enums are enumerated with members + semantics; formulas/constants are pinned to literals (a starting value, never "an appropriate value"); EVERY files_to_modify path and EVERY symbol/function/schema an AC names is verified by opening the file; no "appropriate"/"correct"/"as needed" acceptance criteria — each must be a testable assertion; end grounding_notes with a "soft values / deferrals" line listing anything not pinned.',
  '',
  'build_engine convention: prefer "claude-code-sonnet" for well-scoped non-trivial execution.',
].join('\n')

const ARGS = typeof args === 'string' ? JSON.parse(args) : args
const ITEMS = Array.isArray(ARGS) ? ARGS : (ARGS && ARGS.items)
if (!Array.isArray(ITEMS)) {
  throw new Error('groom-to-ready: expected args.items array, got: ' + JSON.stringify(args).slice(0, 200))
}
const CONTEXT = ARGS && ARGS.context ? String(ARGS.context) : ''
const SHARED = CONTEXT
  ? UNIVERSAL + '\n\n## Project context (where to look, conventions, ground-truth facts to treat as authoritative)\n' + CONTEXT
  : UNIVERSAL

const finals = await pipeline(
  ITEMS,
  // ---- Stage 1: DRAFT ----
  (item) => agent(
    [
      SHARED,
      '',
      `## Item to groom: ${item.title || item.slug || item.id}`,
      `GTD id: ${item.id}${item.priority ? ` | priority: ${item.priority}` : ''}`,
      '',
      'Seed (the design intent / current state to verify / scope — this is intent to STRUCTURE and VERIFY, not ground truth about the code):',
      item.seed || '(none provided)',
      item.criticalConstraint ? `\nItem-specific constraint:\n${item.criticalConstraint}` : '',
      '',
      'TASK: Read the cited docs AND the actual repo. First determine readiness (do the prerequisites exist in the repo today?). Then produce the groomed spec: testable acceptance_criteria tied to real code; files_to_modify each verified to exist (or marked NEW); scope_out boundaries; build_engine; blocking_prereqs (exact missing code, if any); open_questions (only genuine human decisions); grounding_notes citing the specific files you opened and what you found.',
    ].join('\n'),
    { label: `draft:${item.slug || item.id}`, phase: 'Draft', schema: SPEC_SCHEMA },
  ),
  // ---- Stage 2: CRITIC (two parallel lenses) ----
  (draft, item) => parallel([
    () => agent(
      [
        SHARED,
        '',
        `LENS: code-grounding / reality-check. Review this DRAFT groomed spec for "${item.title || item.slug}".`,
        '',
        'DRAFT SPEC:',
        JSON.stringify(draft, null, 2),
        '',
        'Open the actual files. Verify: (1) the readiness verdict is honest — if it says "ready" but a named prerequisite is NOT in the repo, downgrade it and say so; (2) every files_to_modify path exists (or is correctly marked NEW); (3) every symbol/function/type/schema the ACs name actually exists with the implied signature; (4) the blocking_prereqs list is complete and accurate. Flag phantom paths, wrong signatures, stale-design-doc assumptions. Set ready=true only if the spec is fully grounded AND its readiness verdict is honest.',
      ].join('\n'),
      { label: `critic:grounding:${item.slug || item.id}`, phase: 'Critic', schema: CRITIQUE_SCHEMA },
    ),
    () => agent(
      [
        SHARED,
        '',
        `LENS: concrete-by-default / spec-rigor. Review this DRAFT groomed spec for "${item.title || item.slug}".`,
        '',
        'DRAFT SPEC:',
        JSON.stringify(draft, null, 2),
        '',
        'Find every place the spec is soft where it must be concrete: an enum member missing or its semantics unstated; a constant/format left as prose instead of a literal; an AC using "appropriate"/"correct"/"as needed" instead of a testable assertion; a missing test for a stated behavior; a missing soft-values/deferrals note. For each finding give the EXACT fix. Set ready=true only if a build agent could one-shot this with zero design guesses (assuming prereqs are met).',
      ].join('\n'),
      { label: `critic:rigor:${item.slug || item.id}`, phase: 'Critic', schema: CRITIQUE_SCHEMA },
    ),
  ]).then((critiques) => ({ draft, item, critiques: critiques.filter(Boolean) })),
  // ---- Stage 3: SYNTHESIZE ----
  (bundle) => agent(
    [
      SHARED,
      '',
      `## Finalize the dispatch-ready spec for "${bundle.item.title || bundle.item.slug}" (id ${bundle.item.id})`,
      '',
      'DRAFT:',
      JSON.stringify(bundle.draft, null, 2),
      '',
      'CRITIC FINDINGS (apply every actionable one; re-open any file a critic flagged to confirm the correction):',
      JSON.stringify(bundle.critiques, null, 2),
      '',
      'Produce the FINAL spec. acceptance_criteria testable + tied to verified code; files_to_modify confirmed (mark NEW clearly); scope_out explicit; build_engine set; item_id = the GTD id above. Set readiness honestly per the rules: "ready" only if all prereqs exist in the repo today; otherwise "blocked-on-prereq" with the exact missing code in blocking_prereqs. open_questions empty unless a real human decision remains. grounding_notes: what you verified + the readiness rationale + a soft-values/deferrals line.',
    ].join('\n'),
    { label: `synth:${bundle.item.slug || bundle.item.id}`, phase: 'Synthesize', schema: SPEC_SCHEMA },
  ),
)

return { finals: finals.filter(Boolean) }