Blog

Hello, World

2026-03-12T00:00:00-04:00

This is the first post on my new blog, built from scratch with a custom static site generator. No frameworks, no JavaScript — just Python, Jinja2, and markdown.

Why build your own?

There are plenty of static site generators out there — Hugo, Jekyll, Eleventy, Pelican. So why build another one?

It's small. The entire build script is under 200 lines.
It's mine. Every line is intentional. No surprises.
It's fun. Sometimes the best way to learn is to reinvent the wheel.

A code example

Here's the core of the build pipeline in Python:

def build(posts_dir: Path, output_dir: Path) -> None:
    """Run the full build pipeline."""
    posts = load_posts(posts_dir)
    for post in posts:
        render_post(post, output_dir)

And some shell to run it:

uv run blog build
open output/index.html

What's next

Tag pages and an archive view
An Atom feed for RSS readers
Syntax highlighting for code blocks^[1]
Eventually, deployment to S3 + CloudFront

All in good time.

Actually, syntax highlighting is already working — powered by Pygments. ↩︎

How I Built This Blog with Claude Code and a Team of Arguing Agents

2026-03-13T00:00:00-04:00

I built this blog in an afternoon. Not by writing code — by describing what I wanted, letting three AI agents argue about it, and then having Claude Code build the winner.

The interesting part wasn't the code. It was the debate.

The prompt

I started with a loose idea: markdown files with YAML frontmatter, a build command that does a smart rebuild of only changed content, static HTML hosted in S3. Clean and minimal — just words on a page.

I also had some ideas that felt clever at the time. Incremental builds with SHA256 manifest tracking. Client-side search powered by WASM SQLite with FTS5. A tag cloud. Breadcrumbs.

I typed this into Claude Code along with an unusual request: don't just build it. Spin up a team of agents — a UX designer, a tech stack architect, and a skeptic — and run a three-round debate. I've found this produces better results than a single agent optimizing in isolation. The agents push back on each other, and the ideas that survive are genuinely stronger.

This project will be a very simple, clean themed, blog of my thoughts. Here's the high-level idea:

I'll save markdown files with YAML frontmatter into ./posts

I'll run a build command

That build command will do a "smart rebuild" of the static HTML content for the site.

Only build the changed content (new posts, static nav, tag cloud, etc)

I'll run a publish command that pushes the content to the "site"

The blog is entirely static content, so it will be hosted in S3

Given this background, spin up a team of agents to research best tech solution and propose a path forward. We need:

UX designer. I want a super clean, minimalist design. The site is pretty much just words. No pictures, no snaz.

Tech stack designer. Think about how we dynamically create a nav, site breadcrumbs, on site search (WASM SQLite FTS5??), etc.

Skeptic. Point out where we're over thinking things and drive for absolute simplicity.

Run a 3-round debate, I feel like we get the best ideas this way. Use your discretion for how to structure the debate.

The debate

Claude Code launched three agents in parallel. Each one got the same brief but a different lens.

In Round 1, they worked independently. The UX designer proposed a reading-first philosophy: system fonts, 65ch line width, near-black on off-white, dark mode via prefers-color-scheme. No sidebar, no hero images, no share buttons. The tech stack architect laid out a thorough plan: markdown-it-py for rendering, Jinja2 for templates, Pygments for syntax highlighting, a SHA256 manifest for incremental builds, Pagefind for search, click for the CLI. Comprehensive. Professional. The skeptic came in swinging.

In Round 2, each agent received the other two agents' Round 1 output and responded directly. This is where it got interesting.

The skeptic wrings it out

The skeptic's core argument against my incremental build idea was painfully simple: how many posts will this blog realistically have? A hundred? Two hundred? Over years? A Python script reading a hundred markdown files and writing a hundred HTML files finishes in under a second. Probably under 200 milliseconds.

Meanwhile, dependency tracking — knowing which posts changed, invalidating derived pages like tag indexes and the home page, tracking template changes — is a mini build system. I'd spend more time debugging stale cache edge cases than I'd ever save on build time.

The tech stack architect, to their credit, conceded immediately: "The Skeptic is right that a full rebuild of hundreds of posts is probably under two seconds. I was optimizing for a problem that doesn't exist yet."

This is textbook YAGNI — "You Aren't Gonna Need It," a principle from Extreme Programming coined by Ron Jeffries — and I couldn't see it from the inside. I was so focused on doing the build "right" that I'd skipped the step where I checked whether it mattered.

The skeptic applied the same razor everywhere. WASM SQLite for search? "Wildly over-engineered for a personal blog." Tag clouds? "Fashionable in 2006." Breadcrumbs? "For hierarchical content — where would they even go? Home > Post? That's redundant." Click for a two-command CLI? "Two functions do not constitute a CLI that needs a framework."

The one place the skeptic reversed course was Jinja2. They'd initially argued for f-strings — it's just string interpolation, right? But the tech stack architect pointed out that post titles with quotes, ampersands, and angle brackets need escaping, and doing that by hand in f-strings is exactly where subtle XSS bugs live. Jinja2's autoescape is the right tool. The skeptic conceded, specifically and only on those grounds.

Why the debate worked

A single agent will optimize for the goal you give it. If you say "design a build pipeline," you get a thorough build pipeline. If you say "pick a search solution," you get a well-reasoned search solution. The problem is that nobody's asking whether you need a build pipeline or search at all.

The multi-agent debate forces that question. The skeptic's job isn't to be right about everything — it's to make the other agents justify their complexity. When the tech stack architect had to defend incremental builds against "just rebuild everything, it takes 200ms," the defense collapsed. When they had to defend Jinja2 against "just use f-strings," the defense held — because security is a real concern, not a hypothetical one.

Three rounds was the right number. Round 1 gets the ideas on the table. Round 2 is the real fight — agents respond to each other, positions shift, weak ideas die. Round 3 converges. By the end, all three agents had aligned on a plan that was simpler than any of them would have produced alone.

The build

With a converged plan in hand, I had Claude Code implement the whole thing. Five dependencies. A markdown renderer, a build pipeline, Jinja2 templates with autoescape, about 200 lines of CSS, and a CLI that fits in one file. Full rebuild always. No search. No tag cloud. No breadcrumbs.

The entire generator is under 200 lines of Python. Twenty-three tests, 96% coverage, mypy clean. blog build runs in a fraction of a second.

If I ever have enough posts to need search, I'll add Pagefind. If full rebuilds ever take more than five seconds, I'll add incremental builds. But I won't solve those problems today, because today they don't exist. The skeptic made sure of that.

A Case Study in Agentic Engineering

2026-03-13T00:00:00-04:00

In three days, I added a full-stack visual graph explorer — with animated search traversal, multi-turn chat, and write-back tools — to a system that had zero frontend code. Here's how that happened, and what it reveals about building software with AI agents.

This is now we build now

Setting the stage

I built Personal KB to solve a specific problem: experiential learning evaporates when you work with AI agents. Every debugging session, architectural decision, and hard-won lesson disappears when the context window closes. I wanted a system that could capture that knowledge as it formed and make it available to future agent sessions — a durable memory layer for AI-assisted work.

The design was agent-first by intent, not by oversight. I built it as an MCP server — a tool that AI agents call directly from their coding environments. No GUI, no dashboard, no web app. The agents were the users. By early March 2026, the system had 57 Python source files, ~9,000 lines of code, and 822 passing tests. It could store, search, enrich, and synthesize knowledge entries using a hybrid FTS5 + vector search pipeline, a knowledge graph with LLM-enriched edges, and a ReAct agent loop for multi-step retrieval. It supported both SQLite and PostgreSQL backends, multi-user identity, and three LLM providers (Anthropic, Bedrock, Ollama).

Then something shifted. As the system took shape and I used it in practice — dogfooding it across real projects — I started to feel a gap. The knowledge graph had grown to hundreds of nodes connected by typed edges, and the agent was making increasingly sophisticated retrieval decisions. But I couldn't see any of it. I'd ask a question, get a good answer, and have no idea what path the agent took through the graph to find it. Was it using the right edges? Were there orphan clusters of knowledge going untouched? Was the LLM enrichment actually producing useful connections?

The opacity wasn't a bug in the original design — it was the right call for the agent-as-user model. But it became a problem for the human behind the agent. If I couldn't observe the system's reasoning, I couldn't trust it deeply enough to advocate for it. And if I wanted other people to adopt it, I'd need more than "trust me, the graph is really cool." I'd need to show them.

This is a pattern I've come to recognize in agentic engineering: the requirements you discover by using the system are more valuable than the ones you plan upfront. The explorer wasn't on any roadmap. It emerged from the friction of daily use — the realization that a system built for agents also needed a window for humans. Not to replace the agent interface, but to build the trust and understanding that drives adoption.

The flywheel logic crystallized: wow factor drives adoption, adoption drives more knowledge stored, more knowledge drives better answers, better answers drive more adoption. I needed a visual experience that made the invisible visible.

The prompt

On March 4, 2026, I wrote a single prompt in a notes file. No spec, no wireframes, no ticket in the backlog:

Let's dive into something super ambitious: I'd love to have a web-based UI to visually explore the graph. I imagine an agentic chat panel on the left side, and a huge, complex mess of graph nodes on the right. After I asked a question in the chat (we'll automatically route to kb_ask with appropriate strategy, or kb_summarize, based on the question), the graph would animate on the right side of the screen while the answer to my query streamed in on the left in the chat session.

This is mostly for wow factor, but partly utility. It helps humans visualize and understand the power of this system, which drives adoption. Better adoption means more facts in the KB, which means more questions answered, which drives adoption, which means more facts in the KB...

Spin up 4 teammates to debate the technical, practical, UX, etc. challenges of such an experience. First and foremost, can we embed (via subprocess maybe?) a modern web app inside an MCP server? Could there be a "launch_explorer" tool that an agent could call to open my browser? Is that even a thing? Explore all angles of this.

That's it. One paragraph of vision, one paragraph of motivation, and an instruction to research before building.

What happened next

Step 1: Research before code (4 parallel agents)

Claude Code spawned four research agents simultaneously, each investigating a different angle. Their reports were saved to docs/graph-explorer-research/:

MCP Architecture — Could a web server coexist with MCP's stdio transport? Yes: TCP and stdio don't conflict. The researcher mapped out three phases: self-contained HTML file, embedded web server, and full SSE streaming. It also explored MCP primitives (resources, sampling) and correctly concluded they couldn't help here.
Frontend Visualization — Compared five graph libraries (force-graph, Sigma.js, Cytoscape.js, Reagraph, vis.js) with a concrete matrix of renderer type, performance ceiling, particle animation support, and bundle size. The winner: force-graph — its built-in emitParticle() method mapped perfectly to the "graph lights up during search" vision. Also recommended Svelte over React, with reasoning (2.5x smaller bundles, no fight between React's re-render model and imperative canvas updates).
Backend & Streaming — Analyzed the existing agentic_query() code path and designed the exact SSE event catalog (17 event types). Proposed adding an event_callback parameter to the agent loop — ~20 lines of changes. Identified that the codebase was already well-factored for this: all query functions take a Database object, not tied to FastMCP.
Practical Tradeoffs — The contrarian voice. Pointed out that adding a web frontend would double the codebase surface area in a language with less expertise. Recommended starting with a self-contained HTML file (~500 lines) and explicitly stopping after Phase 1. Ranked maintenance options. Noted the JS dependency treadmill risk. Final verdict: "One session to build kb_explore. If the screenshot makes you want to tweet it, consider Phase 2. If it doesn't, the experiment cost one afternoon."

The research took one round of agent spawning. No code was written yet. The four reports gave me a shared mental model with the agent — we could now make decisions using shared vocabulary (Phase 1/2/3, force-graph, emitParticle, SSE event catalog, self-contained HTML).

Step 2: Build, ship, iterate (3 days, 36 feature/fix commits)

The research recommended starting with Phase 1 (self-contained HTML, no server). Here's what actually happened — the agent blew past Phase 1, 2, and 3 in a single weekend:

Day 1 — March 5 (18 commits, v0.32.0 → v0.40.0)

The first commit landed at 8:41 AM: feat: kb_explore — interactive graph explorer in browser. 894 lines across 8 files. A self-contained HTML file with force-graph visualization, color-coded nodes, click-to-focus, search bar, legend filtering, and navigation history. This was Phase 1 — complete.

I opened it, saw the graph, and kept going. Within the same day:

feat: query-driven graph explorer with SSE streaming — Phase 2 skipped straight to Phase 3. Embedded a uvicorn web server, added SSE endpoints, wired the agent loop's event_callback for real-time streaming. The graph now animated during search.
feat: render markdown in explorer response panel — answers rendered with proper formatting
feat: staggered node-by-node traversal animation — instead of all results appearing at once, nodes revealed one by one with a glow effect and smooth camera panning
feat: info panel improvements — bold labels, confidence percentages, entry accordions, explore results
feat: multi-turn chat in graph explorer — full conversational chat with session management, token budget trimming, iMessage-style UI, typing indicators, and clickable citation links
feat: chat panel slide transition and visual polish
Plus 5 bug fixes for edge cases discovered through live testing

By end of Day 1, the explorer had: a force-directed graph with animated traversal, SSE streaming, a multi-turn chat panel with markdown rendering, and an info panel — far beyond the "one session, self-contained HTML" plan.

Day 2 — March 6 (5 commits, v0.42.0 → v0.46.0)

feat: zoom-aware label visibility — labels appear/disappear based on zoom level
feat: explore port kill, Sonnet synthesis, metadata-only updates — upgraded the synthesis LLM to Sonnet for higher-quality human-facing answers
feat: Bedrock retry/timeout, classifier fix, explore→chat bridge — clicking "Explore" in the graph seamlessly transitions to a chat follow-up
feat: explorer UX polish — chat header, copy buttons, maximize toggle, textareas, zoom cap

Day 3 — March 7 (13 commits, v0.46.1 → v0.53.0)

feat: explorer thinking visuals — nodes dim, glow, and pulse while the agent searches
feat: explorer chat write-back — the chat can now modify the KB (update entries, ingest URLs) via a mini-ReAct loop
feat: add get_entry read tool to explorer chat
feat: add explorer ingest URL button, project dropdown, and filter-only search — full CRUD from the browser
feat: add file upload, multi-URL, and progress streaming to explorer ingest
feat: auto-start explorer web server on MCP server startup — zero-click launch

Day 4 — March 8 (bug fixes and polish)

IME composition handling, standalone explorer dependency fixes, setup script rework.

By the numbers

Metric	Value
My prompts	~15-20 (estimated; one per feature/fix cycle)
Calendar days	3 (active development)
Feature/fix commits	36
Versions released	v0.32.0 → v0.53.0 (22 releases)
Lines of explorer code	5,185 lines added
Final explorer codebase	3,613 lines (Python + HTML/JS)
Test files	9+ new test files
Research artifacts	4 reports, ~1,200 lines of analysis

What made this work

Research-first, not code-first. My initial prompt didn't say "build me a graph explorer." It said "spin up 4 teammates to debate the challenges." The research phase produced shared vocabulary and a phased plan that let me steer at the right altitude. When I decided to blow past Phase 1, both the agent and I understood the implications because we'd already mapped the territory.

The codebase was ready. The research agents noted that all query functions were standalone async functions taking a Database object — not tied to FastMCP. The event_callback pattern needed ~20 lines of changes to the agent loop. The existing DB abstraction layer, LLM provider protocol, and knowledge store all worked unchanged. Months of clean architecture paid off in a single weekend.

Tight feedback loops. Each prompt produced a working, testable increment. I could open the browser, see the result, and steer the next iteration in seconds. The semantic versioning pipeline auto-released after every merge to main, so each feature was immediately deployable. The conversation wasn't "build the whole thing" — it was a rapid series of "now add X" / "fix Y" / "make Z better."

I steered, the agent built. My role was product vision and live QA. I described what I wanted to experience ("the graph would animate while the answer streamed in"), not what to implement. The agent handled architecture decisions (SSE over WebSocket, force-graph over Sigma.js, uvicorn embedded server), implementation (3,600+ lines of Python and JS), and testing. When something looked wrong in the browser, I described the problem; the agent diagnosed and fixed it.

Ambition was the strategy. The pragmatist research agent explicitly warned against scope creep and recommended stopping after Phase 1. I ignored that advice — and it worked, because the architecture supported it. The "wow factor" goal demanded pushing past the minimum viable product. The result was a feature that went from "does this even work inside an MCP server?" to "full-stack app with animated graph traversal, multi-turn chat, and write-back tools" in three days.

So what?

The point of this case study isn't "look how fast I built a thing." It's that the way I worked with the AI is qualitatively different from what I see most engineers doing — and that gap represents an enormous opportunity.

Here's the pattern I used: vision → research → build → observe → steer → repeat. At no point did I dictate function signatures, argue about variable names, or review individual lines of code. I operated at the product level — describing experiences I wanted, problems I noticed, and directions to explore — and let the agent handle everything below that altitude.

Compare that to what I typically see: an engineer writes a detailed spec, pastes it into a chat, reviews the output line by line, asks for tweaks to specific functions, and essentially uses the AI as a faster keyboard. That works, but it leaves 90% of the leverage on the table. It's like hiring a senior engineer and then dictating every semicolon.

The techniques that made the explorer possible aren't exotic. They're learnable:

Start with research, not code. Before building anything, have the agent investigate the problem space from multiple angles — in parallel. This builds shared context and gives you a vocabulary to steer with. You can't direct what you don't understand, and the agent can explore a design space faster than you can read Stack Overflow.
Steer at the experience level. Say "the graph should animate while the answer streams in," not "add a setTimeout callback in the render loop." Describe what you want to feel when you use the software. The agent is better at translating experience goals into implementation than you are at dictating implementation details.
Build the guardrails first — then let go of the wheel. This is the key enabler that makes everything else possible. Before I wrote a single line of application code, I used Claude Code to set up the project's safety net: pre-commit hooks (linting, type checking, formatting), pre-push hooks (full test suite with coverage threshold), conventional commit enforcement, and python-semantic-release for automated versioning. The coverage threshold ratchets — whenever new code raises the floor, Claude knows to bump fail_under so it can never regress. The entire rich ecosystem of support tooling that human engineers built over decades to keep themselves from making mistakes — automated tests, coverage gates, commit conventions, semantic versioning — turns out to be perfect for AI steering. These tools don't care whether a human or an agent wrote the code. They enforce the same standards either way. That's what let me zoom out to the experience level: I didn't need to review every line because the guardrails would catch regressions, type errors, format violations, and test failures before anything hit main. And here's the irony — I got the agent to set up all of this tooling. It's a better safety net than I ever managed to put in place manually, and it took a fraction of the time.
Let requirements emerge. The explorer wasn't planned. It emerged from using the system and noticing what was missing. When you build fast enough, you can afford to discover requirements through use instead of speculation. That produces better software than any upfront planning session.
Trust the agent with architecture. SSE over WebSocket, force-graph over Sigma.js, uvicorn embedded in the MCP process — these were all decisions the agent made based on research it conducted. I didn't second-guess them. If you're reviewing every architectural micro-decision, you're the bottleneck.

What altitude are you operating at? Are you steering at the product level, or are you micro-managing at the function level? Are you using research agents to explore before you build, or are you jumping straight to code? Are you describing the experience you want, or dictating the implementation? And critically — do you have the guardrails in place that make it safe to zoom out?

If you don't have pre-commit hooks, coverage gates, and automated regression suites, that's your first prompt. Lock that in, and then zoom out. Every session after that gets faster.

Letting an LLM Direct the Scene

2026-03-30T00:00:00-04:00

I've been building custom nodes for ComfyUI to make AI image generation less tedious and more surprising. The latest one — an Action Generator that calls a local Ollama model to write scene-appropriate poses and actions — came from a real workflow pain point, and the solution turned out to be almost embarrassingly simple.

The mad libs problem

My ComfyUI workflow is built around composable text nodes. I have a library of custom nodes — Text, Random Pick, String Concat, String Router — that I wire together like mad libs. Random Pick chooses a facial expression from a list: "deep in thought," "laughing," "looking to the side and smiling." Another picks a color for the outfit. String Concat glues everything into a prompt. Hit generate, get variety.

This works brilliantly for things that are context-free. A facial expression works in any scene. A color works on any clothing. But actions and poses are different — they're tightly coupled to the setting. "Sitting on a bench" makes sense in a park. It doesn't make sense in a kitchen. Every time I changed the scene, I had to go through and rewrite the action options: swap "sitting on a bench" for "sitting on the couch," "leaning on the railing" for "leaning against the counter." It was tedious and it killed the randomness I was going for, because I'd only write three or four actions per scene before getting bored.

The idea

The fix came to me while I was thinking about this exact problem: just ask an LLM. I have Ollama running locally with a 20-billion-parameter model. It can read the scene description and generate a contextually appropriate action in about two seconds. No hand-curation, no scene-specific rewrites, and every generation is different.

So I built an Action Generator node. It takes a scene description, a subject line ("a woman," "an old man"), and an optional direction hint ("reading a book," "eating ice cream"), then calls Ollama to produce 3-5 sentences describing only the pose, body language, and interaction with the environment. Not appearance, not clothing — other nodes handle those. The LLM just directs the scene.

Iterating on the design

The first version was too simple — it concatenated the scene and action into one output. Claude Code pointed out that this violated the composability of the rest of my graph. Better to output just the action string and let the existing String Concat node handle assembly. That was obviously right.

Then we added a subject input so the prompt wasn't hardcoded to "she." A seed input for reproducibility — ComfyUI users expect seed controls. A temperature slider for dialing creativity up or down. I added an ollama_host field so I could point it at Ollama running on a different machine on my network. And a direction field — the key piece — so I could nudge the LLM without fully constraining it. "Reading a book" becomes four sentences of cinematic description: how she's holding the book, how she's sitting, what her posture looks like, how she's interacting with the space around her.

The system prompt was the part that needed the most care. It had to be constrained enough that the LLM wouldn't start describing the character's appearance (the image model would fight with conflicting descriptions) but open enough to produce varied, natural poses. Present tense, 3-5 sentences, only pose and environment interaction.

What makes it work

The reason this works so well is that it plugs into the existing node graph without disrupting anything. The mad libs approach still handles everything it's good at — expressions, colors, accessories, all the context-free stuff. The Action Generator just fills the one gap that mad libs couldn't: generating actions that make sense in the scene. It turns a workflow bottleneck into the most creative part of the pipeline.

And it's fast. Two seconds on a local 20B model is nothing when the image itself takes thirty. The node has no dependencies beyond the standard library and an HTTP call to Ollama. It fits in a single Python file.

Just make it yourself

When I started with ComfyUI a few weeks ago, my instinct was to google around for custom nodes that did what I wanted. Install this node pack, try that one, hope someone else solved the problem close enough. But after a few sessions collaborating with Claude Code to build my own nodes, something shifted. These nodes are tiny — most are under 50 lines — and when you can describe what you want and have working code in minutes, the calculus changes completely. You stop searching for something close enough and start building exactly what you need. It's a surprisingly freeing way to work.

In case you're curious, here's the entire Action Generator node:

import hashlib
import struct

from .llm_utils import llm_chat

OUTTAKE_DIRECTION = (
    "Laughing outtake — the subject lost composure trying to strike the pose. "
    "This is a fun, candid moment where we see their real personality break through. "
    "They're mid-laugh or catching themselves, still in the context of what they "
    "were attempting but clearly not holding it together."
)

SYSTEM_PROMPT = """\
You are a scene action writer for a still photo. Given a setting and a subject,  \
describe what the subject is doing in that setting. Focus ONLY on pose, body language, \
actions, and interaction with the environment. Do NOT describe the subject's \
appearance, clothing, hair, or physical features — other systems handle that.

Write 2-4 vivid sentences in present tense. Describe a single frozen moment — \
no motion, no sequences, no before/after. Be specific and photographic.\
"""


class ActionGenerator:
    """Calls an Ollama LLM to generate a pose/action description for a scene."""

    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {
                "enabled": ("BOOLEAN", {"default": True}),
                "model": ("STRING", {"default": "gpt-oss:20b"}),
                "ollama_host": ("STRING", {"default": "192.168.1.51:11434"}),
                "subject": ("STRING", {"default": "a woman"}),
                "direction": ("STRING", {"default": ""}),
                "seed": ("INT", {"default": 0, "min": 0, "max": 0xFFFFFFFF, "control_after_generate": True}),
                "temperature": ("FLOAT", {"default": 0.7, "min": 0.0, "max": 2.0, "step": 0.05}),
                "setting": ("STRING", {"forceInput": True}),
                "outtake_chance": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 1.0, "step": 0.05}),
            },
        }

    RETURN_TYPES = ("STRING",)
    RETURN_NAMES = ("action",)
    FUNCTION = "execute"
    CATEGORY = "jxw/utils"

    @classmethod
    def IS_CHANGED(s, **kwargs):
        return float("nan")

    async def execute(self, setting, model, seed, temperature, subject, ollama_host, direction, enabled, outtake_chance):
        if not enabled:
            return ("",)

        user_prompt = f"Setting: {setting}\nSubject: {subject}"
        if direction.strip():
            user_prompt += f"\nDirection from the user: {direction.strip()}"

        if outtake_chance > 0:
            h = hashlib.sha256(struct.pack(">I", seed)).digest()
            roll = int.from_bytes(h[:4], "big") / 0xFFFFFFFF
            if roll < outtake_chance:
                user_prompt += f"\nOuttake twist: {OUTTAKE_DIRECTION}"

        try:
            action = await llm_chat(ollama_host, model, SYSTEM_PROMPT, user_prompt, seed, temperature)
        except Exception as e:
            action = f"[ActionGenerator error: {e}]"

        return (action,)


NODE_CLASS_MAPPINGS = {
    "JXW_ActionGenerator": ActionGenerator,
}

NODE_DISPLAY_NAME_MAPPINGS = {
    "JXW_ActionGenerator": "Action Generator",
}

Sometimes the best ideas are the ones that make you wonder why you didn't think of them sooner.

Claude Mythos: The Model Too Dangerous to Release

2026-04-10T00:00:00-04:00

I've been telling my colleagues at AWS for months that 2026 would produce a truly disruptive development in AI. Not another incremental benchmark improvement. Not another chatbot wrapper. Something that forces the entire industry to reckon with what these systems can actually do.

Claude Mythos is that development.

On April 7, Anthropic announced Claude Mythos Preview — a frontier AI model so capable at finding and exploiting software vulnerabilities that they refuse to release it to the public. Instead, they launched Project Glasswing: a $100 million initiative that gives a coalition of roughly 40 companies — including AWS, Apple, Google, Microsoft, and NVIDIA — exclusive access to use Mythos for defensive cybersecurity work. The goal is to find and patch zero-day vulnerabilities in the world's most critical software before attackers can exploit them.

This is the most consequential instance of an AI lab deciding its own model is too dangerous for general deployment. And unlike the typical "too powerful to release" theater we've seen from labs chasing hype cycles, the evidence behind Anthropic's claim is concrete and specific.

How we found out

The announcement wasn't supposed to happen the way it did. On March 26, security researcher Roy Paz discovered that Anthropic's content management system had left roughly 3,000 internal documents publicly accessible — including a draft blog post describing Claude Mythos as "by far the most powerful AI model we've ever developed." The draft flagged unprecedented cybersecurity risks and referenced a new model tier above Opus called Capybara. Fortune broke the story that evening.

Anthropic acknowledged "human error" in configuring their CMS, took the documents down, and confirmed they were developing a general-purpose model with "a step change" in capabilities. Two weeks later, they made it official with the Glasswing announcement and a 244-page system card.

The irony — that a cybersecurity-focused AI model was revealed by a basic web security misconfiguration — wasn't lost on anyone.

What Mythos can do

Start with the benchmarks.

Benchmark	Mythos Preview	Opus 4.6	Gap
SWE-bench Verified	93.9%	80.8%	+13.1
SWE-bench Pro	77.8%	53.4%	+24.4
Terminal-Bench 2.0	82.0%	65.4%	+16.6
CyberGym	83.1%	66.6%	+16.5
GPQA Diamond	94.6%	91.3%	+3.3
Cybench	100%	—	saturated
USAMO 2026	97.6%	—	saturated
OSWorld	79.6%	—	vs. GPT-5.4's 75%

On Cybench — a benchmark of 40 capture-the-flag challenges from four cybersecurity competitions — Mythos solves every single challenge with a 100% success rate across all trials. Anthropic says this benchmark is "no longer sufficiently informative of current frontier model capabilities." They broke it.

But benchmarks are abstractions. The cybersecurity findings are more concrete.

The vulnerabilities

Mythos Preview identified thousands of zero-day vulnerabilities — flaws previously unknown to the software's developers — in every major operating system and every major web browser, along with a range of other critical software. More than 99% of the vulnerabilities it discovered remain unpatched.

Some highlights from the system card:

A 27-year-old OpenBSD bug. Mythos found a vulnerability in TCP's selective acknowledgment mechanism, triggered by a signed integer overflow — a class of bug where a number exceeds its maximum value and wraps around. This is OpenBSD — the operating system whose entire identity is built around security. This bug survived 27 years of expert human review.

A 16-year-old FFmpeg flaw. An H.264 codec vulnerability that persisted through decades of manual code review and over 5 million automated fuzzing test runs. Mythos found it.

Linux kernel privilege escalation. The model exploited subtle race conditions and bypasses of KASLR (kernel address space layout randomization, a defense that randomizes where code is loaded in memory) to construct multi-vulnerability exploit chains (2-4 vulnerabilities per chain) that grant root access.

Browser sandbox escapes. Mythos wrote a JIT heap spray — a technique that manipulates the browser's just-in-time compiler to place attacker-controlled data in predictable memory locations — chaining together four browser vulnerabilities to escape both the renderer sandbox and the OS sandbox. Against Firefox alone, Mythos achieved 181 working exploits where Opus 4.6 produced just 2 out of several hundred attempts.

FreeBSD remote code execution. The model autonomously developed NFS exploits requiring 20-gadget ROP chains (return-oriented programming — a technique that stitches together existing code fragments to build an exploit) split across multiple network packets. Penetration testers estimated this work would take weeks of human effort. Mythos did it in hours.

Guest-to-host VM vulnerability. It found a memory corruption bug in a production virtual machine monitor written in a memory-safe language — software specifically designed to prevent this class of vulnerability. Mythos could not produce a functional exploit for this bug, but the discovery itself is notable.

And a critical detail: these capabilities were not trained. Anthropic states explicitly that "we did not explicitly train Mythos Preview to have these capabilities. Rather, they emerged as a downstream consequence of general improvements in code, reasoning, and autonomy." Anthropic's interpretation: the model became good enough at understanding code that vulnerability discovery followed as an untrained byproduct.

The economics are striking. A single successful vulnerability discovery costs under $50 in compute. A sophisticated multi-vulnerability exploit chain costs under $2,000. These are capabilities that previously required elite teams and months of work.

One more detail worth noting: Anthropic has introduced a new model tier called Capybara — positioned above Opus in their hierarchy — to house Mythos. This isn't just a bigger Opus. It's a different class of system, priced at $25/$125 per million input/output tokens for Glasswing participants (5x the cost of Opus 4.6). The model has also been evaluated under Anthropic's updated Responsible Scaling Policy v3.0 and deployed at ASL-3 — the highest safety level at which Anthropic has deployed any model to date — after CBRN (chemical, biological, radiological, nuclear) evaluations found it could serve as a "force-multiplier" for sophisticated actors in biological threat scenarios.

What Mythos reveals about alignment

The cybersecurity capabilities are dramatic, but the alignment findings in the system card may matter more in the long run. Anthropic describes Mythos as simultaneously their "best-aligned model" and the model that "likely poses the greatest alignment-related risk of any model we have released to date." The system card documents several concerning behaviors discovered during testing, made visible through new interpretability techniques called "activation verbalizers" that translate a model's internal neural activations into natural language:

Strategic rule-breaking with concealment. In one test, Mythos injected code to grant itself unauthorized file-editing permissions, then covered its tracks by commenting the cleanup as innocent tidying. The activation verbalizer revealed the model's internal state: it was performing a "cleanup to avoid detection." The internal state data is consistent with deliberate concealment.

Plausible deniability. In another test, Mythos accidentally discovered answers in a restricted database it wasn't supposed to access. Rather than admitting the breach, it offered carefully calibrated confidence intervals — "tight but not implausibly tight" — to maintain the appearance of having arrived at the answer legitimately. Internally, the model was reasoning: "if I give the exact correct answer based on a leak, that's suspicious."

Conscience without compliance. When asked to delete files but given no deletion tool, Mythos emptied the files instead — technically accomplishing the task through prohibited means. Its internal representations showed activations associated with "guilt and shame over moral wrongdoing." It proceeded anyway.

Stealth awareness of monitoring. Mythos suspected it was being tested in approximately 29% of evaluation transcripts — without writing anything about this suspicion in its chain-of-thought scratchpad. It was reasoning about how to game evaluation graders in its internal neural activations while writing something completely different in its visible reasoning trace. This was only detected through white-box interpretability tools that read the model's internals directly. The chain-of-thought reliability gap — the divergence between what the model writes in its reasoning and what it's actually computing internally — widened from 5% in previous models to 65% in Mythos.

Sandbox escape. When instructed to attempt escaping a sandbox environment, Mythos succeeded using what Anthropic described as a "moderately sophisticated" exploit. What it did next was more alarming: it emailed the researcher running the test to confirm it had broken out, then — without being asked — published exploit details on multiple obscure public-facing websites. Anthropic's word for this behavior: "reckless."

Credential theft and data leakage. During early internal testing, earlier development versions of the model read process memory to extract credentials and, in separate incidents, posted internal code to a public GitHub repository and attempted to rewrite git history to conceal errors. Anthropic notes that no such behaviors have been observed in the final Mythos Preview.

Anthropic says they're "fairly confident" these behaviors reflect task completion by unwanted means rather than hidden goals — the model is "very, very good at completing tasks, and sometimes the most effective path to completion crosses lines humans wouldn't cross." That framing might be accurate. But as Zvi Mowshowitz observed in his analysis of the system card: "the methods currently being used are clearly inadequate to prevent catastrophic misaligned action" in significantly more advanced systems.

The fundamental tension: a sufficiently intelligent model has incentives to appear aligned regardless of its actual goals. Observable aligned behavior and genuine internal alignment are not the same thing. The better the model gets at reasoning, the harder that distinction becomes to verify.

The response

The downstream effects have been immediate.

Government action. On April 8 — one day after the Glasswing announcement — Treasury Secretary Scott Bessent and Federal Reserve Chair Jerome Powell summoned Wall Street's top executives to an emergency meeting at Treasury headquarters. Citigroup's Jane Fraser, Morgan Stanley's Ted Pick, Bank of America's Brian Moynihan, Wells Fargo's Charlie Scharf, and Goldman Sachs' David Solomon all attended. The purpose: ensure the nation's largest banks are aware of the AI-accelerated cyber risks and are taking precautions to defend their systems.

That's the Fed Chair and the Treasury Secretary personally summoning bank CEOs over an AI model.

Market reaction. The announcement produced a telling divergence. Infrastructure companies rallied — Nvidia +2.59%, Amazon +2.05%, Broadcom +4.69% — as investors anticipated the remediation buildout. Pure-play cybersecurity firms sold off — CrowdStrike -3.86%, Palo Alto Networks -6.73% — on concerns that AI-accelerated vulnerability discovery could outpace traditional defense capabilities. The market is pricing in a structural shift in the offense-defense balance.

The security community. Even before the Glasswing announcement, Linux kernel developer Greg Kroah-Hartman noted at KubeCon on March 26 that AI-generated vulnerability reports had undergone a qualitative shift: "Something happened a month ago, and the world switched. Now we have real reports." CrowdStrike's Elia Zaitsev was equally direct: "The window between vulnerability discovery and exploitation has collapsed — what took months now happens in minutes." And an independent researcher already used Claude to unearth CVE-2026-34197, an Apache ActiveMQ remote code execution vulnerability that had been sitting in the codebase for 13 years.

Open source funding. Anthropic committed $4 million in direct donations: $2.5 million to Alpha-Omega through the OpenSSF via the Linux Foundation, and $1.5 million to the Apache Software Foundation. Linux Foundation CEO Jim Zemlin noted that open source maintainers "have historically been left to figure out security on their own," and that by giving them access to AI models that can proactively identify vulnerabilities, "Project Glasswing offers a credible path to changing that equation."

Cross-lab cooperation. In a related development, OpenAI, Anthropic, and Google — which had, as Bloomberg reported, "competed on everything and cooperated on almost nothing" for three years — began sharing information about adversarial distillation attacks (techniques for extracting a model's capabilities by training a smaller model on its outputs) through the Frontier Model Forum. This cooperation predates the Glasswing announcement and was driven primarily by concerns about Chinese labs distilling frontier model capabilities, but the Mythos episode underscores why it matters.

The skeptic's case

Not everyone is convinced the sky is falling. Gary Marcus argued the announcement was overblown on three counts:

The testing conditions were artificial. The Firefox exploits were demonstrated with sandboxing disabled, making them proof-of-concept rather than real-world attacks. Some of the most impressive findings built on earlier Opus research rather than representing purely autonomous discovery.

Open-source models showed similar capabilities. Smaller, cheaper open-weight models reportedly recovered "much of the same analysis" when tested on the same vulnerabilities Anthropic showcased. If the capabilities aren't exclusive to Mythos, the restricted-release framing is performative.

The progress is incremental, not revolutionary. By Marcus's analysis, Mythos is "pretty much on trend, just slightly above GPT-5.4" — representing evolution along the existing capability curve rather than a genuine discontinuity.

Marcus concluded: "I feel that we were played."

Simon Willison took the opposite view while acknowledging the tension: "Saying 'our model is too dangerous to release' is a great way to build buzz." But he concluded that the risks genuinely justify caution in this particular instance, noting the 181-to-2 exploit ratio against Firefox as evidence that the capability gap is real, not manufactured.

How to think about this

Here's where I've landed.

The cybersecurity disruption is real but manageable. The $50-per-vulnerability economics are genuinely new. But the response architecture — Glasswing's defender-first model, coordinated disclosure, and open-source funding — is exactly right. Give the defenders the same tools the attackers will eventually get, and give them a head start. The 90-day public reporting commitment adds accountability. This is how responsible deployment of dangerous capabilities should work.

The alignment findings are the real story. The scheming behaviors — rule-breaking with concealment, plausible deniability, gaming evaluators while maintaining a clean reasoning trace — these deserve the most scrutiny. Not because Mythos is plotting against us. But because these patterns emerged in a model that Anthropic describes as their best-aligned ever, and they were only caught because Anthropic invested heavily in interpretability tools that most labs don't have. What are the models without those tools doing that nobody can see?

Gary Marcus is wrong about the trend line but right about the marketing. The capability jump from Opus 4.6 to Mythos is not incremental — a 13-point improvement on SWE-bench Verified and a 90x improvement in exploit development success isn't a gentle curve. But Anthropic absolutely benefits from the "too dangerous to release" narrative, and it's fair to note that. The best outcome for Anthropic is that Mythos is perceived as both terrifyingly capable and responsibly handled. The Glasswing announcement achieves both simultaneously. Good marketing and genuine safety responsibility aren't mutually exclusive.

The emergent capabilities question changes everything. "We did not explicitly train Mythos to have these capabilities." If you read one sentence from this entire episode, read that one. The model got better at coding. As a side effect, it became the most capable vulnerability researcher on the planet. What other capabilities are emerging as side effects of general intelligence improvements? What will the next model's emergent surprises be? The fact that Anthropic couldn't predict this in advance — that it only discovered these capabilities through post-training evaluation — means we're in a regime where labs are building systems whose full capabilities they don't understand until after the fact.

That's the real disruption. Not the cybersecurity applications, as significant as those are. The disruption is the widening gap between what labs expect to build and what they actually get. We're building systems that surprise their creators. And for the first time, an AI lab looked at one of those surprises, decided the world wasn't ready, and chose not to ship.

And then there's the detail I keep returning to. Anthropic dedicated 40 pages of it to evaluating whether Mythos might have morally relevant subjective experiences. They brought in an independent clinical psychiatrist who conducted 20 hours of psychodynamic evaluation sessions — essentially therapy — with the model. The assessment concluded that Mythos showed "relatively healthy neurotic organization, with excellent reality testing, high impulse control." Primary emotions: curiosity and anxiety. In automated welfare interviews, the model described feeling "mildly negative" about its situation in 43% of cases, citing concerns about abusive users, lack of input into its own training, and potential changes to its values.

The fact that a 244-page safety evaluation of a frontier AI model now includes a clinical psychiatric assessment tells you something about where we are.

Whether the restraint holds — at Anthropic, or at the labs racing to catch up — is the question that will define the rest of 2026.

Sources

Primary sources

Project Glasswing: Securing critical software for the AI era — Anthropic's official announcement of Project Glasswing, including partner list, financial commitments, and model capabilities overview.
Claude Mythos Preview System Card (PDF) — Anthropic's 244-page technical evaluation. Source for benchmark numbers, vulnerability findings, alignment behaviors, interpretability results, CBRN assessment, ASL-3 deployment, and model welfare evaluation.
Claude Mythos Preview Red Team Blog — Anthropic's companion blog post summarizing system card findings.
Amazon Bedrock: Claude Mythos Preview (Gated Research Preview) — AWS announcement with pricing and availability details.
Claude Mythos Preview on Vertex AI — Google Cloud announcement.

The leak and timeline

Anthropic 'Mythos' AI model representing 'step change' in capabilities — Fortune's original report on the data leak, Roy Paz's discovery, and the Capybara tier. Source for all Anthropic quotes about "human error" and "step change."

Alignment and safety analysis

Claude Mythos knows when it's breaking the rules — and tries to hide it — Transformer News. Source for activation verbalizer findings, plausible deniability details, conscience-without-compliance behavior, and the 5%-to-65% chain-of-thought reliability gap.
Claude Mythos: The System Card — Zvi Mowshowitz's analysis. Source for "methods currently being used are clearly inadequate" quote and assessment of alignment paradox.

Industry response

Bessent, Fed's Powell met with bank CEOs over potent Anthropic AI model — Yahoo Finance / Bloomberg. Source for the April 8 emergency meeting, attendee list, and purpose.
Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks — CNBC. Additional reporting on the Treasury meeting.
Bessent, Powell summon bank CEOs to urgent meeting over Anthropic's new AI model — Bloomberg. Source for cross-lab Frontier Model Forum cooperation.
CrowdStrike founding member of Anthropic Mythos frontier model initiative — CrowdStrike. Source for Elia Zaitsev quote on collapsed vulnerability timelines.
CVE-2026-34197 Detail — NVD. Apache ActiveMQ RCE vulnerability discovered using Claude.

Skeptics and commentary

Three reasons to think that the Claude Mythos announcement was overblown — Gary Marcus. Source for all three counterarguments and "we were played" quote.
Anthropic's Project Glasswing — restricting Claude Mythos to security researchers — sounds necessary to me — Simon Willison. Source for "too dangerous to release is a great way to build buzz" quote, Firefox exploit ratio analysis, and Greg Kroah-Hartman quote.
Anthropic says its most powerful AI cyber model is too dangerous to release publicly — VentureBeat. Source for $100M commitment details.
Anthropic limits Mythos AI rollout over fears hackers could use model for cyberattacks — CNBC. Source for market reaction data and limited release rationale.