A Case Study in Agentic Engineering

In three days, I added a full-stack visual graph explorer — with animated search traversal, multi-turn chat, and write-back tools — to a system that had zero frontend code. Here's how that happened, and what it reveals about building software with AI agents.

This is now we build now

Setting the stage

I built Personal KB to solve a specific problem: experiential learning evaporates when you work with AI agents. Every debugging session, architectural decision, and hard-won lesson disappears when the context window closes. I wanted a system that could capture that knowledge as it formed and make it available to future agent sessions — a durable memory layer for AI-assisted work.

The design was agent-first by intent, not by oversight. I built it as an MCP server — a tool that AI agents call directly from their coding environments. No GUI, no dashboard, no web app. The agents were the users. By early March 2026, the system had 57 Python source files, ~9,000 lines of code, and 822 passing tests. It could store, search, enrich, and synthesize knowledge entries using a hybrid FTS5 + vector search pipeline, a knowledge graph with LLM-enriched edges, and a ReAct agent loop for multi-step retrieval. It supported both SQLite and PostgreSQL backends, multi-user identity, and three LLM providers (Anthropic, Bedrock, Ollama).

Then something shifted. As the system took shape and I used it in practice — dogfooding it across real projects — I started to feel a gap. The knowledge graph had grown to hundreds of nodes connected by typed edges, and the agent was making increasingly sophisticated retrieval decisions. But I couldn't see any of it. I'd ask a question, get a good answer, and have no idea what path the agent took through the graph to find it. Was it using the right edges? Were there orphan clusters of knowledge going untouched? Was the LLM enrichment actually producing useful connections?

The opacity wasn't a bug in the original design — it was the right call for the agent-as-user model. But it became a problem for the human behind the agent. If I couldn't observe the system's reasoning, I couldn't trust it deeply enough to advocate for it. And if I wanted other people to adopt it, I'd need more than "trust me, the graph is really cool." I'd need to show them.

This is a pattern I've come to recognize in agentic engineering: the requirements you discover by using the system are more valuable than the ones you plan upfront. The explorer wasn't on any roadmap. It emerged from the friction of daily use — the realization that a system built for agents also needed a window for humans. Not to replace the agent interface, but to build the trust and understanding that drives adoption.

The flywheel logic crystallized: wow factor drives adoption, adoption drives more knowledge stored, more knowledge drives better answers, better answers drive more adoption. I needed a visual experience that made the invisible visible.

The prompt

On March 4, 2026, I wrote a single prompt in a notes file. No spec, no wireframes, no ticket in the backlog:

Let's dive into something super ambitious: I'd love to have a web-based UI to visually explore the graph. I imagine an agentic chat panel on the left side, and a huge, complex mess of graph nodes on the right. After I asked a question in the chat (we'll automatically route to kb_ask with appropriate strategy, or kb_summarize, based on the question), the graph would animate on the right side of the screen while the answer to my query streamed in on the left in the chat session.

This is mostly for wow factor, but partly utility. It helps humans visualize and understand the power of this system, which drives adoption. Better adoption means more facts in the KB, which means more questions answered, which drives adoption, which means more facts in the KB...

Spin up 4 teammates to debate the technical, practical, UX, etc. challenges of such an experience. First and foremost, can we embed (via subprocess maybe?) a modern web app inside an MCP server? Could there be a "launch_explorer" tool that an agent could call to open my browser? Is that even a thing? Explore all angles of this.

That's it. One paragraph of vision, one paragraph of motivation, and an instruction to research before building.

What happened next

Step 1: Research before code (4 parallel agents)

Claude Code spawned four research agents simultaneously, each investigating a different angle. Their reports were saved to docs/graph-explorer-research/:

  1. MCP Architecture — Could a web server coexist with MCP's stdio transport? Yes: TCP and stdio don't conflict. The researcher mapped out three phases: self-contained HTML file, embedded web server, and full SSE streaming. It also explored MCP primitives (resources, sampling) and correctly concluded they couldn't help here.

  2. Frontend Visualization — Compared five graph libraries (force-graph, Sigma.js, Cytoscape.js, Reagraph, vis.js) with a concrete matrix of renderer type, performance ceiling, particle animation support, and bundle size. The winner: force-graph — its built-in emitParticle() method mapped perfectly to the "graph lights up during search" vision. Also recommended Svelte over React, with reasoning (2.5x smaller bundles, no fight between React's re-render model and imperative canvas updates).

  3. Backend & Streaming — Analyzed the existing agentic_query() code path and designed the exact SSE event catalog (17 event types). Proposed adding an event_callback parameter to the agent loop — ~20 lines of changes. Identified that the codebase was already well-factored for this: all query functions take a Database object, not tied to FastMCP.

  4. Practical Tradeoffs — The contrarian voice. Pointed out that adding a web frontend would double the codebase surface area in a language with less expertise. Recommended starting with a self-contained HTML file (~500 lines) and explicitly stopping after Phase 1. Ranked maintenance options. Noted the JS dependency treadmill risk. Final verdict: "One session to build kb_explore. If the screenshot makes you want to tweet it, consider Phase 2. If it doesn't, the experiment cost one afternoon."

The research took one round of agent spawning. No code was written yet. The four reports gave me a shared mental model with the agent — we could now make decisions using shared vocabulary (Phase 1/2/3, force-graph, emitParticle, SSE event catalog, self-contained HTML).

Step 2: Build, ship, iterate (3 days, 36 feature/fix commits)

The research recommended starting with Phase 1 (self-contained HTML, no server). Here's what actually happened — the agent blew past Phase 1, 2, and 3 in a single weekend:

Day 1 — March 5 (18 commits, v0.32.0v0.40.0)

The first commit landed at 8:41 AM: feat: kb_explore — interactive graph explorer in browser. 894 lines across 8 files. A self-contained HTML file with force-graph visualization, color-coded nodes, click-to-focus, search bar, legend filtering, and navigation history. This was Phase 1 — complete.

I opened it, saw the graph, and kept going. Within the same day:

  • feat: query-driven graph explorer with SSE streaming — Phase 2 skipped straight to Phase 3. Embedded a uvicorn web server, added SSE endpoints, wired the agent loop's event_callback for real-time streaming. The graph now animated during search.
  • feat: render markdown in explorer response panel — answers rendered with proper formatting
  • feat: staggered node-by-node traversal animation — instead of all results appearing at once, nodes revealed one by one with a glow effect and smooth camera panning
  • feat: info panel improvements — bold labels, confidence percentages, entry accordions, explore results
  • feat: multi-turn chat in graph explorer — full conversational chat with session management, token budget trimming, iMessage-style UI, typing indicators, and clickable citation links
  • feat: chat panel slide transition and visual polish
  • Plus 5 bug fixes for edge cases discovered through live testing

By end of Day 1, the explorer had: a force-directed graph with animated traversal, SSE streaming, a multi-turn chat panel with markdown rendering, and an info panel — far beyond the "one session, self-contained HTML" plan.

Day 2 — March 6 (5 commits, v0.42.0v0.46.0)

  • feat: zoom-aware label visibility — labels appear/disappear based on zoom level
  • feat: explore port kill, Sonnet synthesis, metadata-only updates — upgraded the synthesis LLM to Sonnet for higher-quality human-facing answers
  • feat: Bedrock retry/timeout, classifier fix, explore→chat bridge — clicking "Explore" in the graph seamlessly transitions to a chat follow-up
  • feat: explorer UX polish — chat header, copy buttons, maximize toggle, textareas, zoom cap

Day 3 — March 7 (13 commits, v0.46.1v0.53.0)

  • feat: explorer thinking visuals — nodes dim, glow, and pulse while the agent searches
  • feat: explorer chat write-back — the chat can now modify the KB (update entries, ingest URLs) via a mini-ReAct loop
  • feat: add get_entry read tool to explorer chat
  • feat: add explorer ingest URL button, project dropdown, and filter-only search — full CRUD from the browser
  • feat: add file upload, multi-URL, and progress streaming to explorer ingest
  • feat: auto-start explorer web server on MCP server startup — zero-click launch

Day 4 — March 8 (bug fixes and polish)

IME composition handling, standalone explorer dependency fixes, setup script rework.

By the numbers

Metric Value
My prompts ~15-20 (estimated; one per feature/fix cycle)
Calendar days 3 (active development)
Feature/fix commits 36
Versions released v0.32.0v0.53.0 (22 releases)
Lines of explorer code 5,185 lines added
Final explorer codebase 3,613 lines (Python + HTML/JS)
Test files 9+ new test files
Research artifacts 4 reports, ~1,200 lines of analysis

What made this work

Research-first, not code-first. My initial prompt didn't say "build me a graph explorer." It said "spin up 4 teammates to debate the challenges." The research phase produced shared vocabulary and a phased plan that let me steer at the right altitude. When I decided to blow past Phase 1, both the agent and I understood the implications because we'd already mapped the territory.

The codebase was ready. The research agents noted that all query functions were standalone async functions taking a Database object — not tied to FastMCP. The event_callback pattern needed ~20 lines of changes to the agent loop. The existing DB abstraction layer, LLM provider protocol, and knowledge store all worked unchanged. Months of clean architecture paid off in a single weekend.

Tight feedback loops. Each prompt produced a working, testable increment. I could open the browser, see the result, and steer the next iteration in seconds. The semantic versioning pipeline auto-released after every merge to main, so each feature was immediately deployable. The conversation wasn't "build the whole thing" — it was a rapid series of "now add X" / "fix Y" / "make Z better."

I steered, the agent built. My role was product vision and live QA. I described what I wanted to experience ("the graph would animate while the answer streamed in"), not what to implement. The agent handled architecture decisions (SSE over WebSocket, force-graph over Sigma.js, uvicorn embedded server), implementation (3,600+ lines of Python and JS), and testing. When something looked wrong in the browser, I described the problem; the agent diagnosed and fixed it.

Ambition was the strategy. The pragmatist research agent explicitly warned against scope creep and recommended stopping after Phase 1. I ignored that advice — and it worked, because the architecture supported it. The "wow factor" goal demanded pushing past the minimum viable product. The result was a feature that went from "does this even work inside an MCP server?" to "full-stack app with animated graph traversal, multi-turn chat, and write-back tools" in three days.

So what?

The point of this case study isn't "look how fast I built a thing." It's that the way I worked with the AI is qualitatively different from what I see most engineers doing — and that gap represents an enormous opportunity.

Here's the pattern I used: vision → research → build → observe → steer → repeat. At no point did I dictate function signatures, argue about variable names, or review individual lines of code. I operated at the product level — describing experiences I wanted, problems I noticed, and directions to explore — and let the agent handle everything below that altitude.

Compare that to what I typically see: an engineer writes a detailed spec, pastes it into a chat, reviews the output line by line, asks for tweaks to specific functions, and essentially uses the AI as a faster keyboard. That works, but it leaves 90% of the leverage on the table. It's like hiring a senior engineer and then dictating every semicolon.

The techniques that made the explorer possible aren't exotic. They're learnable:

  1. Start with research, not code. Before building anything, have the agent investigate the problem space from multiple angles — in parallel. This builds shared context and gives you a vocabulary to steer with. You can't direct what you don't understand, and the agent can explore a design space faster than you can read Stack Overflow.

  2. Steer at the experience level. Say "the graph should animate while the answer streams in," not "add a setTimeout callback in the render loop." Describe what you want to feel when you use the software. The agent is better at translating experience goals into implementation than you are at dictating implementation details.

  3. Build the guardrails first — then let go of the wheel. This is the key enabler that makes everything else possible. Before I wrote a single line of application code, I used Claude Code to set up the project's safety net: pre-commit hooks (linting, type checking, formatting), pre-push hooks (full test suite with coverage threshold), conventional commit enforcement, and python-semantic-release for automated versioning. The coverage threshold ratchets — whenever new code raises the floor, Claude knows to bump fail_under so it can never regress. The entire rich ecosystem of support tooling that human engineers built over decades to keep themselves from making mistakes — automated tests, coverage gates, commit conventions, semantic versioning — turns out to be perfect for AI steering. These tools don't care whether a human or an agent wrote the code. They enforce the same standards either way. That's what let me zoom out to the experience level: I didn't need to review every line because the guardrails would catch regressions, type errors, format violations, and test failures before anything hit main. And here's the irony — I got the agent to set up all of this tooling. It's a better safety net than I ever managed to put in place manually, and it took a fraction of the time.

  4. Let requirements emerge. The explorer wasn't planned. It emerged from using the system and noticing what was missing. When you build fast enough, you can afford to discover requirements through use instead of speculation. That produces better software than any upfront planning session.

  5. Trust the agent with architecture. SSE over WebSocket, force-graph over Sigma.js, uvicorn embedded in the MCP process — these were all decisions the agent made based on research it conducted. I didn't second-guess them. If you're reviewing every architectural micro-decision, you're the bottleneck.

What altitude are you operating at? Are you steering at the product level, or are you micro-managing at the function level? Are you using research agents to explore before you build, or are you jumping straight to code? Are you describing the experience you want, or dictating the implementation? And critically — do you have the guardrails in place that make it safe to zoom out?

If you don't have pre-commit hooks, coverage gates, and automated regression suites, that's your first prompt. Lock that in, and then zoom out. Every session after that gets faster.