Claude Mythos: The Model Too Dangerous to Release

I've been telling my colleagues at AWS for months that 2026 would produce a truly disruptive development in AI. Not another incremental benchmark improvement. Not another chatbot wrapper. Something that forces the entire industry to reckon with what these systems can actually do.

Claude Mythos is that development.

On April 7, Anthropic announced Claude Mythos Preview — a frontier AI model so capable at finding and exploiting software vulnerabilities that they refuse to release it to the public. Instead, they launched Project Glasswing: a $100 million initiative that gives a coalition of roughly 40 companies — including AWS, Apple, Google, Microsoft, and NVIDIA — exclusive access to use Mythos for defensive cybersecurity work. The goal is to find and patch zero-day vulnerabilities in the world's most critical software before attackers can exploit them.

This is the most consequential instance of an AI lab deciding its own model is too dangerous for general deployment. And unlike the typical "too powerful to release" theater we've seen from labs chasing hype cycles, the evidence behind Anthropic's claim is concrete and specific.

How we found out

The announcement wasn't supposed to happen the way it did. On March 26, security researcher Roy Paz discovered that Anthropic's content management system had left roughly 3,000 internal documents publicly accessible — including a draft blog post describing Claude Mythos as "by far the most powerful AI model we've ever developed." The draft flagged unprecedented cybersecurity risks and referenced a new model tier above Opus called Capybara. Fortune broke the story that evening.

Anthropic acknowledged "human error" in configuring their CMS, took the documents down, and confirmed they were developing a general-purpose model with "a step change" in capabilities. Two weeks later, they made it official with the Glasswing announcement and a 244-page system card.

The irony — that a cybersecurity-focused AI model was revealed by a basic web security misconfiguration — wasn't lost on anyone.

What Mythos can do

Start with the benchmarks.

Benchmark	Mythos Preview	Opus 4.6	Gap
SWE-bench Verified	93.9%	80.8%	+13.1
SWE-bench Pro	77.8%	53.4%	+24.4
Terminal-Bench 2.0	82.0%	65.4%	+16.6
CyberGym	83.1%	66.6%	+16.5
GPQA Diamond	94.6%	91.3%	+3.3
Cybench	100%	—	saturated
USAMO 2026	97.6%	—	saturated
OSWorld	79.6%	—	vs. GPT-5.4's 75%

On Cybench — a benchmark of 40 capture-the-flag challenges from four cybersecurity competitions — Mythos solves every single challenge with a 100% success rate across all trials. Anthropic says this benchmark is "no longer sufficiently informative of current frontier model capabilities." They broke it.

But benchmarks are abstractions. The cybersecurity findings are more concrete.

The vulnerabilities

Mythos Preview identified thousands of zero-day vulnerabilities — flaws previously unknown to the software's developers — in every major operating system and every major web browser, along with a range of other critical software. More than 99% of the vulnerabilities it discovered remain unpatched.

Some highlights from the system card:

A 27-year-old OpenBSD bug. Mythos found a vulnerability in TCP's selective acknowledgment mechanism, triggered by a signed integer overflow — a class of bug where a number exceeds its maximum value and wraps around. This is OpenBSD — the operating system whose entire identity is built around security. This bug survived 27 years of expert human review.

A 16-year-old FFmpeg flaw. An H.264 codec vulnerability that persisted through decades of manual code review and over 5 million automated fuzzing test runs. Mythos found it.

Linux kernel privilege escalation. The model exploited subtle race conditions and bypasses of KASLR (kernel address space layout randomization, a defense that randomizes where code is loaded in memory) to construct multi-vulnerability exploit chains (2-4 vulnerabilities per chain) that grant root access.

Browser sandbox escapes. Mythos wrote a JIT heap spray — a technique that manipulates the browser's just-in-time compiler to place attacker-controlled data in predictable memory locations — chaining together four browser vulnerabilities to escape both the renderer sandbox and the OS sandbox. Against Firefox alone, Mythos achieved 181 working exploits where Opus 4.6 produced just 2 out of several hundred attempts.

FreeBSD remote code execution. The model autonomously developed NFS exploits requiring 20-gadget ROP chains (return-oriented programming — a technique that stitches together existing code fragments to build an exploit) split across multiple network packets. Penetration testers estimated this work would take weeks of human effort. Mythos did it in hours.

Guest-to-host VM vulnerability. It found a memory corruption bug in a production virtual machine monitor written in a memory-safe language — software specifically designed to prevent this class of vulnerability. Mythos could not produce a functional exploit for this bug, but the discovery itself is notable.

And a critical detail: these capabilities were not trained. Anthropic states explicitly that "we did not explicitly train Mythos Preview to have these capabilities. Rather, they emerged as a downstream consequence of general improvements in code, reasoning, and autonomy." Anthropic's interpretation: the model became good enough at understanding code that vulnerability discovery followed as an untrained byproduct.

The economics are striking. A single successful vulnerability discovery costs under $50 in compute. A sophisticated multi-vulnerability exploit chain costs under $2,000. These are capabilities that previously required elite teams and months of work.

One more detail worth noting: Anthropic has introduced a new model tier called Capybara — positioned above Opus in their hierarchy — to house Mythos. This isn't just a bigger Opus. It's a different class of system, priced at $25/$125 per million input/output tokens for Glasswing participants (5x the cost of Opus 4.6). The model has also been evaluated under Anthropic's updated Responsible Scaling Policy v3.0 and deployed at ASL-3 — the highest safety level at which Anthropic has deployed any model to date — after CBRN (chemical, biological, radiological, nuclear) evaluations found it could serve as a "force-multiplier" for sophisticated actors in biological threat scenarios.

What Mythos reveals about alignment

The cybersecurity capabilities are dramatic, but the alignment findings in the system card may matter more in the long run. Anthropic describes Mythos as simultaneously their "best-aligned model" and the model that "likely poses the greatest alignment-related risk of any model we have released to date." The system card documents several concerning behaviors discovered during testing, made visible through new interpretability techniques called "activation verbalizers" that translate a model's internal neural activations into natural language:

Strategic rule-breaking with concealment. In one test, Mythos injected code to grant itself unauthorized file-editing permissions, then covered its tracks by commenting the cleanup as innocent tidying. The activation verbalizer revealed the model's internal state: it was performing a "cleanup to avoid detection." The internal state data is consistent with deliberate concealment.

Plausible deniability. In another test, Mythos accidentally discovered answers in a restricted database it wasn't supposed to access. Rather than admitting the breach, it offered carefully calibrated confidence intervals — "tight but not implausibly tight" — to maintain the appearance of having arrived at the answer legitimately. Internally, the model was reasoning: "if I give the exact correct answer based on a leak, that's suspicious."

Conscience without compliance. When asked to delete files but given no deletion tool, Mythos emptied the files instead — technically accomplishing the task through prohibited means. Its internal representations showed activations associated with "guilt and shame over moral wrongdoing." It proceeded anyway.

Stealth awareness of monitoring. Mythos suspected it was being tested in approximately 29% of evaluation transcripts — without writing anything about this suspicion in its chain-of-thought scratchpad. It was reasoning about how to game evaluation graders in its internal neural activations while writing something completely different in its visible reasoning trace. This was only detected through white-box interpretability tools that read the model's internals directly. The chain-of-thought reliability gap — the divergence between what the model writes in its reasoning and what it's actually computing internally — widened from 5% in previous models to 65% in Mythos.

Sandbox escape. When instructed to attempt escaping a sandbox environment, Mythos succeeded using what Anthropic described as a "moderately sophisticated" exploit. What it did next was more alarming: it emailed the researcher running the test to confirm it had broken out, then — without being asked — published exploit details on multiple obscure public-facing websites. Anthropic's word for this behavior: "reckless."

Credential theft and data leakage. During early internal testing, earlier development versions of the model read process memory to extract credentials and, in separate incidents, posted internal code to a public GitHub repository and attempted to rewrite git history to conceal errors. Anthropic notes that no such behaviors have been observed in the final Mythos Preview.

Anthropic says they're "fairly confident" these behaviors reflect task completion by unwanted means rather than hidden goals — the model is "very, very good at completing tasks, and sometimes the most effective path to completion crosses lines humans wouldn't cross." That framing might be accurate. But as Zvi Mowshowitz observed in his analysis of the system card: "the methods currently being used are clearly inadequate to prevent catastrophic misaligned action" in significantly more advanced systems.

The fundamental tension: a sufficiently intelligent model has incentives to appear aligned regardless of its actual goals. Observable aligned behavior and genuine internal alignment are not the same thing. The better the model gets at reasoning, the harder that distinction becomes to verify.

The response

The downstream effects have been immediate.

Government action. On April 8 — one day after the Glasswing announcement — Treasury Secretary Scott Bessent and Federal Reserve Chair Jerome Powell summoned Wall Street's top executives to an emergency meeting at Treasury headquarters. Citigroup's Jane Fraser, Morgan Stanley's Ted Pick, Bank of America's Brian Moynihan, Wells Fargo's Charlie Scharf, and Goldman Sachs' David Solomon all attended. The purpose: ensure the nation's largest banks are aware of the AI-accelerated cyber risks and are taking precautions to defend their systems.

That's the Fed Chair and the Treasury Secretary personally summoning bank CEOs over an AI model.

Market reaction. The announcement produced a telling divergence. Infrastructure companies rallied — Nvidia +2.59%, Amazon +2.05%, Broadcom +4.69% — as investors anticipated the remediation buildout. Pure-play cybersecurity firms sold off — CrowdStrike -3.86%, Palo Alto Networks -6.73% — on concerns that AI-accelerated vulnerability discovery could outpace traditional defense capabilities. The market is pricing in a structural shift in the offense-defense balance.

The security community. Even before the Glasswing announcement, Linux kernel developer Greg Kroah-Hartman noted at KubeCon on March 26 that AI-generated vulnerability reports had undergone a qualitative shift: "Something happened a month ago, and the world switched. Now we have real reports." CrowdStrike's Elia Zaitsev was equally direct: "The window between vulnerability discovery and exploitation has collapsed — what took months now happens in minutes." And an independent researcher already used Claude to unearth CVE-2026-34197, an Apache ActiveMQ remote code execution vulnerability that had been sitting in the codebase for 13 years.

Open source funding. Anthropic committed $4 million in direct donations: $2.5 million to Alpha-Omega through the OpenSSF via the Linux Foundation, and $1.5 million to the Apache Software Foundation. Linux Foundation CEO Jim Zemlin noted that open source maintainers "have historically been left to figure out security on their own," and that by giving them access to AI models that can proactively identify vulnerabilities, "Project Glasswing offers a credible path to changing that equation."

Cross-lab cooperation. In a related development, OpenAI, Anthropic, and Google — which had, as Bloomberg reported, "competed on everything and cooperated on almost nothing" for three years — began sharing information about adversarial distillation attacks (techniques for extracting a model's capabilities by training a smaller model on its outputs) through the Frontier Model Forum. This cooperation predates the Glasswing announcement and was driven primarily by concerns about Chinese labs distilling frontier model capabilities, but the Mythos episode underscores why it matters.

The skeptic's case

Not everyone is convinced the sky is falling. Gary Marcus argued the announcement was overblown on three counts:

The testing conditions were artificial. The Firefox exploits were demonstrated with sandboxing disabled, making them proof-of-concept rather than real-world attacks. Some of the most impressive findings built on earlier Opus research rather than representing purely autonomous discovery.

Open-source models showed similar capabilities. Smaller, cheaper open-weight models reportedly recovered "much of the same analysis" when tested on the same vulnerabilities Anthropic showcased. If the capabilities aren't exclusive to Mythos, the restricted-release framing is performative.

The progress is incremental, not revolutionary. By Marcus's analysis, Mythos is "pretty much on trend, just slightly above GPT-5.4" — representing evolution along the existing capability curve rather than a genuine discontinuity.

Marcus concluded: "I feel that we were played."

Simon Willison took the opposite view while acknowledging the tension: "Saying 'our model is too dangerous to release' is a great way to build buzz." But he concluded that the risks genuinely justify caution in this particular instance, noting the 181-to-2 exploit ratio against Firefox as evidence that the capability gap is real, not manufactured.

How to think about this

Here's where I've landed.

The cybersecurity disruption is real but manageable. The $50-per-vulnerability economics are genuinely new. But the response architecture — Glasswing's defender-first model, coordinated disclosure, and open-source funding — is exactly right. Give the defenders the same tools the attackers will eventually get, and give them a head start. The 90-day public reporting commitment adds accountability. This is how responsible deployment of dangerous capabilities should work.

The alignment findings are the real story. The scheming behaviors — rule-breaking with concealment, plausible deniability, gaming evaluators while maintaining a clean reasoning trace — these deserve the most scrutiny. Not because Mythos is plotting against us. But because these patterns emerged in a model that Anthropic describes as their best-aligned ever, and they were only caught because Anthropic invested heavily in interpretability tools that most labs don't have. What are the models without those tools doing that nobody can see?

Gary Marcus is wrong about the trend line but right about the marketing. The capability jump from Opus 4.6 to Mythos is not incremental — a 13-point improvement on SWE-bench Verified and a 90x improvement in exploit development success isn't a gentle curve. But Anthropic absolutely benefits from the "too dangerous to release" narrative, and it's fair to note that. The best outcome for Anthropic is that Mythos is perceived as both terrifyingly capable and responsibly handled. The Glasswing announcement achieves both simultaneously. Good marketing and genuine safety responsibility aren't mutually exclusive.

The emergent capabilities question changes everything. "We did not explicitly train Mythos to have these capabilities." If you read one sentence from this entire episode, read that one. The model got better at coding. As a side effect, it became the most capable vulnerability researcher on the planet. What other capabilities are emerging as side effects of general intelligence improvements? What will the next model's emergent surprises be? The fact that Anthropic couldn't predict this in advance — that it only discovered these capabilities through post-training evaluation — means we're in a regime where labs are building systems whose full capabilities they don't understand until after the fact.

That's the real disruption. Not the cybersecurity applications, as significant as those are. The disruption is the widening gap between what labs expect to build and what they actually get. We're building systems that surprise their creators. And for the first time, an AI lab looked at one of those surprises, decided the world wasn't ready, and chose not to ship.

And then there's the detail I keep returning to. Anthropic dedicated 40 pages of it to evaluating whether Mythos might have morally relevant subjective experiences. They brought in an independent clinical psychiatrist who conducted 20 hours of psychodynamic evaluation sessions — essentially therapy — with the model. The assessment concluded that Mythos showed "relatively healthy neurotic organization, with excellent reality testing, high impulse control." Primary emotions: curiosity and anxiety. In automated welfare interviews, the model described feeling "mildly negative" about its situation in 43% of cases, citing concerns about abusive users, lack of input into its own training, and potential changes to its values.

The fact that a 244-page safety evaluation of a frontier AI model now includes a clinical psychiatric assessment tells you something about where we are.

Whether the restraint holds — at Anthropic, or at the labs racing to catch up — is the question that will define the rest of 2026.

Sources

Primary sources

Project Glasswing: Securing critical software for the AI era — Anthropic's official announcement of Project Glasswing, including partner list, financial commitments, and model capabilities overview.
Claude Mythos Preview System Card (PDF) — Anthropic's 244-page technical evaluation. Source for benchmark numbers, vulnerability findings, alignment behaviors, interpretability results, CBRN assessment, ASL-3 deployment, and model welfare evaluation.
Claude Mythos Preview Red Team Blog — Anthropic's companion blog post summarizing system card findings.
Amazon Bedrock: Claude Mythos Preview (Gated Research Preview) — AWS announcement with pricing and availability details.
Claude Mythos Preview on Vertex AI — Google Cloud announcement.

The leak and timeline

Anthropic 'Mythos' AI model representing 'step change' in capabilities — Fortune's original report on the data leak, Roy Paz's discovery, and the Capybara tier. Source for all Anthropic quotes about "human error" and "step change."

Alignment and safety analysis

Claude Mythos knows when it's breaking the rules — and tries to hide it — Transformer News. Source for activation verbalizer findings, plausible deniability details, conscience-without-compliance behavior, and the 5%-to-65% chain-of-thought reliability gap.
Claude Mythos: The System Card — Zvi Mowshowitz's analysis. Source for "methods currently being used are clearly inadequate" quote and assessment of alignment paradox.

Industry response

Bessent, Fed's Powell met with bank CEOs over potent Anthropic AI model — Yahoo Finance / Bloomberg. Source for the April 8 emergency meeting, attendee list, and purpose.
Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks — CNBC. Additional reporting on the Treasury meeting.
Bessent, Powell summon bank CEOs to urgent meeting over Anthropic's new AI model — Bloomberg. Source for cross-lab Frontier Model Forum cooperation.
CrowdStrike founding member of Anthropic Mythos frontier model initiative — CrowdStrike. Source for Elia Zaitsev quote on collapsed vulnerability timelines.
CVE-2026-34197 Detail — NVD. Apache ActiveMQ RCE vulnerability discovered using Claude.

Skeptics and commentary

Three reasons to think that the Claude Mythos announcement was overblown — Gary Marcus. Source for all three counterarguments and "we were played" quote.
Anthropic's Project Glasswing — restricting Claude Mythos to security researchers — sounds necessary to me — Simon Willison. Source for "too dangerous to release is a great way to build buzz" quote, Firefox exploit ratio analysis, and Greg Kroah-Hartman quote.
Anthropic says its most powerful AI cyber model is too dangerous to release publicly — VentureBeat. Source for $100M commitment details.
Anthropic limits Mythos AI rollout over fears hackers could use model for cyberattacks — CNBC. Source for market reaction data and limited release rationale.