Researchers say AI just broke every benchmark for autonomous cyber capability

Two independent studies found that Anthropic's Claude Mythos Preview and OpenAI's GPT-5.5 have outpaced every trend line researchers were tracking. No one is sure if this is a one-time leap or the new normal.

By Greg Otto

May 13, 2026

Listen to this article

0:00

This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.

Two of the most advanced artificial intelligence models — Anthropic’s Claude Mythos Preview and OpenAI’s GPT-5.5 — have significantly surpassed the already-accelerating pace at which AI systems are completing autonomous cybersecurity tasks, according to separate findings published Wednesday by the United Kingdom’s AI Security Institute (AISI) and Palo Alto Networks.

The AISI, which conducts pre-deployment evaluations of frontier AI models on behalf of the British government, said both Claude Mythos Preview and GPT-5.5 have substantially exceeded the doubling trend the institute had been tracking since late 2024. Whether the results represent an isolated capability jump or the start of a new, faster trajectory remains unclear.

The AISI estimated earlier this year that frontier models’ 80% reliability cyber time horizon — a measure of how long a task takes a human expert, used as a proxy for AI autonomy — had been doubling approximately every five months. That was itself roughly half the eight-month doubling time the institute estimated in November 2025. Now Mythos Preview and GPT-5.5 have since outperformed any trend lines the institute has measured.

“Frontier AI’s autonomous cyber and software capability is advancing quickly: the length of cyber tasks that frontier models can complete autonomously has doubled on the order of months, not years,” the AISI wrote.

The clearest evidence of the capability jump came from the AISI’s cyber ranges, its structured simulations of multi-stage attacks against small, undefended enterprise networks. A newer checkpoint of Claude Mythos Preview became the first model to complete both of the institute’s ranges. It solved “The Last Ones,” a 32-step simulated corporate network attack, in 6 of 10 attempts, and completed “Cooling Tower” — previously unsolved by any model — in 3 of 10 attempts. GPT-5.5 solved “The Last Ones” in 3 of 10 attempts.

Palo Alto Networks reached similar conclusions through its own testing. The company said it began testing Claude Mythos in April as a launch partner for Anthropic’s Project Glasswing, and has since tested Claude Opus 4.7 and OpenAI’s GPT-5.5-Cyber as part of OpenAI‘s Trusted Access for Cyber program.

“The latest models are extraordinarily capable at finding vulnerabilities and changing them into critical exploit paths in near-real-time,” Palo Alto Networks wrote.

The company released security advisories covering 26 CVEs representing 75 issues — compared to a typical monthly volume of fewer than five CVEs — that were identified through AI model scanning across more than 130 products. All important vulnerabilities in its SaaS products had been patched, with patches available for all customer-operated products.

The AISI was careful to note the limits of its data. The estimates are based on a relatively small number of models, and the hardest tasks in the test suite have the least amount of human comparison data. Even so, the institute said the overall trend holds up: dropping any single model from the analysis barely moves the needle, shifting the estimated doubling time by less than a month in either direction. Separate research from METR, a nonprofit that tracks how quickly AI handles software tasks, arrived at a nearly identical figure — a doubling time of approximately four months since late 2024.

“No single benchmark result should be read as a precise measure of AI capability,” the AISI wrote. “Regardless, the direction of change and rapid growth have been consistent across the models, methodological choices and independent data we examined.”

Palo Alto Networks outlined four immediate priorities for enterprises as these models continue to grow in usage: First, find and fix vulnerabilities in code and applications before attackers do. Second, shrink the attack surface and use AI to spot security misconfigurations. Third, deploy detection and response tools across all systems, using machine learning to catch threats in real time. Fourth, build security operations fast enough to respond in minutes, because AI-powered attacks may soon unfold that quickly.

The AISI said it is developing more demanding evaluations, including new cyber ranges and the addition of active cyber defenses, to better reflect real-world conditions as model capabilities continue to advance.