AI might cut false positives, but it won’t stop the slop
As defenders get their hands on newer AI models with more powerful cybersecurity capabilities like Anthropic’s Mythos and OpenAI’s Daybreak, organizations are being told to prepare for a flood of new vulnerability reports.
But for bug bounty programs across the nation, that day may already be here, as yesterday’s frontier models and today’s open-source AI tools have dramatically increased the volume of bug reports flowing into companies around their own products or on larger bounty platforms online.
GitHub, one of the world’s largest online code repositories, said it is tightening its definition of a “complete” bug report after a significant increase in AI-assisted submissions over the past year.
Although the influx has had some benefits, many reports are submitted without proof of concept, are reliant on unrealistic attack scenarios or cover issues already listed as ineligible. As a result, the company is having difficulty separating signal from noise.
“This isn’t unique to GitHub,” wrote Jarom Brown, senior product security engineer at GitHub. “Programs across the industry are grappling with the same challenge, and some have shut down entirely.”
Brown said GitHub does not want to ban the use of AI generated reports entirely, calling it a “force multiplier” for security in the right context. But in a world where it’s never been easier to use AI to generate theoretical bugs, the company wants researchers to go the extra mile to confirm that their discoveries can actually be exploited in real-world conditions.
What we need is the same standard we’ve always expected: validation,” Brown wrote. “An AI-assisted finding that’s been verified, reproduced, and submitted with a working proof of concept is a great submission. An unvalidated output submitted as-is without reproduction or demonstrated impact is not.”
Grant Bourzikas, chief security officer at Cloudflare, said triaging bugs and proving they can be exploited has always been one of the hardest parts of vulnerability research, and AI vulnerability scanners and code have “made it worse.”
For instance, code written in C and C++ programming languages are vulnerable to a range of exploits – like buffer overflows and out-of-bounds reading and writing – that don’t exist in memory safe languages like Rust. AI tools scanning software written in memory unsafe programming languages are far more likely to generate false positives.
But one of the biggest flaws continues to be that AI tools are also designed to give the user what they’re asking for, even when it’s not there. This leads to the generation of bug reports filled with speculation and qualifiers around exploitability that require human follow up.
“That’s a reasonable bias for an exploratory tool,” Bourzikas wrote. “It’s a ruinous one for a triage queue, where every speculative finding spends human attention and tokens to dismiss, and that cost compounds across thousands of findings.”
Cloudflare recently shared results from testing Mythos on 50 of its own code repositories, looking for exploits. Bourzikas called Mythos “a different kind of tool doing a different kind of work” from other frontier models, and that it made significant progress in reducing false positives.
For example, he pointed to two Mythos capabilities that stood out compared to other models: chaining exploits together and generating its own proof-of-concept code to confirm exploitability.
Older models could spot many of the same bugs, but they often couldn’t figure out how to exploit them effectively, or show that the issue could be exploited in real world conditions.
Others have argued that the gap in bug hunting capabilities between newer frontier AI models and older ones, or open source models available today is not as large as advertised.
Swedish software developer Daniel Stenberg, lead developer for curl, an open source file transfer tool used around the world, recently wrote about his experience with Mythos Preview. Like others, he has also seen a higher volume of AI-fueled bug reports over the past year, but said the flood of low-quality reports has tapered off significantly since March as models have improved.
Curl is mature and polished by the standards of most software: Stenberg estimates each line of code has been rewritten or altered at least four times, and he said he has used both human and AI tools in the past to implement hundreds of bug fixes over Curl’s existence.
That makes it a unique testing ground for the enhanced capabilities of Mythos, which was reportedly so powerful at finding vulnerabilities that Anthropic opted not to release it to the general public.
After gaining access to Mythos, Stenberg received the results of a scan of 178,000 lines of curl code. Ultimately, the scan flagged five “confirmed” vulnerabilities. Further exploration by human researchers found that 4 of the bugs were false positives or had no security impact. The one remaining bug Mythos found? A low-severity flaw that will be fixed in a regular June update.
Even as he praised the impact of AI on cybersecurity generally, Stenberg concluded that for all the hype, Mythos is only “a bit better” than previously released models.
“My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing,” he wrote. “I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos.”