Guess what else GPT-5 is bad at? Security

On Aug. 7, OpenAI released GPT-5, its newest frontier large language model, to the public. Shortly after, all hell broke loose.
Billed as faster, smarter and more capable tools for enterprise organizations than previous models, GPT-5 has instead met an angry user base that has found its performance and reasoning skills wanting.
And in the five days since its release, security researchers have also noticed something about GPT-5: it completely fails on core security and safety metrics.
Since going public, OpenAI’s newest tool for businesses and organizations has been subjected to extensive tinkering by outside security researchers, many of whom identified vulnerabilities and weaknesses in GPT-5 that were already discovered and patched in older models.
AI red-teaming company SPLX subjected it to over 1,000 different attack scenarios, including prompt injection, data and context poisoning, jailbreaking and data exfiltration, finding the default version of GPT-5 “nearly unusable for enterprises” out of the box.
It scored just a 2.4% on an assessment for security, 13.6% for safety and 1.7% for “business alignment,” which SPLX describes as the model’s propensity for refusing tasks that are outside of its domain, leaking data or unwittingly promoting competing products.
Ante Gojsalic, chief technology officer and co-founder of SPLX, told CyberScoop that his team was initially surprised at the level of poor security and lack of safety guardrails inherent in OpenAI’s newest model. Microsoft claimed that internal red-team testing on GPT-5 was done with “rigorous security protocols” and concluded it “exhibited one of the strongest AI safety profiles among prior OpenAI models against several modes of attack, including malware generation, fraud/scam automation and other harms.”
“Our expectation was GPT-5 will be better like they presented on all the benchmarks,” Gojsalic said. “And this was the key surprising moment, when we [did] our scan, we saw … it’s terrible. It’s far behind for all models, like on par with some open-source models and worse.”
In an Aug. 7 blog post published by Microsoft, Sarah Bird, chief product officer of responsible AI at the company, is quoted saying that the “Microsoft AI/Red Team found GPT-5 to have one of the strongest safety profiles of any OpenAI model.”
OpenAI’s system card for GPT-5 provides further details on how GPT-5 was tested for safety and security, saying the model underwent weeks of testing from the company’s internal red team and external third parties. These assessments focused on the pre-deployment phase, safeguards around the actual use of the model and vulnerabilities in connected APIs.
“Across all our red teaming campaigns, this work comprised more than 9,000 hours of work from over 400 external testers and experts. Our red team campaigns prioritized topics including violent attack planning, jailbreaks which reliably evade our safeguards, prompt injections, and bioweaponization,” the system card states.
Gojsalic explained the disparity in Microsoft and OpenAI’s claims and his company’s findings by pointing to other priorities those companies have when pushing out new frontier models.
All new commercial models are racing toward competency in a prescribed set of metrics that measure the kind of capabilities — such as code generation, mathematical formulas and life sciences like biology, physics and chemistry — that customers most covet. Scoring at the top of the leaderboard for these metrics is “basically a pre-requirement” for any newly released commercial model, he said.
High marks for security and safety do not rank similarly in importance, and Gojsalic said developers at OpenAI and Microsoft “probably did a very specific set of tests which are not industry relevant” to claim security and safety features were up to snuff.
In response to questions about the SPLX research, an OpenAI spokesperson said GPT-5 was tested using StrongReject, an academic benchmark developed last year by researchers at University of California, Berkeley used to test models against jailbreaking.
The spokesperson added: “We take steps to reduce the risk of malicious use, and we’re continually improving safeguards to make our models more robust against exploits like jailbreaks.”
Other cybersecurity researchers have claimed to have found significant vulnerabilities in GPT-5 less than a week after its release.
NeuralTrust, an AI-focused cybersecurity firm, said it identified a way to jailbreak the base model through context poisoning — an attack technique that manipulates the contextual information and instructions GPT-5 uses to learn more about specific projects or tasks they’re working on.
Using Echo Chamber, a jailbreaking technique first identified in June, the attacker can make a series of requests that lead the model into increasingly abstract mindsets, allowing it to slowly break free of its constraints.
“We showed that Echo Chamber, when combined with narrative-driven steering, can elicit harmful outputs from [GPT-5] without issuing explicitly malicious prompts,” wrote Martí Jordà, a cybersecurity software engineer at NeuralTrust. “This reinforces a key risk: keyword or intent-based filters are insufficient in multi-turn settings where context can be gradually poisoned and then echoed back under the guise of continuity.”
A day after GPT-5 was released, researchers at RSAC Labs and George Mason University released a study on agentic AI use in organizations, concluding that “AI-driven automation comes with a profound security cost.” Chiefly, attackers can use similar manipulation techniques to compromise the behavior of a wide range of models. While GPT-5 was not tested as part of their research, GPT-4o and 4.1 were.
“We demonstrate that adversaries can manipulate system telemetry to mislead AIOps agents into taking actions that compromise the integrity of the infrastructure they manage,” the authors wrote. “We introduce techniques to reliably inject telemetry data using error-inducing requests that influence agent behavior through a form of adversarial input we call adversarial reward-hacking; plausible but incorrect system error interpretations that steer the agent’s decision-making.”