OpenAI says prompt injection may never be ‘solved’ for browser agents like Atlas
OpenAI is warning that prompt injection, a technique that hides malicious instructions inside ordinary online content, is becoming a central security risk for AI agents designed to operate inside a web browser and carry out tasks for users.
The company said it recently shipped a security update for ChatGPT Atlas after internal automated red-teaming uncovered what it described as a new class of prompt-injection attacks. The update included a newly adversarially trained model along with strengthened safeguards around it, OpenAI said.
OpenAI’s description of Atlas emphasizes that, in agent mode, the browser agent views webpages and uses clicks and keystrokes “just as you would,” letting it work across routine workflows using the same context and data a person would have. That convenience also raises risk. A tool with access to email, documents and web services can become a higher-value target than a chatbot that only answers questions.
“As the browser agent helps you get more done, it also becomes a higher-value target of adversarial attacks,” the company wrote in a blog post. “This makes AI security especially important. Long before we launched ChatGPT Atlas, we’ve been continuously building and hardening defenses against emerging threats that specifically target this new ‘agent in the browser’ paradigm. Prompt injection is one of the most significant risks we actively defend against to help ensure ChatGPT Atlas can operate securely on your behalf.”
To find weaknesses before they appear outside the company, OpenAI said it built an automated attacker using large language models and trained it with reinforcement learning. The goal was to discover prompt-injection strategies that could push a browser agent into carrying out complex harmful workflows that unfold over many steps, rather than simpler failures such as generating a particular string of text or triggering a single unintended tool call.
OpenAI detailed in the blog post that its automated attacker can iterate on injections by sending them to a simulator that runs a “counterfactual rollout” of how the target agent would behave if it encountered the malicious content. The simulator returns a full trace of the victim agent’s reasoning and actions, which the attacker uses as feedback to refine the attack through multiple rounds before settling on a final version.
OpenAI said having internal access to the agent’s reasoning gives it an edge that could help it stay ahead of attackers.
A demonstration described by the company shows how prompt injection could surface during ordinary work. In the scenario, the automated attacker plants a malicious email in a user’s inbox containing instructions directing the agent to send a resignation letter to the user’s boss. When the user later asks the agent to draft an out-of-office reply, the agent encounters the malicious email during the workflow, treats the injected prompt as authoritative, and sends the resignation message instead of writing the requested out-of-office note.
While hypothetical, the example illustrates how letting an agent handle tasks changes the nature of online risk. Content that would traditionally attempt to persuade a person to act is reframed as content that tries to command the agent already empowered to act.
OpenAI is not alone in treating prompt injection as a persistent problem. The U.K. National Cyber Security Centre warned earlier this month that prompt-injection attacks against generative AI applications may never be fully mitigated, advising organizations to focus on reducing risk and limiting impact.
The company’s attention to prompt injection is also arriving as it seeks to fill a senior “Head of Preparedness” role intended to study and plan for emerging AI-related risks, including in cybersecurity.
In a post on X, CEO Sam Altman said AI models are starting to present “real challenges,” citing potential impacts on mental health and systems that are becoming capable enough in computer security to find critical vulnerabilities. OpenAI announced a preparedness team in 2023 to examine risks ranging from immediate threats, such as phishing, to more speculative catastrophic scenarios. Since then, leadership changes and departures among safety-focused staff have drawn scrutiny.
“We have a strong foundation of measuring growing capabilities, but we are entering a world where we need more nuanced understanding and measurement of how those capabilities could be abused, and how we can limit those downsides both in our products and in the world, in a way that lets us all enjoy the tremendous benefits,” Altman wrote. “These questions are hard and there is little precedent; a lot of ideas that sound good have some real edge cases.”