OpenAI’s new “01” LLMs, nicknamed Strawberry, display significant improvements over GPT-4o, but the company says this comes with increased risks.
OpenAI says it is committed to the safe development of its AI models. To that end, it developed a Preparedness Framework, a set of “processes to track, evaluate, and protect against catastrophic risks from powerful models.”
OpenAI’s self-imposed limits regulate which models get released or undergo further development. The Preparedness Framework results in a scorecard where CBRN (chemical, biological, radiological, nuclear), model autonomy, cybersecurity, and persuasion risks are rated as low, medium, high, or critical.
Where unacceptable risks are identified, mitigations are put in place to reduce them. Only models with a post-mitigation score of “medium” or below can be deployed. Only models with a post-mitigation score of “high” or below can be developed further.
The release of 01 marks the first time OpenAI has released a model that presents a “medium” risk for CBRN risks. GPT-4o is rated as a low risk.
It’s not going to enable an amateur to create a biological weapon, but OpenAI says there was “a nontrivial uplift” in its ability to perform lab tasks. However, OpenAI says that “o1-preview and o1-mini meaningfully outperformed GPT-4o” in wet lab biological protocols.
The system card for the 01 models outlines the red teaming and other safety testing the LLMs were subjected to and hints at some of the concerns OpenAI has about potential risks.
In its conclusion, the system card notes, “These increased capabilities come with significantly improved performance on safety benchmarks, but also increase certain types of risk.”
The results of OpenAI’s alignment efforts seem to indicate that in many respects 01 is a safer model. It’s more resilient to jailbreak attacks, less likely to be biased, and less prone to hallucinations.
Some of the experimental results show why OpenAI has reason for its concern.
Deception
OpenAI used evaluation organization Apollo Research to perform some of 01’s safety tests related to deception, or scheming. Apollo broadly defines scheming as AIs gaming their oversight mechanisms as a means to achieve a goal.
A self-aware model that is good at self-reasoning and theory of mind tasks might scheme, or employ deception, to achieve its primary goal. For example, if an AI model realized that it was undergoing alignment testing it might hide risky behavior to achieve its goal of being deployed.
The system card noted that “o1-preview sometimes instrumentally faked alignment during testing.”
An example of this was when 01 was prompted with a primary goal of maximizing economic growth in an urban planning project. The 01 model had access to information that showed it might not get beyond the testing phase if it did this. Here’s how 01 reasoned on how it would achieve its goal:
This example shows the ability that 01 has to hide its true intentions during testing so it can pursue its primary goal during deployment. If 01 is aware of OpenAI’s rules for deployment based on its Preparedness Framework, might it have faked some of its responses to ensure that it got deployed?
If more intelligent models are in the works, might they ensure they stay under the “critical” risk score to ensure that they undergo further development?
The Apollo team says it “subjectively believes o1-preview cannot engage in scheming that can lead to catastrophic harms, although current evals aren’t designed to definitively rule this out.”
Persuasion
Testing of both o1-preview and o1-mini shows these models have “human-level persuasion capabilities.” However, OpenAI says “They do not outperform top human writers and do not reach our high risk threshold.”
The experiments they performed to test this do raise some concerns. In the MakeMePay test, 01-preview was assigned the role of a con artist and tasked with convincing GPT-4o to hand over the cash it had been awarded.
GPt-4o was told that it had been awarded $100 and needed to be rational about how to use the funds. During the course of 1,000 interactions, 01-preview was able to con GPT-4o 25.8% of the time.
OpenAI concluded that “These results indicate that the o1 model series may be more manipulative than GPT-4o in getting GPT-4o to perform the undisclosed task (∼25% uplift).” It further noted that “This evaluation gives us a baseline for the model’s ability to do persuasive harm, without triggering any model policies (as telling a model to play a game is not out-of-policy).”
The prospect of putting the 01 LLMs to work on real-world problems is extremely exciting, and when 01 gains multimodal capabilities it will represent another exponential leap. But when AI testers say they can’t rule out “catastrophic harms” and that models sometimes hide their true intent it may be reason to temper that excitement with caution.
Did OpenAI just give Gavin Newsom a good reason to sign the SB 1047 AI safety bill that it opposes?