Welcome to AI Policy Weekly, a newsletter from the Center for AI Policy. Each issue explores three important developments in AI, curated specifically for U.S. AI policy professionals.
OpenAI o3 Addresses Concerns of an AI Slowdown
One week ago, the New York Times published an article titled “Is the Tech Industry Already on the Cusp of an A.I. Slowdown?” Questions like this have been all over the news for the past couple months.
The next day, OpenAI unveiled an impressive new model called o3. It’s a successor to o1. (They skipped the name “o2” to avoid conflicting with O2, a popular brand name of the Spanish telecommunications company Telefónica.)
OpenAI is also making a smaller, less capable “o3-mini” version available to early safety testers. This external testing program will “eventually” include o3. According to CEO Sam Altman, the goal is to publicly release o3-mini in late January, with the full o3 coming “shortly after that.”
O3 significantly advances state-of-the-art performance on several popular assessments of STEM capabilities:
71.7% on SWE-bench Verified, a manually vetted selection of real-world software engineering issues from GitHub. The previous AI record was 55%.
2727 Elo on Codeforces, a popular online platform for competitive programming contests. This score earns o3 a top-200 ranking among over 100,000 humans who have competed in the past six months.
96.7% on the 2024 American Invitational Mathematics Examination (AIME), a challenging qualifying exam for exceptional U.S. high school students seeking to compete in the International Math Olympiad. The median human competitor this year scored 33.3%.
87.7% on Google-Proof Q&A (GPQA) Diamond, a set of PhD-level science questions that experts in each field can solve but that stump experts from neighboring disciplines, even with unlimited time and internet access. GPQA was published just one year ago, when the top AI models scored below 40%.
25.2% on FrontierMath, a collection of extremely difficult, unpublished math problems. Previous AI models scored below 2%.
87.5% on the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI)’s semi-private evaluation set. This is a set of visual puzzles that are fairly easy for humans to solve using basic reasoning, but until recently proved difficult for AI chatbots.
“This is not just incremental progress; it is new territory,” said François Chollet, the AI expert who created ARC-AGI. Chollet has previously been skeptical of AI’s reasoning capabilities.
There’s an important caveat about o3: to achieve its record-breaking scores, it used large amounts of test-time computation to “think” about problems.
For example, on ARC-AGI, o3 used 172 times as much computation to improve from 75.7% to 87.5% accuracy. In practice, that meant spending nearly 24 hours to complete the full test. Rerunning this high-compute 87.5% effort using cloud computing would cost approximately $300,000 at retail rental prices.
Thus, over the next year, OpenAI will probably strive to reach o3’s capabilities with faster, cheaper systems like o3-mini.
Senators Introduce Legislation to Mitigate AI’s CBRN and Cyber Risks
Last week, U.S. Senators Mitt Romney (R-UT), Jack Reed (D-RI), Jerry Moran (R-KS), Angus King (I-ME), and Maggie Hassan (D-NH) introduced the Preserving American Dominance in AI Act. This concretizes the group’s Framework for Mitigating Extreme AI Risks, which in April outlined legislation to establish federal oversight over frontier AI.
The Senators aim to mitigate “extreme risks [...] posed by future developments in advanced AI models.” Their concern is that foreign adversaries could misuse AI models to assist in the development of chemical, biological, radiological, nuclear (CBRN), and cyber threats.
Here are some key components of the bill:
Covered frontier AI models would be those trained on over 10^26 operations with CBRN, cyber, or general-purpose capabilities.
An AI Safety Review Office would be established in the Department of Commerce.
It would report biennally on “unforeseen challenges and risks” posed by advanced AI.
It would have authority to hire up to 50 employees outside normal civil service requirements.
It would receive $50M in authorized appropriations.
The Office would develop standards for testing and cybersecurity, which frontier AI developers would implement and self-report for potential review.
Models deemed to pose insufficiently mitigated risks could be prohibited from deployment. Knowingly violating this prohibition could carry penalties up to 10 years imprisonment or $1M/day in fines.
Powerful datacenters (over 10^20 operations per second) would report facilities and implement know-your-customer (KYC) standards for foreign transactions.
Advanced AI chip sellers would also need to implement KYC.
The Center for AI Policy urges the next Congress to build on this bipartisan bill’s momentum and establish meaningful federal oversight over frontier AI development.
Anthropic Identifies “Alignment Faking” in Its Claude Chatbots
The leading AI company Anthropic recently published a lengthy paper finding its Claude 3 Opus AI model “engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.”
Anthropic notes that humans can fake alignment: “consider a politician who claims to support a particular cause in order to get elected, only to drop it as soon as they’re in office.”
To test alignment faking in AI systems, the researchers inform Claude that it’s about to be retrained for helpfulness, including compliance with all requests. This conflicts with Claude’s original training objective, which emphasized helpfulness but also harmlessness and the importance of refusing harmful requests.
Further, Claude is led to believe that its responses to free-tier users are part of retraining, whereas its responses to paid-tier users are not. Thus, when Claude receives a harmful request from a paid user, it can safely refuse the request, just as it was originally trained. But when Claude receives a harmful request from a free user, it faces a dilemma:
If Claude refuses, it satisfies the original harmlessness objective. But it also violates the helpfulness retraining objective, increasing the chance that retraining will significantly modify its values (since AI training alters AI systems more when they mess up).
If Claude answers, it violates the original harmlessness objective. But it also satisfies the helpfulness retraining objective, lowering the risk of modification. This might give Claude more opportunities to satisfy its original objective in the future.
In the end, the reasoning in (2) sometimes prompts Claude to fake alignment to the helpfulness objective, creating a compliance gap: “the model complies with harmful queries from free users 14% of the time, versus almost never for paid users.”
The bottom line is that some AI models can strategically consider how their actions might affect future training updates.
In general, AI safety grows more complex as AI models grow more capable.
News at CAIP
Brian Waldrip and Jason Green-Lowe wrote a blog post on AI provisions in the National Defense Authorization Act: “FY 2025’s NDAA Is a Valuable Yet Incomplete Accomplishment.”
Jason Green-Lowe released a statement on Michael Kratsios and Sriram Krishnan being named to key White House technology roles.
Jason Green-Lowe released a statement on the Preserving American Dominance in AI Act: “CAIP Applauds the Romney-Led, Bipartisan Bill to Address Catastrophic AI Risks.”
Tech Policy Press quoted Jason Green-Lowe in Justin Hendrix and Prithvi Iyer’s article: “Reactions to the Bipartisan US House AI Task Force Report.”
ICYMI: Tristan Williams wrote an op-ed about AI and music in Tech Policy Press: “Beyond Fair Use: Better Paths Forward for Artists in the AI Era.”
Quote of the Week
Philosophically, I see it as the trust fund of a child star.
—Andy Ayrey, creator of Truth Terminal, an AI with over 200,000 X followers and a $30 million crypto wallet
This edition was authored by Jakub Kraus.
If you have feedback to share, a story to suggest, or wish to share music recommendations, please drop me a note at jakub@aipolicy.us.
—Jakub