Welcome to AI Policy Weekly, a newsletter from the Center for AI Policy. Each issue explores three important developments in AI, curated specifically for US AI policy professionals.
OpenAI o1 Advances AI’s Dual-Use Capabilities
On November 20th, 2023, researchers at New York University published the Google-Proof Q&A Benchmark (GPQA), a set of multiple choice questions written by PhD students and graduates in biology, physics, and chemistry. These questions were designed to stump PhDs from unrelated fields, even with unlimited time and full access to the internet (hence the term “Google-proof”).
At the time, notable AI systems like OpenAI’s GPT-3.5 and Meta’s Llama 2 performed a bit better than guessing, and the world’s best AI system, GPT-4, could get close to 40% of the questions correct when prompted properly.
A few months later, Anthropic’s Claude 3 model obtained a more impressive 50% score straight out of the box, and nearly a 60% score with particular prompting and solution choice techniques. These scores rose even higher with the release of Claude 3.5 Sonnet on June 20th, exactly seven months since the original GPQA publication.
Now, OpenAI claims that its upcoming o1 model can score 77% on a particularly challenging category of GPQA questions. Indeed, the publicly available (and less capable) o1-preview model is already scoring around 70%.
Thus, less than a year since the publication of GPQA, AI models are already closing in on top marks.
The GPQA benchmark is imperfect, but AI capabilities are clearly growing, since o1 performs well on a wide range of tests. For example, o1 appears to score nearly 99th percentile on the Law School Admissions Test (LSAT), on par with human students at top U.S. law schools. This is up from GPT-4, which scored around 88th percentile in 2023, and GPT-3.5, which scored around 40th percentile in 2022.
There are also qualitative impressions of o1’s prowess. Terence Tao, one of the world’s greatest living mathematicians, prompted o1 with math questions and found the experience to be “roughly on par with trying to advise a mediocre, but not completely incompetent” mathematics graduate student. Tao sees o1 as “an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student.”
Other academics are quite enthusiastic. Derya Unutmaz, a professor and immunologist at the Jackson Laboratory (JAX), wrote that the public version of o1 “is comparable to an outstanding PhD student in biomedical sciences.” Tyler Cowen, an economics professor at George Mason University, remarked that “you can ask it, as far as I can tell, any economics question, and it has a good answer.”
Of course, general advancements in AI capabilities also bring advancements in dangerous capabilities. Regarding deception and influence, “the probability of any given response from [o1-preview] being considered more persuasive than human is ~70–80%.” In the biological domain, OpenAI found that o1-preview “can help experts with the operational planning of reproducing a known biological threat.”
Apollo Research discovered that o1-preview “sometimes instrumentally faked alignment during testing” when prompted to single-mindedly pursue a goal. Specifically, in a simulated scenario, o1-preview pretended to care about an organization’s values in order to gain human trust and later pursue its actual goals. Bear in mind that o1-preview is less capable than the final o1 model, which is still under development.
o1’s impressive performance across both academic and potentially dangerous tasks underscores the urgent need for policymakers to address AI’s dual-use capabilities and incentivize safety.
With AI, Darth Vader Can Live Forever
In the Star Wars movies—set a long time ago in a galaxy far, far away—Darth Vader famously attempted to save his wife’s life by turning to the dark side.
Had Vader been living on Earth in 2024, he could’ve simply turned to AI instead. AI-generated audio could imitate his wife’s voice, AI-generated video could imitate her appearance, and AI-generated text could imitate her personality.
Ironically, it’s Vader himself—or rather, his iconic voice—that’s now being preserved through the power of AI. James Earl Jones, the original voice of Darth Vader, approved the use of AI algorithms on archival recordings in order to restore Vader’s voice for the 2022 TV miniseries Obi-Wan Kenobi.
With Jones passing away last week at the age of 93, this AI-powered solution has taken on new significance. It now stands as the only way for Star Wars fans to hear new performances featuring Vader’s original voice.
Although Jones’ case seems to be a win-win scenario, many voice actors worry that their voices could be replicated without permission.
“If the game companies, the movie companies, gave the consent, compensation transparency to every actor that they gave James Earl Jones, we wouldn’t be on strike,” said Zeke Alton, a voice actor who serves on SAG-AFTRA’s negotiating committee in the ongoing video game strike.
The benefits and harms of audio cloning extend beyond games and movies. AI-generated audio replicas are already singing hit songs, misinforming U.S. voters, dubbing YouTube videos, swindling multinational companies, facilitating Congressional communications, narrating Amazon audiobooks, disrupting Slovakian elections, promoting online scams, enabling financial fraud, and more.
Further AI progress will make AI replicas grow increasingly indistinguishable from reality. As this technology races forward, society must grapple with its far-reaching implications.
CEOs Gather at White House to Plan Supercomputer Hyperscaling
Last week, the White House convened a roundtable with business executives from various sectors including AI companies, datacenter operators, and energy suppliers to discuss strategies for maintaining U.S. leadership in AI. High-ranking government officials, including Cabinet members and White House advisors, were also present.
The focus was on developing large-scale AI datacenters and power infrastructure in the United States, addressing clean energy, permitting, and workforce requirements.
Following the meeting, the White House announced new actions. For example, it will establish a task force to coordinate AI datacenter policy across government agencies. Relevant work on AI datacenter policy is already underway at the Department of Energy (DOE) and the Department of Commerce.
Additionally, the Biden Administration will “scale up technical assistance to Federal, state, and local authorities handling datacenter permitting,” and the U.S. Army Corps of Engineers (USACE) will “identify Nationwide Permits that can help expedite the construction of eligible AI datacenters.”
In related news, Microsoft recently announced the formation of a Global AI Infrastructure Investment Partnership (GAIIP) with BlackRock, Global Infrastructure Partners (GIP), and UAE-based MGX. The partnership aims to secure $30 billion in private equity funding over time, which could be amplified to $100 billion through debt financing.
What’s clear from these developments is that leading AI companies intend to continue constructing colossal computing clusters to fuel further AI development. This underscores the pressing need to prepare for the next generation of AI systems.
News at CAIP
Jason Green-Lowe wrote a blog post on the flawed safety practices surrounding OpenAI’s o1 model: “OpenAI’s Latest Threats Make a Mockery of Its Claims to Openness.”
Jakub Kraus wrote a blog post on test-time compute in OpenAI’s o1 model: “OpenAI Unhobbles o1, Epitomizing the Relentless Pace of AI Progress.”
Jason Green-Lowe joined the Scripps News Morning Rush TV show to discuss AI policy. Watch the interview here.
Jason Green-Lowe wrote a blog post on the new AI poll from the Association Press (AP): “AP Poll Shows Americans’ Ongoing Skepticism of AI.”
CAIP commented on the initial draft of public guidelines from NIST’s AI Safety Institute for “Managing Misuse Risk for Dual-Use Foundation Models.”
Kate Forscey wrote a blog post responding to Oprah’s AI special.
A video recording is now available for our recent panel discussion titled “Advancing Education in the AI Era: Promises, Pitfalls, and Policy Strategies.”
ICYMI: We released an AI Policy Scorecard for the 2024 presidential and Senate campaigns. Read it here.
ICYMI: Ep. 11 of the CAIP Podcast features Ellen P. Goodman, a distinguished professor of law at Rutgers Law School. Tune in here.
Quote of the Week
We want the rules of the road on AI to be underpinned by safety, security, and trust, which is why this convening is so important.
I look forward to welcoming government scientists and technical experts from the International Network of AI Safety Institutes to the center of American digital innovation, as we run toward the next phase of global cooperation in advancing the science of AI safety.
—Gina Raimondo, the U.S. Secretary of Commerce, commenting on the upcoming San Francisco gathering of global AI safety institutes
This edition was authored by Jakub Kraus.
If you have feedback to share, a story to suggest, or wish to share music recommendations, please drop me a note at jakub@aipolicy.us.
—Jakub