AI Policy Weekly #29

Anthropic’s Claude 3.5, Stability AI’s lifeline, and Congress’s bills to study AI in daily life

Jun 27, 2024

Welcome to AI Policy Weekly, a newsletter from the Center for AI Policy. Each issue explores three important developments in AI, curated specifically for US AI policy professionals.

Claude 3.5 Sonnet Extends the Frontier of AI Capabilities

In March 2023, OpenAI released GPT-4, a successor to ChatGPT that used nearly ten times as much computation during training. Since then, OpenAI has continued improving the model, with updates such as GPT-4 Turbo and GPT-4o.

Since that time, the GPT lineage has maintained its leading position in the field of general-purpose AI models. While many companies are close behind, Google’s Gemini and Anthropic's Claude stand out as the two main rivals to GPT’s dominance.

Last week, Anthropic released a new model, Claude 3.5 Sonnet, that has arguably unseated GPT-4o as the world’s best AI chatbot.

First, there are results from Anthropic’s internal usage of the model. The model scored almost twice as well as the previous top Claude model on an “internal agentic coding evaluation.”

The test provided the AI with a codebase and instructions for a code change, such as fixing a bug or adding a feature. Claude 3.5 Sonnet had to implement these changes across multiple files without seeing the tests used to verify its work, simulating real-world software engineering.

Accordingly, Anthropic’s AI researchers are already using the model to assist in their work. One engineer was stunned at the model’s coding abilities, remarking that “this is pretty unprecedented for me.” Another researcher is “frequently asking it to explain [AI] papers.”

Sonnet also outclassed its competitors on many public tests of AI models:

59% on the Google-Proof Q&A Benchmark (GPQA), a set of multiple-choice questions designed to stump PhD students, even with unlimited time and full access to the internet to research the answers.
67% on GPQA when using particular prompts and solution choice techniques designed to improve performance.
90% on Massive Multitask Language Understanding (MMLU), a set of multiple-choice questions designed for high school and college students in 57 different subjects ranging from chemistry to psychology to philosophy.
96% on Grade School Math 8K (GSM8K), a dataset of over 8,000 math problems designed for elementary school students.

Importantly, Anthropic announced that “to complete the Claude 3.5 model family,” they will release Claude 3.5 Opus “later this year.” This model will serve as a successor to Claude 3 Opus. Opus was a larger and more capable model than Claude 3 Sonnet, the precursor to Claude 3.5 Sonnet.

Claude 3.5 Opus will likely perform significantly better than 3.5 Sonnet on GPQA. But for MMLU and GSM8K, there’s not much room for improvement, as Sonnet is already scoring over 90%.

Former GitHub CEO Nat Friedman noticed this, commenting that “we’re gonna need some new benchmarks.”

But before new tests arrive to replace the old ones—which lasted for only a few years—it’s worth contemplating the types of questions that could realistically stump AI systems for the foreseeable future.

Stated differently: what are the least impressive capabilities that AI models definitely won’t have by 2030?

And if AI progress exceeds those expectations like it has in the past, will the US government be ready to respond?

At the Center for AI Policy, we think there’s no time like the present to begin preparing for AI’s impacts. That’s why we support policies like funding the US AI Safety Institute and strengthening security at top AI companies.

Stability AI Attempts Recovery From Financial Woes

Many startups fail, and AI is no different.

Stability AI, a generative AI startup that has been teetering on the brink of collapse, may have found a lifeline.

In February 2022, Stability had “0 developers and 0 researchers.” They made their first hires the following month.

Stability quickly attracted attention by supporting the famous “Stable Diffusion” model with computing resources. In October 2022, the company raised $101 million at a reported valuation of $1 billion.

But it quickly burned through its funds under the leadership of founder and CEO Emad Mostaque, a former hedge fund manager. By October 2023, the company had only $4 million left.

The two most significant expenses were supercomputers and R&D talent. In October, the company’s projected costs for 2023 included $99 million on compute and $54 million on wages and operating expenses.

Meanwhile, its projected revenue was a mere $11 million. For comparison, OpenAI is likely to earn billions of dollars in 2024.

The company also failed to pay for licensed training data, which can easily cost tens of millions of dollars. As a result, it faced a looming copyright infringement lawsuit from Getty Images.

Thus, Stability’s implosion highlights the steep costs of competing in cutting-edge AI development. Big AI is quickly becoming a Big Tech project.

Nonetheless, the company gained a glimmer of hope this week. Facebook’s first president, Sean Parker, helped coordinate an effort to bring $80 million in fresh funding and a makeover to company leadership.

The new investors also “struck a deal with suppliers to forgive some $100 million owed by Stability,” and released the company from $300 million in future obligations. Much of this money was set to go towards computing resources.

Time will tell whether Stability can recover. But if it hopes to catch up to AI leaders like OpenAI and Anthropic, it will probably need to spend hundreds of millions of dollars, if not more.

*Prem Akkaraju (right) will step in as CEO, three months after the departure of Stability AI’s founder and former CEO, Emad Mostaque (left).*

Senators Introduce Legislation to Raise Awareness of AI’s Effects on Daily Life

Last week, Senators Todd Young (R-IN) and Brian Schatz (D-HI) introduced the AI Public Awareness and Education Campaign Act.

The bill would direct the Secretary of Commerce to run an educational campaign to inform the public about AI’s benefits, risks, and prevalence in daily life.

Specific outreach efforts would include promoting best practices for detecting AI-generated content, informing vulnerable populations about AI-related scams, and highlighting AI-related workforce opportunities.

The legislation is timely, because more and more Americans are beginning to recognize and use AI.

For example, a recent poll found that 79% of US teachers are familiar with ChatGPT, up from 55% in February 2023. Additionally, 49% of K–12 students reported using ChatGPT at least weekly, up from 22% the previous year.

This bipartisan effort reflects the growing recognition that AI literacy is an essential skill for navigating the 21st century.

News at CAIP

We’re pleased to announce the newest full-time member of CAIP: Claudia Wilson is joining the team as Senior Policy Analyst. Claudia earned her master’s degree in public policy at Yale’s Jackson School of Global Affairs, where she was part of the Schmidt Program on Artificial Intelligence, Emerging Technologies, and National Power. She also brings several years of consulting experience from her time at Boston Consulting Group (BCG).
We hosted a panel discussion on AI and privacy in the Rayburn House Office Building. Stay tuned for a recording and transcript.
Our latest research report explores AI and privacy concerns. We find that AI will both intensify current privacy concerns and fundamentally restructure the privacy landscape.
Jason Green-Lowe wrote a memo regarding tonight’s presidential debate. If there’s one thing this year’s presidential candidates agree on, artificial intelligence is scary.
CAIP on the road: today, we are hosting a booth at RecruitMilitary’s job fair in Joint Base Myer-Henderson Hall.
We’re hiring for an External Affairs Director.

Quote of the Week

The vast majority of research and development that has national security implications used to be government programs, and now it is happening in the private sector, so these companies became really potentially lucrative targets from a Chinese perspective.

—Lt. General H.R. McMaster (Ret.), former United States National Security Advisor, commenting on state-sponsored foreign espionage threats to US tech companies

This edition was authored by Jakub Kraus.

If you have feedback to share, a story to suggest, or wish to share music recommendations, please drop me a note at jakub@aipolicy.us.

—Jakub