WeeklyDispatch.AI
Posts
The week in AI: OpenAI's o3 surpasses human experts, sparks AGI debate

The week in AI: OpenAI's o3 surpasses human experts, sparks AGI debate

Plus: Google Search to received dedicated 'AI mode'

The Dispatch
December 26, 2024

In partnership with

Welcome to The Dispatch! We are the newsletter that keeps you informed about AI. Each Thursday, we aggregate the major developments in artificial intelligence - we pass along the news, useful resources, tools and services; we highlight the top research in the field as well as exciting developments in open source. Even if you aren’t a machine learning engineer, we’ll keep you in touch with the most important developments in AI.

We hope you had a very Merry Christmas! This is a shorter, o3-centric edition due to the holiday. Everyone has big opinions on the o3 release and what it signifies - please read our breakdown below and stay informed!

NEWS & OPINION

OpenAI showcases o3 model

-------------------------

“Scaling has hit a wall” - an oft-repeated industry narrative in recent months. You simply can’t keep adding more compute, more data and expect better results, we were hearing. Even Bill Gates had chimed in about a plateau, so it must be true: “GPT-5?! More like GPT-4.1!” To some (maybe even most), it appeared that heading into 2025 there were only incremental improvements left for LLMs to make.

… Psyche!

OpenAI bulldozed those claims with the unveiling of o3 and o3-mini, the latest iterations in their family of complex reasoning-focused models. These releases mark a new era in the capabilities of LLMs for up to research level tasks, with benchmarks surpassing human expert performance in several domains. Even Anthropic co-founder Jack Clark complemented OpenAI’s work in a blog post on Monday, citing o3 as evidence that “AI progress will be faster in 2025 than in 2024.”

So, how good is o3? The model put up record-breaking numbers across numerous frontier benchmarks in coding, mathematics, and PhD-level science. One standout was its performance on the ARC-AGI benchmark, which had been built specifically around tasks that LLMs perform poorly on; o3 achieved 87.5% on its high-power setting compared to o1's 32% (not to mention just 5% by GPT-4o back in May).

o3 also achieved a score of 87.7% on the GPQA Diamond test (biology, physics, and chemistry); for reference, a PhD holder with internet access typically scores between 34% and 81%. In programming, o3's Codeforces Elo rating is 2727, putting it in the 99.95th percentile of competitive programmers. And in mathematics, o3 attained 96.7% accuracy on the American Invitational Mathematics Exam (AIME), up from 83.3% with o1 and just 13.4% with GPT-4o.

Notably, o3 performed well on both the public and private ARC-AGI datasets, suggesting it wasn’t simply drawing on training data for performance but also able to solve difficult tasks it had not seen before. Additionally, ARC co-founder Francois Chollet confirmed that the model’s high cost was not simply due to brute force methodology to get the correct answer (although spending thousands of dollars on computation to solve one problem might be brutish enough for some!). Both of those things underscore that the model is actually reasoning - or something approximating reasoning - not solely exploiting its training data or raw computational power.

There’s just one tiny catch… and, is it AGI?

-------------------------

These high performance scores come with astronomical computational costs. o3 can cost a whopping $3,400 per task on its most intensive setting, versus $20 for high-efficiency mode. OpenAI even asked that ARC-AGI not disclose how much they spent running the full benchmark on the high-compute setting - but it was likely over $1 million.

Worth nothing is that these models are still not thinking like a human does - and lacking strong generalization capabilities, some problems which seem simple for humans are beyond o3’s capabilities. How can it be AGI if it can’t even solve a ‘simple’ puzzle? It appears increasingly likely that the way we have considered AGI is overly-anthropomorphized, and there won't be a single defining event that leads to a consensus AGI moment. We'll just gradually stop talking or even thinking about it as these systems get better at tackling individual types of tasks, domain by domain - the slow-creep alternative to the traditional "singularity" narrative.

There is much to unpack with the o3 release. LLMs can now perform to human expert standards at many tasks - and these breakthroughs were achieved at an accelerating pace. Will the inference time compute scaling paradigm continue to deliver new generations every 3 months relative to the 1-2 years for the training time scaling regime? How will these models perform in the real world beyond their benchmarks? Will o3 models rapidly begin to transform the global economy and disrupt huge numbers of jobs, or is the cost too large a bottleneck to adoption? On which tasks will it be worth spending 170x more compute for incrementally better performance (as with ARC-AGI)? Is this model AGI already? And, of course, might you need to look for a new career?

The o3 models have not yet been released, as the rollout is dependent on safety testing. o3-mini is likely due for release in late January 2025, while the larger o3 model will follow. Early access applications are currently being accepted through January 10. Pricing is still pending…

But it ain’t gonna be cheap.

MORE IN AI THIS WEEK

Google Search will reportedly have a dedicated ‘AI Mode’ soon
Instagram to launch AI tools that can create deepfakes
xAI raises $6B Series C; is testing a standalone iOS app for its Grok chatbot
Global online holiday sales are up 4%, with AI and agents on track to influence $200 billion in sales
The real reason your company’s AI isn’t working (hint: it’s not the technology)
OpenAI is Napster
OCTAVE by Hume Research is a frontier speech-language model with new emergent capabilities, like on-the-fly voice and personality creation
From Anthropic: Building effective agents
How to fine-tune open LLMs in 2025 with Hugging Face

The future of presentations, powered by AI

Gamma is a modern alternative to slides, powered by AI. Create beautiful and engaging presentations in minutes. Try it free today.

VIDEOS, SOCIAL MEDIA & PODCASTS

o3 - wow [YouTube]
Marques Brownlee: OpenAI’s SORA vs Google’s VEO 2 AI Generated Videos [YouTube]
Building Anthropic - a conversation with their co-founders [YouTube]
OpenAI is offering 1 million free tokens for GPT-4o and o1 if you share your API usage with them for training [X]
Starting today, you can run your private GGUFs from the Hugging Face hub directly in ollama [X]
I built a multimodal AI medical image diagnosis agent using Gemini 2.0 [X]
ChatGPT’s rumored infinite memory roll-out - real? [Reddit]
PSA - Deepseek v3 outperforms Sonnet at 53x cheaper pricing (API rates) [Reddit]

TECHNICAL NEWS, DEVELOPMENT, RESEARCH & OPEN SOURCE

How to fine-tune open LLMs in 2025 with Hugging Face
From Anthropic: Building effective agents
Genesis: a new pythonic physics simulation platform designed for robotics and AI with a photo-realistic rendering system and generative data engine
AIOpsLab by Microsoft: Building AI agents for autonomous clouds
Alibaba Cloud’s team releases QVQ-72B-Preview, a new open-weight multimodal reasoning model and also publishes technical report of Qwen 2.5
DeepSeek V3 Base model has landed on Hugging Face - a 685B parameter Mixture-of-Experts
Gemini 2.0 Multimodal Live API quickstart for interactive audio responses
Microsoft has open-sourced PromptWizard, a framework to automate prompt optimization

That’s all for this week! We’ll see you next Thursday.