• WeeklyDispatch.AI
  • Posts
  • The week in AI: OpenAI's o3 surpasses human experts, sparks AGI debate

The week in AI: OpenAI's o3 surpasses human experts, sparks AGI debate

Plus: Google Search to received dedicated 'AI mode'

In partnership with

Welcome to The Dispatch! We are the newsletter that keeps you informed about AI. Each Thursday, we aggregate the major developments in artificial intelligence - we pass along the news, useful resources, tools and services; we highlight the top research in the field as well as exciting developments in open source. Even if you aren’t a machine learning engineer, we’ll keep you in touch with the most important developments in AI.

We hope you had a very Merry Christmas! This is a shorter, o3-centric edition due to the holiday. Everyone has big opinions on the o3 release and what it signifies - please read our breakdown below and stay informed!

NEWS & OPINION

-------------------------

“Scaling has hit a wall” - an oft-repeated industry narrative in recent months. You simply can’t keep adding more compute, more data and expect better results, we were hearing. Even Bill Gates had chimed in about a plateau, so it must be true: “GPT-5?! More like GPT-4.1!” To some (maybe even most), it appeared that heading into 2025 there were only incremental improvements left for LLMs to make.

Psyche!

OpenAI bulldozed those claims with the unveiling of o3 and o3-mini, the latest iterations in their family of complex reasoning-focused models. These releases mark a new era in the capabilities of LLMs for up to research level tasks, with benchmarks surpassing human expert performance in several domains. Even Anthropic co-founder Jack Clark complemented OpenAI’s work in a blog post on Monday, citing o3 as evidence that “AI progress will be faster in 2025 than in 2024.”

So, how good is o3? The model put up record-breaking numbers across numerous frontier benchmarks in coding, mathematics, and PhD-level science. One standout was its performance on the ARC-AGI benchmark, which had been built specifically around tasks that LLMs perform poorly on; o3 achieved 87.5% on its high-power setting compared to o1's 32% (not to mention just 5% by GPT-4o back in May).

o3 also achieved a score of 87.7% on the GPQA Diamond test (biology, physics, and chemistry); for reference, a PhD holder with internet access typically scores between 34% and 81%. In programming, o3's Codeforces Elo rating is 2727, putting it in the 99.95th percentile of competitive programmers. And in mathematics, o3 attained 96.7% accuracy on the American Invitational Mathematics Exam (AIME), up from 83.3% with o1 and just 13.4% with GPT-4o.

Notably, o3 performed well on both the public and private ARC-AGI datasets, suggesting it wasn’t simply drawing on training data for performance but also able to solve difficult tasks it had not seen before. Additionally, ARC co-founder Francois Chollet confirmed that the model’s high cost was not simply due to brute force methodology to get the correct answer (although spending thousands of dollars on computation to solve one problem might be brutish enough for some!). Both of those things underscore that the model is actually reasoning - or something approximating reasoning - not solely exploiting its training data or raw computational power.

-------------------------

These high performance scores come with astronomical computational costs. o3 can cost a whopping $3,400 per task on its most intensive setting, versus $20 for high-efficiency mode. OpenAI even asked that ARC-AGI not disclose how much they spent running the full benchmark on the high-compute setting - but it was likely over $1 million. 

Worth nothing is that these models are still not thinking like a human does - and lacking strong generalization capabilities, some problems which seem simple for humans are beyond o3’s capabilities. How can it be AGI if it can’t even solve a ‘simple’ puzzle? It appears increasingly likely that the way we have considered AGI is overly-anthropomorphized, and there won't be a single defining event that leads to a consensus AGI moment. We'll just gradually stop talking or even thinking about it as these systems get better at tackling individual types of tasks, domain by domain - the slow-creep alternative to the traditional "singularity" narrative.

There is much to unpack with the o3 release. LLMs can now perform to human expert standards at many tasks  - and these breakthroughs were achieved at an accelerating pace. Will the inference time compute scaling paradigm continue to deliver new generations every 3 months relative to the 1-2 years for the training time scaling regime? How will these models perform in the real world beyond their benchmarks? Will o3 models rapidly begin to transform the global economy and disrupt huge numbers of jobs, or is the cost too large a bottleneck to adoption? On which tasks will it be worth spending 170x more compute for incrementally better performance (as with ARC-AGI)? Is this model AGI already? And, of course, might you need to look for a new career?

The o3 models have not yet been released, as the rollout is dependent on safety testing. o3-mini is likely due for release in late January 2025, while the larger o3 model will follow. Early access applications are currently being accepted through January 10. Pricing is still pending…

But it ain’t gonna be cheap.

MORE IN AI THIS WEEK

The future of presentations, powered by AI

Gamma is a modern alternative to slides, powered by AI. Create beautiful and engaging presentations in minutes. Try it free today.

VIDEOS, SOCIAL MEDIA & PODCASTS

  • o3 - wow [YouTube]

  • Marques Brownlee: OpenAI’s SORA vs Google’s VEO 2 AI Generated Videos [YouTube]

  • Building Anthropic - a conversation with their co-founders [YouTube]

  • OpenAI is offering 1 million free tokens for GPT-4o and o1 if you share your API usage with them for training [X]

  • Starting today, you can run your private GGUFs from the Hugging Face hub directly in ollama [X]

  • I built a multimodal AI medical image diagnosis agent using Gemini 2.0 [X]

  • ChatGPT’s rumored infinite memory roll-out - real? [Reddit]

  • PSA - Deepseek v3 outperforms Sonnet at 53x cheaper pricing (API rates) [Reddit]

TECHNICAL NEWS, DEVELOPMENT, RESEARCH & OPEN SOURCE

That’s all for this week! We’ll see you next Thursday.