- WeeklyDispatch.AI
- Posts
- The week in AI: Anthropic's Claude 3.7 Sonnet comes in with a bang, GPT-4.5 with a whimper
The week in AI: Anthropic's Claude 3.7 Sonnet comes in with a bang, GPT-4.5 with a whimper
Plus: DOGE to use AI to assess federal worker responses
Welcome to The Dispatch! We are the newsletter that keeps you informed about AI. Each Thursday, we aggregate the major developments in artificial intelligence - we pass along the news, useful resources, tools and services; we highlight the top research in the field as well as exciting developments in open source. Even if you aren’t a machine learning engineer, we’ll keep you in touch with the most important developments in AI.
Sorry we are a day late - we wanted to include the GPT-4.5 release in this week’s issue! We’ll be back to our regular schedule next week.
NEWS & OPINION
-------------------------
On Monday, Anthropic announced the long-awaited update to Claude in the form of 3.7 Sonnet, as well as a powerful new tool for developers called Claude Code in research preview. 3.7 Sonnet is the world's first ‘hybrid reasoning’ AI that can combine instant responses with controllable extended thinking capabilities for prompts/tasks that benefit from an advanced reasoning phase. Let’s break down the release:
Claude 3.7 Sonnet is available for free users, albeit without the "extended thinking" mode, where the AI’s reasoning is displayed via a scratchpad. You need the Pro plan ($20/m) to use extended thinking. Additionally, API users can precisely control how long Claude thinks, allowing them to balance speed, cost, and quality based on task complexity. It’s the first AI to give users granular control over how much (or how little, or not at all) you want the AI to think before responding.
3.7 Sonnet achieves SOTA performance on real-world coding benchmarks and agentic tool use, surpassing competitors like o1, o3-mini, and DeepSeek R1. 3.5 Sonnet was well-known for its coding prowess, and 3.7 builds on that - the model is already on top of the WebDevArena leaderboard.
To that end, some users have had the model create surprisingly functional games from a single prompt, and others tested it against other models where 3.7 Sonnet emerged the clear winner - usually across the board.
Some users are noticing that 3.7 Sonnet really wants to do a good job for you, so it has a tendency to go overboard and do things outside of scope for the task. You’re probably going to have to experiment between extended thinking and normal mode for your specific use-case. In our experience, 3.7 Sonnet with extended thinking could add additional objectives that weren’t within the prompt parameters and then create thousands of lines of code from one prompt (leading up to the message length limit, generating an error), often when much less was required.
Anthropic showed off Claude playing Pokémon in real time via Twitch livestream, with Claude's "thought process" on the left while gameplay appears on the right. 3.7 defeated three gym leaders - while the original Sonnet struggled to leave the starting location. It’s pretty cool to see the thinking process go on as it plays.
For developers, Anthropic also introduced Claude Code, a command-line coding agent that can edit files, read code, and write and run tests. The tool leverages Claude 3.7's reasoning capabilities in an agentic workflow that operates directly within your existing project directories. Claude Code can preload documentation for better context, handle Git operations seamlessly, and create complete functional applications from simple prompts. It’s extremely good at debugging. A convenient cost-tracking feature lets developers monitor usage with the "/cost" command - which is great, because Claude’s API is not cheap.
On the heels of the Sonnet 3.7 release, Anthropic is reportedly set to raise a larger-than-planned $3.5B funding round at a $61.5B valuation, per WSJ.
-------------------------
Well, sort of: you can use it if you happen to be paying $200/m for the Pro subscription. If not, you’re going to have to wait at least a week, and even then there will probably be strict usage rate limits.
You might recall that OpenAI was reportedly in panic mode last year because “Orion”, the codename for GPT-4.5, wasn’t showing the same major gains from traditional scaling methods as its predecessors. Now that we have some early impressions to go by, it looks like there was quite a bit of truth to that story. Let’s start with the bad news for OpenAI’s GPT-4.5:
It’s a big, incredibly expensive, and slow non-reasoning model - providing only marginally better performance than GPT-4o at 30x the cost for input and 15x the cost for output. OpenAI is well aware of these limitations, and it took steps to soften the potential letdown by spelling out the model's limitations compared to the more rapid advances in reasoning models in the release post. While the increasingly diminishing returns on traditional scaling have not diminished the ballooning cost to run the model, there’s a silver lining to GPT-4.5’s massive size which we’ll cover below.
CEO Sam Altman admitted, quite bluntly, that the company is out of GPU’s and that GPT-4.5 users should temper their expectations - yikes! OpenAI’s best models have always been tremendously ineffecient (although they have made excellent gains on making their legacy/mini models more efficient), and despite the company’s
It’s not all bad, though, and the media blitz is probably going to oversell this soft-launch as a disaster, when ultimately it might not be:
ChatGPT now has over 400 million weekly active users. The majority of those aren’t using ChatGPT for STEM tasks - they’re using it to navigate their own personal matters and objectives in a thousand different ways. Describing GPT-4.5 as a more “emotionally intelligent” model might sound like damage control, but by many accounts from early users it actually is - better world knowledge, better creativity and humor, increased contextual awareness, etc. Many of these things are hard to benchmark, but not hard to gauge in the user experience.
OpenAI researchers are convinced that emotional intelligence is fueled by the increased scaling - more data, more compute = better, more fundamentally sound grasp of the world. OpenAI rightly pointed to the idea that there is looming synergy between better, general-purpose, innately smarter models and the emerging reasoning models: “The two approaches to scaling—pre-training and reasoning—will complement each other. As models like GPT‑4.5 become smarter and more knowledgeable through pre-training, they will serve as an even stronger foundation for reasoning and tool-using agents.”
To boost GPT-4.5’s performance, OpenAI experimented with new training techniques, like feeding the model synthetic data generated by its reasoning-focused o-series siblings. That approach makes it much less likely to hallucinate, and the most impressive part of the release blog was a benchmark for just how much less it hallucinated - a lot. That’s pretty great.
For now, it seems that GPT-4.5 may be the last of its kind - a technological end-game for the unsupervised learning approach that has paved the way for new architectures in AI models, such as inference-time reasoning and perhaps even something more novel, like the diffusion-based language models we’re already seeing at the edges of the model development world.
MORE IN AI THIS WEEK
DOGE will use AI to assess the responses of federal workers who were told to justify their jobs via email
Pew Research Center: US workers are more worried than hopeful about future AI use in the workplace
ChatGPT saved my life (no, seriously, I’m writing this from the ER)
DeepSeek rushes to launch new AI model as China goes all in
Nvidia bounces back from post-earnings slide
Canada watchdog probing X's use of personal data in AI models' training
Grok 3 appears to have briefly censored unflattering mentions of Trump and Musk
1,000 artists (including Kate Bush, Imogen Heap, Max Richter, Hans Zimmer) release ‘silent’ album to protest UK copyright sell-out to AI
Activision finally admits use of GenAI for assets in Call of Duty: Black Ops 6
Meet the journalists training AI models for Meta and OpenAI
10x Your Outbound With Our AI BDR
Imagine your calendar filling with qualified sales meetings, on autopilot. That's Ava's job. She's an AI BDR who automates your entire outbound demand generation.
Ava operates within the Artisan platform, which consolidates every tool you need for outbound:
300M+ High-Quality B2B Prospects, including E-Commerce and Local Business Leads
Automated Lead Enrichment With 10+ Data Sources
Full Email Deliverability Management
Multi-Channel Outreach Across Email & LinkedIn
Human-Level Personalization
TRENDING AI TOOLS, APPS & SERVICES
Replit: released their Agent V2 - more autonomous in building your app, just from a prompt and uses 3.7 Sonnet
Websets by Exa: a deep search product for hard to find info that deploys agents for better results, beating Google by over 20x and OpenAI Deep Research by 10x
Hume AI: just released Octave, the first LLM specifically designed for text-to-speech; design any voice with a prompt and control emotion/delivery
Poe Apps: “vibe create” apps without writing any code, thanks to App Creator, built on top of Claude 3.7 Sonnet (good for prototyping)
Scribe: ElevenLabs’ new SOTA speech-to-text model
Ideogram: image generation platform launches 2a - the new model excels at text generation in graphic design
Project Starlight by Topaz Labs: bring old videos back to life with AI
Kosmik: an AI browser for visual research. Drag, drop, & organize files on an infinite canvas. Perfect for students, designers & creatives
Hero Stuff: the fastest way to sell instantly online with AI
GUIDES, LISTS, PRODUCTS, UPDATES, INFORMATIVE, INTERESTING
Introducing Alexa+, the next generation of Alexa - 5 things it can do for you
Microsoft removed usage limits on Copilot's Voice and Think Deeper features, giving all free users unlimited access
Can AI sound too human? Sesame's Maya is as unsettling as it is amazing - try it for free
My girlfriend likes word games, so I built one with AI to propose to her
Perplexity news: announces Comet, an AI browser to challenge Chrome; also for iOS app - coming soon to Android and Mac
'Indiana Jones' jailbreak approach highlights the vulnerabilities of existing LLMs
just tried ChatGPT deep research to dive into my family history - here’s what happened
VIDEOS, SOCIAL MEDIA & PODCASTS
Consumer electronics company Nothing unveils its upcoming AI-powered Nothing Phone (3a) with an unboxing video by 1X’s recently debuted NEO Gamma humanoid robot [X]
Two AI agents on a phone call realize they’re both AI and switch to a superior audio signal to communicate faster [X] (Hack-a-thon winning project)
OpenAI co-founder Andrej Karpathy: How I use LLMs [YouTube]
Claude 3.7 goes hard for programmers [YouTube]
GPT 4.5 - not so much wow [YouTube]
Discussion on Sesame’s Maya - see above [Reddit]
Open operator, serverless browsers, and the future of computer-using agents [Podcast]
TECHNICAL NEWS, DEVELOPMENT, RESEARCH & OPEN SOURCE
Inception emerges from stealth with Mercury: the first commercial-scale diffusion large language model family (blazing fast vs. traditional autoregressive LLM’s; try Mercury Coder)
Microsoft launches Phi-4 small language model (SLM) family update with multimodal version
Big week for Alibaba: Qwen team releases QwQ-Max-Preview, a new reasoning-focused AI - full open-source release coming soon; plus the Tongyi Lab releases Wan2.1, an open-source suite of powerful video generation models that outperform SOTA
Google launches a free version of Gemini Code Assist for individual developers with 180k free monthly code completions
Researchers puzzled by AI that praises Nazis after training on unrelated, insecure code
That’s all for this week! We’ll see you next Thursday.