- WeeklyDispatch.AI
- Posts
- The week in AI: xAI unveils Grok 3 & OpenAI is behind Anthropic in their own new benchmark
The week in AI: xAI unveils Grok 3 & OpenAI is behind Anthropic in their own new benchmark
Plus: World's first musculoskeletal robot goes viral for all the wrong reasons
Welcome to The Dispatch! We are the newsletter that keeps you informed about AI. Each Thursday, we aggregate the major developments in artificial intelligence - we pass along the news, useful resources, tools and services; we highlight the top research in the field as well as exciting developments in open source. Even if you aren’t a machine learning engineer, we’ll keep you in touch with the most important developments in AI.
NEWS & OPINION
*according to Elon
-------------------------
xAI launched their latest flagship model, Grok 3, in beta version Monday night via a live stream on X (which as of today had about 7 million views) with CEO Elon Musk. There are four models in the Grok 3 family: Grok 3, Grok 3 mini, Grok 3 Thinking and Grok 3 mini Thinking. Musk claims it is the smartest AI on earth; and Grok 3 is state-of-the-art on some important benchmarks. But it’s getting a bit difficult to properly assess the value of benchmarks in AI at this point. They’re often static, quickly become dated, and typically are narrowly focused on evaluating a single capability - like a model’s factuality in a single domain, or its ability to solve reasoning-based multiple choice questions.
In the case of Grok 3, xAI also highlighted benchmarks for Grok 3 Thinking in a “best of 64 answers by consensus” fashion, which adds some degree of distortion to the numbers since none of the competitor’s cons@64 scores were listed. Unsurprisingly perhaps, this led to a cons@64 X/Twitter war between OpenAI and xAI researchers (we hate to burst your bubble, Igor from xAI, but o3’s benchmarks were showing that its pass@1, or single attempt, was better than o1’s cons@64 - essentially the exact opposite of how xAI used cons@64).
Other notes on Grok 3:
You can try it for free at the moment, “until our servers melt”, xAI says, but you will need an X subscription ($8/m) to use the base models and an X+ subscription ($40/m) to use the reasoning models, at least to start. Leaks suggest a new SuperGrok plan will also become available at $30 monthly or $300 annually, providing subscribers with additional features including unlimited image generation. Beyond the free use period now, Grok 3 (with usage limits) will be rolled out to all X users eventually.
Aside from the aforementioned benchmarks, the early indicators are almost universally positive. Smartest AI on Earth? Debatable, but doubtful. OpenAI co-founder Andrej Karpathy, who is widely respected in the field, had mostly good things to say (but there were some blips, and things other models could do that Grok couldn’t) and highlighted how impressive it was that Grok 3 is this good when xAI essentially started from scratch just over a year ago. Grok also has a deep research tool, which isn’t quite up to par with OpenAI’s yet.
Grok 3 launches as the top model on Chatbot Arena, where the public blind-ranks model outputs head to head. Chatbot Arena is not a perfect ranking system by any means - it’s known to favor models that are likely to not refuse requests (which would favor Grok, as it is less likely to censor political/controversial content), as evidenced by Claude 3.5 Sonnet’s relatively low position on the leaderboard relative to its utility. But overall it’s a strong evaluation. xAI/Musk’s stated goals of a “based” model correlated well here.
xAI also revealed in the livestream that a Voice Mode for Grok will go live in the coming days, and Musk announced the launch of xAI’s new gaming studio.
-------------------------
We mentioned some of the issues with current benchmarks in AI above - but what about benchmarks that measure how well an AI system can perform tasks that skilled humans are being paid for right now?
OpenAI’s new SWE-Lancer benchmark aims to do precisely that by putting major LLMs to the test on 1,488 software engineering jobs from Upwork worth $1m in total - ranging from $50 bug fixes to $32,000 feature implementations.
The models faced actual software engineering tasks, complete with real-world issue descriptions, access to the codebase, and even a "user tool" for the model to interact with the application and observe the issue. Performance was then judged by human-verified end-to-end tests and, for management tasks, benchmarked against real engineering manager decisions. This approach ascribes a real-world-value result to the models rather than just another benchmark number, but also looks at an ongoing debate about AI’s role in the workforce: can AI models actually replace human workers in skilled jobs? Here are the takeaways:
Anthropic took the crown: Claude 3.5 Sonnet emerged as the top earner on the SWE-Lancer benchmark, raking in over $400k in simulated freelance earnings. This outpaced GPT-4o ($300k) and even OpenAI’s high reasoning effort model, o1 ($380k). It’s worth noting that the research team did not test DeepSeek R1 or their own o3 model which, as we noted last week, will not be released but rather rolled into GPT-5.
Managerial SWE AI is more ready than ‘individual contributor’ coding AI: The models demonstrated a significantly higher aptitude for SWE manager tasks – decisions made by human engineering managers when the job was originally posted - than for individual contributor SWE tasks that demanded actual coding solutions. This hints that AI might be closer to augmenting or even automating technical oversight roles before it becomes a truly reliable independent coder in complex projects.
Bug fixing is AI's sweet spot (relatively speaking): While still far from perfect, AI models showed the most success in tackling bug fixes compared to other coding tasks. Claude 3.5 Sonnet managed to solve over 28% of bug-fixing challenges. Current foundation LLMs are much better equipped for well-defined, localized coding problems than they are for more creative software development.
New features are still a mountain to climb: Developing entirely new features proved to be a major hurdle for all models tested. Even the best performer, Claude 3.5 Sonnet, only managed to complete 14.3% of new feature implementation tasks, while GPT-4o completely failed on these. This underscores the significant gap that remains before AI can independently handle the more complex and innovative aspects of software engineering.
System-wide quality/reliability tasks = virtually zero success: When it came to tasks involving system-wide quality, reliability, or debugging – encompassing broad codebase understanding and complex interactions – LLMs essentially hit a wall, achieving near 0% success across the board. This starkly reveals the limitations of current AI in grasping and resolving issues that span across systems.
OpenAI and Anthropic are both on the cusp of releasing their next major models; it will be interesting to see how results for SWE-Lancer evolve around GPT-4.5/5 and whatever Anthropic has been cooking up.
MORE IN AI THIS WEEK
OpenAI aims for less censored ChatGPT, considers special board voting powers to prevent Elon Musk takeover
Meta announces plans to build the world’s longest undersea cable to support its AI infrastructure needs
Palantir stock soared 600% on deportation policies, AI, and its contracts with the U.S. military - but Trump sent it spiraling after a call to slash costs at the Pentagon
Why AWS CEO Matt Garman is playing the long game on AI
Thinking Machines Lab is ex-OpenAI CTO Mira Murati’s new startup
New York Times goes all-in on internal AI tools
South Korea latest to ban China’s DeepSeek from app stores over privacy concerns
Won't vs. Can't - analyzing sandbagging-like behavior from Claude models
Here’s Why Over 4 Million Professionals Read Morning Brew
Business news explained in plain English
Straight facts, zero fluff, & plenty of puns
100% free
TRENDING AI TOOLS, APPS & SERVICES
Perplexity’s Deep Research: use reasoning AI and search to create detailed reports - free with daily limits
Career Dreamer from Google: a new way to explore career possibilities
Lovable: full-stack app development added instant visual edits
Concierge: connected AI Assistant that talks to your apps, in real-time. Support for Gmail, Jira, Notion, Slack, Linear, Confluence, HubSpot, Attio, Salesforce, Airtable, and more
Fiverr Go: empowering freelancers to scale their business with AI
Reflect: track anything with beautiful data driven self-improvement visualizations
Nugget: scale support effortlessly with AI-native, no-code platform
Supavec: flexible, open-source platform to build powerful RAG apps
GUIDES, LISTS, PRODUCTS, UPDATES, INFORMATIVE, INTERESTING
The hottest AI models, what they do, and how to use them
Apple introduced the iPhone 16e as its cheapest device ($599) offering Apple Intelligence
OpenAI published a guide to prompting its o-series reasoning models (emphasizing simpler, more direct approaches over traditional instructions)
Humane has killed its Ai Pin less than a year after its release
LlamaCon: a new open source AI developments conference hosted by Meta
What is Perplexity Deep Research, and how do you use it?
How to use Opera's built-in AI chatbot (and why you should)
VIDEOS, SOCIAL MEDIA & PODCASTS
Dwarkesh Patel interviews CEO Satya Nadella on Microsoft’s AGI plan & quantum breakthrough [YouTube]
How to build full-stack apps with OpenAI o1 pro - Part 1 [YouTube]
Robotics company Figure unveils Helix, a generalist Vision-Language-Action (VLA) model that unifies perception, language understanding, and learned control [YouTube]
‘Protoclone’ from clone robotics went viral - for being absolute nightmare fuel [X]
Director of Neuralink Shivon Zilis calls a conversation with Grok-3’s soon to be released voice mode “one of the most unexpectedly rewarding hours of my life” [X]
OpenAI CEO Sam Altman polls X/Twitter about what their next open source project should be [X]
Discussion on Figure’s new robot, Helix [Reddit]
TECHNICAL NEWS, DEVELOPMENT, RESEARCH & OPEN SOURCE
Alibaba’s Qwen team publishes technical report for their SOTA vision language model, Qwen-2.5VL
DeepSeek publishes Native Sparse Attention: a new training method that cuts training costs, while performing better than full attention models on long tasks
Google research: Accelerating scientific breakthroughs with an AI co-scientist (built with Gemini 2.0)
Microsoft’s Muse: their first generative AI model designed for gameplay ideation
Microsoft’s OmniParser V2: turning any LLM into a computer use agent
Sakana AI creates an AI CUDA Engineer that can produce highly optimized CUDA kernels for AI engineers
Mistral unveils Saba: a 24B parameter model trained on meticulously curated datasets from across the Middle East and South Asia
That’s all for this week! We’ll see you next Thursday.