• WeeklyDispatch.AI
  • Posts
  • The week in AI: xAI unveils Grok 3 & OpenAI is behind Anthropic in their own new benchmark

The week in AI: xAI unveils Grok 3 & OpenAI is behind Anthropic in their own new benchmark

Plus: World's first musculoskeletal robot goes viral for all the wrong reasons

Welcome to The Dispatch! We are the newsletter that keeps you informed about AI. Each Thursday, we aggregate the major developments in artificial intelligence - we pass along the news, useful resources, tools and services; we highlight the top research in the field as well as exciting developments in open source. Even if you aren’t a machine learning engineer, we’ll keep you in touch with the most important developments in AI.

NEWS & OPINION

*according to Elon

-------------------------

xAI launched their latest flagship model, Grok 3, in beta version Monday night via a live stream on X (which as of today had about 7 million views) with CEO Elon Musk. There are four models in the Grok 3 family: Grok 3, Grok 3 mini, Grok 3 Thinking and Grok 3 mini Thinking. Musk claims it is the smartest AI on earth; and Grok 3 is state-of-the-art on some important benchmarks. But it’s getting a bit difficult to properly assess the value of benchmarks in AI at this point. They’re often static, quickly become dated, and typically are narrowly focused on evaluating a single capability - like a model’s factuality in a single domain, or its ability to solve reasoning-based multiple choice questions.

In the case of Grok 3, xAI also highlighted benchmarks for Grok 3 Thinking in a “best of 64 answers by consensus” fashion, which adds some degree of distortion to the numbers since none of the competitor’s cons@64 scores were listed. Unsurprisingly perhaps, this led to a cons@64 X/Twitter war between OpenAI and xAI researchers (we hate to burst your bubble, Igor from xAI, but o3’s benchmarks were showing that its pass@1, or single attempt, was better than o1’s cons@64 - essentially the exact opposite of how xAI used cons@64).

Other notes on Grok 3:

  • You can try it for free at the moment, “until our servers melt”, xAI says, but you will need an X subscription ($8/m) to use the base models and an X+ subscription ($40/m) to use the reasoning models, at least to start. Leaks suggest a new SuperGrok plan will also become available at $30 monthly or $300 annually, providing subscribers with additional features including unlimited image generation. Beyond the free use period now, Grok 3 (with usage limits) will be rolled out to all X users eventually.

  • Aside from the aforementioned benchmarks, the early indicators are almost universally positive. Smartest AI on Earth? Debatable, but doubtful. OpenAI co-founder Andrej Karpathy, who is widely respected in the field, had mostly good things to say (but there were some blips, and things other models could do that Grok couldn’t) and highlighted how impressive it was that Grok 3 is this good when xAI essentially started from scratch just over a year ago. Grok also has a deep research tool, which isn’t quite up to par with OpenAI’s yet.

  • Grok 3 launches as the top model on Chatbot Arena, where the public blind-ranks model outputs head to head. Chatbot Arena is not a perfect ranking system by any means - it’s known to favor models that are likely to not refuse requests (which would favor Grok, as it is less likely to censor political/controversial content), as evidenced by Claude 3.5 Sonnet’s relatively low position on the leaderboard relative to its utility. But overall it’s a strong evaluation. xAI/Musk’s stated goals of a “based” model correlated well here.

  • xAI also revealed in the livestream that a Voice Mode for Grok will go live in the coming days, and Musk announced the launch of xAI’s new gaming studio.

-------------------------

We mentioned some of the issues with current benchmarks in AI above - but what about benchmarks that measure how well an AI system can perform tasks that skilled humans are being paid for right now?

OpenAI’s new SWE-Lancer benchmark aims to do precisely that by putting major LLMs to the test on 1,488 software engineering jobs from Upwork worth $1m in total - ranging from $50 bug fixes to $32,000 feature implementations.

The models faced actual software engineering tasks, complete with real-world issue descriptions, access to the codebase, and even a "user tool" for the model to interact with the application and observe the issue. Performance was then judged by human-verified end-to-end tests and, for management tasks, benchmarked against real engineering manager decisions. This approach ascribes a real-world-value result to the models rather than just another benchmark number, but also looks at an ongoing debate about AI’s role in the workforce: can AI models actually replace human workers in skilled jobs? Here are the takeaways:

  • Anthropic took the crown: Claude 3.5 Sonnet emerged as the top earner on the SWE-Lancer benchmark, raking in over $400k in simulated freelance earnings. This outpaced GPT-4o ($300k) and even OpenAI’s high reasoning effort model, o1 ($380k). It’s worth noting that the research team did not test DeepSeek R1 or their own o3 model which, as we noted last week, will not be released but rather rolled into GPT-5.

  • Managerial SWE AI is more ready than ‘individual contributor’ coding AI: The models demonstrated a significantly higher aptitude for SWE manager tasks – decisions made by human engineering managers when the job was originally posted - than for individual contributor SWE tasks that demanded actual coding solutions. This hints that AI might be closer to augmenting or even automating technical oversight roles before it becomes a truly reliable independent coder in complex projects.

  • Bug fixing is AI's sweet spot (relatively speaking): While still far from perfect, AI models showed the most success in tackling bug fixes compared to other coding tasks. Claude 3.5 Sonnet managed to solve over 28% of bug-fixing challenges. Current foundation LLMs are much better equipped for well-defined, localized coding problems than they are for more creative software development.

  • New features are still a mountain to climb: Developing entirely new features proved to be a major hurdle for all models tested. Even the best performer, Claude 3.5 Sonnet, only managed to complete 14.3% of new feature implementation tasks, while GPT-4o completely failed on these. This underscores the significant gap that remains before AI can independently handle the more complex and innovative aspects of software engineering.

  • System-wide quality/reliability tasks = virtually zero success: When it came to tasks involving system-wide quality, reliability, or debugging – encompassing broad codebase understanding and complex interactions – LLMs essentially hit a wall, achieving near 0% success across the board. This starkly reveals the limitations of current AI in grasping and resolving issues that span across systems.

OpenAI and Anthropic are both on the cusp of releasing their next major models; it will be interesting to see how results for SWE-Lancer evolve around GPT-4.5/5 and whatever Anthropic has been cooking up.

MORE IN AI THIS WEEK

Here’s Why Over 4 Million Professionals Read Morning Brew

  • Business news explained in plain English

  • Straight facts, zero fluff, & plenty of puns

  • 100% free

TRENDING AI TOOLS, APPS & SERVICES

  • Perplexity’s Deep Research: use reasoning AI and search to create detailed reports - free with daily limits

  • Career Dreamer from Google: a new way to explore career possibilities

  • Lovable: full-stack app development added instant visual edits

  • Concierge: connected AI Assistant that talks to your apps, in real-time. Support for Gmail, Jira, Notion, Slack, Linear, Confluence, HubSpot, Attio, Salesforce, Airtable, and more

  • Fiverr Go: empowering freelancers to scale their business with AI

  • Reflect: track anything with beautiful data driven self-improvement visualizations

  • Nugget: scale support effortlessly with AI-native, no-code platform

  • Supavec: flexible, open-source platform to build powerful RAG apps

GUIDES, LISTS, PRODUCTS, UPDATES, INFORMATIVE, INTERESTING

VIDEOS, SOCIAL MEDIA & PODCASTS

  • Dwarkesh Patel interviews CEO Satya Nadella on Microsoft’s AGI plan & quantum breakthrough [YouTube]

  • How to build full-stack apps with OpenAI o1 pro - Part 1 [YouTube]

  • Robotics company Figure unveils Helix, a generalist Vision-Language-Action (VLA) model that unifies perception, language understanding, and learned control [YouTube]

  • ‘Protoclone’ from clone robotics went viral - for being absolute nightmare fuel [X]

  • Director of Neuralink Shivon Zilis calls a conversation with Grok-3’s soon to be released voice mode “one of the most unexpectedly rewarding hours of my life” [X]

  • OpenAI CEO Sam Altman polls X/Twitter about what their next open source project should be [X]

  • Discussion on Figure’s new robot, Helix [Reddit]

TECHNICAL NEWS, DEVELOPMENT, RESEARCH & OPEN SOURCE

  • Alibaba’s Qwen team publishes technical report for their SOTA vision language model, Qwen-2.5VL

  • DeepSeek publishes Native Sparse Attention: a new training method that cuts training costs, while performing better than full attention models on long tasks

  • Google research: Accelerating scientific breakthroughs with an AI co-scientist (built with Gemini 2.0)

  • Microsoft’s Muse: their first generative AI model designed for gameplay ideation

  • Microsoft’s OmniParser V2: turning any LLM into a computer use agent

  • Sakana AI creates an AI CUDA Engineer that can produce highly optimized CUDA kernels for AI engineers

  • Mistral unveils Saba: a 24B parameter model trained on meticulously curated datasets from across the Middle East and South Asia

That’s all for this week! We’ll see you next Thursday.