- WeeklyDispatch.AI
- Posts
- The week in AI: Sora is OpenAI's mind-blowing text to video generator
The week in AI: Sora is OpenAI's mind-blowing text to video generator
Plus: Google unveils new open source LLM
Welcome to The Dispatch! We are the newsletter that keeps you informed about AI. Each Thursday, we aggregate the major developments in artificial intelligence; we pass along the news, useful resources, tools and services, and highlight the top research in the field as well as exciting developments in open source. Even if you aren’t an engineer, we’ll keep you in touch with what’s going on in AI.
NEWS & OPINION
-------------------------
OpenAI has once again shocked the AI world with a stunning showcase of their text-to-video model, Sora. Given a brief text prompt or even a still image, Sora (the Japanese word for “sky”) can generate 1080p movie-like scenes with multiple characters, different types of motion and background details in a range of styles, up to a minute long.
While even OpenAI’s hand-picked videos have irregularities, the overall quality and frame-to-frame consistency is far beyond previous state of the art models, such as Runway ML’s Gen-2, Meta’s Emu Video, and even Google’s Lumiere. Whereas competitor models feel like a stop motion of AI images, Sora’s videos are surprisingly - and perhaps disturbingly - realistic.
Sora’s model is based on a transformer architecture, and has drawn criticism from OpenAI’s competitors for not having a true understanding of the physical world. Given that, it’s perhaps not a coincidence that the same day Sora was announced, Meta released V-JEPA - it’s own video model architecture which uses latent-space abstractions (rather than pixel-based recognition like Sora) to detect and understand the physics behind interactions between objects in videos.
It’s possible most people will care less about how an AI model makes the videos than how good the videos are; and even though there’s still a long way to go (some of the videos are comically erroneous/bizarre), the AI buzz phrase this is the worst it will ever be applies. Sora is currently being red-teamed for safety and access has been granted to a number of visual artists, designers, and filmmakers to get more feedback. There’s no word yet of a public release date or beta sign-up.
-------------------------
Barely two months after releasing Gemini to the public (and just a week after announcing Gemini Ultra 1.0), Google is already introducing Gemini 1.5. The new iteration comes with a massive expansion of its context window and the adoption of a "Mixture of Experts" (MoE) architecture, promising to make the AI both faster and more efficient. The new model also includes expanded multimodal capabilities (when given a 44-minute silent Buster Keaton movie, the model can accurately analyze various plot points and events, and even reason about small details in the movie that could easily be missed).
The ability to process up to 1 million tokens dwarfs the capabilities of its competitors (ChatGPT’s context window is 128,000 tokens). Google CEO Sundar Pichai highlighted the transformative potential of this feature, stating: "This allows use cases where you can add a lot of personal context and information at the moment of the query ... I view it as one of the bigger breakthroughs we have done."
Gemini 1.5 also shows impressive “in-context learning” skills, meaning that it can learn a new skill from information given in a long prompt. This negates the need to fine-tune a model to make it perform better at specific tasks. Gemini 1.5 is being rolled out to select developers and enterprise users now, and you can apply for early access. No word on the official release date other than “soon”.
-------------------------
Lenny Rachitsky, author of popular weekly advice column Lenny’s Newsletter, has provided 20 use case/examples of how people are using custom GPTs to make their teams more productive at work.
Custom GPTs provide a unique way for non-developers to customize and leverage a cutting-edge AI model to fit their specific use cases. Maybe you could use some help with drafting some documentation for work, but you know ChatGPT doesn’t have enough context/data about what you do to create anything of value. With just a little bit of effort and experimentation, you can remedy that. Or maybe you want to converse with an “expert” on a particular subject matter or research paper - just upload the relevant documentation, provide some guidelines for the GPT’s behavior, and you will get a much better result for your inputs than with the default ChatGPT. Use cases are only limited to the imagination.
(Custom GPTs require a ChatGPT Plus subscription; machine learning platform Hugging Face offers a free, open source alternative to GPTs called Hugging Chat Assistants - not quite as effective, but free.)
MORE IN AI THIS WEEK
Why The New York Times might win its copyright lawsuit against OpenAI
How much electricity does AI consume?
Study shows how AI is affecting the freelance job market
AI hiring tools may be filtering out the best job applicants
Elon Musk hints at possible X partnership with Midjourney
ChatGPT has gone berserk (OpenAI has released a postmortem on this issue)
Reddit signs AI content licensing deal with Google
AI can determine sex of person from brain scans
Why Google’s new AI Gemini is being accused of refusing to acknowledge the existence of white people
Your SOC 2 Compliance Checklist from Vanta
Are you building a business? Achieving SOC 2 compliance can help you win bigger deals, enter new markets and deepen trust with your customers — but it can also cost you real time and money.
Vanta automates up to 90% of the work for SOC 2 (along with other in-demand frameworks), getting you audit-ready in weeks instead of months. Save up to 400 hours and 85% of associated costs.
Download the free checklist to learn more about the SOC 2 compliance process and the road ahead.
TRENDING AI TOOLS & SERVICES
Groq: serving the fastest AI responses you’ve ever seen
Adobe’s AI assistant: builds on Acrobat Liquid Mode to further unlock document intelligence with new capabilities in Reader and Acrobat
Made With Sora: curated gallery of prompts and videos generated by Sora
Global AI Regulation Tracker: website that tracks AI regulations around the world
Glif: remix any image on the web with a Chrome extension
MagiScan: creates realistic 3D models with your phone - downloadable in various file formats (like STL for 3D printing or USDZ for augmented reality applications)
KippyAI: your personal AI language tutor
KardsAI: turn any text into ready-to-use flashcards
GUIDES, LISTS, USEFUL INFO
Strategies for an accelerating future - four questions to ask your organization
New ways Google Workspace customers can use Gemini
How to use NotebookLM (Google’s new AI tool)
I’ve gotten 100 free McDonald’s meals - thanks to this ChatGPT hack
iOS 18’s new AI features: everything we know so far
Adobe is introducing AI for PDFs that everyone should use
Google Gemini: everything you need to know about the new generative AI platform
VIDEOS, SOCIAL MEDIA & PODCASTS
Sora’s AI problems (and solutions) [YouTube]
Testing Gemini 1.5 and a 1 Million Token Window [YouTube]
How Sora works (a layman's explanation) [X]
Musk says Grok 1.5 releases in 2 weeks [X]
OpenAI/Sora is now on [TikTok]
Lex Fridman with Marc Raibert: Boston Dynamics and the future of robotics [Podcast]
(Discussion) Meta AI chief Yann LeCun doubles down on Sora opinion [Reddit]
(Discussion) Apple readies AI tool to rival Microsoft’s GitHub Copilot [Reddit]
TECHNICAL, RESEARCH & OPEN SOURCE
-------------------------
Google has just released Gemma, their first family of open models built from the same research and technology used to create Gemini. Gemma models are lightweight, coming in two sizes, Gemma 2B and Gemma 7B. They’re both available as pre-trained and instruction-tuned variants. Alongside these models, Google is releasing a Responsible Generative AI Toolkit and offering support across major frameworks like JAX, PyTorch, TensorFlow, and Hugging Face Transformers.
Gemma models share infrastructure components with Gemini models - Gemma 2B and 7B models are trained on 2T and 6T tokens of text, respectively. The data for training is primarily English from web documents, mathematics, and code. The models are designed for CPU and on-device applications (2B model) and GPU and TPU deployment (7B model). They’re compatible across laptops, desktops, IoT, mobile and cloud.
Gemma 7B appears to outperform both Llama-2 7B and Mistral 7B on several benchmarks, is available for commercial use, and has been optimized for Nvidia GPUs and (obviously) Vertex AI/Google Cloud.
-------------------------
Groq, an emerging AI chip company, is making headlines with its impressive Language Processing Units (LPUs) that are setting new benchmarks for speed in language model applications, notably outperforming Nvidia’s best-in-class GPUs. Through a series of viral demonstrations, Groq has showcased its capability to dramatically accelerate the performance of large language models, making even the most advanced/turbo versions of many popular AI chatbots appear slow in comparison. The company asserts that its technology offers the fastest processing for large language models, a claim that has been supported by third-party evaluations (scroll down to throughput).
Behind Groq's innovation are its LPUs, designed to overcome the computational density and memory bandwidth bottlenecks commonly faced by GPUs and CPUs in AI model training and execution. Founded by Jonathon Ross, who previously co-founded Google’s AI chip division, Groq is an inference engine that enhances the speed of existing AI chatbots rather than replacing them. The technology has shown potential to run ChatGPT over 13 times faster than current standards. Despite the buzz, the scalability of Groq's AI chips remains an open question.
MORE IN T/R/OS
(Video tutorial) Andrej Karpathy apparently left OpenAI to do things like this: building the GPT Tokenizer
(Also linked above) Meta research: V-JEPA - the next step toward Yann LeCun’s vision of advanced machine intelligence
Google DeepMind research: Chain-of-Thought reasoning without prompting
Amazon’s largest text-to-speech AI model yet shows ’emergent abilities’
LLM Fine-tuning: the biggest collection of colab-based LLM fine-tuning notebooks
That’s it for this week! We’ll see you next Thursday.