ThursdAI - The top AI news from the past week cover image

ThursdAI - The top AI news from the past week

Latest episodes

undefined
Feb 20, 2025 โ€ข 1h 41min

๐Ÿ“† ThursdAI - Feb 20 - Live from AI Eng in NY - Grok 3, Unified Reasoners, Anthropic's Bombshell, and Robot Handoffs!

Holy moly, AI enthusiasts! Alex Volkov here, reporting live from the AI Engineer Summit in the heart of (touristy) Times Square, New York! This week has been an absolute whirlwind of announcements, from XAI's Grok 3 dropping like a bomb, to Figure robots learning to hand each other things, and even a little eval smack-talk between OpenAI and XAI. Itโ€™s enough to make your head spin โ€“ but that's what ThursdAI is here for. We sift through the chaos and bring you the need-to-know, so you can stay on the cutting edge without having to, well, spend your entire life glued to X and Reddit.This week we had a very special live show with the Haize Labs folks, the ones I previously interviewed about their bijection attacks, discussing their open source judge evaluation library called Verdict. So grab your favorite caffeinated beverage, maybe do some stretches because your mind will be blown, and let's dive into the TL;DR of ThursdAI, February 20th, 2025!Participants* Alex Volkov: AI Evangelist with Weights and Biases* Nisten: AI Engineer and cohost* Akshay: AI Community Member* Nuo: Dev Advocate at 01AI* Nimit: Member of Technical Staff at Haize Labs* Leonard: Co-founder at Haize LabsOpen Source LLMsPerplexity's R1 7076: Censorship-Free DeepSeekPerplexity made a bold move this week, releasing R1 7076, a fine-tuned version of DeepSeek R1 specifically designed to remove what they (and many others) perceive as Chinese government censorship. The name itself, 1776, is a nod to American independence โ€“ a pretty clear statement! The core idea? Give users access to information on topics the CCP typically restricts, like Tiananmen Square and Taiwanese independence.Perplexity used human experts to identify around 300 sensitive topics and built a "censorship classifier" to train the bias out of the model. The impressive part? They claim to have done this without significantly impacting the model's performance on standard evals. As Nuo from 01AI pointed out on the show, though, he'd "actually prefer that they can actually disclose more of their details in terms of post training... Running the R1 model by itself, it's already very difficult and very expensive." He raises a good point โ€“ more transparency is always welcome! Still, it's a fascinating attempt to tackle a tricky problem, the problem which I always say we simply cannot avoid. You can check it out yourself on Hugging Face and read their blog post.Arc Institute & NVIDIA Unveil Evo 2: Genomics PowerhouseGet ready for some serious science, folks! Arc Institute and NVIDIA dropped Evo 2, a massive genomics model (40 billion parameters!) trained on a mind-boggling 9.3 trillion nucleotides. And itโ€™s fully open โ€“ two papers, weights, data, training, and inference codebases. We love to see it!Evo 2 uses the StripedHyena architecture to process huge genetic sequences (up to 1 million nucleotides!), allowing for analysis of complex genomic patterns. The practical applications? Predicting the effects of genetic mutations (super important for healthcare) and even designing entire genomes. Iโ€™ve been super excited about genomics models, and seeing these alternative architectures like StripedHyena getting used here is just icing on the cake. Check it out on X.ZeroBench: The "Impossible" Benchmark for VLLMsNeed more benchmarks? Always! A new benchmark called ZeroBench arrived, claiming to be the "impossible benchmark" for Vision Language Models (VLLMs). And guess what? All current top-of-the-line VLLMs get a big fat zero on it.One example they gave was a bunch of scattered letters, asking the model to "answer the question that is written in the shape of the star among the mess of letters." Honestly, even I struggled to see the star they were talking about. It highlights just how much further VLLMs need to go in terms of true visual understanding. (X, Page, Paper, HF)Hugging Face's Ultra Scale Playbook: Scaling UpFor those of you building massive models, Hugging Face released the Ultra Scale Playbook, a guide to building and scaling AI models on huge GPU clusters.They ran 4,000 scaling experiments on up to 512 GPUs (nothing close to Grok's 100,000, but still impressive!). If you're working in a lab and dreaming big, this is definitely a resource to check out. (HF).Big CO LLMs + APIsGrok 3: XAI's Big Swing new SOTA LLM! (and Maybe a Bug?)Monday evening, BOOM! While some of us were enjoying President's Day, the XAI team dropped Grok 3. They announced it with a setting very similar to OpenAI announcements. They're claiming state-of-the-art performance on some benchmarks (more on that drama later!), and a whopping 1 million token context window, finally confirmed after some initial confusion. They talked a lot about agents and a future of reasoners as well.The launch was a bitโ€ฆ messy. First, there was a bug where some users were getting Grok 2 even when the dropdown said Grok 3. That led to a lot of mixed reviews. Even when I finally thought I was using Grok 3, it still flubbed my go-to logic test, the "Beth's Ice Cubes" question. (The answer is zero, folks โ€“ ice cubes melt!). But Akshay, who joined us on the show, chimed in with some love: "...with just the base model of Grok 3, it's, in my opinion, it's the best coding model out there." So, mixed vibes, to say the least! It's also FREE for now, "until their GPUs melt," according to XAI, which is great.UPDATE: The vibes are shifting, more and more of my colleagues and mutuals are LOVING grok3 for one shot coding, for talking to it. Iโ€™m getting convinced as well, though I did use and will continue to use Grok for real time data and access to X. DeepSearchIn an attempt to show off some Agentic features, XAI also launched a deep search (not research like OpenAI but effectively the same) Now, XAI of course has access to X, which makes their deep search have a leg up, specifically for real time information! I found out it can even โ€œuseโ€ the X search! OpenAI's Open Source TeaseIn what felt like a very conveniently timed move, Sam Altman dropped a poll on X the same day as the Grok announcement: if OpenAI were to open-source something, should it be a small, mobile-optimized model, or a model on par with o3-mini? Most of us chose o3 mini, just to have access to that model and play with it. No indication of when this might happen, but itโ€™s a clear signal that OpenAI is feeling the pressure from the open-source community.The Eval Wars: OpenAI vs. XAIThings got spicy! There was a whole debate about the eval numbers XAI posted, specifically the "best of N" scores (like best of 64 runs). Boris from OpenAI, and Aiden mcLau called out some of the graphs. Folks on X were quick to point out that OpenAI also used "best of N" in the past, and the discussion devolved from there.XAI is claiming SOTA. OpenAI (or some folks from within OpenAI) aren't so sure. The core issue? We can't independently verify Grok's performance because there's no API yet! As I said, "โ€ฆwe're not actually able to use this model to independently evaluate this model and to tell you guys whether or not they actually told us the truth." Transparency matters, folks!DeepSearch - How Deep?Grok also touted a new "Deep Search" feature, kind of like Perplexity or OpenAI's "Deep Research" in their more expensive plan. My initial tests wereโ€ฆ underwhelming. I nicknamed it "Shallow Search" because it spent all of 34 seconds on a complex query where OpenAI's Deep Research took 11 minutes and cited 17 sources. We're going to need to do some more digging (pun intended) on this one.This Week's BuzzWeโ€™re leaning hard into agents at Weights & Biases! We just released an agents whitepaper (check it out on our socials!), and we're launching an agents course in collaboration with OpenAI's Ilan Biggio. Sign up at wandb.me/agents! We're hearing so much about agent evaluation and observability, and we're working hard to provide the tools the community needs.Also, sadly, our Toronto workshops are completely sold out. But if you're at AI Engineer in New York, come say hi to our booth! And catch my talk on LLM Reasoner Judges tomorrow (Friday) at 11 am EST โ€“ itโ€™ll be live on the AI Engineer YouTube channel (HERE)!Vision & VideoMicrosoft MUSE: Playable Worlds from a Single ImageThis one is wild. Microsoft's MUSE can generate minutes of playable gameplay from just a single second of video frames and controller actions.It's based on the World and Human Action Model (WHAM) architecture, trained on a billion gameplay images from Xbox. So if youโ€™ve been playing Xbox lately, you might be in the model! I found it particularly cool: "โ€ฆyou give it like a single second of a gameplay of any type of game with all the screen elements, with percentages, with health bars, with all of these things and their model generates a game that you can control." (X, HF, Blog).StepFun's Step-Video-T2V: State-of-the-Art (and Open Source!)We got two awesome open-source video breakthroughs this week. First, StepFun's Step-Video-T2V (and T2V Turbo), a 30 billion parameter text-to-video model. The results look really good, especially the text integration. Imagine a Chinese girl opening a scroll, and the words "We will open source" appearing as she unfurls it. Thatโ€™s the kind of detail we're talking about.And itโ€™s MIT licensed! As Nisten noted "This is pretty cool. It came out. Right before Sora came out, people would have lost their minds." (X, Paper, HF, Try It).HAO AI's FastVideo: Speeding Up HY-VideoThe second video highlight: HAO AI released FastVideo, a way to make HY-Video (already a strong open-source contender) three times faster with no additional training! They call the trick "Sliding Tile Attention" apparently that alone provides enormous boost compared to even flash attention.This is huge because faster inference means these models become more practical for real-world use. And, bonus: it supports HY-Video's Loras, meaning you can fine-tune it for, ahem, all kinds of creative applications. I will not go as far as to mention civit ai. (Github)Figure's Helix: Robot Collaboration!Breaking news from the AI Engineer conference floor: Figure, the humanoid robot company, announced Helix, a Vision-Language-Action (VLA) model built into their robots!It has full upper body control!What blew my mind: they showed two robots working together, handing objects to each other, based on natural language commands! As I watched, I exclaimed, "I haven't seen a humanoid robot, hand off stuff to the other one... I found it like super futuristically cool." The model runs on the robot, using a 7 billion parameter VLM for understanding and an 80 million parameter transformer for control. This is the future, folks!Tools & OthersMicrosoft's New Quantum Chip (and State of Matter!)Microsoft announced a new quantum chip and a new state of matter (called "topological superconductivity"). "I found it like absolutely mind blowing that they announced something like this," I gushed on the show. While I'm no quantum physicist, this sounds like a big deal for the future of computing.Verdict: Hayes Labs' Framework for LLM JudgesAnd of course, the highlight of our show: Verdict, a new open-source framework from Hayes Labs (the folks behind those "bijection" jailbreaks!) for composing LLM judges. This is a huge deal for anyone working on evaluation. Leonard and Nimit from Hayes Labs joined us to explain how Verdict addresses some of the core problems with LLM-as-a-judge: biases (like preferring their own responses!), sensitivity to prompts, and the challenge of "meta-evaluation" (how do you know your judge is actually good?).Verdict lets you combine different judging techniques ("primitives") to create more robust and efficient evaluators. Think of it as "judge-time compute scaling," as Leonard called it. They're achieving near state-of-the-art results on benchmarks like ExpertQA, and it's designed to be fast enough to use as a guardrail in real-time applications!One key insight: you don't always need a full-blown reasoning model for judging. As Nimit explained, Verdict can combine simpler LLM calls to achieve similar results at a fraction of the cost. And, it's open source! (Paper, Github,X).ConclusionAnother week, another explosion of AI breakthroughs! Here are my key takeaways:* Open Source is THRIVING: From censorship-free LLMs to cutting-edge video models, the open-source community is delivering incredible innovation.* The Need for Speed (and Efficiency): Whether it's faster video generation or more efficient LLM judging, performance is key.* Robots are Getting Smarter (and More Collaborative): Figure's Helix is a glimpse into a future where robots work together.* Evaluation is (Finally) Getting Attention: Tools like Verdict are essential for building reliable and trustworthy AI systems.* The Big Players are Feeling the Heat: OpenAI's open-source tease and XAI's rapid progress show that the competition is fierce.I'll be back in my usual setup next week, ready to break down all the latest AI news. Stay tuned to ThursdAI โ€“ and don't forget to give the pod five stars and subscribe to the newsletter for all the links and deeper dives. Thereโ€™s potentially an Anthropic announcement coming, so weโ€™ll see you all next week.TLDR* Open Source LLMs* Perplexity R1 1776 - finetune of china-less R1 (Blog, Model)* Arc institute + Nvidia - introduce EVO 2 - genomics model (X)* ZeroBench - impossible benchmark for VLMs (X, Page, Paper, HF)* HuggingFace ultra scale playbook (HF)* Big CO LLMs + APIs* Grok 3 SOTA LLM + reasoning and Deep Search (blog, try it)* OpenAI is about to open source something? Sam posts a polls* This weeks Buzz* We are about to launch an agents course! Pre-sign up wandb.me/agents* Workshops are SOLD OUT* Watch my talk LIVE from AI Engineer - 11am EST Friday (HERE)* Keep watching AI Eng conference after the show on AIE YT* )* Vision & Video* Microsoft MUSE - playable worlds from one image (X, HF, Blog)* Microsoft OmniParser - Better, faster screen parsing for GUI agents with OmniParser v2 (Gradio Demo)* HAO AI - fastVIDEO - making HY-Video 3x as fast (Github)* StepFun - Step-Video-T2V (+Turbo), a SotA 30B text-to-video model (Paper, Github, HF, Try It)* Figure announces HELIX - vision action model built into FIGURE Robot (Paper)* Tools & Others* Microsoft announces a new quantum chip and a new state of matter (Blog, X)* Verdict - Framework to compose SOTA LLM judges with JudgeTime Scaling (Paper, Github,X) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
undefined
74 snips
Feb 13, 2025 โ€ข 1h 44min

๐Ÿ“† ThursdAI - Feb 13 - my Personal Rogue AI, DeepHermes, Fast R1, OpenAI Roadmap / RIP GPT6, new Claude & Grok 3 imminent?

Vu Chan, an AI enthusiast and evaluator, shares his insights on exciting AI developments. He recounts a humorous tale of his rogue AI app that turned a simple messaging tool into an unexpected life coach. The discussion dives into the latest open-source releases, including Agentica DeepScaler 1.5B's impressive benchmarks. They also decode the mystery behind OpenAI's confusing roadmap and explore the ongoing innovations that are reshaping the AI landscape. Expect a blend of laughter, curiosity, and thought-provoking advancements!
undefined
65 snips
Feb 7, 2025 โ€ข 1h 40min

๐Ÿ“† ThursdAI - Feb 6 - OpenAI DeepResearch is your personal PHD scientist, o3-mini & Gemini 2.0, OmniHuman-1 breaks reality & more AI news

In this episode, Derya Unutmaz, a scientist and M.D., shares his insights on OpenAI's groundbreaking Deep Research agent, emphasizing its transformative potential in medical research. The conversation also delves into Googleโ€™s Gemini 2.0 and ByteDanceโ€™s mind-bending OmniHuman-1, a human animation model that blurs the lines between reality and simulation. With personal experiences illustrating AI's role in tackling complex health issues, the discussion highlights the revolutionary impact of AI on scientific inquiry and the urgent need for open-access initiatives in research.
undefined
36 snips
Jan 30, 2025 โ€ข 1h 55min

๐Ÿ“† ThursdAI - Jan 30 - DeepSeek vs. Nasdaq, R1 everywhere, Qwen Max & Video, Open Source SUNO, Goose agents & more AI news

This week, the unveiling of DeepSeek's R1 model sent shockwaves through Wall Street, marking a historic stock market loss. The podcast explores the growing excitement around open-source AI innovations and highlights key advancements from industry giants like Meta and OpenAI. There's deep discussion on emerging models, from multimodal capabilities to music generation tools, including the open-source Suno. The episode wraps up with insights on navigating AI tools and the competitive landscape, teasing an increasing year of reasoning and agency in AI.
undefined
71 snips
Jan 24, 2025 โ€ข 1h 50min

๐Ÿ“† ThursdAI - Jan 23, 2025 - ๐Ÿ”ฅ DeepSeek R1 is HERE, OpenAI Operator Agent, $500B AI manhattan project, ByteDance UI-Tars, new Gemini Thinker & more AI news

This week in AI is nothing short of explosive! DeepSeek's open-source R1 model is making waves, promising advanced reasoning capabilities. Meanwhile, a staggering $500 billion infrastructure initiative is on the horizon, shaking up the landscape. OpenAI's new 'Operator' is aiming to bridge the gap between chat and action but faced some live adventures. Plus, ByteDance's UI-Tars model adds to the innovation buzz. All these advancements signal a revolutionary moment in artificial intelligence!
undefined
69 snips
Jan 17, 2025 โ€ข 1h 41min

๐Ÿ“† ThursdAI - Jan 16, 2025 - Hailuo 4M context LLM, SOTA TTS in browser, OpenHands interview & more AI news

This week features Graham Neubig, a Carnegie Mellon professor and chief scientist at All Hands AI, who dives deep into the Open Hands project, an innovative AI coding agent. He discusses the evolution of coding agents and their role in software development. The episode highlights groundbreaking advancements in open-source AI, including massive new models with impressive context lengths. Neubig also shares insights on community collaboration for AI evaluation, stressing the importance of maintaining code quality in an evolving tech landscape.
undefined
50 snips
Jan 10, 2025 โ€ข 1h 20min

๐Ÿ“† ThursdAI - Jan 9th - NVIDIA's Tiny Supercomputer, Phi-4 is back, Kokoro TTS & Moondream gaze, ByteDance SOTA lip sync & more AI news

This week features Vic Korrapati, the creator of Moondream, a vision language model for edge devices. He reveals groundbreaking updates, including AI's ability to determine where someone is looking in a photo. The conversation also dives into NVIDIA's incredible new supercomputer that runs massive AI models at home, and innovations in pandemic monitoring using AI. Additionally, the implications of MetaGene in public health and a new foundation model for viral monitoring add to the excitement of this rapidly evolving AI landscape.
undefined
68 snips
Jan 2, 2025 โ€ข 1h 31min

๐Ÿ“† ThursdAI - Jan 2 - is 25' the year of AI agents?

In a New Year special, the discussion pivots to the rise of AI agents and their evolving reasoning capabilities. Joฤo Moura from Crew.ai shares insights on the rapid growth of AI frameworks and the operational efficiencies they bring. The importance of human oversight in AI decision-making is highlighted, alongside methods for evaluating AI agents' performance. Challenges in reliability and the integration of external systems are examined, emphasizing the need for adaptability in this transformative tech landscape.
undefined
26 snips
Dec 27, 2024 โ€ข 1h 36min

๐Ÿ“† ThursdAI - Dec 26 - OpenAI o3 & o3 mini, DeepSeek v3 658B beating Claude, Qwen Visual Reasoning, Hume OCTAVE & more AI news

Get ready for a whirlwind of AI breakthroughs! OpenAI's latest models sparked debates over AGI, while DeepSeek stunned everyone with a powerful open-source LLM. Discover how the new multimodal model from Qwen excels in visual reasoning tasks. The conversation dives deep into innovative voice generation and the security challenges that come with it. Join insights on the year's highlights in AI, along with predictions for 2024, as the community reflects on rapid advancements and the excitement brewing in the AI landscape.
undefined
41 snips
Dec 20, 2024 โ€ข 1h 36min

๐ŸŽ„ThursdAI - Dec19 - o1 vs gemini reasoning, VEO vs SORA, and holiday season full of AI surprises

Nisten Tahiraj, an AI expert, joins the discussion on thrilling advancements in artificial intelligence. They delve into OpenAIโ€™s latest innovations, like the voice feature for ChatGPT and the O1 model's new reasoning capabilities. The talk shifts to Googleโ€™s ground-breaking VEO2 model and Gemini 2, comparing them with competitors. Exciting tools like Cohere's Command R and real-time communication features are highlighted, showcasing the rapid evolution in the AI landscape. With a festive twist, they celebrate the dynamic updates shaping our tech future!

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode