
ThursdAI - The top AI news from the past week
ThursdAI - Mar 20 - OpenAIs new voices, Mistral Small, NVIDIA GTC recap & Nemotron, new SOTA vision from Roboflow & more AI news
Episode guests
Podcast summary created with Snipd AI
Quick takeaways
- Mistral has re-entered the open-source arena with its innovative Mistral Small model, showcasing impressive performance against larger models like Gemma.
- OpenAI launched the O1 Pro model with enhanced reasoning capabilities and a premium pricing structure, catering primarily to enterprise users.
- Significant advancements in voice recognition were showcased by OpenAI, with new speech-to models outperforming Whisper and introducing emotional nuance in text-to-speech generation.
- The introduction of new evaluation benchmarks like the RF100 vision language aims to improve assessment methods for modern AI models, addressing real-world application gaps.
Deep dives
Introduction of Open Source Developments
Recent open-source developments in artificial intelligence have been notable, particularly the return of Mistral, which has re-entered the open-source arena with its new multi-modal model, Mistral Small, reserved under the Apache 2 license. This 24 billion parameter model, remarkably able to run on a single GPU, has shown significant promise in benchmarks, particularly excelling in comparison to larger models such as Gemma 3. Additionally, LG has unveiled its entry into the LLM space with the introduction of X01 Deep, a 32 billion parameter model. The sentiment around the AI community is one of excitement with the increased tempo of model releases and innovations surfacing, suggesting a vibrant landscape of competitive experimentation and progress in open-source AI.
Recent Releases by Major Tech Companies
Major tech firms have been active in the AI sector with several noteworthy releases. ByteDance has introduced DAPO, a reinforcement learning training method that reportedly surpasses previous limits by achieving a 50% increase in efficiency. Nvidia made headlines at their GTC event by unveiling the Nematron models, which distilled versions of the Llama series promise increased accuracy and reasoning capabilities. Meanwhile, Google is stepping up its game by making advanced tools accessible within its Gemini application, particularly the newly available deep research features that enhance application building for developers.
OpenAI's New High-Speed Offerings
OpenAI has broadened its offerings significantly with the introduction of the O1 Pro, a model priced at $600 per million tokens that promises enhanced reasoning capabilities. This model is now available through their API, emphasizing a more advanced computing layer for users, albeit with a premium pricing model that may restrict access for smaller developers. Notably, they are also preparing to go live with a stream that introduces even more updates, which has generated considerable anticipation within the AI community. The complexity and cost structure of these new offerings suggest that OpenAI is targeting enterprises and users requiring robust capabilities in processing and application development.
Voice and Speech Recognition Advances
Recent measures taken towards voice recognition and synthesis have notably changed the landscape. OpenAI has released two new speech-to models, which outperform their previous offering, Whisper, showcasing significant advancements in versatility and accuracy across various languages. Furthermore, they have introduced a text-to speech model capable of distinguishing emotional cues in delivered prompts, allowing developers to specify not only what is said but also how it is articulated. This evolution in models speaks to a broader movement towards more human-like interaction with AI systems, raising expectations for future applications.
New Evaluation Standards for AI Models
The introduction of new evaluation benchmarks like the RF100 vision language from Roboflow aims to address the inefficiencies of previous measures such as the COCO dataset, which has become somewhat outdated. The new benchmarks are responding to diverse applications like infrared imaging and aerial photography, prompting much-needed adaptations in evaluation methods. Furthermore, the observations regarding various model performances indicate a perceived gap in their effectiveness against real-world challenges, thus exposing areas for further development and improvement. This shift towards versatile benchmarking reflects a growing understanding within the community that traditional standards may not fully capture the capabilities of modern AI models.
3D Model Generation Innovations
Innovations in 3D generation models, particularly those by Tencent with their release of Hunyuan 3D, have shown incredible advancements in the fidelity and speed of 3D object generation. The model's capacity to create detailed 3D representations from one or more images and its new turbo feature allowing rapid generation speaks volumes about the technological progress being made. The ability to create refined 3D elements in seconds, paired with the feature to provide multi-view images, promises potential applications across various sectors including gaming and design. This surge in 3D capabilities suggests a significant shift towards seamless integration of AI-generated content into creative workflows.
Community Response and Collaboration
The ongoing dialogue and collaboration amongst developers and AI practitioners within the community have generated a supportive network aimed at driving further advancements in AI technologies. Projects like Smart Turn from Daily and collaborative efforts to improve semantic voice activity detection showcase the potential success of combining efforts towards common goals in technology. By engaging openly with suggestions and critiques, developers strive to enhance model performance while making AI solutions more practical for real-world applications. This community-driven approach emphasizes the shared belief that collective input is critical to addressing the complexities of AI in holistic and innovative ways.
Ethics and Challenges Ahead
As the AI landscape proliferates with new advancements, the ethical implications of deploying such technologies remain a crucial area of discussion. Companies are urged to prioritize responsible AI practices, ensuring they understand the effects of their developments on society and the environment. The increasing sophistication of models, while promising significant benefits, could lead to misuse or unintended consequences if not managed properly. Scholars and practitioners alike advocate for a balanced approach that encourages innovation while maintaining a focus on the societal responsibilities of those within the tech industry.
Hey, it's Alex, coming to you fresh off another live recording of ThursdAI, and what an incredible one it's been!
I was hoping that this week will be chill with the releases, because of NVIDIA's GTC conference, but no, the AI world doesn't stop, and if you blinked this week, you may have missed 2 or 10 major things that happened.
From Mistral coming back to OSS with the amazing Mistral Small 3.1 (beating Gemma from last week!) to OpenAI dropping a new voice generation model, and 2! new whisper killer ASR models with a Breaking News during our live show (there's a reason we're called ThursdAI) which we watched together and then dissected with Kwindla, our amazing AI VOICE and real time expert.
Not to mention that we also had dedicated breaking news from friend of the pod Joseph Nelson, that came on the show to announce a SOTA vision model from Roboflow + a new benchmark on which even the top VL models get around 6%! There's also a bunch of other OSS, a SOTA 3d model from Tencent and more!
And last but not least, Yam is back 🎉 So... buckle up and let's dive in. As always, TL;DR and show notes at the end, and here's the YT live version. (While you're there, please hit subscribe and help me hit that 1K subs on YT 🙏 )
Voice & Audio: OpenAI's Voice Revolution and the Open Source Echo
Hold the phone, everyone, because this week belonged to Voice & Audio! Seriously, if you weren't paying attention to the voice space, you missed a seismic shift, courtesy of OpenAI and some serious open-source contenders.
OpenAI's New Voice Models - Whisper Gets an Upgrade, TTS Gets Emotional!
OpenAI dropped a suite of next-gen audio models: gpt-4o-mini-tts-latest (text-to-speech) and GPT 4.0 Transcribe and GPT 4.0 Mini Transcribe (speech-to-text), all built upon their powerful transformer architecture.
To unpack this voice revolution, we welcomed back Kwindla Cramer from Daily, the voice AI whisperer himself. The headline news? The new speech-to-text models are not just incremental improvements; they’re a whole new ballgame. As OpenAI’s Shenyi explained, "Our new generation model is based on our large speech model. This means this new model has been trained on trillions of audio tokens." They're faster, cheaper (Mini Transcribe is half price of Whisper!), and boast state-of-the-art accuracy across multiple languages. But the real kicker? They're promptable!
"This basically opens up a whole field of prompt engineering for these models, which is crazy," I exclaimed, my mind officially blown. Imagine prompting your transcription model with context – telling it you're discussing dog breeds, and suddenly, its accuracy for breed names skyrockets. That's the power of promptable ASR! I recorded a live reaction aftder dropping of stream, and I was really impressed with how I can get the models to pronounce ThursdAI by just... asking!
But the voice magic doesn't stop there. GPT 4.0 Mini TTS, the new text-to-speech model, can now be prompted for… emotions! "You can prompt to be emotional. You can ask it to do some stuff. You can prompt the character a voice," OpenAI even demoed a "Mad Scientist" voice! Captain Ryland voice, anyone? This is a huge leap forward in TTS, making AI voices sound… well, more human.
But wait, there’s more! Semantic VAD! Semantic Voice Activity Detection, as OpenAI explained, "chunks the audio up based on when the model thinks The user's actually finished speaking." It’s about understanding the meaning of speech, not just detecting silence. Kwindla hailed it as "a big step forward," finally addressing the age-old problem of AI agents interrupting you mid-thought. No more robotic impatience!
OpenAI also threw in noise reduction and conversation item retrieval, making these new voice models production-ready powerhouses. This isn't just an update; it's a voice AI revolution, folks.
They also built a super nice website to test out the new models with openai.fm !
Canopy Labs' Orpheus 3B - Open Source Voice Steps Up
But hold on, the open-source voice community isn't about to be outshone! Canopy Labs dropped Orpheus 3B, a "natural sounding speech language model" with open-source spirit.
Orpheus, available in multiple sizes (3B, 1B, 500M, 150M), boasts zero-shot voice cloning and a glorious Apache 2 license. Wolfram noted its current lack of multilingual support, but remained enthusiastic, I played with them a bit and they do sound quite awesome, but I wasn't able to finetune them on my own voice due to "CUDA OUT OF MEMORY" alas
I did a live reaction recording for this model on X
NVIDIA Canary - Open Source Speech Recognition Enters the Race
Speaking of open source, NVIDIA surprised us with Canary, a speech recognition and translation model. "NVIDIA open sourced Canary, which is a 1 billion parameter and 180 million parameter speech recognition and translation, so basically like whisper competitor," I summarized. Canary is tiny, fast, and CC-BY licensed, allowing commercial use. It even snagged second place on the Hugging Face speech recognition leaderboard! Open source ASR just got a whole lot more interesting.
Of course, this won't get to the level of the new SOTA ASR OpenAI just dropped, but this can run locally and allows commercial use on edge devices!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Vision & Video: Roboflow's Visionary Model and Video Generation Gets Moving
After the voice-apalooza, let's switch gears to the visual world, where Vision & Video delivered some knockout blows, spearheaded by Roboflow and StepFun.
Roboflow's RF-DETR and RF100-VL - A New Vision SOTA Emerges
Roboflow stole the vision spotlight this week with their RF-DETR model and the groundbreaking RF100-VL benchmark. We were lucky enough to have Joseph Nelson, Roboflow CEO, join the show again and give us the breaking news (they published the Github 11 minutes before he came on!)
RF-DETR is Roboflow's first in-house model, a real-time object detection transformer that's rewriting the rulebook. "We've actually never released a model that we've developed. And so this is the first time where we've taken a lot of those learnings and put that into a model," Joseph revealed.
And what a model it is! RF-DETR is not just fast; it's SOTA on real-world datasets and surpasses the 60 mAP barrier on COCO. But Joseph dropped a truth bomb: COCO is outdated. "The benchmark that everyone uses is, the COCO benchmark… hasn't been updated since 2017, but models have continued to get really, really, really good. And so they're saturated the COCO benchmark," he explained.
Enter RF100-VL, Roboflow's revolutionary new benchmark, designed to evaluate vision-language models on real-world data. "We, introduced a benchmark that we call RF 100 vision language," Joseph announced. The results? Shockingly low zero-shot performance on real-world vision tasks, highlighting a major gap in current models. Joseph's quiz question about QwenVL 2.5's zero-shot performance on RF100-VL revealed a dismal 5.8% accuracy. "So we as a field have a long, long way to go before we have zero shot performance on real world context," Joseph concluded. RF100-VL is the new frontier for vision, and RF-DETR is leading the charge! Plus, it runs on edge devices and is Apache 2 licensed! Roboflow, you legends! Check out the RF-DETR Blog Post, the RF-DETR Github, and the RF100-VL Benchmark for more details!
StepFun's Image-to-Video TI2V - Animating Images with Style
Stepping into the video arena, StepFun released their image2video model, TI2V. TI2V boasts impressive motion controls and generates high-quality videos from images and text prompts, especially excelling in anime-style video generation. Dive into the TI2V HuggingFace Space and TI2V Github to explore further.
Open Source LLMs: Mistral's Triumphant Return, LG's Fridge LLM, NVIDIA's Nemotron, and ByteDance's RL Boost
Let's circle back to our beloved Open Source LLMs, where this week was nothing short of a gold rush!
Mistral is BACK, Baby! - Mistral Small 3.1 24B (Again!)
Seriously, Mistral AI's return to open source with Mistral Small 3.1 deserves another shoutout! "Mistral is back with open source. Let's go!" I cheered, and I meant it. This multimodal, Apache 2 licensed model is a powerhouse, outperforming Gemma 3 and ready for action on a single GPU. Wolfram, ever the pragmatist, noted, "We are in right now, where a week later, you already have some new toys to play with." referring to Gemma 3 that we covered just last week!
Not only did we get a great new update from Mistral, they also cited our friends at Nous research and their Deep Hermes (released just last week!) for the reason to release the base models alongside finetuned models!
Mistral Small 3.1 is not just a model; it's a statement: open source is thriving, and Mistral is leading the charge! Check out their Blog Post, the HuggingFace page, and the Base Model on HF.
NVIDIA Nemotron - Distilling, Pruning, Making Llama's Better
NVIDIA finally dropped Llama Nemotron, and it was worth the wait!
Nemotron Nano (8B) and Super (49B) are here, with Ultra (253B) on the horizon. These models are distilled, pruned, and, crucially, designed for reasoning with a hybrid architecture allowing you to enable and disable reasoning via a simple on/off switch in the system prompt!
Beating other reasoners like QwQ on GPQA tasks, this distillined and pruned LLama based reasoner seems very powerful! Congrats to NVIDIA
Chris Alexius (a friend of the pod) who co-authored the announcement, told me that FP8 is expected and when that drops, this model will also fit on a single H100 GPU, making it really great for enterprises who host on their own hardware.
And yes, it’s ready for commercial use. NVIDIA, welcome to the open-source LLM party! Explore the Llama-Nemotron HuggingFace Collection and the Dataset.
LG Enters the LLM Fray with EXAONE Deep 32B - Fridge AI is Officially a Thing
LG, yes, that LG, surprised everyone by open-sourcing EXAONE Deep 32B, a "thinking model" from the fridge and TV giant. "LG open sources EXAONE and EXAONE Deep 32B thinking model," I announced, still slightly amused by the fridge-LLM concept. This 32B parameter model claims "superior capabilities" in reasoning, and while my live test in LM Studio went a bit haywire, quantization could be the culprit. It's non-commercial, but hey, fridge-powered AI is now officially a thing. Who saw that coming? Check out my Reaction Video, the LG Blog, and the HuggingFace page for more info.
ByteDance's DAPO - Reinforcement Learning Gets Efficient
From the creators of TikTok, ByteDance, comes DAPO, a new reinforcement learning method that's outperforming GRPO. DAPO promises 50% accuracy on AIME 2024 with 50% less training steps. Nisten, our RL expert, explained it's a refined GRPO, pushing the boundaries of RL efficiency. Open source RL is getting faster and better, thanks to ByteDance! Dive into the X thread, Github, and Paper for the technical details.
Big CO LLMs + APIs: Google's Generosity, OpenAI's Oligarch Pricing, and GTC Mania
Switching gears to the Big CO LLM arena, we saw Google making moves for the masses, OpenAI catering to the elite, and NVIDIA… well, being NVIDIA.
Google Makes DeepResearch Free and Adds Canvas
Google is opening up DeepResearch to everyone for FREE! DeepResearch, Gemini's advanced search mode, is now accessible without a Pro subscription. I really like it's revamped UI where you can see the thinking and the sources! I used it live on the show to find out what we talked about in the latest episode of ThursdAI, and it did a pretty good job!
Plus, Google unveiled Canvas, letting you "build apps within Gemini and actually see them." Google is making Gemini more accessible and more powerful, a win for everyone. Here's a Tetris game it built for me and here's a markdown enabled word counter I rebuild every week before I send ThursdAI (making sure I don't send you 10K words every week 😅)
OpenAI's O1 Pro API - Pricey Power for the Few
OpenAI, in contrast, released O1 Pro API, but with a price tag that's… astronomical. "OpenAI makes O1-pro API available to oligarchs ($600/1mtok output!)," I quipped, highlighting the exclusivity. $600 per million output tokens? "If you code with this, if you vibe code with this, you better already have VCs backing your startup," I warned. O1 Pro might be top-tier performance, but it's priced for the 0.1%.
NVIDIA GTC Recap - Jensen's Hardware Extravaganza
NVIDIA GTC was, as always, a hardware spectacle. New GPUs (Blackwell Ultra, Vera Rubin, Feynman!), the tiny DGX Spark supercomputer, the GR00T robot foundation model, and the Blue robot – NVIDIA is building the AI future, brick by silicon brick. Jensen is the AI world's rockstar, and GTC is his sold-out stadium show. Check out Rowan Cheung's GTC Recap on X for a quick overview.
Shoutout to our team at GTC and this amazingly timed logo shot I took from the live stream!
Antropic adds Web Search
We had a surprise at the end of the show, with Antropic releasing web search. It's a small thing, but for folks who use Cloud AI, it's very important.
You can now turn on web search directly on Claude which makes it... the last frontier lab to enable this feature 😂 Congrats!
AI Art & Diffusion & 3D: Tencent's 3D Revolution
Tencent Hunyuan 3D 2.0 MV and Turbo - 3D Generation Gets Real-Time
Tencent updated Hunyuan 3D to 2.0 MV (MultiView) and Turbo, pushing the boundaries of 3D generation. Hunyuan 3D 2.0 surpasses SOTA in geometry, texture, and alignment, and the Turbo version achieves near real-time 3D generation – under one second on an H100! Try out the Hunyuan3D-2mv HF Space to generate your own 3D masterpieces!
MultiView (MV) is another game-changer, allowing you to input 1-4 views for more accurate 3D models. "MV allows to generate 3d shapes from 1-4 views making the 3D shapes much higher quality" I explained. The demo of generating a 3D mouse from Gemini-generated images showcased the seamless pipeline from thought to 3D object. I literally just asked Gemini with native image generation to generate a character and then
Holodecks are getting closer, folks!
Closing Remarks and Thank You
And that's all she wrote, folks! Another week, another AI explosion. From voice to vision, open source to Big CO, this week was a whirlwind of innovation. Huge thanks again to our incredible guests, Joseph Nelson from Roboflow, Kwindla Cramer from Daily, and Lucas Atkins from ARCEE! And of course, massive shoutout to my co-hosts, Wolfram, Yam, and Nisten – you guys are the best!
And YOU, the ThursdAI community, are the reason we do this. Thank you for tuning in, for your support, and for being as hyped about AI as we are. Remember, ThursdAI is a labor of love, fueled by Weights & Biases and a whole lot of passion.
Missed anything? thursdai.news is your one-stop shop for the podcast, newsletter, and video replay. And seriously, subscribe to our YouTube channel! Let's get to 1000 subs!
Helpful? We’d love to see you here again!
TL;DR and Show Notes:
* Guests and Cohosts
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @WolframRvnwlf @yampeleg @nisten
* Sponsor - Weights & Biases Weave (@weave_wb)
* Joseph Nelson - CEO Roboflow (@josephofiowa)
* Kindwla Kramer - CEO Daily (@kwindla)
* Lucas Atkins - Labs team at Arcee lead (@LukasAtkins7)
* Open Source LLMs
* Mistral Small 3.1 24B - Multimodal (Blog, HF, HF base)
* LG open sources EXAONE and EXAONE Deep 32B thinking model (Alex Reaction Video, LG BLOG, HF)
* ByteDance releases DAPO - better than GRPO RL Method (X, Github, Paper)
* NVIDIA drops LLama-Nemotron (Super 49B, Nano 8B) with reasoning and data (X, HF, Dataset)
* Big CO LLMs + APIs
* Google makes DeepResearch free, Canvas added, Live Previews (X)
* OpenAI makes O1-pro API available to oligarchs ($600/1mtok output!)
* NVIDIA GTC recap - (X)
* This weeks Buzz
* Come visit the Weights & Biases team at GTC today!
* Vision & Video
* Roboflow drops RF-DETR a SOTA vision model + new eval RF100-VL for VLMs (Blog, Github, Benchmark)
* StepFun dropped their image2video model TI2V (HF, Github)
* Voice & Audio
* OpenAI launches a new voice model and 2 new transcription models (Blog, Youtube)
* Canopy Labs drops Orpheus 3B (1B, 500B, 150M versions) - natural sounding speech language model (Blog, HF, Colab)
* NVIDIA Canary 1B/180M Flash - apache 2 speech recognition and translation LLama finetune (HF)
* AI Art & Diffusion & 3D
* Tencent updates Hunyuan 3D 2.0 MV (MultiView) and Turbo (HF)
* Tools
* ARCEE Conductor - model router (X)
* Cursor ships Claude 3.7 MAX (X)
* Notebook LM teases MindMaps (X)
* Gemini Co-Drawing - using Gemini native image output for helping drawing (HF)
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe