Expect More, Grounding Matters | 1min snip from ThursdAI

Get the app

📅 ThursdAI - Sep 26 - 🔥 Llama 3.2 multimodal & meta connect recap, new Gemini 002, Advanced Voice mode & more AI news

ThursdAI - The top AI news from the past week

chevron_right

notes

NOTE

Expect More, Grounding Matters

A clear distinction exists between the general capabilities of vision-language models and the specific improvements needed in OCR and document understanding. While a performance score of 65 is deemed satisfactory, expectations remain for advancements in OCR comprehension. The grounding aspect, which pertains to the model's ability to understand the context of its decisions, is also crucial. Notably, the size of a vision component in models is significant; the QWEN 2 VL paper indicates that adding a vision encoder contributes 700 million parameters, showing the impact of increased complexity in model architecture. This insight underscores the importance of continual enhancements in both grounding and specialized visual components.

00:00

Transcript

chevron_right

Play full episode

chevron_right

Transcript

Episode notes

Hey everyone, it's Alex (still traveling!), and oh boy, what a week again! Advanced Voice Mode is finally here from OpenAI, Google updated their Gemini models in a huge way and then Meta announced MultiModal LlaMas and on device mini Llamas (and we also got a "better"? multimodal from Allen AI called MOLMO!)

From Weights & Biases perspective, our hackathon was a success this weekend, and then I went down to Menlo Park for my first Meta Connect conference, full of news and updates and will do a full recap here as well.

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Overall another crazy week in AI, and it seems that everyone is trying to rush something out the door before OpenAI Dev Day next week (which I'll cover as well!) Get ready, folks, because Dev Day is going to be epic!

TL;DR of all topics covered:

* Open Source LLMs

* Meta llama 3.2 Multimodal models (11B & 90B) (X, HF, try free)

* Meta Llama 3.2 tiny models 1B & 3B parameters (X, Blog, download)

* Allen AI releases MOLMO - open SOTA multimodal AI models (X, Blog, HF, Try It)

* Big CO LLMs + APIs

* OpenAI releases Advanced Voice Mode to all & Mira Murati leaves OpenAI

* Google updates Gemini 1.5-Pro-002 and 1.5-Flash-002 (Blog)

* This weeks Buzz

* Our free course is LIVE - more than 3000 already started learning how to build advanced RAG++

* Sponsoring tonights AI Tinkerers in Seattle, if you're in Seattle, come through for my demo

* Voice & Audio

* Meta also launches voice mode (demo)

* Tools & Others

* Project ORION - holographic glasses are here! (link)

Meta gives us new LLaMas and AI hardware

LLama 3.2 Multimodal 11B and 90B

This was by far the biggest OpenSource release of this week (tho see below, may not be the "best"), as a rumored released finally came out, and Meta has given our Llama eyes! Coming with 2 versions (well 4 if you count the base models which they also released), these new MultiModal LLaMas were trained with an adapter architecture, keeping the underlying text models the same, and placing a vision encoder that was trained and finetuned separately on top.

LLama 90B is among the best open-source mutlimodal models available

— Meta team at launch

These new vision adapters were trained on a massive 6 Billion images, including synthetic data generation by 405B for questions/captions, and finetuned with a subset of 600M high quality image pairs.

Unlike the rest of their models, the Meta team did NOT claim SOTA on these models, and the benchmarks are very good but not the best we've seen (Qwen 2 VL from a couple of weeks ago, and MOLMO from today beat it on several benchmarks)

With text-only inputs, the Llama 3.2 Vision models are functionally the same as the Llama 3.1 Text models; this allows the Llama 3.2 Vision models to be a drop-in replacement for Llama 3.1 8B/70B with added image understanding capabilities.

Seems like these models don't support multi image or video as well (unlike Pixtral for example) nor tool use with images.

Meta will also release these models on meta.ai and every other platform, and they cited a crazy 500 million monthly active users of their AI services across all their apps 🤯 which marks them as the leading AI services provider in the world now.

Llama 3.2 Lightweight Models (1B/3B)

The additional and maybe more exciting thing that we got form Meta was the introduction of the small/lightweight models of 1B and 3B parameters.

Trained on up to 9T tokens, and distilled / pruned from larger models, these are aimed for on-device inference (and by device here we mean from laptops to mobiles to soon... glasses? more on this later)

In fact, meta released an IOS demo, that runs these models, takes a group chat, summarizes and calls the calendar tool to schedule based on the conversation, and all this happens on device without the info leaving to a larger model.

They have also been able to prune down the LLama-guard safety model they released to under 500Mb and have had demos of it running on client side and hiding user input on the fly as the user types something bad!

Interestingly, here too, the models were not SOTA, even in small category, with tiny models like Qwen 2.5 3B beating these models on many benchmarks, but they are outlining a new distillation / pruning era for Meta as they aim for these models to run on device, eventually even glasses (and some said Smart Thermostats)

In fact they are so tiny, that the communtiy quantized them, released and I was able to download these models, all while the keynote was still going! Here I am running the Llama 3B during the developer keynote!

Speaking AI - not only from OpenAI

Zuck also showcased a voice based Llama that's coming to Meta AI (unlike OpenAI it's likely a pipeline of TTS/STT) but it worked really fast and Zuck was able to interrupt it.

And they also showed a crazy animated AI avatar of a creator, that was fully backed by Llama, while the human creator was on stage, Zuck chatted with his avatar and reaction times were really really impressive.

AI Hardware was glasses all along?

Look we've all seen the blunders of this year, the Humane AI Ping, the Rabbit R1 (which sits on my desk and I haven't recharged in two months) but maybe Meta is the answer here?

Zuck took a bold claim that glasses are actually the perfect form factor for AI, it sits on your face, sees what you see and hears what you hear, and can whisper in your ear without disrupting the connection between you and your conversation partner.

They haven't announced new Meta Raybans, but did update the lineup with a new set of transition lenses (to be able to wear those glasses inside and out) and a special edition clear case pair that looks very sleek + new AI features like memories to be able to ask the glasses "hey Meta where did I park" or be able to continue the conversation. I had to get me a pair of this limited edition ones!

Project ORION - first holographic glasses

And of course, the biggest announcement of the Meta Connect was the super secret decade old project of fully holographic AR glasses, which they called ORION.

Zuck introduced these as the most innovative and technologically dense set of glasses in the world. They always said the form factor will become just "glasses" and they actually did it ( a week after Snap spectacles ) tho those are not going to get released to any one any time soon, hell they only made a few thousand of these and they are extremely expensive.

With 70 deg FOV, cameras, speakers and a compute puck, these glasses pack a full day battery with under 100grams of weight, and have a custom silicon, custom displays with MicroLED projector and just... tons of more innovation in there.

They also come in 3 pieces, the glasses themselves, the compute wireless pack that will hold the LLaMas in your pocket and the EMG wristband that allows you to control these devices using muscle signals.

These won't ship as a product tho so don't expect to get them soon, but they are real, and will allow Meta to build the product that we will get on top of these by 2030

AI usecases

So what will these glasses be able to do? well, they showed off a live translation feature on stage that mostly worked, where you just talk and listen to another language in near real time, which was great. There are a bunch of mixed reality games, you'd be able to call people and see them in your glasses on a virtual screen and soon you'll show up as an avatar there as well.

The AI use-case they showed beyond just translation was MultiModality stuff, where they had a bunch of ingredients for a shake, and you could ask your AI assistant, which shake you can make with what it sees. Do you really need

I'm so excited about these to finally come to people I screamed in the audience 👀👓

OpenAI gives everyone* advanced voice mode

It's finally here, and if you're paying for chatGPT you know this, the long announced Advanced Voice Mode for chatGPT is now rolled out to all plus members.

The new updated since the beta are, 5 new voices (Maple, Spruce, Vale, Arbor and Sol), finally access to custom instructions and memory, so you can ask it to remember things and also to know who you are and your preferences (try saving your jailbreaks there)

Unfortunately, as predicted, by the time it rolled out to everyone, this feels way less exciting than it did 6 month ago, the model is way less emotional, refuses to sing (tho folks are making it anyway) and generally feels way less "wow" than what we saw. Less "HER" than we wanted for sure Seriously, they nerfed the singing! Why OpenAI, why?

Pro tip of mine that went viral : you can set your action button on the newer iphones to immediately start the voice conversation with 1 click.

*This new mode is not available in EU

This weeks Buzz - our new advanced RAG++ course is live

I had an awesome time with my colleagues Ayush and Bharat today, after they finally released a FREE advanced RAG course they've been working so hard on for the past few months! Definitely check out our conversation, but better yet, why don't you roll into the course? it's FREE and you'll get to learn about data ingestion, evaluation, query enhancement and more!

New Gemini 002 is 50% cheaper, 2x faster and better at MMLU-pro

It seems that every major lab (besides Anthropic) released a big thing this week to try and get under Meta's skin?

Google announced an update to their Gemini Pro/Flash models, called 002, which is a very significant update!

Not only are these models 50% cheaper now (Pro price went down by 50% on <128K context lengths), they are 2x faster on outputs with 3x lower latency on first tokens. It's really quite something to see

The new models have also improved scores, with the Flash models (the super cheap ones, remember) from September, now coming close to or beating the Pro scores from May 2024!

Definitely a worthy update from the team at Google!

Hot off the press, the folks at Google Labs also added a feature to the awesome NotebookLM that allows it to summarize over 50h of youtube videos in the crazy high quality Audio Overview feature!

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

That's it for the week, we of course chatted about way way more during the show, so make sure to listen to the podcast this week, but otherwise, signing off for this week, as I travel back home for a weekend, before returning to SF for the OpenAI dev day next week!

Expect full Dev Day coverage live next tuesday and a recap on the newsletter.

Meanwhile, if you've already subscribed, please share this newsletter with 1 or two people who are interested in AI 🙇‍♂️ and see you next week.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.

Home Top podcasts Popular guests