📅 ThursdAI - Gemma 2, AI Engineer 24', AI Wearables, New LLM leaderboard

ThursdAI - The top AI news from the past week

Focus on MMLU Pro and GPQA benchmarks

1min Snip

00:00

Play full episode

Summary

Transcript

Episode notes

The discussion highlighted the importance of MMLU Pro, a refined version of the MMLU dataset, for evaluating model performance and anticipated industry alignment towards it. Additionally, the GPQA benchmark was mentioned, emphasizing its challenge as a hard knowledge dataset where experts, like those in biology, physics, and chemistry, with PhD levels design questions. Notably, an AI model surpassed the human expert average of around 65% on the GPQA benchmark by achieving 67%, showcasing the difficulty of this evaluation.

Hey everyone, sending a quick one today, no deep dive, as I'm still in the middle of AI Engineer World's Fair 2024 in San Francisco (in fact, I'm writing this from the incredible floor 32 presidential suite, that the team here got for interviews, media and podcasting, and hey to all new folks who I’ve just met during the last two days!)

It's been an incredible few days meeting so many ThursdAI community members, listeners and folks who came on the pod! The list honestly is too long but I've got to meet friends of the pod Maxime Labonne, Wing Lian, Joao Morra (crew AI), Vik from Moondream, Stefania Druga not to mention the countless folks who came up and gave high fives, introduced themselves, it was honestly a LOT of fun. (and it's still not over, if you're here, please come and say hi, and let's take a LLM judge selfie together!)

On today's show, we recorded extra early because I had to run and play dress up, and boy am I relieved now that both the show and the talk are behind me, and I can go an enjoy the rest of the conference 🔥 (which I will bring you here in full once I get the recording!)

On today's show, we had the awesome pleasure to have Surya Bhupatiraju who's a research engineer at Google DeepMind, talk to us about their newly released amazing Gemma 2 models! It was very technical, and a super great conversation to check out!

Gemma 2 came out with 2 sizes, a 9B and a 27B parameter models, with 8K context (we addressed this on the show) and this 27B model incredible performance is beating LLama-3 70B on several benchmarks and is even beating Nemotron 340B from NVIDIA!

This model is also now available on the Google AI studio to play with, but also on the hub!

We also covered the renewal of the HuggingFace open LLM leaderboard with their new benchmarks in the mix and normalization of scores, and how Qwen 2 is again the best model that's tested!

It's was a very insightful conversation, that's worth listening to if you're interested in benchmarks, definitely give it a listen.

Last but not least, we had a conversation with Ethan Sutin, the co-founder of Bee Computer. At the AI Engineer speakers dinner, all the speakers received a wearable AI device as a gift, and I onboarded (cause Swyx asked me) and kinda forgot about it. On the way back to my hotel I walked with a friend and chatted about my life.

When I got back to my hotel, the app prompted me with "hey, I now know 7 new facts about you" and it was incredible to see how much of the conversation it was able to pick up, and extract facts and eve TODO's!

So I had to have Ethan on the show to try and dig a little bit into the privacy and the use-cases of these hardware AI devices, and it was a great chat!

Sorry for the quick one today, if this is the first newsletter after you just met me and register, usually there’s a deeper dive here, expect a more in depth write-ups in the next sessions, as now I have to run down and enjoy the rest of the conference!

Here's the TL;DR and my RAW show notes for the full show, in case it's helpful!

* AI Engineer is happening right now in SF

* Tracks include Multimodality, Open Models, RAG & LLM Frameworks, Agents, Al Leadership, Evals & LLM Ops, CodeGen & Dev Tools, Al in the Fortune 500, GPUs & Inference

* Open Source LLMs

* HuggingFace - LLM Leaderboard v2 - (Blog)

* Old Benchmarks sucked and it's time to renew

* New Benchmarks

* MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)

* GPQA (Google-Proof Q&A Benchmark, paper). GPQA is an extremely hard knowledge dataset

* MuSR (Multistep Soft Reasoning, paper).

* MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)

* IFEval (Instruction Following Evaluation, paper)

* 🤝 BBH (Big Bench Hard, paper). BBH is a subset of 23 challenging tasks from the BigBench dataset

* The community will be able to vote for models, and we will prioritize running models with the most votes first

* Mozilla announces Builders Accelerator @ AI Engineer (X)

* Theme: Local AI

* 100K non dilutive funding

* Google releases Gemma 2 (X, Blog)

* Big CO LLMs + APIs

* UMG, Sony, Warner sue Udio and Suno for copyright (X)

* were able to recreate some songs

* sue both companies

* have 10 unnamed individuals who are also on the suit

* Google Chrome Canary has Gemini nano (X)

* Super easy to use window.ai.createTextSession()

* Nano 1 and 2, at a 4bit quantized 1.8B and 3.25B parameters has decent performance relative to Gemini Pro

* Behind a feature flag

* Most text gen under 500ms

* Unclear re: hardware requirements

* Someone already built extensions

* someone already posted this on HuggingFace

* Anthropic Claude share-able projects (X)

* Snapshots of Claude conversations shared with your team

* Can share custom instructions

* Anthropic has released new "Projects" feature for Claude AI to enable collaboration and enhanced workflows

* Projects allow users to ground Claude's outputs in their own internal knowledge and documents

* Projects can be customized with instructions to tailor Claude's responses for specific tasks or perspectives

* "Artifacts" feature allows users to see and interact with content generated by Claude alongside the conversation

* Claude Team users can share their best conversations with Claude to inspire and uplevel the whole team

* North Highland consultancy has seen 5x faster content creation and analysis using Claude

* Anthropic is committed to user privacy and will not use shared data to train models without consent

* Future plans include more integrations to bring in external knowledge sources for Claude

* OpenAI voice mode update - not until Fall

* AI Art & Diffusion & 3D

* Fal open sourced AuraSR - a 600M upscaler based on GigaGAN (X, Fal)

* Interview with Ethan Sutin from Bee Computer

* We all got Bees as a gifts

* AI Wearable that extracts TODOs, knows facts, etc'

This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

📅 ThursdAI - Gemma 2, AI Engineer 24', AI Wearables, New LLM leaderboard

ThursdAI - The top AI news from the past week

Focus on MMLU Pro and GPQA benchmarks

1min Snip

Get the Snipdpodcast app

AI-poweredpodcast player

Discoverhighlights

Save anymoment

Share& Export

AI-poweredpodcast player

Discoverhighlights

Get the Snipd
podcast app

AI-powered
podcast player

Discover
highlights

Save any
moment

Share
& Export

AI-powered
podcast player

Discover
highlights