AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Gemini Ultra is Google's answer to GPT-4, released after 10 months and 3 weeks. It offers similar language generation capabilities and is heavily moderated. While the code performance is good, it may struggle with longer context lengths. Google has also released an iOS and Android app, integrating Gemini Ultra with Google Assistant on Android devices. The pricing is $20 per month with a two-month free trial period.
OpenAI has added watermarks to daily images to indicate if they were generated with DALI. The watermarks are included in the image metadata. The API keys now allow per-key restrictions, improving security and control for developers. Developers can now restrict API keys based on usage requirements, enhancing the developer experience.
LM Studio has released RMBG V1.4, a background segmentation model that allows for easy background removal of images within the browser. This model, along with the efforts of Transformers.js, enables users to remove backgrounds right from their own devices, providing an efficient and accessible tool for image editing.
Weights & Biases recently held a build week for NLP and Vision projects. The company's growth ML team shared what they learned during the event. They discussed various models, including Gemini Ultra and OpenAI's offerings, highlighting the strengths and limitations of each. The team also explored the latest developments in project deployment, code analysis, and background removal. The live show is available on Weights & Biases' YouTube and LinkedIn channels.
D-Spies is a framework for optimizing prompts in language models, while Cold-Bert is a retrieval model that enables improved document similarity matching. D-Spies focuses on fine-tuning prompts and optimizing language models, while Cold-Bert enhances the retrieval process. Both frameworks offer significant benefits in terms of performance and generalization, allowing for more efficient and accurate retrieval of relevant documents.
Cold-Bert provides advantages in retrieval speed and generalization compared to traditional approaches. It allows for embedding per token, enabling more accurate comparisons between query and document tokens. This leads to faster, near-instant retrieval, even for large document sets. Additionally, Cold-Bert ensures better generalization, particularly for specific domains, avoiding the need to fit all document information into a single vector. Using Cold-Bert for retrieval can significantly enhance search performance and user experience.
To get started, Raga2 is a user-friendly library that simplifies the implementation of Cold-Bert for retrieval. It provides code examples and notebooks for easy integration. Starting with re-ranking is a simple way to experiment with Cold-Bert before committing to indexing an entire document collection. Raga2 can also be integrated with popular frameworks like Lamechain and Transformers, allowing for seamless adoption. The Raga2 library and community provide excellent resources and support for users venturing into Cold-Bert and document retrieval.
While D-Spies and Cold-Bert serve different purposes, there is a connection between the two. D-Spies can optimize the prompts and models used for retrieval, empowering users to fine-tune retrieval models like Cold-Bert for maximum performance. By leveraging both frameworks, users can achieve enhanced retrieval results and improve the overall effectiveness of their document similarity matching systems.
Hihi, this is Alex, from Weights & Biases, coming to you live, from Yosemite! Well, actually I’m writing these words from a fake virtual yosemite that appears above my kitchen counter as I’m not a Vision Pro user and I will force myself to work inside this thing and tell you if it’s worth it. I will also be on the lookout on anything AI related in this new spatial computing paradigm, like THIS for example!
But back to rfeality for a second, we had quite the show today! We had the awesome time to have Junyang Justin Lin, a dev lead in Alibaba, join us and talk about Qwen 1.5 and QwenVL and then we had a deep dive into quite a few Acronyms I’ve been seeing on my timeline lately, namely DSPy, ColBERT and (the funniest one) RAGatouille and we had a chat with Connor from Weaviate and Benjamin the author of RAGatouille about what it all means! Really really cool show today, hope you don’t only read the newsletter but listen on Spotify, Apple or right here on Substack.
TL;DR of all topics covered:
* Open Source LLMs
* Alibaba releases a BUNCH of new QWEN 1.5 models including a tiny .5B one (X announcement)
* Abacus fine-tunes Smaug, top of HF leaderboard based Qwen 72B (X)
* LMsys adds more open source models, sponsored by Together (X)
* Jina Embeddings fine tune for code
* Big CO LLMs + APIs
* Google rebranding Bard to Gemini and launching Gemini Ultra (Gemini)
* OpenAI adds image metadata (Announcement)
* OpenAI keys are now restricted per key (Announcement)
* Vision & Video
* Bria - RMBG 1.4 - Open Source BG removal that runs in your browser (X, DEMO)
* Voice & Audio
* Meta voice, a new apache2 licensed TTS - (Announcement)
* AI Art & Diffusion & 3D
* Microsoft added DALL-E editing with "designer" (X thread)
* Stability AI releases update to SVD - video 1.1 launches with a webUI, much nicer videos
* Deep Dive with Benjamin Clavie and Connor Shorten show notes:
* Benjamin's announcement of RAGatouille (X)
* Connor chat with Omar Khattab (author of DSPy and ColBERT) - Weaviate Podcast
* Very helpful intro to ColBert + RAGatouille - Notion
Open Source LLMs
Alibaba releases Qwen 1.5 - ranges from .5 to 72B (DEMO)
With 6 sizes, including 2 new novel ones, from as little as .5B parameter models to an interesting 4B, to all the way to a whopping 72B, Alibaba open sources additional QWEN checkpoints. We've had the honor to have friend of the pod Junyang Justin Lin again, and he talked to us about how these sizes were selected, that even thought this model beats Mistral Medium on some benchmarks, it remains to be seen how well this performs on human evaluations, and shared a bunch of details about open sourcing this.
The models were released with all the latest and greatest quantizations, significantly improved context length (32K) and support for both Ollama and Lm Studio (which I helped make happen and am very happy for the way ThursdAI community is growing and connecting!)
We also had a chat about QwenVL Plus and QwebVL Max, their API only examples for the best open source vision enabled models and had the awesome Piotr Skalski from Roborflow on stage to chat with Junyang about those models!
To me a success of ThursdAI, is when the authors of things we talk about are coming to the show, and this is Junyang second appearance, which he joined at midnight at the start of the chinese new year, so greately appreciated and def. give him a listen!
Abacus Smaug climbs to top of the hugging face leaderboard
Junyang also mentioned that Smaug is now at the top of the leaderboards, coming from Abacus, this is a finetune of the previous Qwen-72B, not even this new one. First model to achieve an average score of 80, this is an impressive appearance from Abacus, though they haven't released any new data, they said they are planning to!
They also said that they are planning to finetune Miqu, which we covered last time, the leak from Mistral that was acknowledged by Arthur Mensch the CEO of Mistral.
The techniques that Abacus used to finetune Smaug will be released an upcoming paper!
Big CO LLMs + APIs
Welcome Gemini Ultra (bye bye Bard)
Bard is no longer, get ready to meet Gemini. it's really funny because we keep getting cofusing naming from huge companies like Google and Microsoft. Just a week ago, Bard with Gemini Pro shot up to the LMSYS charts, after regular gemini pro API were not as close. and now we are suppose to forget that Bard even existed? 🤔
Anyhow, here we are, big G answer to GPT4, exactly 10 months 3 weeks 4 days 8 hours, but who's counting?
So what do we actually get? a $20/m advanced tier for Gemini Advanced (which will have Ultra 1.0) the naming confusion continues. We get a longer context (how much?) + IOS and android apps (though I couldn't find it in IOS, maybe it wasn't yet rolled out)
Gemini now also replaces google assistant for those with androids who opt in (MKBHD was somewhat impressed but not super impressed) but google is leaning into their advantage including home support!
* Looks like Gemini is ONLY optimized for English as well
We had quite the conversation on stage from folks who upgraded and started using, including noticing that Gemini is a better role player, and less bland, but also that they don't yet support uploading documents besides images, and that the context window is very limited, some said 8K and some 32K but definitely on the lower side.
Also from Google : a llama.cpp wrapper called localllm (Blog)
OpenAI watermarks DALL-E images and adds per key API limits (finally) (Blog)
OpenAI's using something calledC2PA for pictures made by DALL-E 3, whether you're chatting with ChatGPT or using their API. It's a way to show that DALL-E 3 actually created those images. But it's just for images right now, not for text or voice stuff. Adding this info can make the files up to 32% bigger, but it doesn't mess with the quality. The tags tell you if the source was DALL-E 3, ChatGPT, or the API by including special signatures and stuff. Just a heads up, though, this C2PA thing isn't perfect. The metadata could get wiped either on purpose or by mistake.
They also released an update to the developer experience that allows you to track usage but also restrict usage per API key! Very very needed and helpful!
This weeks Buzz (What I learned with WandB this week)
First part of the live series with the Growth ML team was live and AWESOME!
Vision
BRIA - Open-Source background removal (non commercial)
📷 Introducing Open-Source Background Removal by @BriaAI 📷 Now live on @huggingface, RMBG v1.4 excels in separating foreground from background across diverse categories, surpassing current open models. See demo [https://t.co/DDwncjkYqi] #BriaAI #OpenSource #AI @briaai https://t.co/BlhjMMNWxa
Voice
MetaVoice (hub)
1.2B parameter model.Trained on 100K hours of data.Supports zero-shot voice cloning.Short & long-form synthesis.Emotional speech.Best part: Apache 2.0 licensed. 🔥
Powered by a simple yet robust architecture: > Encodec (Multi-Band Diffusion) and GPT + Encoder Transformer LM. > DeepFilterNet to clear up MBD artefacts.
That's it for us this week, this time I bring you both the news segment AND the deepdive in one conversation, hope it's not super long, see you here next ThursdAI! 👏
Full Transcript:
[00:00:00] Intro and housekeeping
[00:00:00]
[00:00:00] Alex Volkov: You're on ThursdAI, and I think it's time for us to get started with the recording and the introduction.
[00:00:26] Alex Volkov: Happy, happy Thursday everyone! Today is February 8th, 2024. I don't know, This is the second calendar year the Thursday is happening in, so I don't know if I need to mention the year or not but we're well on our way into 2024 and you're here on Thursday, I, the Thursday I is the space, the newsletter, and the podcast to keep you up to date with all of the very interesting things that are happening in the very fast moving world of ai.
[00:00:58] Alex Volkov: Hopefully by now, all of you already have ThursdAI in your podcast, wherever you get a podcast, Spotify, recently YouTube as well, which is weird. But with this introduction, I will just say, hello myself, basically. Hey everyone. My name is Alex Volkov. I'm an AI evangelist with Weights & Biases.
[00:01:15] Alex Volkov: Weights & Biases is the reason why this comes to life to you. And there's going to be a little segment about Weights & Biases in the middle here as well, and I'm joined on stage. Often, and pretty much every week by great friends, experts in their fields. As we talk about everything AI related this week, especially we're going to have some interesting things.
[00:01:34] Alex Volkov: Those of you who come back week after week. Thank you, and we love that you're part of the community, and it's great to see how many people just return, and those of you who are new, we're here every week and The community doesn't stop after we finish the space. There's a bunch of spaces. I think our friend AlignmentLab had the space that went on for the full week, I think.
[00:01:55] Alex Volkov: I don't know if he ever slept. That's maybe why he's not here on stage. But we're here every week for the two hours to give you updates for the first hour and definitely some very interesting deep dives that has been happening, that have been happening for the past few Weeks, I want to say, so I just want to shout out some friends of ours that recently we were featured in the deep dives.
[00:02:16] Alex Volkov: We've talked with Maxime Lubon, who trained the Beagle series and then also gave a deep dive with us about model merging. That was really fun. And on the last deep dive, we talked with the Lilac folks and they're building an open source tool. That lets you peer into huge data sets, like imagine millions of rows, data sets, and they chunk and cluster this. And we've talked about the importance of data sets in creation of LLMs or large language models.
[00:02:46] Alex Volkov: And they've taken the huge data sets of the folks to usually come up on ThursdAI. Technium from Nous Research just released their Hermes dataset, for example. And the folks in Lilac talked to us about how that would be visualized and how you can see which parts of it is comprised of.
[00:03:03] Alex Volkov: It's quite an interesting conversation about how to approach the training and fine tuning area. And we haven't often talked about dataset curation and creation, so that conversation was a very nice one. So we have deep dives. I will say that last weekend, I also interviewed, and that's probably going to come up as a separate episode.
[00:03:24] Alex Volkov: I interviewed Sasha Zhadan from Moscow, and this was a first for me. And I just want to like, highlight where this weird thing takes me, because that's not ThursdAI, and that's not about the news. That was just literally about AI stuff. So this guy from Moscow, and this will be dropping on ThursdAI podcast soon.
[00:03:42] Alex Volkov: This guy from Moscow built a bot that auto swipes for him on Tinder. And that bot started using gpt instruct, and then moved to gpt chat, gpt etc, and then moved to gpt 4. And he talks about how this bot kept improving with the improvement of AI. And then he autoswiped a wife, basically. And then this was, this took over the Russian ex.
[00:04:08] Alex Volkov: I don't know if you guys are on the Russian side of ex, but I definitely noticed that everybody, that's all they could talk about. This guy Previously also did some shenanigans with OpenAI stuff. And so it was a very interesting conversation, unlike anything that I did previously on ThursdAI.
[00:04:21] Alex Volkov: And definitely that's coming more as a human interest story than anything else. But it's very interesting. And also his fiance also joined and we talked about the morality of this as well. And it was really fun. So if that kind of new type of content also interests you definitely check out.
[00:04:37] Alex Volkov: That's probably not going to end up on X.
[00:04:40] Alex Volkov: And I think with this, it's time to get started. , The usual way we get started here is I just run through everything that we have. Just so you know what we're going to talk about.
[00:04:52] Alex Volkov: And then we're going to start with segment by segment. So that's
[00:04:54] TL;DR and recap of the conversation
[00:04:54] Alex Volkov: Hey everyone, this is a recap of everything we talked about on ThursdAI for February 8th. 2024 and we had a bunch of breaking new stuff today, specifically around the fact that Google finally gave us something. But I'm gonna do this recap properly based on the categories. So let's go. So in the category of open source lms, we've talked about Alibaba releases a bunch of new Qwen models, specifically under the numbering 1.5.
[00:05:33] Alex Volkov: And we had the great pleasure again to talk with Justin J. Yang Lin. from Qwen team the guy who's a tech lead there and pushes for open source. And he came up and talked about why this is a 1. 5 model, not a 2 model. He also talked about the fact that they released a tiny 0.
[00:05:51] Alex Volkov: 5 billion one. This is like a very tiny. Large language model. I think it's really funny to say a tiny large language model, but this is the case. And he talked about multiple releases for Qwen. We also had, friend of the pod, Piotr Skalski from Roboflow, who's like a vision expert who comes up from time to time, and the author of I forget the name of the library.
[00:06:12] Alex Volkov: I will remember this and put this in the show notes as well. He came up and he had a bunch of plays with the visions part of the Qwen. ecosystem, and we've talked about QNVL plus and QNVL max with Justin as well, and we've talked about their potential for open sourcing these models. They also released a 72 billion parameter model that's now part of the top of the Hug Face leaderboard, which is super cool.
[00:06:34] Alex Volkov: So definitely a great conversation. And I love it when the authors of the things that we talk about come out and talk about the, in ThursdAI. We then smooth, smoothly move to the next topic where Abacus, the company Abacus AI, there is Finetune that's now top of the Hug Face leaderboard, and that's based on QN72B, and not even the new one, the previous one, so 1.
[00:06:54] Alex Volkov: 0, and that's now the top model on Hug Face leaderboard, and that has an average score of over 80. And I think it's the first open source model to do and they haven't fully released the process of what they what they used in order to make this much better in different leaderboards. But they have mentioned that they're going to train this model on top of the Mikulik over Mixtral.
[00:07:17] Alex Volkov: And it's very interesting. And they also They're building some other stuff in Abacus as well. Very interesting. And then we moved to talk about LMSYS Arena. LMSYS Arena is the place that we send you to see which models users prefer better versus just the benchmarks and evaluations hung in phase.
[00:07:35] Alex Volkov: LMSYS Arena added a bunch of open source models, so shout out OpenChat again. They added another Hermes the Finetune that Technium did for Hermes on top of Mixtral, and they also added a bunch of Qwen versions as well. LMSYS adds open source, so you continuously can see which models are better and don't have to judge for yourself, because sometimes it's not very easy.
[00:07:55] Alex Volkov: We also covered JINA embeddings that are fine tuned for code. JINA from the company JINA AI and the representative Bo Wang who came, and he's a friend of the pod. We talked about their embeddings for code. Bo didn't show up this time, but maybe next time as well. Then we moved to big companies, LLMs and API, and definitely the conversation turned interesting, where multiple folks here on stage paid the new 20 tax, let's say from AI [00:08:20] for for the rebranded Bard now called Gemini and the launch of Gemini Ultra.
[00:08:25] Alex Volkov: And we've talked about how long we've waited for Google to actually give us something like this. And now we're getting Gemini Ultra and Bard is no more, Bard is Essentially dead as a brand, and now we're getting the Gemini brand. So if you used to go to BART, now you go to Gemini, but also the brain behind this also improved.
[00:08:41] Alex Volkov: So you get Gemini Pro by default for free, I think, and Gemini Ultra is going to cost you 20 bucks a month. It's free for the next two months, so you can sign up for a trial, and then you'll get Gemini Ultra. And you'll get it not only in the web interface, you also get it in iOS and Android apps. And if you're on Android, it also integrates with the Android Assistant.
[00:09:00] Alex Volkov: That's pretty cool. It has a context length of not very much, I think we said 8 or 16 or so and some folks contested this in the comments, so we're still figuring out the context length, and it looks like the context length for that is Restricted with the UI, less on the API side, and Gemini Ultra did not release an API yet.
[00:09:17] Alex Volkov: So we've talked about Gemini Ultra and different things there. We also covered that OpenAI adds image metadata to all DALI generations, whether through the UI or through the API, this image metadata can be stripped. So it's not a watermark per se, but it's definitely helpful. And there also the OpenAI gives us a little bit of a developer experience thing where you can restrict.
[00:09:36] Alex Volkov: Per key on API keys different possibilities. So if one key gets stolen, you can lock only that one, or you can restrict it to only like a specific use as well. In the vision video category, we've talked about the new model for background removal called RMBG from Bria AI. It's not a fully commercial license, but you can play with this now.
[00:09:57] Alex Volkov: There's a demo I'm going to add to the show notes. And also it runs fully on your client via the efforts of friends of the pod Zenova from Transformers. js. And it's pretty cool to have a model that removes background super like with two clicks with no back with no servers. And in the voice and audio category, we talked about MetaVoice, a new.
[00:10:14] Alex Volkov: licensed Apache 2 licensed text to speech model, not from Meta, even though it's called MetaVoice, and it's funny it's pretty decent and has zero shot voice cloning which means that you can provide a piece of your voice and fairly quickly get a your voice speaking back to you generated. And we also talked about breaking news from NVIDIA AI, something called Nemo Canary 1B, which is a ASR model, Automatic Speech Recognition model, that's now top of the leaderboards on Hug Face, and it beats Whisper on everything, including specifically for four languages.
[00:10:48] Alex Volkov: It's trained on 8, 500 hours 85, 000 hours of annotated audio, and it's very fast conformer encoder as well. We barely covered this, but Microsoft added DALI editing with the designer. So if you remember, Microsoft also did a rebrand. It used to be called Bing Chat, and now it's called Copilot.
[00:11:07] Alex Volkov: And that Copilot now adds capabilities that don't exist in other places, like GPT, ChatGPT with DALI. So Microsoft's DALI now is involving the designer thing, and they have cool things where you can edit images. On the fly, you can click on different segmented objects from your generated image and say, Hey, redo this in a different style.
[00:11:27] Alex Volkov: The video for this is super cool. I'm going to add this in the show notes. And it's very interesting to see that Mali Microsoft with their co pilots is moving away from where the capabilities is for ChatGPT exist. We also barely, briefly mentioned and glanced through this, but Stability AI released an update to stable video diffusion, including a web UI that you can use now, and it's not only a model, it's a web UI as well, and that web UI is pretty cool, if you didn't get an access to it, I'll link to the show notes, I think it's now possible to register, much nicer videos, and obviously it's in the open source.
[00:11:59] Alex Volkov: as much as possible. So super cool. But the web UI shows you other people's video attempts. You can actually use their prompts to create videos of your own. They have some controls. It's very nice. Then I think we talked a little bit at the end there about Vision Pro and my experience with this as it comes to AI.
[00:12:15] Alex Volkov: We didn't dive in into Vision Pro, even though this is my new, this is my new toy in life. And I'm very happy to participate in the renaissance of spatial computing. And we covered like the intersection of AI and spatial computing. And I think the very interesting part of today's ThursdAI was thanks to two new guests, Benjamin Clavy and Connor from Weaviate, and we've talked about DSPy and Colbert, or Colbert, and Ragatouille, which is a library to use Colbert embeddings.
[00:12:43] Alex Volkov: And we talked about what they mean, and this was a great learning kind of experience for me. And if you see these concepts on your timeline and you have no idea what we talked about, I basically played the role of, hey, I'm the village dummy, let's say. I'm gonna re ask the question about what this means, why should we use this as well.
[00:13:01] Alex Volkov: And I think this is our show today, folks. This is the quick summary. If I missed anything super big and important, please let me know.
[00:13:08] Open source LLMs and AI news
[00:13:08] Alex Volkov: But otherwise, I think we'll start with open source. All right, welcome to the open source corner. And I guess because the tradition of ThursdAI is Something releases, I go in the comments and say, Hey, I'm going to talk about this on ThursdAI. Do you want to join? And sometimes people say yes. And this is how we met Justin or Junyang here on stage. Junyang is the dev lead for the Qwen team and welcome Junyang.
[00:13:50] Alex Volkov: It's very late where you are. So I really appreciate your time here. Please feel free to unmute and introduce yourself again. Some folks already know you, but if in case some new folks are listening to us, feel free to introduce yourself. And then let's talk about the stuff that you released.
[00:14:06] New Qwen models 1.4 from Alibaba
[00:14:06] Junyang Lin: Yeah. Thanks Alex. Nice to be at Thursday. ai it's a very great program for us to talk about ai. I am j Young and you can call me Justin. I'm working in the team for the LM and LMM. And we are now working for the new LLM, Qwen 1. 5, and we are also upgrading our vision language model, QwenBL, to QwenBL Plus and Max.
[00:14:33] Junyang Lin: Plus and Max are not open sourced yet, but we have demos, so you can try in our HuggingFace organization, and you can find our demos, and you can try with Plus and Max. And the max is the best one, and I am very confident with the max demo. And about our language model today actually this week we are open sourcing QWAM 1.
[00:14:58] Junyang Lin: 5. Maybe you previously you have noticed the QWAM 2 code inside Hugging Face target based transformers. Yeah, we are moving to new codes for you to use our QUANT models because in the past few months I have been interviewing our users and they found some problems with using our code, the original QUANT code, so I'm moving a step forward.
[00:15:23] Junyang Lin: So this is why we had the QUANT 2 model, but for the model themselves actually we are still we in our judgment, we are still at the 1. 5 not 2 yet. We're still training the real Qwen 2, so this time we have Qwen 1. 5. For Qwen 1. 5 we are actually fixing a lot of problems because there are some models like 7 billion and 14 billion, there are a lot of people using these models, but they are actually quite old.
[00:15:50] Junyang Lin: They were released months ago. They have some problems for Qwen 14 billion It is actually only supporting around 2 to 4K context length, which is far from enough for a lot of users. So for this time, we have upgraded all models to supporting 32, 000 tokens. And for the sizes, we have released more sizes.
[00:16:15] Junyang Lin: Previously, we had 1. 8, which is the smallest one. But this time, we have 0. 5. only 0. 5. I used to think this one is just for experimental usage but there are some users in Twitter they found still 0. 5 can used to be do something so if you have any comments on [00:16:40] 0. 5 you can share the comments to me. And we also have 4 billion which is between 1.
[00:16:46] Junyang Lin: 8 and 7 billion. The reason why we have 4 billion is that actually when we first released 1. 8 billion it is actually popular because they would like to deploy the small model to some devices like cell phones. but they found just 1. 8 is not good enough for them to for the applications.
[00:17:07] Junyang Lin: So they want something just smaller than 7 billion, but much better than 0. 8. So we have 4 billion. Yeah. We have a wide range of sizes. These are for you to choose. And,
[00:17:19] Alex Volkov: six, six models overall Junaid?
[00:17:22] Junyang Lin: Yeah. Six
[00:17:23] Alex Volkov: Six sizes overall, but definitely more models than this, because you also released, I think for the first time, you released quantized versions as well, correct?
[00:17:32] Junyang Lin: No, but previously we have released GPDQ,
[00:17:35] Alex Volkov: Oh yeah.
[00:17:35] Junyang Lin: our convention, but this time I also have AWQ and also GGUF maybe GGUF is the new one admittedly, previously I don't know too much about AWQ and GGUF. This time I tried and everything is okay. So I just released the AWQ and GGUF.
[00:17:52] Junyang Lin: And GGUF is the new thing for me. But it is quite popular in the community. Like Elm Studio, like you introduced. To me and I found a lot of people using gguf they use in Olama. So I collaborated with Olama. So you can now just run one line of code, like Olama run QWAM. So you can use the QWAM models with Olama and you can also use it in Elm Studio.
[00:18:15] Alex Volkov: I just wanna
[00:18:16] Junyang Lin: No
[00:18:16] Alex Volkov: just a tiny pause here because I think first of all, to highlight the importance of this community, you guys are releasing a bunch of great models in open source, and first of all, just a Great. At testament to the community because you're listening to what folks have been saying, how they're reacting to your models and part of the Thursday aid, I was able to just introduce you to, to LM Studio and you guys work together.
[00:18:37] Alex Volkov: And now the second year model drops, not only you guys already pro providing us quantized versions in four and GGF stuff. It's also very easy to start using and I think, just a shout out to you guys for thinking about this because a lot of models when they release they just release a waste file and then it's up in the community to figure out how to run them, when to run them, what's the problems.
[00:18:57] Alex Volkov: And this was the issue with Gwen before. It was like harder to use and maybe only on hug and face demos. And now you guys released it with support for the most popular open source runners out there. So Ollama, if folks haven't used Ollama by now, definitely there's a CLI, just like Ollama installed this.
[00:19:14] Alex Volkov: And LM Studio, which we've talked about a bunch, so shout out LM Studio. Shout out JAGS. And I'm, I was very happy to introduce both of you. So it's been great. And I've used the small model, the baby model as well. How was the reception from the community? What have you seen people do? Have there been any fine tunes already that you're excited about?
[00:19:33] Junyang Lin: yeah this is a very great comment for helping us to improve. Yeah, previously like us, a lot of people just drop open source models and they just let the community to use it. But this is maybe, this may be not right, because we can do more to the community, maybe we can do things. more easily than the community users.
[00:19:56] Junyang Lin: So this is why we are changing our style. We try to modify our code, try to adapt to the usages to make our models more popular. And recently I found them just gradually fine tuned our models. Previously fine tuned users are inside mainland China because they have chances to talk to us, so they will know more about our models so they, they can finally fine tune it.
[00:20:24] Junyang Lin: But with the support of Lama X Tree and especially Alto wing Winland helped me a lot. Technium just introduced wing land to me, and I found some people are using X lotto to do it. I dunno if Chen I don't know if I pronounced his name he's one of the users of Qwen and he he previously got the usage of our models and then he quickly fine tuned a lot of model its name is Q U Y
[00:20:54] Alex Volkov: Oh, Stable Quan. Yeah, I think I know what the guys are talking about. Stable Quan from also Nous Research
[00:20:59] Junyang Lin: yeah, stableQwen I'm quite familiar with him, I just talked to him very much, and he just directly used our models, very quickly finding a series of models, and I find them, the quality are quite good.
[00:21:12] Junyang Lin: So this is quite encouraging for me, because you can find people are interested in your models, they can find you in it, very fast speed, and I recently found Smog by Abacus AI, but I got no chance to talk to them because I don't know who actually built the model, but I found a small 72 billion is built on Qwen 72 billion
[00:21:37] Alex Volkov: Oh, really?
[00:21:39] Junyang Lin: Open open leaderboard.
[00:21:40] Alex Volkov: Smog is the next thing we're going to talk about, so you're taking us exactly there. I think, Nisten, you have a question just before, and then we're going to move to talk about smog. Just on the community part just the names you mentioned. You mentioned Stablequan, definitely friend of the pod.
[00:21:52] Alex Volkov: You mentioned Technium introduced you to Winglian, the guy from Axolotl. All of this happens in the ThursdAI community, and I love it. I'll just say that I see Robert in the audience here. Smog is from Abacus AI, and I think Robert has some connection to Bindu, so Robert, if you can introduce Junyang to Bindu, that would be great, and then we'll figure out, like, how they use the 72B model.
[00:22:12] Alex Volkov: 72B model that you guys released is one of the more performant ones. I think it's even outperforming Mistral Medium, is that correct?
[00:22:21] Junyang Lin: Yeah it's now this version QEM 1. 5 SIMD2 BDN is for the chat model for the base model, it is actually quite similar some users have found that I admit that, and, but for the chat models, we have some improvements because this time we are not only Actually, we not only SFD the model, but we also use DBO.
[00:22:40] Junyang Lin: We have some progress in DBO. So we've reached like 8. 67 in MTBench. This is a relatively high score and we just did simple DBO and just improved the model. And we also sent our model to Chatbot Arena in Elimsys. supported by Together AI, because we have some friends in Together AI. They just built API for us, and we have been in chatbot arena, so you can try it in chatbot arena to see how it really performs.
[00:23:18] Junyang Lin: Is it really perform just like the score of MTBench? I'm not quite sure, because I'm also dependent on the users feedback.
[00:23:27] Alex Volkov: it depends on human preference. I so first of all, Justin, you're taking over my job now because you're also reporting on the stuff that I wanted to mention, but definitely a shout out for getting added to LMSYS. That's not super easy. Not every model out there on the Hagenfest leaderboard gets added there.
[00:23:41] Alex Volkov: So definitely super cool. Yeah, please go ahead. If you have anything else to
[00:23:46] Junyang Lin: for as you have mentioned Mistral Medium, I'm not sure which one is better Mistral Medium or Qwen 72 Billion from some reviews they might be similar for the Qwen 1. 5 72 Billion similar to MiQ some of my friends like Blade just tested In EqBench, the scores are very similar, but I need some more reviews to let me really know that how the 72 billion model really perform, that how is it better or is it worse than MeeQ?
[00:24:20] Junyang Lin: They're all okay for me. I just want real reviews for me. Yeah,
[00:24:23] Alex Volkov: Yeah,
[00:24:24] Junyang Lin: it.
[00:24:25] Discussion about Qwen VL with Nisten and Piotr
[00:24:25] Alex Volkov: awesome. Junaid, thank you for joining us. And Nisten, go ahead. You have a few questions, I think, about the interesting things about VL.
[00:24:34] Nisten Tahiraj: Yeah, so one thing is that the 0.5 Bs and the small models, I know Denova in the audience was specifically looking for one around that size or like a 0.3 to run on web GBU, because then even at 32 bit, which older browsers will still support it, it will still only take two gigs. So that, that would run anywhere.
[00:24:58] Nisten Tahiraj: But my question. [00:25:00] So shout out to Feliz de Nova for all that. I know he's going to do something with it but my question for you was more about the Macs and the the larger Qwen QwenVL chats are those also based off of the 72B and did you find more improvements in going with a larger LLM, and I also wanted to know your opinion on Lava.
[00:25:27] Nisten Tahiraj: The Lava 1. 6 method where they mosaic together four clip models on top to get a larger image, even though it slows down inference because now it's got a output like 2000 embeddings. So yeah, what do you think of Lava and is there more stuff to share about the Clang,
[00:25:47] Junyang Lin: VL, Max. Yeah for Plus and Max it may be, sorry for me not ready to open source it.
[00:25:57] Junyang Lin: I cannot decide these things. Yeah actually it's built on larger language models much larger than the Plus, and you can guess whether it is 72 billion. It is not that important, and we have found that The scaling of the language model is really important for the understanding of the VR models.
[00:26:18] Junyang Lin: We have tested it on the MMMU benchmark and we have found that the Max model is highly more com competitive and performs much better than the Quin bi plus. Although previously many people have thought that Quin Bi Plus is strong enough, but we found that the max had. Much better reasoning capabilities, just understand some, something like some reasoning games like poker or these things like that, some complex things that people can understand through the vision information they can somehow understand it.
[00:26:52] Junyang Lin: I think the performance might be a bit slower. Approaching Gemini, Ultra, or GPE4B for the QEMDR MAX. We were just gathering some reviews. I'm not quite sure, but
[00:27:05] Alex Volkov: From the review perspective, I want to say hi to Petr, our friend here on stage, from Roboflow. Petr is one of the vision experts here on stage. Petr, welcome. Feel free to introduce yourself briefly, but I definitely know that you got excited about some of the GwenVL Plus stuff, so definitely feel free to share some of your insights here.
[00:27:30] Piotr Skalski: Okay. Yeah. And first of all, awesome to meet somebody from Qwentin. Yeah.
[00:27:36] Piotr Skalski: So yeah I'm from Roboflow, like you said and I'm responsible there for computer vision and growth. So it's like in between of being ML engineer and marketing something like this.
[00:27:49] Piotr Skalski: And yeah, I was experimenting with Qwen, Plas and Max last week. Super impressed in my opinion. I know that you tried to be humble, maybe, but. In my opinion it's, at least on things that
[00:28:04] Junyang Lin: I test, it performs like the best compared
[00:28:08] Piotr Skalski: to other
[00:28:09] Junyang Lin: models. Thank you very much. Thanks for the appreciation.
[00:28:14] Piotr Skalski: Yeah. And especially the fact, so the biggest game changer for me, and I know that there were models that were capable of that before, is the fact that you can ground those predictions and you can, for example, point to a specific element on the image. So it's not only that you can ask questions and get answers and do OCR, but you can straight up do zero shot detection if you would like.
[00:28:40] Piotr Skalski: Yeah. Which is which is awesome. And that's something that none of the. Other popular models can do to that extent, at least on the
[00:28:51] Piotr Skalski: things
[00:28:51] Piotr Skalski: that I
[00:28:51] Piotr Skalski: tested. My question is,
[00:28:55] Piotr Skalski: do you plan to open source it? Because it's awesome that you can try it out for the API and I highly appreciate the fact that you created the, HF space and you can go there and try it.
[00:29:07] Piotr Skalski: But is there a chance that you will open source it even with the meeting? License are not necessary.
[00:29:16] Junyang Lin: Yeah personally, I would like to open source some but I cannot decide these things, but I think there's a chance I'm still promoting these things inside the core, but I cannot say too many things about these stuff, but we will try because we have found out that we ourselves can also build very good LMM.
[00:29:37] Junyang Lin: I think the gap Just between the big corp between us and the big corp. In LMM, it's very small. And we have found that our techniques or our training is quite effective. So maybe one day we'll share to the community, but for now it is still APIs and demos and I would try to think about these things.
[00:29:59] Junyang Lin: And also question about. The comparison with us and Lava, and I have just tried Lava 1. 6 not quite freQwently. I just tried it. I think it's a very good model and it it has very good performance in the benchmark results but I think the limitations of these other open source models may be that It still lacks sufficient pre training for them Skullscape just said we can do Qwen can do OCR and you can find that Qwen's reasoning capability is quite strong because we have done a lot of pre training work on it.
[00:30:39] Junyang Lin: We have done a lot of data engineering on pre training because we have capabilities of handling different resolutions and different aspect ratios so that we can use the curated, the OCR data and put them in the pre training. And when the vision length model can understand a lot of textual like linguistic information inside the images, they may do something like like we said, reasoning, and you will find that really powerful, very impressive, or things like that.
[00:31:13] Junyang Lin: Yeah I think the gap between other models and us, or also Gemini Ultra and GPT 4b, maybe still the lack of large scale data. for training. Yeah, this is my opinion.
[00:31:27] Alex Volkov: we're waiting for more data, but we're also waiting for you guys too. I just want to thank you for being the champion for open source from within the organization, and really appreciate all your releases as well. I think Piotr and Nisten, like everybody here on stage, definitely. It feels that, and thank you for coming and talking about this.
[00:31:45] Alex Volkov: Justin, feel free to stick around because the next thing we're gonna talk about, you already mentioned, which is Smog 72 B which is the top of the leaderboard. And I just read through the thread from Bindu, ready from Abacus ai and it looks like they didn't even use 1.5. I think they used 70 the previous Quinn
[00:32:02] Junyang Lin: yeah, they used the previous QUANT72B. If they are really based on the base language model there might not be a lot of differences. Because 1. 5 for the base language model 72B is actually slightly better than the original 72B for the base language model. Yeah.
[00:32:22] Alex Volkov: for the base ones. And very interesting what they
[00:32:24] Junyang Lin: the base one.
[00:32:25] Alex Volkov: So they, they don't share any techniques, but they promised to open source their techniques. They're saying like, our next goal will be to publish these techniques as a research paper and apply them to some of the best Mistral models, including Miku.
[00:32:37] Alex Volkov: So I got confused. I thought that they already fine tuned Miku, but no, they just fine tuned on top of Qwen. And now the top Hug Face leaderboard model is based, is a fine tune of Qwen, which is like also super cool.
[00:32:50] Junyang Lin: Yeah, I'm very proud of it.
[00:32:52] Alex Volkov: Yeah, congrats.
[00:32:53] Junyang Lin: They are using our model to be the top of the model. I'm also really expecting their technical report to see how they reach the top of the benchmark. But I think it is not that It is not that kind of difficult because you have a lot of ways to improve your performance in the benchmark, so we'll still see how it really performs in the real scenarios, especially for their chat models, yeah.
[00:33:18] Alex Volkov: Yeah, that's true, [00:33:20] that's often the case. But I just want to shout out that the world is changing like super fast. We're definitely watching and monitoring the Hagenface leaderboard. And performing better than Mistral Medium is impressive. And this looks at least on the MMLU, this is 77. I think they said they broke The average score of 80, this is the first model that broke the average score of 80 on the open source leaderboard on hang and face, which is super cool based on Quinn as well, and definitely worth it.
[00:33:46] Alex Volkov: I'm gonna add this link to the show notes and hopefully we'll find a way to connect you guys with the Bindu team there at Abacus to see how else this can be improved even for, and whether or not these techniques can be put on smaller models as well. I think in the open source, the last thing.
[00:34:00] Junyang Lin: expecting the chat. Yeah, I'm really expecting to chat with them. Yeah, continue,
[00:34:05] Alex Volkov: So definitely hoping that some of our friends can connect between these awesome teams and learn from each other, which I think is the benefit of speaking in the public and putting things in open source. Now, moving on, the last thing that you definitely mentioned is the update from LMSys, which is quite a few of our friends of the pod are now also part of the chatbot arena.
[00:34:24] Alex Volkov: They just announced this yesterday. They've added Three of your versions, right? They added 1.572 B, 1.57 B, 1.5, four B, and they also added open chat. So shout out the folks from Open Chat Alai and the Alignment Lab and some other friends of ours who like release open chats latest release and they also added news imis fine tune.
[00:34:47] Alex Volkov: So if you guys remember we've talked about news fine tuning on mixed mixture and that improved on the mixture of expert model from. From Mistral a little bit based on DPO data sets. So now that's also in the LMCS arena and it's now powered by Together Compute. Which I have no affiliation with besides the fact that they're awesome.
[00:35:04] Alex Volkov: They're sponsoring a bunch of stuff. And we did a hackathon together together is great. Like you can easily fine tune stuff on their platform, but now they're also sponsoring the arena, at least to some extent, which is great because we get more models and arena keeps going. And if you guys remember, or you probably use it, LMC's arena is this another great way for us to feel what human preference is in models.
[00:35:27] Alex Volkov: And for many of these models. That's what's more important than actual performance on evaluations, on leaderboards, et cetera. So definitely great update from LMCs as well. And I think that, I'm gonna ask my folks here on stage, but Nisten, Far El, if this is like anything else in open source that's super interesting this week, I think that's mostly it.
[00:35:44] Alex Volkov: We can talk about Gemini.
[00:35:48] Nisten Tahiraj: There was a data set, which I think is pretty huge of HackerNoon that they released. And oh, there was one more thing HuggingFace made a GPT store.
[00:35:58] Alex Volkov: Oh,
[00:35:59] Nisten Tahiraj: they made their own GPT store. Yes. I think that's a big,
[00:36:03] Alex Volkov: I want to hear about this, for sure. I haven't used it yet, but I invite the Hug Face folks that are listening to this to come and tell us about this, because I haven't used it yet, so I don't actually have many opinions. But yeah, they released their own open source GPT store, which is super cool, and we're going to add this maybe in the show notes, but I don't have a lot to say about this.
[00:36:24] Alex Volkov: And I think, in the spirit of Yeah, go ahead.
[00:36:27] Nisten Tahiraj: Oh, sorry. Sorry. I'll quickly say that the HackerNoon data set of tech articles, those are some Because they have a lot of guest developers I remember over the years, they had the best ones. Those articles, that data set, is extremely great for any kind of coding or website or whatever work you're doing.
[00:36:50] Nisten Tahiraj: That's because it's step by step instructions on how to build something and all the code for it, it's pretty awesome and it's at the very beginning on the Jumbotron if you guys see it from Daniel VanStream. And yeah, and it's MIT licensed and it's 6. 9 million articles and you can do whatever you want with it.
[00:37:07] Nisten Tahiraj: That, shout out to them.
[00:37:09] Alex Volkov: We'll add this again to the show notes. And as you said something about articles and code, I remembered another thing that definitely Also worth mentioning Junaid Embeddings, if you guys remember, we had a chat with Bo Wang from Juna deep dive into embeddings a while ago, and Junaid Embeddings released a fine tune for code.
[00:37:25] Alex Volkov: So just a quick shout out that embeddings can be fine tuned, embedding models can be fine tuned for specific purposes, and definitely embeddings for co and you guys re if those of us who follow from week to week, we talk about embeddings a lot. We've talked about NumX Embeddings last week, the open source full, including the training datasets.
[00:37:42] Alex Volkov: We've talked about. OpenAI changing embeddings and giving us new ones and cheaper ones. And Junaid, we had a deep dive and I definitely welcome you to go and check out that special episode with Bo Wang from Junaid and they trained their own BERT model as the backbone, the LLM backbone that decides about embeddings and they just released an update to their embeddings fine tuned for code retrieval specifically.
[00:38:03] Alex Volkov: And I think for many folks are building rack system. That's something that they should be aware of that embedding models can be also fine tuned for specific purposes like Q& A and obviously code as well. So if you haven't tried that yet and you're doing a bunch of material on top of code, for example, using some of the data sets that Nisten just mentioned, that probably there's code in there definitely check this out.
[00:38:25] Alex Volkov: I think we're moving on to the big company thing, and I don't have a big company transition, I do have this one though.
[00:38:43] Google finally lanuches Gemini Ultra
[00:38:43] Alex Volkov: Just in, as we started the space, maybe an hour before, our friends from the big G, Google finally answered the question that we've been asking since 10 months and three weeks ago, where is Google? So GPT 4 was released to us after ChaiGPT released in, I want to say December, maybe December 1st, November 31st of 2020.
[00:39:06] Alex Volkov: Then GPT 4 was released in March of 2023. And throughout this time, there was this famous video of Satya Nadella asking where is Google and where's this like 600 pound gorilla in the room of search? And we're going to make them dance. And they definitely make them dance. And we've been waiting.
[00:39:25] Alex Volkov: Where's Google? Where's Google? And Google has released. Quite a few stuff for us since then. Just for context, I think everybody knows this already. Google is the place of the birth of the transformer paper. So like most of this, the recent Gen AI explosion is, can be attributed to transformers architecture that came out from Google.
[00:39:43] Alex Volkov: Google had trained multiple models, including like Palm, and we've talked about Palm and Palm 2, and I don't even remember all the names of the models that they've released for us throughout the years. Google then also. At some point gave us BARD, which is their interface, the chat interface that people used in order to play with their models, and I think some of this was Bye.
[00:40:04] Alex Volkov: Bye. Palm, something else as well. And recently, and I think around December, they said, Hey, you know what? We're here and we have this thing called Gemini after the unification of Google Brain and DeepMind under one org. And we're going to give you Gemini Pro right now, but we'll tell you that Gemini Ultra, that was back in December.
[00:40:23] Alex Volkov: The Gemini, I guess December will tell you the Gemini Ultra is coming and it's going to be better than GPT 4 and you're going to get it soon. And we've been like saying when? And today is the day is the answer for those questions. So today we're celebrating, congrats folks at Google who finally released and upgrade to their LLM capabilities.
[00:40:41] Alex Volkov: Not only an upgrade, so much an upgrade that they've killed the Bard brand completely. No more Bard. That's what I'm understanding. No more BARD, even though that's very confusing. If you guys remember a few weeks ago, we've talked about LMSYS changes were barred with Gemini, I think, something like confusing like this, shot up to the top of the charts and just was trailing GPT 4.
[00:41:05] Alex Volkov: So like second best model in LMSYS arena was barred with GPT 4, or sorry, barred with Gemini. See how confusing this is? And now there's no more barred. But there is an LNCS. Anyway, this is like the whole naming is confusing thing, but Google, including a blog post from Sundar and everything, Google comes out with a new update and says, Hey, Bard is no more.
[00:41:25] Alex Volkov: It's now Gemini and the models are also Gemini. So that's confusing. And the models are Gemini Ultra. We finally get access to Google's answer to GPT 4 today, which is incredible. That answer is Ultra 1. 0. [00:41:40] And we can get this. As part of something like a paid premium tier that's called GMA Advanced on Google.
[00:41:46] Alex Volkov: So you can actually go right now, you can sign up, it's 20 bucks a month, and it starts 20 bucks or 30 bucks? I think it's 20
[00:41:52] Nisten Tahiraj: It's two months free
[00:41:54] Alex Volkov: Yeah, and you get two months, two months trial because they have to prove themselves to you because many people will decide whether or not they're going to go with Google or with JGPT.
[00:42:03] Alex Volkov: And we're going to talk about which one folks will prefer. I haven't tried it yet. Literally as I woke up, I had to prepare my notes for the space. I just want to say. Google, welcome to the party, we've been waiting for you, and I counted, it's been exactly 10 months and 3 weeks and 4 days since GPT 4 released that you came with the same level of, at least, based on benchmarks.
[00:42:24] Alex Volkov: And now we're gonna talk with some folks who actually tried it, Nisten, you tried it, I think Ray, you also tried it let's talk about your first impressions from BART, oh, or, sorry, Gemini.
[00:42:35] Nisten Tahiraj: One, it's heavily moderated. No one's surprised by that. It does answer and reason nicely, or at least the way it communicates, it's a lot more eloQwent, I would say. It feels nicer in the way it reasons stuff out. However, compared to Mistral Medium, or Mixtral, it doesn't quite obey you. I tried my standard question, which is just like Climb out a schedule of building a city on Mars and write the code in C and JavaScript.
[00:43:10] Nisten Tahiraj: And that's a pretty complex question for, that only the best models get. And I needed to re prompt it in order for it to give the answer. And even then, it only wrote some JavaScript. But it was really good JavaScript. However, it didn't do the rest of the task. Okay, it's not bad. It is worth using. Again, very heavily moderated.
[00:43:33] Nisten Tahiraj: As for the vision side of it, it's extremely heavily moderated. I was even telling it to count out, I had an old gaming PC on the floor with two GPUs on the side, and I told it to make me a JSON of all the parts that it sees in the picture. It won't answer questions like, that have humans in them, or even if they're like Star Wars characters or whatever.
[00:43:58] Nisten Tahiraj: But This, I thought, would be something pretty simple, and it, even this one it refused to answer. Yes is good, I think. On the, as far as the vision side goes, the model, the open source models might have it already beat, or will soon.
[00:44:19] Ray Fernando: Yeah, I wanted to add, Ankesh from Google DeepMind actually wrote because I've been posting some of this stuff, and he says, To preempt any confusion, multimodal queries don't go through Pro slash Ultra yet, but that is coming soon too.
[00:44:33] Ray Fernando: Which makes sense a little bit of why you're seeing some of that stuff. I've been seeing similar things when I've been doing some image analysis or even trying to generate images that have people. One of my examples I've just been posting on my my Twitter feed is like having to analyze a meme.
[00:44:48] Ray Fernando: So it's the hot girls meme or the hot ones meme and I was like, hey, this is very popular. Can you tell me what this meme is? And it's I'm sorry I can't because there's images of people. And then I had to do some other meme analysis with Elon Musk and it's the same type of queries. But to add to what Nisten was saying, I've been doing a lot of creative writing tasks, and the writing output has been actually really nice.
[00:45:10] Ray Fernando: And it doesn't have all that extra fluff that you normally would get from ChatGPT 4. And what I find with OpenAI's ChatGPT 4 is that they freQwently say, Hey, don't use purple prose, which is all that extra fluffy stuff you read that make people sound smart. It's I just want a regular sounding piece.
[00:45:27] Ray Fernando: And usually ChatGPT would do that and then revert back to its normal state but I find that Gemini Advanced just keeps going through it and, continues with the writing pieces of things. And for coding stuff, it's really strange. You actually cannot upload any CSV or any text files.
[00:45:43] Ray Fernando: They only let you upload images right now. So you can only have a picture of a microphone and a picture of the little icon to upload an image. Because I wanted to just do a simple analysis on my tweets with a CSV file. And it's there's no place that I see to actually upload that. And I could probably upload so many lines, but there's also a character cutoff, too, that doesn't allow me to upload a lot of code for,
[00:46:03] Ray Fernando: A code base.
[00:46:04] Alex Volkov: What's the, I was about to say this next thing. Do we know the context length? Anybody have an idea of where Gemini Ultra is at around? 'cause we know that GT four is 1 28 K and I think they recently opened this up on the UI as well. I've been noticing less restrictions. I've been able to pace like a lot more code.
[00:46:21] Alex Volkov: My, my test is, you guys know my test is the transcription of the Thursday I conversation that I past and Claude with the a hundred K context definitely takes all of it. GBT. For the pro kind of level used to refuse and now recently it's okay. Yeah, let me summarize this for you Have you guys been able to sense the context length of Gemini Ultra?
[00:46:41] Alex Volkov: Is it any close? Actually, go ahead Welcome to the stage, buddy
[00:46:46] Akshay Gautam: Hello, I just wanted to bring up that their official document mentions that it's 2k context length.
[00:46:53] Alex Volkov: Actually, we don't get greetings of the day
[00:46:57] Akshay Gautam: I see. Yeah. Yeah. Greetings of the day everybody. My name is Akshay Kumar Gautam and I'm an applied AI engineer. I was a data scientist before, but now I work with, modeling and stuff. And yeah I was literally waiting for, I tried, came out, I paid for it because why not? And and a lot of stuff.
[00:47:14] Akshay Gautam: First of all, it's really good at coding. By the way, the context length is 32K at least that's what they say. Yeah, 32K. And and the model is not good at keeping context, like that is what I was here to talk about. It will lose sense for example, if you ask it to do multiple things in a single prompt, it will not.
[00:47:33] Akshay Gautam: Unlike chatGPT, but like with coding, it's better than chatGPT in my humble opinion.
[00:47:41] Alex Volkov: so I want to talk about some advantages that Google has, the big dog definitely, because an additional thing that they released, which Chantipiti doesn't have, is ChairGPT has this, but they released an iOS and Android app, but Android also has integration with the Google Assistant, right?
[00:47:56] Alex Volkov: So you can now join this advanced or ultra tier and use this from your Android device. Now, I'm not an Android user, but I definitely understand that the ecosystem is vast and many people just use this assistant and we're waiting for Apple. We don't have anything to say about Apple specifically today, besides the fact that, they released the, maybe the next era of computing.
[00:48:16] Alex Volkov: But. There's nothing AI series, still the same series from like 2019 with some examples, but Google has now moved everybody who wants to, who pays the 20 bucks a month and has an Android device basically towards this level of intelligence, basically a GPT 4 level of intelligence. And I saw that Marques Brownlee, MKBHD on YouTube, like one of the best tech reviewers out there.
[00:48:38] Alex Volkov: He has been playing with the Android stuff, and he said that even the integration Google Assistant even uses your home stuff. So you can actually ask this level of intelligence to turn on some lights, whatever, and probably better context. Actually, you have any comments on this? Have you played with the Assistant version?
[00:48:54] Akshay Gautam: Two things first of all, Bing chat was already available on Android devices, right? The Copilot, now it's called. Copilot uses GPT 4, so it's already really good. And you can actually use a lot of voice stuff with Copilot as well, which was surprising. The Google Assistant to be honest, in terms of assistants among Siri and I have a Samsung device, so it has Bixby and, among all the AI systems, Google Assistant was the best one by far, in terms of how much you can, use it, and hoping to get access because I have paid for the Ultra, but I still don't have, access to everything.
[00:49:29] Akshay Gautam: Also, there's no API for Ultra, so you cannot actually test anything as well.
[00:49:34] Alex Volkov: we haven't gotten an API developers Sundar Pichai said the developers announcements are going to come next week. IOS hasn't updated yet. Yeah, go ahead Nisten.
[00:49:44] Nisten Tahiraj: I just really quickly tested it with the entire Lama. cpp file. I am down to 15, 000 tokens I cut it down to and it's still too long. We know it's under 16, 000 that you can paste in. I will know [00:50:00] exactly in a few minutes,
[00:50:03] Alex Volkov: So not super, super impressive in terms of like long context. I will also
[00:50:06] Nisten Tahiraj: at least not for the UI,
[00:50:08] Alex Volkov: for the UI. Usually, yeah, usually for some reason they restrict the UI or they forget to update this. And then the model itself is like way longer context, but for now not extremely impressive comparatively.
[00:50:18] Alex Volkov: And again, we're comparing the two like main flagship models OpenAI GPT 4 and now Google's Gemini Ultra. And I also want to say one thing, Gemini seems to be optimized only for English as well, even though it will answer like most of the questions other languages, but it looks like the optimization was focused on English as well.
[00:50:36] Alex Volkov: including some of the apps as well, which is, understandable, but we have to, as we're trying to compare apples to apples GPT 4 is incredibly versatile in multi language operations as well. LDJ, you have some comments? Welcome, buddy, to the stage and give us some Have you played with Ultra so far?
[00:50:55] LDJ: Yes I was actually wondering, does anybody know of plans for them to integrate this with Google Home? Because I just asked my Google Home right now are you Gemini? And it said, I'm a Virgo. And then I asked it, what AI model are you running right now? It said, sorry, I don't understand. So I don't think it's, at least mine, I don't think is running Gemini right now.
[00:51:16] LDJ: But
[00:51:17] Alex Volkov: No, so I think the announcement was
[00:51:18] Junyang Lin: to put it.
[00:51:19] Alex Volkov: The integration into Google Home will come from the Google Assistant. So if you have an Android device, you'll have Google Assistant there. That you can switch on like a smarter brain, and that you can ask it to integrate like with your home through the device. So you can ask it to do stuff in your home.
[00:51:34] Alex Volkov: But the Google Home itself, like the Google Home devices that you have, they're not talked about upgrading them, but maybe at some point, because why not? But I haven't seen anything on this yet. Anything else here?
[00:51:46] Junyang Lin: I think that'd be the perfect. Sorry. Yeah, go on.
[00:51:48] Alex Volkov: Yeah, no, that would be great. I agree with you. Being able to walk around your house and just talk with GPT 4 level intelligence to do operations, I definitely agree.
[00:51:55] Alex Volkov: That would be great. I gotta wonder anything else here on Ultra? We've talked about its code performance. We've talked about its inability to talk about people. Anything else interesting that we want to cover so far? And again, folks, it's been two hours and we're already giving you like a bunch of info, but we'll play with this going forward.
[00:52:12] Nisten Tahiraj: It's about 8, 000 the context length that you
[00:52:14] Alex Volkov: Are you serious? Wow, that's not a lot at
[00:52:17] Nisten Tahiraj: that's as much I was able to paste it like 7, 500.
[00:52:20] Alex Volkov: So yeah, folks, you heard it here first. You'll get more context than you previously got probably, but it's not a lot comparatively. Even though it can probably, it's probably a consideration of compute for Google, right? How much context to give you the model probably gets more. And it's also a vision enabled model.
[00:52:36] Alex Volkov: But I think that we've covered this enough. Gemini Ultra. It's here, it's very impressive from Google, and yet, I want to say personally, maybe a little bit underwhelming because, they need to convince us to move, and it's going to be the same price, and I don't know, let me just ask this before we move on.
[00:52:55] Alex Volkov: Anybody here on stage who has access to both plans to pay for this and not GPT?
[00:53:03] Nisten Tahiraj: I haven't paid for anything since September But I'm
[00:53:08] Junyang Lin: not the right person for this question. My company pays for like my character description. So I might keep both
[00:53:15] Alex Volkov: Interesting.
[00:53:16] Junyang Lin: paying for mine's out of pocket. I'm just going to keep both. I like the OpenAI app because it's just the multimodal picture on my phone.
[00:53:23] Junyang Lin: I'm on the go. For Google, I'm just curious because it's two months free. That just means that, they have me hooked. We'll see.
[00:53:30] Alex Volkov: Yeah, it's two months free. And then let's check in back in two months, and see how many of us kept paying. All right. I so Google also releases. a Llama CPP wrapper called Local LLM. I don't know if you guys saw this. It's pretty cool. It's an open source tool from Google that actually helps you run LLMs locally on CPUs and then also on the Google Cloud with a super easy integration.
[00:53:51] Alex Volkov: Very interesting choice. They also call out the bloke that you can download models from the bloke with their tool. And I think it's very funny that if you go on. The description of the blog of local LLM, they call this. Now, the tool, they told you in the code snippets, they say, Hey, install OpenAI.
[00:54:10] Alex Volkov: So I found it really funny. But yeah, they have a wrapper there that integrates with Google Cloud as well.
[00:54:15] OpenAI adds DALL-E watermarking and per API key restrictions
[00:54:15] Alex Volkov: Running through the big companies areas like super quick. OpenAI added watermarks to Dali images. They use this new metadata thing called C two P embeds and it embeds in the metadata.
[00:54:27] Alex Volkov: And so basically what this means for us is not that much, but when you download images from Dali generated, I assume that the same will come to Microsoft copilot. They will now have in the metadata, where like the location is and everything else. They will now have the fact that they have been generated with.
[00:54:43] Alex Volkov: They have been generated with DALI this information will sit in the metadata. Now it's only images, not text or voice or anything else from OpenAI. This happens over the API or from the ChatGPT interface as well. This increases the file size a little bit because of some of the stuff, but it's not super interesting.
[00:55:00] Alex Volkov: This can be stripped. So it doesn't mean that if the lack of presence of this thing does not mean that it's not generated with DALI. It just, if there is, it's definitely generated with DALI. And so this is an interesting attempt from OpenAI to say, Hey, we're doing as much as we can.
[00:55:15] Alex Volkov: It's not foolproof, but an interesting attempt. And also, I just want to mention that if, for those of us who develop with OpenAI, The API keys, they keep upgrading the developer experience there and the API keys part. And now you can restrict per API key. You can restrict its usage, which many people have been waiting for a long time.
[00:55:33] Alex Volkov: And that's really like many people has been wanting this. You can create one API key for OpenAI for a specific purpose and restrict it to only DALI, for example. And you can, I don't know if you can restrict. based on credits, I don't think so, but you can definitely restrict in, in the usage related stuff.
[00:55:49] Alex Volkov: That's, I think, all the updates from the big companies and the LLMs and APIs,
[00:55:53] Alex Volkov: This week's buzz is the corner and I stopped the music too prematurely. This week's buzz is the corner where I talk about the stuff that I learned in Weights & Biases this week. And I don't know how many of you were, had a chance to join our live segments, but we definitely had a build week. And I think I mentioned this before, but actually we had a live show on Monday.
[00:56:19] Alex Volkov: We're going to have another one this probably tomorrow. Yeah, tomorrow. I think it's Noon Pacific, where I interview my team, the GrowthML team in Weights & Biases, about the build with projects that we've built, uh, last December to try and see what's the latest and greatest in this world. So as we build tools for you in this world, we also wanna Build internal tools to see what are the latest techniques and stuff like we just talked about.
[00:56:46] Alex Volkov: For example, it gives us a chance to play around with them. It's like an internal hackathon. And what happened was is we build those tools and we present them to the company and then this was basically it. And I said, Hey, hold on a second. I learned the best publicly. I learned the best about, the way I just learned from Connor and Benjamin.
[00:57:02] Alex Volkov: I learned from Nisten and Far El and all the folks in the audience. And Luigi and I had a whole section where he taught me weights and biases before. I learned the best by being public and talking about what I'm learning as I'm learning this. And so I did the same thing with our folks from the GrowthML team.
[00:57:15] Alex Volkov: We just literally folks came up on stage and I asked them about what they built and what they learned. And we're going to summarize those learnings in the live show. And that live show, if you're interested, is all over our social, so on Weights & Biases YouTube and LinkedIn. Yes, LinkedIn, I now need to also participate in that whole thing.
[00:57:33] Alex Volkov: So if you have tips about LinkedIn, let me know. But it's live on LinkedIn, live on YouTube. I think we did X as well and nobody came. We're probably try to send you to the live YouTube flow. But basically the second part of this is coming up tomorrow. We're interviewing three more folks and you get to meet the team that I'm, the incredible team that that I'm part of.
[00:57:53] Alex Volkov: Very smart folks. like Kaggle Masters, and some of them came to Kano's show as well, which is super cool. And I find the first conversation super interesting and insightful for me. Definitely recommend if you're into Understanding how to build projects that actually work within companies was the process.
[00:58:11] Alex Volkov: We have folks who build something from scratch, we have somebody who runs a actual bot with retrieval and re ranking and evaluations and like all these things and [00:58:20] have been running them for a year basically on the production. So you can actually try our bot in Discord right now and in Slack and on GPTs.
[00:58:28] Alex Volkov: If you want to hear about the difference between a mature, rag based But that's in production for a professional AI company, but also the difference between that and something that somebody can like quickly build in a week. We've talked about those differences as well. So definitely worth checking out that live.
[00:58:46] Alex Volkov: Moving on from this week's buzz, and I learned a lot. Okay, so back from the this week's buzz, we're moving into vision.
[00:58:52]
[00:58:57] Alex Volkov: And Bria AI like super quick, they released a new Background Segmentation Model, or Background Removal Model, that's live on Hug Face, is called RMBG V1. 4, and I think the cool thing about this is that it now runs completely in the browser, thanks to the efforts of our friend Zinova, who is no longer in the audience, I think, from Hug Face and Transformers.
[00:59:19] Alex Volkov: js, and it's super cool. You can like, remove backgrounds completely without sending any images to anywhere, and just straight from your browser. That model is called, again, RMBG, and it's not Commercially viable. So you cannot use this for professional stuff, but it's open for you to try and play with in the voice category, the voice and audio category.
[00:59:39] Alex Volkov: We don't have a lot of audio stuff lately, so I think the main audio stuff that we've talked about was. I want to say Suno is like the latest and greatest, but we're still waiting for some cool music creation stuff from different labs. And definitely I know some of them are coming but in the voice category and you know that we've been talking about, my position in this and Nisten and I share this position.
[01:00:01] Alex Volkov: I think personally, The faster models will come out that can clone your voice and the faster they're going to come out in open source, the better it is generally for society. I know it's a hot take, I know, but I know also, I cannot reveal the source, I know that voice cloning tech is going to be at open source like super, super quick.
[01:00:21] Alex Volkov: And I think it's like one of those. Break the dam type things that the first kind of major lab will release a voice cloning and then everybody will see that nothing happened in the world, everybody else will release theirs, and we know everybody has one. We know for a long time that Microsoft has, I want to say Valley, was that Valley?
[01:00:38] Alex Volkov: That clones your voice in under three seconds. There's papers on this from every company in the world. We know that OpenAI has one. They collaborated with Spotify and they cloned Lex Fridman's voice and it sounds exactly like Lex Fridman. We know that companies like Heygen, for example, I think they use 11labs.
[01:00:54] Alex Volkov: 11labs has voice cloning as well. None of this is open source, everything is proprietary. So we're still waiting for the voice cloning area from open source from a big company. But for now, we got something called MetaVoice from a smaller company. Not from Meta, it's just called MetaVoice, it's confusing.
[01:01:08] Alex Volkov: It's just like a tiny model, 1. 2 billion parameters model. It's trained on 100k hours of data, which is quite significant, but not millions of hours. And it supports zero shot voice cloning. So basically under a few samples, like a basic sample of your voice, and then you're going to get a clone of your voice or somebody else's, which is what scares many people in this area.
[01:01:30] Alex Volkov: It has like long form synthesis as well. It's super cool. And it has emotional speech. If you guys remember, we've talked about. How important emotion is in voice cloning, because again, for those of you who follow ThursdAI for a while, you may remember myself voice cloned in kind of Russian, and I'm doing this with a lot of excitement, when the regular voice cloning thing for Alex speaks in a monotone voice, that's Very clearly not the same kind of person.
[01:01:56] Alex Volkov: So emotional speech is very important. And some of this is with prompt engineering and some of this happens in voice casting or voice acting. And the best part about this MetaVoice thing is Apache 2 license and it sounds pretty good. And so we've talked about multiple TTS models, and now this model is definitely out there.
[01:02:14] Alex Volkov: So if you're building anything and you want a TTS model for you with voice cloning, I think this is now the best. the best shot you have. It's called MetaVoice. I'm going to be adding this to the show notes as well. And I think we have a breaking news from a friend, VB with another model called Nemo.
[01:02:30] Alex Volkov: So let's take a look. Yeah, definitely a new model from NVIDIA. It's called Nemo. Let me actually use this. I want to use the sound as much as possible.
[01:02:50] Alex Volkov: So I'm gonna go and try and find this tweet for you, but basically we have a breaking news, literally Rich VB, which is the guy friend of the Padawars, who's in charge of, like, all the cool voice related and TTS related tech and Hug Face, he mentioned that NVIDIA AI released Nemo Canary.
[01:03:07] Alex Volkov: Nemo Canary is the top of open a SR leaderboard. VB is also part of the folks who are running the leaderboard for us, a SR stands for automatic speech Recognition. No, I think I'm confusing this. Yes, automatic speech recognition. Cool. Thank you, Nisten. So basically, if you guys remember Whisper, we talked about Whisper a lot.
[01:03:25] Alex Volkov: This is the leaderboard, and Whisper has been on top of this leaderboard for a while. Recently, NVIDIA has done some stuff with stuff like Parakit. And now we have a new contender in the ASR leaderboard called Nemo Canary 1B. 1B is not that much. Whisper The highest Whisper large, I think it's 2. 5 B or something.
[01:03:44] Alex Volkov: This is now the top SR leaderboard. It beats Whisper and it beats Seamless from Meta as well. And I don't know about License here. It supports four languages. Whisper obviously supports a hundred, which is, uh, which is, we know the best for many kind of low resource languages as well. Trained on not that much hours of annotated audio, only 85 1000 hours or so, and it's super fast as well.
[01:04:10] Alex Volkov: It's very interesting that NVIDIA does multiple things in this area. We had Parakit, now we have Canary as well. What else should we look at? I think Bits, Whisper, and a considerable margin, again, on these specific languages. Folks, we've been, I think, we've been on this trend for a while, and I think it's clear.
[01:04:28] Alex Volkov: Incredible automatic speech recognition comes on device very soon. Like this trend is very obvious and clear. I will add my kind of thoughts on this from somebody who used Whisper in production for a while. The faster it comes on device, the better. And specifically, I think this will help me talk about the next topic.
[01:04:47] Alex Volkov: Let's see what else I have to cover. Yeah, I think it's pretty much it. The next topic
[01:04:51] Nisten Tahiraj: I'm trying it right now, by the way. And it's pretty good.
[01:04:55] Alex Volkov: Are you transcribing me in real time or what are you doing?
[01:04:58] Nisten Tahiraj: yeah, I was transcribing your voice through the phone to my laptop but weirdly enough it doesn't output numbers, it only outputs words however
[01:05:06] Nisten Tahiraj: It seems pretty good, huh? I don't know, it seems good to
[01:05:09] Nisten Tahiraj: me, LGTM looks good to me.
[01:05:11] Alex Volkov: Yeah, it was good to me. Absolutely. The word error rate, the word error rate for Whisper is around 8%, I think, on, on average for these languages and for Canary is less than it's 5. I think, if I remember correctly, VB told us that word error rate is like how many mistakes per 100 words it does, and this does, Five Mistakes Versus Eight, I think on the general data sets.
[01:05:36] Alex Volkov: Quite incredible. This is coming and I think I'll use this to jump to the next thing
[01:05:39] Alex finds a way to plug Vision Pro in spaces about AI
[01:05:39] Alex Volkov: . The next thing, and briefly we'll cover this, is that I haven't used it for the show, but for the past, since last Friday, basically, I've been existing in reality and in augmented virtual spatial reality from Apple.
[01:05:52] Alex Volkov: And the reason I finally have a chance to connect these two things is because. I use a lot of the hand gestures within the Vision Pro from Apple, which was released on Friday and a lot of voice as well. And obviously we've talked about Siri, we've talked about finally Google stepping up with their assistant.
[01:06:08] Alex Volkov: Siri voice recognition and also typing is not that great. And I know because I used Whisper in production for a bunch. I also use Super Whisper, shout out Neil on my Mac to actually dictate a bunch. And all those tools, all the new tools, Whisper and now Canary and like all these things, they understand me and my accent very well.
[01:06:26] Alex Volkov: Whereas Siri is like on device. So Siri actually has two automatic speech recognition. They have the fast one on device and they actually do send your voice on onto the cloud and they return something. So you would [01:06:40] actually see a wrong transcription and then the right one replaced the wrong one. And the right one is actually generally okay, even though with my accent doesn't get me as much, but the wrong one is very bad.
[01:06:50] Alex Volkov: It's it's like they stopped. Thinking about ASR, Automatic Spatial Recognition in Apple, back in 2019, and that's what they shipped. However, there were quite a few papers from Apple on this topic, and I know for a fact that we're getting on device. And the reason I'm excited about this in the spatial context as well is because you can talk instead of using Hands on keyboard and that's very cool I think that's all I had to connect with the spatial computing in addition to I've tried all the AI tools and games and everything And we're still not there.
[01:07:19] Alex Volkov: There has been one thing that I want to connect if you guys know from the diffusion model area There is a way to generate images in 360 around you and I thought this was super cool because this is essentially a holodeck moment where you can stand in full virtual embedded reality and just say, Hey, I want this thing to appear.
[01:07:39] Alex Volkov: And we have now models of text to 3d that are coming like super soon. We obviously have virtual friends that embedding them in real space needs a robot. But now if you have this like spatial computing thing, you can actually put an AI friend in the corner. You will always talk to you. So there's a few like attempts at this in the Apple thing.
[01:07:57] Alex Volkov: but not a lot. And also I will ping back to this like last thing where Apple is coming. We've talked about this. Apple is coming on Friday of release of Vision Pro, which was the day after last Thursday. Apple had their uh, shareholder meeting. And in there, Tim Cook said, Hey, we launched spatial computing.
[01:08:15] Alex Volkov: We're really happy. This is the next iteration of spatial stuff, blah, blah, blah. I definitely agree about all this. If you watch my feed for the past week, that's pretty much all I can talk about besides AI. However, going back to the AI, Tim Cook finally mentioned the word AI in the call, and he's not the only one.
[01:08:30] Alex Volkov: It's very clear where the thing is going. Every earnings call for every major company mentioned AI. Tim Cook specifically mentioned AI finally and said, Hey. We're very excited about this technology and we're going to show you something like soon. So I expect that this WWDC is going to be Spatial and AI related and I definitely think that Apple are thinking about both just because the way Siri looks in Spatial is just incredibly like nice.
[01:08:55] Alex Volkov: And I can see how embodying AI in your physical world, where you have spatial awareness, you can put something in the corner, it will sound like it's coming in the corner. And I'm waiting for the, for the point where that has a bot, like a Tesla Optimus bot with AI.
[01:09:11] Alex Volkov: But before that, we'll definitely get there with spatial computing. So I'm going to have embodied AI agents around me and I'm going to ask questions. For some reason, the ChatGPT interface within the headset is horrible. And specifically because we all know that the iPhone app you can talk to, but Vision Pro only has access to iPad apps, and you can install the ChatGPT iPad app, but you cannot talk to it, which is a miss, I think, on OpenAI's part.
[01:09:35] Alex Volkov: This isn't in my segment about the Vision Pro. I tried as much as possible to connect these things to AI to bring this to you. But, separately from this my full review of Vision Pro is, holy s**t, this device is the new category of computing, and I can talk about this in a different space if you're interested.
[01:09:50] Space reset
[01:09:50] Alex Volkov: and I think it's time for a reset the space, as we've gone up for an hour here, folks. A little bit more than an hour. I'm just gonna play some music, reset the space, and then we're gonna have a conversation with some folks here on stage.
[01:10:12] Deep dive into DSPy, COLbert and RAGatouille with Ben Clavie and Connor Shorten
[01:10:12] Alex Volkov: Welcome, everyone, to the second hour of ThursdAI. Where we usually, we have a bunch of stuff to cover still from the news angle, like the Bria stuff and the MetaVoice stuff and the Arts in the Fusion. But, and also maybe you want to have some time to talk about Vision Pro, but for now, we have two guests here on stage that I want to welcome and introduce.
[01:10:31] Alex Volkov: And we're going to talk about very interesting things that maybe some of you who follow the Twitter, XAI, Ecosphere have been seeing around and I really want to say I want to say thank you and welcome to Conor and Benjamin for joining us. Maybe let's unmute Conor first and then Benjamin and just introduce yourself.
[01:10:49] Alex Volkov: Benjamin, I know you're going through some stuff, buddy. And as much as you can Benjamin feel free to, to talk to us, but we'll try to cover as much as possible. Conor, go ahead and then Benjamin.
[01:10:58] Nisten Tahiraj: Hey Alex, are you able to hear me first
[01:11:00] Alex Volkov: Yes, we can hear you loud and clear.
[01:11:03] Connor Shorten: Awesome, cool. I think I've been like refreshing the Twitter page and all that, but awesome. So I'm Connor. I'm a research scientist at Weavier. I also host the Weavier podcast. And yeah, I've just been so excited about DSPI and I'm, really excited to be diving
[01:11:15] Connor Shorten: into it further.
[01:11:16] Alex Volkov: That's awesome. And I think that WayVid podcast was the first podcast that I came on as a little bit of a guest from NeurIPS. So we had a great conversation outside of NeurIPS sign. If you guys want to check this out, but also WayVid podcast, the folks from Weights & Biases had a great chat with you.
[01:11:29] Alex Volkov: That's where I know you from. Actually researched my position and my team based on the conversation you had with them. Very knowledgeable. And thank you for that content. It's really great. And folks definitely should check it out. And I want to also say hi to Benjamin Clavy. Welcome, Benjamin.
[01:11:44] Benjamin Clavie: Hey,
[01:11:45] Benjamin Clavie: thank you for having me. Can you hear me?
[01:11:47] Alex Volkov: Yes, you're coming through loud and clear.
[01:11:50] Benjamin Clavie: Yeah. Thank you. Yeah, I've made Tato, which you might have seen if you're interested in T at all, which is
[01:12:02] Benjamin Clavie: physically here, but not present in, but
[01:12:05] Alex Volkov: Do, what's in terms of background? Could you give us a little bit of background? Like how you came up to build these things? What's your background? Is this AI? Give us maybe a few brief sentences there.
[01:12:15] Benjamin Clavie: I'll say. My background
[01:12:16] Benjamin Clavie: here is basically ai. I've done the stereotypical thing of dropping out of uni and immediately gone walking into NLP and I've been doing retrieval on NLP for 6 7 years now.
[01:12:25] Benjamin Clavie: Very standard background.
[01:12:27] Alex Volkov: So definitely related background. Okay. So we're here to talk about multiple multiple things, interesting things. And Conor, I think maybe let's just start with. I think the guy behind some of this work Omar Khattab is not with us, right? But definitely some of the work that we're going to talk about is attributed to him.
[01:12:45] Alex Volkov: So maybe, can you, Conor, can you start us with an introduction to maybe DSPy and then Colbert, and then we're going to talk about Colbert and Ragatouille, and then just a brief one, then we're going to dive into what this means for retrieval stuff, definitely as it relates to you guys in Wave V8 rags are everywhere and like better rack systems and better.
[01:13:03] Alex Volkov: Options to prompt these LLMs to better retrieve is, everybody's looking for those. So let's start maybe there.
[01:13:12] Connor Shorten: Okay, so I'll try to keep the story going from intro to DSPy and then taking it into retrieval. So I think the first thing about DSPy that will like capture your interest is the programming model. It has this way of Writing initial prompts in a really succinct way, and then you can chain together or compose these graphs of several large language model calls with tool use in the middle, and we can come into retrieve a little bit there as well, but you start off with a really coarse description of what you want it to do, re rank these documents, and then it will optimize the, the whole description of the task as well as giving you a few shot examples to put in the prompt.
[01:13:50] Connor Shorten: So that's the first thing that is just super interesting I'm sure everyone listening has done this like manual tweaking of the prompt to try to, get it to do your task and how irritating that can be. And so that's probably the quickest value add is it automatically will come up with the prompts.
[01:14:03] Connor Shorten: And then when you want to switch your language model you've been over there saying please output JSON, four exclamation marks performing better than one. And now you switch from GPT 4 to Gemini Ultra, or say, you want to see if Quinn can be view shot prompted to do this.
[01:14:17] Connor Shorten: You can now recompile the prompt by using DSPy, and you can switch your language model without having to then redo the prompt tuning.
[01:14:24] Alex Volkov: So I have to pause right here, Connor, because I'm coming to this as clean as possible with not a lot of understanding of these things . You said recompile the prompt.
[01:14:33] Alex Volkov: I'm definitely one of the folks who've tweaked prompts, tried again, saw, okay, it works for a GPT 4. I'm definitely one of those folks. What do you mean compile the prompt, recompile the prompt? Let's talk about the compilation part of this.
[01:14:44] Connor Shorten: I even, when I met Omar, I said, compile it. It's overloaded. I think this kind of analogy started with calling LLMs the new operating system for LLMs and So I think that's the line of thinking to be calling it a compiler. Really we mean automated prompt [01:15:00] tuning.
[01:15:00] Connor Shorten: But the reason compiling, I think is the right way to think about it, is, let's say you have eight large language model programs eight parts of it that's what I think is the really exciting that's what I think makes LangChain so popular is people see this gallery of examples of chains where you first analyze some chunks of blog posts, extract the topics, then, You later on aggregate the topics into a description of the topic and then you maybe pass it to an editor prompt, and then you maybe have a council of reviewers, like there's this chain, and so with each component of the chain, or I think graph is now the more common abstraction.
[01:15:35] Connor Shorten: You have a prompt there. So let's say you have eight language, or however many, I imagine that as this, continues to evolve, we're going to see like super deep LLM the programs that will have so many LLMs in the middle of it. And so you have a prompt for each of those components.
[01:15:49] Connor Shorten: And so that's why compiling, I think the analogy is great because you're compiling the prompts for all of these prompts and yeah, so that's why I'll defend the compiling.
[01:16:01] Alex Volkov: So I'll just say like from a perspective of a tinkerer. That's something that maybe triggers me a little bit to say, Oh, I need to compile stuff. No, I just write Python code, but you're saying developers do not fret. Compiling is not that like crazy. It's specifically very helpful and useful for larger applications and very, is very helpful for when you want to replace the brain behind the stuff that you're doing or you want to do this in a structured way.
[01:16:24] Alex Volkov: Is that me understanding correctly of what we're talking about?
[01:16:28] Connor Shorten: Yeah, I agree completely with that.
[01:16:29] Alex Volkov: Awesome. So that's DSPy and Omer Hatab Latent Interactions, or Latest Interactions I think the nickname is. We're definitely going to add him to show notes as well. He's the author of this. DSPy has been around for a while. I definitely know that he has been posting about this quite, quite a lot, but recently has been on the pickup as well.
[01:16:46] Alex Volkov: And maybe Colbert is one of the reasons. Let's maybe, can you introduce Colbert as well, Conor? Or do we have some stuff about DSPi still to cover in the introduction phase?
[01:16:56] Connor Shorten: Okay, I can transition to Colbert.
[01:16:58] Alex Volkov: Colbert? Colbert? How do we, how do you even pronounce this thing?
[01:17:02] Connor Shorten: I was surprised when Omar pronounced it Colbert because it, it's Bert and then there's Stephen Colbert. I'd heard him on the podcast with I think Christopher Manning from Stanford who had, asked him about that.
[01:17:14] Alex Volkov: So if Omar, the creator of this pronounced Colbert, unfortunately, even though it's BERT models, I think Colbert is what we're talking about. But yeah, from Stephen Colbert. What is Colbert? Why is there excitement on my feed around this? And let's give us an introduction, Carmen.
[01:17:31] Connor Shorten: So the, probably the right way to start thinking about it is in search, you typically have retrieval and then re ranking and retrieval is where you have like encodings of the documents. Like you put each of the documents into an embedding model and you get a vector embedding, and then you're doing just, dot product distances between the query vector and these document vectors.
[01:17:51] Connor Shorten: So there's no interaction between the query and the documents. The representations are encoded completely separately in retrieval. And then you'll typically pass that into a re ranker. And so there are three kinds of re rankers. There's point wise re rankers that take as input the query in the document and then output a relevance score, doing the interaction between just this query and this, the query in this one document.
[01:18:12] Connor Shorten: Then there's pair wise where you take two documents in the query and have a tournament of two at a time. And then there's the list wise re rankers where you're taking all the documents as input at once. So the re rankers are pretty effective, you have this massive latency overhead by doing it like that.
[01:18:28] Connor Shorten: So what Colbert introduces is this late interaction. So the benefit of having this interaction between the query and the document most similar to the point-wise cross and coer reran, where you keep the vectors for the the documents and you have this kind of interaction between the inner token vectors.
[01:18:47] Connor Shorten: So let me, it's right now what we're doing mostly with vector search is, and this is why the BERT thing is actually really important, is because we're using these encoder only models that output that like a vector for each of the token. But then we pool all those vectors to represent the object with one vector.
[01:19:02] Connor Shorten: But Colbert, you keep all the vectors for the query and the document. And then you have this inner, it's maybe a little hard to just talk you through the math behind this, but you have this. The maximum similarity of each of those query vectors with all the document vectors. So say you have 100, document vectors and you're at index 0 of the query vector as you do the maximum similarity with those 100.
[01:19:22] Connor Shorten: Then you're at the first vector of the query, second, third, so on. And then you'll average that out. So you now have this late interaction of the vectors between the query and the document. I hope that maybe Benjamin can take the mic from here. I hope that gets the gist of it.
[01:19:37] Benjamin Clavie: Yeah, that was pretty good. So just to clarify, like max similarity is like when you're using normal vectors or like batch representation, you do have a single vector for the whole document.
[01:19:48] Benjamin Clavie: When you're using Colbert, like Connor said, you've got one vector per token, and at retrieval time, what you do is you compare every single one of your query tokens, so generally not a lot, like maybe 32, and you compare that with every single token in every single document, and you make, you only keep the highest similarity, and then you sum that up, and so you compare every token to every token, you get this really fine grained comparison, instead of trying to slot everything into one massive vector, which would probably lose information.
[01:20:17] Benjamin Clavie: Because you're doing it at the token level, it's very clear, I call this like a bag of embeddings because it's like quite close to what we do with TF IDF but with embeddings instead of like just a word count.
[01:20:29] Alex Volkov: Wow. Okay. So let me try. So Connor said a bunch of stuff. Then Lindgren, you simplified. Let me try to simplify from my understanding. Okay. Regular rack system, regular basic, not without even the re ranking step. Connor? Like the basic stuff that people do in the wavy examples, for example or whatever local embeddings you have, let's say a vector store of a bunch of information.
[01:20:49] Alex Volkov: You have a user asking a question, you want to augment LLM's information. tree because of the knowledge cutoff. And then you embed the user's query in some sort of embedding. We've talked about embeddings multiple times here on ThursdAI. You get some number back and like Benjamin said, you get one embedding for the whole document or the whole query.
[01:21:08] Alex Volkov: You get like just one, not per token. You get one embedding and then you use that. And to compare, and the usual similarity score is the ways to compare this. Then if we, you wanna go to advanced stuff, then you maybe do some re ranking. Re ranking is basically showing you like another LLM step, basically, right Conor?
[01:21:28] Alex Volkov: Or some maybe model that does re ranking for you, that chooses, you retrieve multiple examples, and you choose which one like fits better. And you can do this based on several things. The downside of this is, the bigger documents you embed, the kind of, um, The last concepts maybe in this whole embedding are similar to your query.
[01:21:47] Alex Volkov: And we've all like talked about this kind of similarity is very interesting because embedding definitely has dimensions, but it's hard to figure out if a huge document like embeds into one is how should I say, averages with everything that happens in there. And the benefit here of cold bear.
[01:22:06] Alex Volkov: Finally, I'm pronouncing this correctly. Colbert is that instead of embedding one time, it embeds per token. And am I getting this correctly? That sounds to me like a lot of compute. Is that correct? Embedding per token sounds okay, now we can compare each token from the query to each token of the document.
[01:22:24] Alex Volkov: But is it significantly overhead in terms of compilation time compute? What's the downside? It sounds better on the surface.
[01:22:32] Benjamin Clavie: So yeah,
[01:22:33] Alex Volkov: Go ahead, Benjamin, please. Yeah.
[01:22:35] Benjamin Clavie: clarification was quite clear in that, yeah, it's very clear, the problem with single vector representation is You've got a long document, and you're essentially asking the model to be like, I'm going to squeeze in every single thing that could be to know about this document into 500 floats or something, which is not a lot of space.
[01:22:54] Benjamin Clavie: But, Colbert takes more storage space, to answer your question, like you will need to store more tokens even though there are compression techniques, and we'll get into that later. But compute wise, it's essentially the same, because when you're using any sort of transformer model, you'll be attending to every token anyway.
[01:23:09] Benjamin Clavie: The only difference is Colbert actually stores those, instead of just averaging them at the end.
[01:23:15] Alex Volkov: Oh, so the, on the output of something like Colbert, you actually get all of the [01:23:20] embeddings per token and not just one embedding per the whole document. And then you can, it's like the storage is higher, but you can actually use those for more, better, higher quality comparisons. That's what we're talking about here.
[01:23:33] Alex Volkov: Is that correct?
[01:23:35] Benjamin Clavie: That's the gist of it, yeah. And then after Colbert You've got Colbert V2 and PLED, which is essentially Omar and Tim found out that, yeah, that does take a lot of space, but can we compress the embeddings? So most of the time when you see Colbert using production, it actually compresses every single token vector to just one or two bits.
[01:23:56] Benjamin Clavie: So don't take that much space
[01:23:58] Alex Volkov: Oh, so Colbert v2 is, what, 10x size or something comparison, right? Something like this. Conor, can you speak about this? Cause obviously you're in the vector dataset space. The more folks host, the better it is, for you guys. Cause you get a pet token. Can you just speak about the size of this and like the improvement as well?
[01:24:20] Connor Shorten: There's a couple ways you can do this quantization. The most common is just to have k means on the segments. You divide vectors and every two contiguous values you would then cluster that and then reduce the precision to like, eight bits, so when you quantize the token vectors, you can take down the storage overhead a lot. But yeah, I think Benjamin already said it all.
[01:24:43] Alex Volkov: Okay, so now let me take this into the practical realm because Colbert, the original paper came out in 2020 and I don't remember this off the top of my head, but the way I'm reading, I have some mental documentation here that I'm using to ask you guys the proper questions. And then Colbert V2 came out and a significant compression of the data because they quantize the actual individual embeddings and performance is essentially the same, I assume.
[01:25:06] Alex Volkov: And then. It also came out a while ago, and then, Benjamin, I think you're in charge, single handedly, for the resurrection, or like the renewed interest, because all of what we're saying doesn't not, doesn't sound to me super easy, as somebody who just okay, it's super easy for me to use a vector database, like wavy, other competitors, local vector stores, they all have very simple tutorials for me to just embed the query, go do a regular the nearest neighbor can then search whatever, and then just do this for the user.
[01:25:34] Alex Volkov: Now, all of what we're talking about, embedding per token, like comparison, like all of these things sound complex to me, and then that's where Ragatouille comes in, correct? So can you talk about, you see all this happening, and then what's your library doing why is it in charge of the resurrection of this whole concept?
[01:25:53] Benjamin Clavie: Yeah, I don't know if I'll go as far as resurrection, but yeah, Colbert is basically used by everyone who is quite aware of search, like pretty much every search startup, people at Google, etc. are using Colbert, but they don't got that big outside the poor user area, and the reason I think it's something that Omar mentioned the other day is I wouldn't say Colbert itself isn't usable, but it's not approachable.
[01:26:16] Benjamin Clavie: If you go look at the repo, it's scary. There's a lot of things. How do I store those vectors, et cetera. And the point of Rege2 is trying to bridge that gap because we are now at the point, I think, where AI has users that aren't like traditional AI for users, especially in IR. Vectors are complicated.
[01:26:33] Benjamin Clavie: Embeddings are complicated. And the point of Rege2 was basically like, yeah, but what if you could use Colbert and just like 4 lines of code, and I tried to build that, and it turned out to be quite easy to build, so that's how it came to be.
[01:26:46] Alex Volkov: So you built it, it's quite easy for you. What is it? Just this is like a library wrapper on top of, The knowledge of how to run Colbert in production. What is the library like? Is this the lang chain for Colbert? Tell us like what folks are to expect when they open up and they say, okay, I need to use something like this.
[01:27:03] Alex Volkov: This is super interesting. This is higher quality retrieval. How do I start?
[01:27:09] Benjamin Clavie: Yeah, so I think there's two things here, that's where I would like it to be, and where it currently is. Where I would like it to be is to keep adding more stuff and basically bridge the gap between what's popular in IR research or retrieval, which is probably a few years ahead of what's actually popular in the mainstream because it's quite obscure.
[01:27:26] Benjamin Clavie: And then what it is right now, like when you open like a tool, it's basically there's two main classes, one that you can use to fine tune and train Colbert models and hopefully more late interaction models, but right now it's just Colbert. And tries to abstract away all the hard stuff there's a thing called hard negatives, when you're training for retrieval, and you need to mime for hard negatives, and that's they're done in the background.
[01:27:48] Benjamin Clavie: And then you've got the main one, which you can use to use Colbert to re rank her, or use Colbert to uncode documents in memory, or use Colbert to create an optimized Colbert index, which does the compression, etc. So it's basically, yeah, give it your documents, it will process them, and then you end up with something you can play.
[01:28:04] Alex Volkov: Just from a perspective of nobody that used this model so far . Let's say I already have a vector database existing. I need to reed everything in there to start using called Bay and with regulatory. And that's what you mean by fine tune or is there like an additional thing that's called fine tune?
[01:28:20] Alex Volkov: 'cause this is not like the LLM fine tune that we've talked about here on Thursday and multiple times. This is a different fine tune. What are we fine tuning? How long does it take? Does it need GPUs? If you don't mind, walk us through this. If how easy this is for the user to do.
[01:28:36] Benjamin Clavie: Yeah, that's a good question. So it's actually quite similar to LLM fine tunes, just on a much smaller scale, because you would actually be fine tuning the model itself. There's another paper by Omar and team, Omar is everywhere in this link, regardless. There's another paper by Omar and team called UDA PBR, which is actually a combination of choosing DSP, so the proto DSP Y.
[01:28:59] Benjamin Clavie: With Colbert to make the fine tune Colbert to any unknown domain. So any new domain, you could technically get a much better retrieval model using that. Right now there's only one implementation. That's something we would like to have in Regentoo. But yeah, the other question is, can you use joint distinct vectors with this?
[01:29:17] Benjamin Clavie: The answer is no, and that's quite annoying. And when fine tune, I also mean like you can fine tune the model, but you can also just choose Colbert of the shells and use that to embed your documents and create a new index. Beef. If I have to speak of the cons, I would say there's no VectorDB except Vespa, which I don't think qualifies as a modern VectorDB we probably mean here that can use call back embeddings out of the box.
[01:29:41] Benjamin Clavie: I know there's interest, maybe Connor, you guys will support it at
[01:29:44] Connor Shorten: some point soon. Yeah, we're definitely working on it. I think we, I think, I do think that you've maybe understated the contribution of Ragatouille before you did this, it wasn't, it was not easy to train your own Colbert model, and it definitely wasn't something that we saw as freQwently.
[01:30:03] Connor Shorten: It was like, Yeah, I think that you've definitely evangelized it. I don't necessarily agree with the most people doing search were doing it this way. Maybe I've just opened a thing, but I think most people have been doing the kind of pooled vectors thing and this is very new, but and yeah, we are working on adding it.
[01:30:22] Alex Volkov: I, from my perspective, just judging by the social feeds, I agree, Benjamin, without working through it I don't think I've been even been interested. But I want to maybe ask Connor here as a follow up. So you, Ragatori, you see it blowing up, like what piques your interest in how approachable this is?
[01:30:36] Alex Volkov: What's fine tuning a Colbert model mean for retrieval? You guys are like researching every retrieval technology out there as much as possible in order to bring this obviously to your users as well. Quality of retrieval is very high of a very high importance as well, but storing these like vectors in different vector databases.
[01:30:54] Alex Volkov: What do you see in Ragatori like exploding and how does this translate into people are using rags better, sorry, rags better.
[01:31:05] Connor Shorten: Yeah, I guess it yeah it definitely is just I think what I opened with this kind of retrieved and re rank it, collapsing it into the one thing. And I think it's really just explained it really well. I agree with you, Alex. I don't think I would have understood Colbert as well as I do now if it wasn't for Benjamin and Ragatouille.
[01:31:21] Connor Shorten: So that's what I think, but under the hood, it's I think it's still like this re ranking thing where we can still use, we still use the pooled vector and like an HNSW search to surface the candidates and then we'll now bring the, the other token vectors with it.
[01:31:35] Connor Shorten: And then, for Weaviate that just means opening up, like having a more generic type [01:31:40] for how we store vectors to, instead of just one vector now we have this, like an open interface. To, to let you still use the, because the pooled vector embedding search is still very popular as well.
[01:31:51] Connor Shorten: The OpenAI embedding. I think the Matryoshka thing, maybe we could talk about that as well. I think that has some flavors of this. I'm not sure if it still has the same kind of hierarchy to it. But I think there's also, maybe I'm going off topic, but there's also a paper from DeepMind about semantic IDs.
[01:32:06] Connor Shorten: And so semantic IDs, they're like this like hierarchical, discrete quantized things where it'd be like you Like at the, say you have three, three IDs and they're each eight bits and the first one would be like whether it's about sports or news or something like that. So there's definitely a, yeah, this is definitely like a newer thing, I would say.
[01:32:25] Connor Shorten: And I hope I answered the question. I think I just did like a circle around.
[01:32:28] Alex Volkov: No, with this article, definitely. I just want to touch about a concept that may be not familiar for folks here on the ThursdAI stage. Matryoshka embeddings came to my, on my radar just recently after OpenAI released their new embedding models. And one of the things they've added in their new embedding models is the ability to reduce dimensions like via API call.
[01:32:45] Alex Volkov: And people were starting thinking like, Hey, how did they do this? What usually, like when you get an embedding model, you get And then some folks started saying there was this paper called Matryoshka embeddings that Matryoshka, if you guys are not visualizing what this is like the Russian dolls thing where one fits into another.
[01:33:00] Alex Volkov: And there's this paper, and I think the author of Matryoshka embeddings is on my Reddit as well. Maybe we'll get him on ThursdAI that actually allows for significantly smaller, correct me if I'm wrong, way to do this. And I think. Folks from Junaid definitely talked about trying to train Matryoshka with some other stuff.
[01:33:17] Alex Volkov: So this is like a new concept we haven't touched upon yet, but could potentially be an additional competitor here. I want to scroll back real quick. We have Benjamin back. Benjamin let's talk about the speed of this for like larger documents. Definitely what I Learned about Regato definitely, but also about Colbert is for larger documents.
[01:33:36] Alex Volkov: I saw something, I think from Omar about just like millions of rows or something significantly faster. Could you speak about like the speed of this whole thing? Are we getting like an improvement significantly for speed? Like why would a person who already has a setup consider switching to something like this?
[01:33:51] Alex Volkov: And let's talk about the seconds it takes to run through like a bunch of documents. to find similarities.
[01:33:59] Benjamin Clavie: Okay, so that's, so I did miss a few things, so it might have been said already, but there's a trade off here in that creating a Colbert index as in an optimized one using quantization, like Connor said, is quite slow, like pretty slow, because it has to run k means on all your embeddings, etc., but the con, like the flip side of that is that once your documents are in an optimized index, Query is pretty much in constant time, like it doesn't matter if you've got 100 million documents or billions, it will take about 50 60 milliseconds, and that's because the indexing optimization step, I think, creates A bunch of centroids that you can use to, you can use as a gateway to documents, like to simplify things.
[01:34:40] Benjamin Clavie: So query is pretty much constant, and that's a big pro of optimized Colbert indexes. I think that's what counts, because it also means that adding and deleting from a Colbert index is very slow, because you need to recompute that. And I think that's space here for some sort of hybrid approach. Also using NHSW for like smaller collections, because you don't need that sort of optimization if you've got like 10, 000 documents or something.
[01:35:04] Alex Volkov: Interesting. It's just for my understanding brain this is very similar to pre compilation of some stuff versus like runtime executions or some stuff you're saying if basically you can offload. The compilation part, and your users will not basically suffer from this, right?
[01:35:20] Alex Volkov: You don't have to go and call different APIs for this. If you're able to do this, and then you precompile everything, and the benefit here is larger indices, like larger, like significant larger document stores. You're talking about like millions or a hundred millions or so. But then retrieval is almost near time, like instant, under like milliseconds.
[01:35:41] Alex Volkov: That's, I think, a crazy benefit for folks, especially in enterprises and different places where Yeah, I think it's like a significant improvement towards regular like search and vector comparison. Conor, would you say so as well? Because you guys are in the business of vector comparison and bringing people.
[01:36:00] Alex Volkov: Are you seeing like a significant improvement from a retrieval speed here.
[01:36:08] Connor Shorten: Yeah, I think the latency probably isn't too bad because you, the way that I understand Colbert is that you still, or Colbert, sorry, I would agree on Colbert, but, is that you still have the the top 100 search with HNSW and, that latency is, Pretty slow. It's gonna be like five milliseconds at a million scan.
[01:36:25] Connor Shorten: That's like the most hand wavy thing ever, but and then you just bring these quantized vectors into memory to then re it's way faster than, the cross encoder approach where you're going to take those top 100 results and then append them with the query and send them to a, an inference container to get back the scores and sort them.
[01:36:39] Connor Shorten: So it's way faster than that. I think maybe one thing out of what you just said that I'd want to parse is I don't think it's the same analogy as compile it or compose it at runtime. It's maybe more so like an asynchronous kind of thing where you can query the index that you currently have and then in the background, the index can start doing that k means quantization.
[01:37:00] Connor Shorten: That's probably the slowest thing of as Benjamin just mentioned. Like that quantizing the token vectors and now we're, let's say we're I'm actually not familiar with the detail of exactly how many token vectors you're keeping for document, but let's say it's 512, right?
[01:37:14] Connor Shorten: And now you're going to be running k means over, each of those and in parallel and then you also are, trying to multi thread the per segment codebook. So I think feeding that, fitting that codebook is going to be your challenge. And so that's probably, and then keeping that fresh because these codebooks, if that's the way you're doing it, I don't The thing about Matryoshka and it's like maybe, and it's like maybe you can get the quantized vectors out of the box with one of the embedding models, but it's the quantization schemes are pretty dependent, like dependent on your data, particularly, like you can't it's not like the embedding models that you get from the common APIs that they come with the code books.
[01:37:53] Connor Shorten: You have to fit these code books to your data. So I think the way to think about it would be that we can fit these code books like asynchronously in the background and you can query what you currently have and then, the updating and having the refresh indexing that can happen with a cycle kind of way.
[01:38:10] Alex Volkov: All right. I wanna maybe move towards, okay. Let's say folks are interested to trying this. Benjamin, could you could you speak about how to like. Is Regatoid the right start? Do they have to? I think you mentioned this briefly. I just want to return to this. Is this only like significantly better for a large set of documents?
[01:38:28] Alex Volkov: What are the steps to getting started here and what people should know? And then I guess we'll ask about if where to find you guys and how to keep up to date with as these developments around this area happen.
[01:38:43] Benjamin Clavie: So if you want to get started, I think Regato is probably definitely the easiest way to try Colbert. We've got a few example notebooks on the GitHub repository. If you want to contribute more, please do. That's the big thing. I need more documentation, more notebooks. But you can try re ranking or indexing in memory or building your index.
[01:39:01] Benjamin Clavie: So I've got Finetuning pretty much out of the box. So I'd say start there. In terms of retrieval performance, like Colbert is always a really strong competitor. Performer in the existing IR literature, and we do have a re ranker, so you can just try it out, just use it to re rank before you commit to indexing your whole documents, just to see how it would perform for you.
[01:39:21] Benjamin Clavie: So that could be an easy way to slot in any existing pipeline, basically, just retrieve documents, re rank them. and see what the rerun code does for you.
[01:39:29] Alex Volkov: And that in that case, I think integration with existing libraries also exists for folks who use like ClangChain or LAMI index. I saw that they also integrate at least some parts of this, correct?
[01:39:40] Benjamin Clavie: Yeah, and I do want to thank them for that because they basically did this within 24 hours of me reusing ReGaTu. On Lama Index you can use Colbert Indexes and on LangChain you can use Colbert Indexes and you can use like Colbert's rerun code as well. So if you already use LangChain you can add like an extra Colbert step using [01:40:00] ReGaTu in three more lines of code, I think.
[01:40:02] Alex Volkov: Incredible. So folks definitely definitely who are interested in trying out what the big dogs use for search re ranking is a very easy, like without committing re ranking is a fairly easy way to get started with this and see if you get a significant performance. And Connor, we barely touched on DSPies.
[01:40:19] Alex Volkov: I do want to have a conversation about because that's also all over my feed and basically Omar is all over my feed. And could you Let's say, how, does this all connect somehow with DSPies or no, and because DSPies is for the prompts area. This is more for the retrieval area. Where's the connection point that I'm missing besides Omar being everywhere?
[01:40:39] Connor Shorten: I think that, oh, I think Omar being everywhere is maybe the biggest connection I, because to me it's kinda like D SPY is like optimizing the LLM program prompt part. And then I think to have the optimi optimization loop connect between that and the retrieval model, you definitely, there's works like propagator in pairs.
[01:40:59] Connor Shorten: Omar has, I think, UDAPDR, something like that, where you use the LM to generate synthetic queries, then you fine tune the embedding model with that. So that's that would be where the connection would be, DSPy is like a synthetic data framework, you tell it what you want it to do, and it will use the LLMs to generate successful executions of the task, and then you use that to distill it to either small models, or to tune the prompts, or you could fine tune an embedding model.
[01:41:25] Connor Shorten: I don't think it's quite, I think that would be pretty advantageous. Benjamin can take the mic from here.
[01:41:32] Benjamin Clavie: Yeah, I wouldn't say DSPy and Colbert are directly related. They exist in the same space, but definitely very different tools. Like Connor mentioned, UDA PDR, which is a paper, the paper I mentioned, actually, where you use DSP and hopefully soon DSPy to fine tune a Colbert to any domain.
[01:41:50] Benjamin Clavie: It's not exposed. It's never been exposed to before and get it to a state of the art result only domain. That's a really good application of DSPy to Colbert. And likewise, you can use Colbert as a retriever on your DSPI pipeline, but it's just a component, it's not quite the DSPI thing.
[01:42:08] Connor Shorten: I do have something, though, that is very related to retrieval generally.
[01:42:12] Connor Shorten: Is we saw all these amazing LLM query router things. I want to give Llama Index credit for evangelizing most of this stuff. But, so one example is, say you have the LLM pick a metadata filter to put on the vector search. Like you want to, search only where you're searching through, let's say you have an index of podcast clips and you want to say only where the speaker is Omar Khattab, and you have an LLM predict that filter, and then that would be in the retrieval engine.
[01:42:38] Connor Shorten: And so you have this you have a prompt behind that same with text to SQL. There's a prompt behind how you we'll put these things around retrieval. And so DSPy can optimize the prompts or optimize the models that do that to get the maximum performance out. And not, I, not to, I don't mean to say anything negative about the existing frameworks, but you're right now, locking into the prompts that they have built in to the framework it do these things, whereas DSPy opens it up to optimize it for your thing.
[01:43:06] Alex Volkov: Interesting. Yeah, I don't think it's negative necessarily. I think people after using some of these frameworks they understand that and we've seen this from multiple folks. This, they could potentially start with something like a Lama index or LinkedIn and then quickly figure out that some more.
[01:43:20] Alex Volkov: Freedom is needed and de SPI saying is a potential kind of way to do that. Okay. Connor, anything else? Very interesting. So first of all, you have a bunch of great content on this. You recently did. I think it's been to the top of the tweet. I'll definitely add this to the show notes as well.
[01:43:32] Alex Volkov: You did a deep dive into de SSPs on your, was that on the podcast or was just a video? Definitely we'll send folks there. Anything else you want to add of like, how to find you, where to find your content and definitely folks should follow you. First of all, we'll add your things.
[01:43:48] Connor Shorten: Thanks, Alex. Yes, I have two podcasts right now with Omar, of course, and then I have Carol Duserlink, who's created this. Infer, Retrieve, Rank, Program. It's one of the coolest examples of DSPi. And yeah, and then I have one video out so far explaining the whole thing. Quickly, I wanted to point people to the update to DSPi Assertions.
[01:44:05] Connor Shorten: Because I think this is the most important thing with these prompting frameworks. And I think it's important. to also understand Instructor from Jason Liu, which is where you use Pydantic to define the schema of the outputs that you want from the language model, and then you validate the outputs to make sure that it, outputted JSON with the keys that you wanted.
[01:44:23] Connor Shorten: And so DSPi Assertions is in this similar category, and this is like the most common discussion I'm seeing in the DSPi Discord is people looking to add Instructor to DSPi and jointly looking to do this thing of like structured outputs and have this retry mechanism. There's a new work from Arnav Signal Sig, oh, sorry, Arnav Singh V.
[01:44:43] Connor Shorten: We haven't met yet, but, and know more about DSPi assertions. And I'm going to link it in the description of this chat. Cause I highly recommend people check it out.
[01:44:50] Alex Volkov: Awesome. Nisten, just before I give you a question I will shout out that Jason Liu from the instructor library came to the Weights & Biases courses, and there's a course that he builds with us as well that's for free. You can just go 1db. ai courses. I'll definitely add this in the link below about structured output and how to force these LLMs to give us better structure output.
[01:45:09] Alex Volkov: It's funny that a person named Jason is building, you tools to get LLMs to output JSONs. But that's all I have. Just super quick. Nisten, go ahead. You had a question here.
[01:45:19] Nisten Tahiraj: I just want to say it's pretty amazing that the people we bring here are from the industry. We actually use, like from last week, I started using Lilac, I might actually start running Ragatouille on that on that Hacker Neon dataset. And so I wanted to know and mainly since some people ask in the comments, what have I used, I forced myself to only use open source models.
[01:45:45] Nisten Tahiraj: And cause I feel like that's the only way they're going to start getting better if we restrict themselves to them. I don't recommend you do it just yet, just wait another. Maybe a week or two but I want, I wanted to ask uh, we see some limitations with retrieval augmentation systems, like in GPT 4 when people use it.
[01:46:07] Nisten Tahiraj: It only gives three points from the document, doesn't really summarize it and stuff. What are the benefits of going with the Colbert? I'm sorry. Is it because it's much faster? Can you feed it many more documents? I'm talking from a practical point of view, not necessarily even from a tech person's point of view, like as a business who has a lot of customer data why should they use this versus just putting it on pgVector and doing function calling?
[01:46:41] Nisten Tahiraj: Is this faster that way? And what limitations does using again, RegA2 with Colbert
[01:46:47] Benjamin Clavie: have? That is a good and open question. So limitations we have a lot right now, like the lack of Cloud hosting offering is a big one. There's not really somewhere you can host this except doing it yourself, which is a big problem.
[01:47:05] Benjamin Clavie: And the main reason to use it, I would say, is generalization because the thing when you use any of the shared submitting models is they look good on benchmarks, and they tend to work quite well, but they've been optimized for those benchmarks. Whereas Colbert, for instance, like Colbert V2, has never been trained on the MTEB benchmark for retrieval, etc.
[01:47:24] Benjamin Clavie: The reason it generalizes well is because working at the token level makes it a lot easier for your model to encode information. Whereas, like, when you're trying to squeeze everything into a single vector, it might not very well, not work very well, say, for your custom domain. Whereas with Colbert, you can always assume it's going to be okay in every domain, but if it's not the best, you will need to fine tune it later.
[01:47:45] Benjamin Clavie: It's probably the biggest draw, I'd say.
[01:47:51] Alex Volkov: Awesome. So I definitely wanna thank you guys for coming up and explaining these concepts that have been floating around in very simple language. And I appreciate your patience with me re asking this in the way that I understand, because I know definitely that this is my way to understand, but also some folks in the audience.
[01:48:06] Alex Volkov: That's how we do here on ThursdAI, so more than welcome to rejoin. For I now consider both of you friends of the pod, so I agree with Nisten. It's really cool to see the authors of the libraries and the tools that we use. Come here to ThursdAI to talk about them, [01:48:20] and obviously, upcoming features as well.
[01:48:22] Alex Volkov: Definitely welcome. Benjamin, thank you for doing a bunch of open source stuff, and evangelizing the whole con birth call birth thing to make it simpler for folks. Definitely, thank you. And any anything you want to add here that I haven't touched yet? Please go ahead, Benjamin.
[01:48:36] Benjamin Clavie: I do have a few shoutouts, shall we say. One of them is that LungChain and DSPy are not mutually exclusive, and I shared that in the chat. There is now LungChain x DSPy integration, where you can define your chains in LungChain and still use DSPy to optimize things, which is pretty cool.
[01:48:53] Benjamin Clavie: And in the embedding world, so you mentioned Matrioshka embedding, and we talked about Colbert, and the people at JIN are actually training a Colbert model right now using Matrioshka embedding for compression, as like some sort of let's try this out, see how it works. And the final one is, you might have brought this up already, but the people at BAI train, like really, BGM3, as a really cool embedding model that in a single pass outputs.
[01:49:19] Benjamin Clavie: Dan's Vector, Burst, or Colbert Style Multivector Implantation, and the Splate Style Sparse Implantation. I won't go into too much detail about that,
[01:49:26] Alex Volkov: I'm sorry. I don't think I covered that. Who was that? Sorry. Could you repeat?
[01:49:31] Benjamin Clavie: The people at BAAI, the people who do the BGE
[01:49:34] Alex Volkov: Oh yeah, but yeah. We've talked about their model recently. They,
[01:49:37] Benjamin Clavie: ABI, yeah,
[01:49:38] Alex Volkov: Oh, I did not know.
[01:49:39] Alex Volkov: So they're now have a thing where outputs a regular embedding and also called burst style embedding.
[01:49:46] Benjamin Clavie: Yeah, the big thing last week was M3, which has a Colbert Style Embedding, Splate Style Embedding, which is a Sparse Implantation method, and Dan's Embedding, which is just a single model, a total of three.
[01:49:57] Alex Volkov: Oh, that's incredible. Okay. So we're adding some knowledge here. Thank you for, let me just repeat just the way that I hear this, we've talked about the BAAI BGE M3. M3 basically stands for multiple things. One of them is multilinguality. So they upgraded their embeddings to use not only English, but also I think a hundred languages as well.
[01:50:14] Alex Volkov: So now Benjamin, you're saying they're also implementing for us this step, the output, the dense embedding, but also the. The call Burr embedding, correct?
[01:50:25] Benjamin Clavie: yeah, yeah, one of the meanings of M, I think, is
[01:50:27] Alex Volkov: Multicomposability or some con yeah. Multifunctionality. Yes, exactly.
[01:50:33] Benjamin Clavie: can use it to generate different kinds of embedding. And I think that the first Non Colbert, actually like strong multi vector model. There's issues as in the vectors are too big, etc.
[01:50:45] Benjamin Clavie: But it's a very nice thing to see happen. Definitely, like
[01:50:49] Alex Volkov: Oh, definitely shout out then we need to get the folks from BA AI here to speak about this. So if you folks know them, definitely connect them to me. I would love to hear about from the authors of BG. Yeah, definitely shouts out Junaid. I think Bo Wang, we've mentioned he's a friend of the pod.
[01:51:03] Alex Volkov: He came when Junaid released embeddings and he often comes here and gives us like insights about how embeddings work. Shout out Bo and the team with Junaid as well. Connor your stage, if you want to add everywhere else where folks. You can follow or shout out your stage. And then we're going to continue with some more news.
[01:51:21] Connor Shorten: It's been so cool to be a part of the podcast. And I love how it's integrated into X because this is actually my favorite place to manage communication. So if you want to reach out, here would be great.
[01:51:31] Alex Volkov: Yeah. So definitely give a Connor a follow and a Wave8 podcast is incredible. We've been, by we, Wits and Biases. We had a mutual video together and Connor hosted our folks. And there was a, I learned a bunch of it before I joined Wits and Biases as well. A great source of information from both of you.
[01:51:45] Alex Volkov: Thank you guys so much for coming up, explaining these complex. on the surface concept to us, maybe complex also implementation wise, but making them simpler as well. I think it's very important talking about them. I think it's very important and you are now considered friends of ThursdAI community and hopefully this will get more folks to learn about this, contribute, etc.
[01:52:05] Alex Volkov: And I think with that, we're like, a bit over the top, like two hours since I started the recording. We had a great show today. Thank you everybody for listening and coming. I just wanna summarize this in a few notes that that I really enjoy my time here every week. And I really enjoy learning from folks. I think Nisten, you mentioned today that it's so cool to have the authors of the things we talked about.
[01:52:25] Alex Volkov: So today we also had this benefit. We had Benjamin here and we had Connor who covered this. And we also had Justin again from the Qwen team to talk about the Qwen stuff that they released. And it's really cool that the community now connects different people.
[01:52:36] Alex Volkov: So I was able to connect Justin and the Qwen team with the LM studio folks and Olama folk. No, I think only LM studio. And they were able to work together that they release is now supported in LM studio. the second they release something. So I love how this community comes together. I encourage everybody who listens to this to also participate in this.
[01:52:55] Alex Volkov: Either follow everybody who's on stage here interact with our posts and boost the signal a little bit. Tell your friends if you're working with friends and they don't listen to ThursdAI. And there's alpha in listening to ThursdAI like today definitely tell your friends where this alpha can be found.
[01:53:10] Alex Volkov: And with that, I want to thank you all and have a nice Thursday. Bye bye, everyone.
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode