Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0 cover image

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Latest episodes

undefined
May 20, 2023 • 1h 7min

MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML

We are excited to be the first podcast in the world to release an in-depth interview on the new SOTA in commercially licensed open source models - MosiacML MPT-7B!The Latent Space crew will be at the NYC Lux AI Summit next week, and have two meetups in June. As usual, all events are on the Community page! We are also inviting beta testers for the upcoming AI for Engineers course. See you soon!One of GPT3’s biggest limitations is context length - you can only send it up to 4000 tokens (3k words, 6 pages) before it throws a hard error, requiring you to bring in LangChain and other retrieval techniques to process long documents and prompts. But MosaicML recently open sourced MPT-7B, the newest addition to their Foundation Series, with context length going up to 84,000 tokens (63k words, 126 pages):This transformer model, trained from scratch on 1 trillion tokens of text and code (compared to 300B for Pythia and OpenLLaMA, and 800B for StableLM), matches the quality of LLaMA-7B. It was trained on the MosaicML platform in 9.5 days on 440 GPUs with no human intervention, costing approximately $200,000. Unlike many open models, MPT-7B is licensed for commercial use and it’s optimized for fast training and inference through FlashAttention and FasterTransformer.They also released 3 finetuned models starting from the base MPT-7B: * MPT-7B-Instruct: finetuned on dolly_hhrlhf, a dataset built on top of dolly-5k (see our Dolly episode for more details). * MPT-7B-Chat: finetuned on the ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct datasets.* MPT-7B-StoryWriter-65k+: it was finetuned with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. While 65k is the advertised size, the team has gotten up to 84k tokens in response when running on a single node A100-80GB GPUs. ALiBi is the dark magic that makes this possible. Turns out The Great Gatsby is only about 68k tokens, so the team used the model to create new epilogues for it!On top of the model checkpoints, the team also open-sourced the entire codebase for pretraining, finetuning, and evaluating MPT via their new MosaicML LLM Foundry. The table we showed above was created using LLM Foundry in-context-learning eval framework itself!In this episode, we chatted with the leads of MPT-7B at Mosaic: Jonathan Frankle, Chief Scientist, and Abhinav Venigalla, Research Scientist who spearheaded the MPT-7B training run. We talked about some of the innovations they’ve brought into the training process to remove the need for 2am on-call PagerDutys, why the LLM dataset mix is such an important yet dark art, and why some of the traditional multiple-choice benchmarks might not be very helpful for the type of technology we are building.Show Notes* Introducing MPT-7B* Cerebras* Lottery Ticket Hypothesis* Hazy Research* ALiBi* Flash Attention* FasterTransformer* List of naughty words for C4 https://twitter.com/code_star/status/1661386844250963972* What is Sparsity?* Hungry Hungry Hippos* BF16 FPp.s. yes, MPT-7B really is codenamed LLongboi!Timestamps* Introductions [00:00:00]* Intro to Mosaic [00:03:20]* Training and Creating the Models [00:05:45]* Data Choices and the Importance of Repetition [00:08:45]* The Central Question: What Mix of Data Sets Should You Use? [00:10:00]* Evaluation Challenges of LLMs [0:13:00]* Flash Attention [00:16:00]* Fine-tuning for Creativity [00:19:50]* Open Source Licenses and Ethical Considerations [00:23:00]* Training Stability Enhancement [00:25:15]* Data Readiness & Training Preparation [00:30:00]* Dynamic Real-time Model Evaluation [00:34:00]* Open Science for Affordable AI Research [00:36:00]* The Open Approach [00:40:15]* The Future of Mosaic [00:44:11]* Speed and Efficiency [00:48:01]* Trends and Transformers [00:54:00]* Lightning Round and Closing [1:00:55]TranscriptAlessio: [00:00:00] Hey everyone. Welcome to the Latent Space podcast. This is Alessio partner and CTO-in-Residence at Decibel Partners. I'm joined by my co-host, Swyx, writer and editor of Latent Space.Swyx: Hey, and today we have Jonathan and Abhi from Mosaic ML. Welcome to our studio.Jonathan: Guys thank you so much for having us. Thanks so much.Swyx: How's it feel?Jonathan: Honestly, I've been doing a lot of podcasts during the pandemic, and it has not been the same.Swyx: No, not the same actually. So you have on your bio that you're primarily based in Boston,Jonathan: New York. New York, yeah. My Twitter bio was a probability distribution over locations.Swyx: Exactly, exactly. So I DMd you because I was obviously very interested in MPT-7B and DMd you, I was like, for the 0.2% of the time that you're in San Francisco, can you come please come to a podcast studio and you're like, I'm there next week.Jonathan: Yeah, it worked out perfectly. Swyx: We're really lucky to have you, I'll read off a few intros that people should know about you and then you can fill in the blanks.So Jonathan, you did your BS and MS at Princeton in programming languages and then found your way into ML for your PhD at MiT where you made a real splash with the lottery ticket hypothesis in 2018, which people can check up on. I think you've done a few podcasts about it over the years, which has been highly influential, and we'll talk about sparse models at Mosaic. You have also had some side [00:01:30] quest. You taught programming for lawyers and you did some law and privacy stuff in, in DC and also did some cryptography stuff. Um, and you've been an assistant professor at Harvard before earning your PhD.Jonathan:  I've yet to start.Swyx: You, you yet to start. Okay. But you just got your PhD.Jonathan:. I technically just got my PhD. I was at Mosaic which delayed my defense by about two years. It was, I was at 99% done for two years. Got the job at Harvard, Mosaic started, and I had better things to do than write my dissertation for two years. Swyx: You know, you know, this is very out of order.Jonathan: Like, oh, completely out of order, completely backwards. Go talk to my advisor about that. He's also an advisor at Mosaic and has been from the beginning. And, you know, go talk to him about finishing on time.Swyx: Great, great, great. And just to fill it out, Abhi, you did your BS and MS and MIT, you were a researcher at Cerebras, and you're now a research scientist at Mosaic. Just before we go into Mosaic stuff, I'm actually very curious about Cereus and, uh, just that, that space in general. Um, what are they doing that people should know about?Abhinav: Yeah, absolutely. Um, I think the biggest thing about CEREUS is that they're really building, you know, kind of the NextGen computing platform beyond, like GPUs.Um, they're trying to build a system that uses an entire wafer, you know, rather than cutting up a wafer into smaller chips and trying to train a model on that entire system, or actually more recently on many such wafers. Um, so it's, and it's really extraordinary. I think it's like the first time ever that kind of wafer scale computing has ever really worked. And so it's a really exciting time to be there, trying to figure out how we can map ML workloads to work, um, on a much, much bigger chip.Swyx: And do you use like [00:03:00] a different programming language or framework to do that? Or is that like..Abhinav: Yeah, so I mean, things have changed a bit since I was there.I think, um, you can actually run just normal tensor flow and pie torch on there. Um, so they've built a kind of software stack that compiles it down. So it actually just kind of works naturally. But yeah.Jonathan : Compiled versions of Python is a hot topic at the moment with Mojo as well. Swyx: And then Mosaic, you, you spearheaded the MPT-7B effort.INTRO TO MOSAIC [00:03:20]Abhinav: Uh, yeah. Yeah, so it's kind of like, it's been maybe six months, 12 months in the making. We kind of started working on LMs sort of back in the summer of last year. Um, and then we came with this blog post where we kind of profiled a lot of LMs and saw, hey, the cost of training is actually a lot lower than what people might think.Um, and then since then, you know, being inspired by kind of, you know, meta’s release, so the LLaMA models and lots of other open source work, we kind of started working towards, well, what if we were to release a really good kind of 7 billion parameter model? And that's what MPT is. Alessio:You know, we mentioned some of the podcasts you had done, Jonathan, I think in one of them you mentioned Mosaic was not planning on building a  model and releasing and obviously you eventually did. So what are some of the things that got you there that maybe obviously LLaMA you mentioned was an inspiration. You now have both the training and like inference products that you offer. Was this more of a research challenge in a way, uh, that you wanted to do?Or how did the idea come to be?Jonathan: I think there were a couple of things. So we still don't have a first class model. We're not an open AI where, you know, our businesses come to use our one great model. Our business is built around customers creating their own models. But at the end of the day, if customers are gonna create their own models, we have to have the tools to help them do that, and to have the tools to help them do that and know that they work we have to create our own models to start. We have to know that we can do something great if customers are gonna do something great. And one too many people may have challenged me on Twitter about the fact that, you know, mosaic claims all these amazing numbers, but, you know, I believe not to, you know, call out Ross Whiteman here, but, you know, I believe he said at some point, you know, show us the pudding.Um, and so Ross, you know, please let me know how the pudding tastes. But in all seriousness, like I think there is something, this is a demo in some sense. This is to say we did this in 9.5 days for a really reasonable cost, straight through 200, an intervention. 200 K. Yep. Um, you can do this too.Swyx: Uh, and just to reference the numbers that you're putting out, this is the, the last year you were making a lot of noise for trading GPT 3 under 450 K, which is your, your initial estimate.Um, and then it went down to a 100 K and stable diffusion 160 k going down to less than 50 K as well.Jonathan: So I will be careful about that 100 K number. That's certainly the challenge I've given Abhi to hit. Oh, I wouldn't make the promise that we’ve hit yet, but you know, it's certainly a target that we have.And I, you know, Abhi may kill me for saying this. I don't think it's crazy. TRAINING AND CREATING THE MODELS [00:05:45] Swyx: So we definitely want to get into like estimation math, right? Like what, what needs to happen for those big order magnitude changes to in, in infrastructure costs. But, uh, let's kind of stick to the MPT-7B story. Yeah. Tell us everything.Like you have, uh, three different models. One of them. State of the art essentially on context length. Let's talk about the process of training them, the, uh, the decisions that you made. Um, I can go into, you know, individual details, but I just wanna let you let you rip.Abhinav: Yeah, so I mean, I think, uh, we started off with the base model, which is kind of for all practical purposes, a recreation of LLaMA 7B.Um, so it's a 7 billion perimeter model trained on the trillion tokens. Um, and our goal was like, you know, we should do it efficiently. We should be able to do it like, kind of hands free so we don't have to babysit the runs as they're doing them. And it could be kind of a, a launching point for these fine tune models and those fine tune models, you know, on, on the one hand they're kind of really fun for the community, like the story writer model, which has like a 65,000 length context window and you can even kind of extrapolate beyond that. Um, but they're, they're also kind of just tr inspirations really. So you could kind of start with an MPT-7B base and then build your own custom, you know, downstream. If you want a long context code model, you could do that with our platform. If you wanted one that was for a particular language, you could do that too.But yeah, so we picked kind of the three variance chat and instruct and story writer just kind of like inspirations looking at what people were doing in the community today. Yeah. Alessio: And what's the beginning of the math to come up with? You know, how many tokens you wanna turn it on? How many parameters do you want in a bottle? 7 billion and 30 billion seem to be kind of like two of the magic numbers going around right now. Abhinav: Yeah, definitely. Definitely. Yeah, I think like there's sort of these scaling laws which kind of tell you how to best spend your training compute if that's all you cared about. So if you wanna spend $200,000 exactly in the most efficient way, there'd be a recipe for doing that.Um, and that we usually go by the Chinchilla laws. Now for these models, we actually didn't quite do that because we wanted to make sure that people could actually run these at home and that they [00:07:30] were good for inference. So we trained them kind of beyond those chinchilla points so that we're almost over-training them.I think there's like a joke going on online that they're like long boy and that that came up internally because we were training them for really, really long durations. So that 7B model, the chinchilla point might be 140 billion tokens. Instead, we trained a trillion, so almost seven times longer than you normally would.Swyx: So longboi was the code name. So is it, is it the trading method? Is it the scaling law that you're trying to coin or is it the code name for the 64 billion?Jonathan: Uh, 64. It was just an internal joke for the, for training on way more tokens than you would via chinchilla. Okay. Um, we can coin it long boy and it, it really stuck, but just to, you know, long boys filled with two ELs at the beginning.Yeah. Cause you know, we wanted the lLLaMA thing in there as well. Jonathan: Yeah, yeah, yeah. Our darn CEO we have to rein him in that guy, you know, you can't, yeah. I'm gonna take away his Twitter password at some point. Um, but you know, he had to let that one out publicly. And then I believe there was a YouTube video where someone happened to see it mentioned before the model came out and called it the Long G boy or something like that.Like, so you know, now it's out there in the world. It's out there. It's like Sydnee can't put it back inSwyx: There's a beautiful picture which I think Naveen tweeted out, which, um, shows a long boy on a whiteboard.Jonathan: That was the origin of Long Boy. In fact, the legs of the lLLaMA were the two Ls and the long boy.DATA CHOICES AND THE IMPORTANCE OF REPETITION [00:08:45]Swyx: Well, talk to me about your data choices, right? Like this is your passion project. Like what can you tell us about it?Jonathan: Yeah, I think Abhi wanted to kill me by the end for trying to use all the GPUs on data and none of them on actually training the model. Um, at the end of the day, We know that you need to train these models and [00:09:00] lots of data, but there are a bunch of things we don't know.Number one is what kinds of different data sources matter. The other is how much does repetition really matter? And really kind of repetition can be broken down into how much does quality versus quantity matter. Suppose I had the world's best 10 billion tokens of data. Would it be better to train on that a hundred times or better to train on a trillion tokens of low quality, fresh data?And obviously there's, there's a middle point in between. That's probably the sweet spot. But how do you even know what good quality data is? And. So, yeah, this is, nobody knows, and I think the more time I spent, we have a whole data team, so me and several other people, the more time that we spent on this, you know, I came away thinking, gosh, we know nothing.Gosh, if I were back in academia right now, I would definitely go and, you know, write a paper about this because I have no idea what's going on.Swyx: You would write a paper about it. I'm interested in such a paper. I haven't come across any that exists. Could you frame the central question of such a paper?THE CENTRAL QUESTION: WHAT MIX OF DATA SETS SHOULD YOU USE? [00:10:00]Jonathan: Yeah. The central question is what mix of data sets should you use? Okay. Actually I've, you know, you had mentioned my law school stuff. I went back to Georgetown Law where I used to teach, um, in the midst of creating this model, and I actually sat down with a class of law students and asked them, I gave them our exact data sets, our data mixes, um, like how many tokens we had, and I said, Create the best data set for your model.Knowing they knew nothing about large language models, they just know that data goes in and it's going to affect the behavior. Um, and I was like, create a mix and they basically covered all the different trade-offs. Um, you probably want a lot of English language [00:10:30] text to start with. You get that from the web, but do you want it to be multilingual?If so, you're gonna have a lot less English text. Maybe it'll be worse. Do you wanna have code in there? There are all these beliefs that code leads to models being better at logical reasoning, of which I've seen zero evidence. Rep. It's not, um, I mean, really made a great code model, but code models leading to better chain of thought reasoning on the part of language or code being in the training set leading to better chain of thought reasoning.People claim this all the time, but I've still never seen any real evidence beyond that. You know, one of the generations of the GPT three model started supposedly from Code Da Vinci. Yes. And so there's a belief that, you know, maybe that helped. But again, no evidence. You know, there's a belief that spending a lot of time on good sources like Wikipedia is good for the model.Again, no evidence. At the end of the day, we tried a bunch of different data mixes and the answer was that there are some that are better or worse than others. We did find that the pile, for example, was a really solid data mix, but you know, there were stronger data mixes by our evaluation metrics. And I'll get back to the evaluation question in a minute cuz that's a really important one.This data set called c4, which is what the original T five model was trained on, is weirdly good. And everybody, when I posted on this on Twitter, like Stella Beaterman from Luther mentioned this, I think someone else mentioned this as well. C4 does really well in the metrics and we have no idea why we de-duplicated it against our evaluation set.So it's not like it memorized the data, it is just one web scrape from 2019. If you actually look at the T five paper and see how it was pre-processed, it looks very silly. Mm-hmm. They removed anything that had the word JavaScript in it because they didn't want to get like no JavaScript [00:12:00] warnings. They removed anything with curly braces cuz they didn't wanna get JavaScript in it.They looked at this list of bad words, um, and removed anything that had those bad words. If you actually look at the list of bad words, words like gay are on that list. And so there's, you know, it is a very problematic, you know, list of words, but that was the cleaning that leads to a data set that seems to be unbeatable.So that to me says that we know nothing about data. We, in fact used a data set called mc four as well, which is they supposedly did the same pre-processing of C4 just on more web calls. The English portion is much worse than C4 for reasons that completely escape us. So in the midst of all that, Basically I set two criteria.One was I wanted to be at least as good as mc four English, like make sure that we're not making things actively worse. And mc four English is a nice step up over other stuff that's out there. And two was to go all in on diversity after that, making sure that we had some code, we had some scientific papers, we had Wikipedia, because people are gonna use this model for all sorts of different purposes.But I think the most important thing, and I'm guessing abhi had a million opinions on this, is you're only as good as your evaluation. And we don't know how to evaluate models for the kind of generation we ask them to do. So past a certain point, you have to kinda shrug and say, well, my evaluation's not even measuring what I care about.Mm-hmm. So let me just make reasonable choices. EVALUATION CHALLENGES OF LLMs [0:13:00]Swyx: So you're saying MMLU, big bench, that kind of stuff is not. Convincing for youJonathan: A lot of this stuff is you've got two kinds of tasks. Some of these are more of multiple choice style tasks where there is a right answer. Um, either you ask the model to spit out A, B, C, or D or you know, and if you're more [00:13:30] sophisticated, you look at the perplexity of each possible answer and pick the one that the model is most likely to generate.But we don't ask these models to do multiple choice questions. We ask them to do open-ended generation. There are also open-ended generation tasks like summarization. You compare using things like a blue score or a rouge score, which are known to be very bad ways of comparing text. At the end of the day, there are a lot of great summaries of a paper.There are a lot of great ways to do open form generation, and so humans are, to some extent, the gold standard. Humans are very expensive. It turns out we can't put them into our eval pipeline and just have the humans look at our model every, you know, 10 minutes? Not yet. Not yet. Maybe soon. Um, are you volunteering Abhi?Abhinav: I, I, I just know we have a great eval team who's, uh, who's helping us build new metrics. So if they're listening,Jonathan:  But it's, you know, evaluation of large language models is incredibly hard and I don't think any of these metrics really truly capture. What we expect from the models in practice.Swyx: Yeah. And we might draw wrong conclusions.There's been a debate recently about the emergence phenomenon, whether or not it's a mirage, right? I don't know if you guys have opinions about that process. Abhinav: Yeah, I think I've seen like this paper and all and all, even just kind of plots from different people where like, well maybe it's just a artifact of power, like log scaling or metrics or, you know, we're meshing accuracy, which is this a very like harsh zero one thing.Yeah. Rather than kind of something more continuous. But yeah, similar to what Jonathan was saying about evals. Like there there's one issue of like you just like our diversity of eval metrics, like when we put these models up, even like the chat ones, the instruct ones, people are using 'em for such a variety of tasks.There's just almost no way we get ahead of time, like measuring individual dimensions. And then also particularly like, you know, at the 7B scale, [00:15:00] um, these models still are not super great yet at the really hard tasks, like some of the hardest tasks in MMLU and stuff. So sometimes they're barely scoring like the above kind of random chance, you know, like on really, really hard tasks.So potentially as we. You know, aim for higher and higher quality models. Some of these things will be more useful to us. But we kind of had to develop MPT 7B kind of flying a little bit blind on, on what we knew it was coming out and just going off of like, you know, a small set of common sensor reasoning tasks.And of course, you know, just comparing, you know, those metrics versus other open source models. Alessio: I think fast training in inference was like one of the goals, right? So there's always the trade off between doing the hardest thing and like. Doing all the other things quickly.Abhinav: Yeah, absolutely. Yeah, I mean, I think like, you know, even at the 7B scale, you know, uh, people are trying to run these things on CPUs at home.You know, people are trying to port these to their phones, basically prioritizing the fact that the small scale would lead to our adoption. That was like a big, um, big thing going on. Alessio: Yeah. and you mentioned, um, flash attention and faster transformer as like two of the core things. Can you maybe explain some of the benefits and maybe why other models don't use it?FLASH ATTENTION [00:16:00]Abhinav: Yeah, absolutely. So flash attention is this basically faster implementation of full attention. Um, it's like a mathematical equivalent developed by like actually some of our collaborators, uh, at Stanford. Uh, the hazy research. Hazy research, yeah, exactly.Jonathan: What is, what, what, what's the name hazy research mean?Abhinav: I actually have no idea.Swyx: I have no clue. All these labs have fun names. I always like the stories behind them.Abhinav: Yeah, absolutely. We really, really liked flash attention. We, I think, had to integrate into repo even as [00:16:30] as early as September of last year. And it really just helps, you know, with training speed and also inference speed and we kind of bake that into model architecture.And this is kind of unique amongst all the other hugging face models you see out there. So ours actually, you can toggle between normal torch attention, which will work anywhere and flash attention, which will work on GPUs right out of the box. And that way I think you get almost like a 2x speed up at training time and somewhere between like 50% to a hundred percent speed up at inference time as well.So again, this is just like, we really, really wanted people to use these and like, feel like an improvement and we, we have the team to, to help deliver that. Swyx: Another part, um, of your choices was alibi position, encodings, which people are very interested in, maybe a lot of people just, uh, to sort of take in, in coatings as, as a given.But there's actually a lot of active research and honestly, it's a lot of, um, it's very opaque as well. Like people don't know how to evaluate encodings, including position encodings, but may, may, could you explain, um, alibi and, um, your choice?Abhinav: Yeah, for sure. The alibi and uh, kind of flash attention thing all kind of goes together in interesting ways.And even with training stability too. What alibi does really is that it eliminates the need to have positional embeddings in your model. Where previously, if you're a token position one, you have a particular embedding that you add, and you can't really go beyond your max position, which usually is like about 2000.With alibies, they get rid of that. Instead, just add a bias to the attention map itself. That's kind of like this slope. And if at inference time you wanna go much, much larger, they just kind of stretch that slope out to a longer, longer number of positions. And because the slope is kind of continuous and you can interpret it, it all works out now.Now one of [00:18:00] the, the funny things we found is like with flash attention, it saved so much memory and like improved performance so much that even as early as I kind of last year, like we were profiling models with, with very long context lines up to like, you know, the 65 k that you seen in release, we just never really got around to using it cuz we didn't really know what we might use it for.And also it's very hard to train stably. So we started experimenting with alibi integration, then we suddenly found that, oh wow, stability improves dramatically and now we can actually work together with alibi in a long context lens. That's how we got to like our story writer model where we can stably train these models out to very, very long context lenses and, and use them performantly.Jonathan: Yeah.Swyx: And it's also why you don't have a firm number. Most people now have a firm number on the context line. Now you're just like, eh, 65 to 85Abhinav: Oh yeah, there's, there's a, there's a big age to be 64 K or 65 k. 65 k plus.Swyx: Just do powers of twos. So 64 isn't, you know. Jonathan: Right, right. Yeah. Yeah. But we could, I mean, technically the context length is infinite.If you give me enough memory, um, you know, we can just keep going forever. We had a debate over what number to say is the longest that we could handle. We picked 84 cakes. It's the longest I expect people to see easily in practice. But, you know, we played around for even longer than that and I don't see why we couldn't go longer.Swyx: Yeah. Um, and so for those who haven't read the blog posts, you put the Great Gatsby in there and, uh, asked it to write an epilogue, which seemed pretty impressive.Jonathan: Yeah. There are a bunch of epilogues floating around internally at Mosaic. Yeah. That wasn't my favorite. I think we all have our own favorites.Yeah. But there are a bunch of really, really good ones. There was one where, you know, it's Gatsby's funeral and then Nick starts talking to Gatsby's Ghost, and Gatsby's father shows up and, you know, then he's [00:19:30] at the police station with Tom. It was very plot heavy, like this is what comes next. And a bunch of that were just very Fitzgerald-esque, like, you know, beautiful writing.Um, but it was cool to just see that Wow, the model seemed to actually be working with. You know, all this input. Yeah, yeah. Like it's, it's exciting. You can think of a lot of things you could do with that kind of context length.FINE-TUNING FOR CREATIVITY [00:19:50]Swyx: Is there a trick to fine tuning for a creative task rather than, um, factual task?Jonathan: I don't know what that is, but probably, yeah, I think, you know, the person, um, Alex who did this, he did fine tune the model explicitly on books. The goal was to try to get a model that was really a story writer. But, you know, beyond that, I'm not entirely sure. Actually, it's a great question. Well, no, I'll ask you back.How would you measure that? Swyx: Uh, God, human feedback is the solve to all things. Um, I think there is a labeling question, right? Uh, in computer vision, we had a really, really good episode with Robo Flow on the segment. Anything model where you, you actually start human feedback on like very, I think it's something like 0.5% of the, the overall, uh, final, uh, uh, labels that you had.But then you sort augment them and then you, you fully automate them, um, which I think could be applied to text. It seems intuitive and probably people like snorkel have already raised ahead on this stuff, but I just haven't seen this applied in the language domain yet.Jonathan: It, I mean there are a lot of things that seem like they make a lot of sense in machine learning that never work and a lot of things that make zero sense that seem to work.So, you know, I've given up trying to even predict. Yeah, yeah. Until I see the data or try it, I just kind shg my shoulders and you know, you hope for the best. Bring data or else, right? Yeah, [00:21:00] exactly. Yeah, yeah, yeah.Alessio: The fine tuning of books. Books three is like one of the big data sets and there was the whole.Twitter thing about trade comments and like, you know, you know, I used to be a community moderator@agenius.com and we've run into a lot of things is, well, if you're explaining lyrics, do you have the right to redistribute the lyrics? I know you ended up changing the license on the model from a commercial use Permitted.Swyx: Yeah let's let them. I'm not sure they did. Jonathan: So we flipped it for about a couple hours. Swyx: Um, okay. Can we, can we introduce the story from the start Just for people who are under the loop. Jonathan: Yeah. So I can tell the story very simply. So, you know, the book three data set does contain a lot of books. And it is, you know, as I discovered, um, it is a data set that provokes very strong feelings from a lot of folks.Um, that was one, one guy from one person in particular, in fact. Um, and that's about it. But it turns out one person who wants a lot of attention can, you know, get enough attention that we're talking about it now. And so we had a, we had a discussion internally after that conversation and we talked about flipping the license and, you know, very late at night I thought, you know, maybe it's a good thing to do.And decided, you know, actually probably better to just, you know, Stan Pat's license is still Apache too. And one of the conversations we had was kind of, we hadn't thought about this cuz we had our heads down, but the Hollywood writer Strike took place basically the moment we released the model. Mm-hmm.Um, we were releasing a model that could do AI generated creative content. And that is one of the big sticking points during the strike. Oh, the optics are not good. So the optics aren't good and that's not what we want to convey. This is really, this is a demo of the ability to do really long sequence lengths and.Boy, you know, [00:22:30] that's, that's not timing that we appreciated. And so we talked a lot internally that night about like, oh, we've had time to read the news. We've had time to take a breath. We don't really love this. Came to the conclusion that it's better to just leave it as it is now and learn the lesson for the future.But certainly that was one of my takeaways is this stuff, you know, there's a societal context around this that it's easy to forget when you're in the trenches just trying to get the model to train. And you know, in hindsight, you know, I might've gone with a different thing than a story writer. I might've gone with, you know, coder because we seem to have no problem putting programmers out of work with these models.Swyx: Oh yeah. Please, please, you know, take away this stuff from me.OPEN SOURCE LICENSES AND ETHICAL CONSIDERATIONS [00:23:00]Jonathan: Right. You know, so it's, I think, you know, really. The copyright concerns I leave to the lawyers. Um, that's really, if I learned one thing teaching at a law school, it was that I'm not a lawyer and all this stuff is a little complicated, especially open source licenses were not designed for this kind of world.They were designed for a world of forcing people to be more open, not forcing people to be more closed. And I think, you know, that was part of the impetus here, was to try to use licenses to make things more closed. Um, which is, I think, against the grain of the open source ethos. So that struck me as a little bit strange, but I think the most important part is, you know, we wanna be thoughtful and we wanna do the right thing.And in that case, you know, I hope with all that interesting licensing fund you saw, we're trying to be really thoughtful about this and it's hard. I learned a lot from that experience. Swyx: There’s also, I think, an open question of fair use, right? Is training on words of fair use because you don't have a monopoly on words, but some certain arrangements of words you do.And who is to say how much is memorization by a model versus actually learning and internalizing and then. Sometimes happening to land at the right, the [00:24:00] same result.Jonathan: And if I've learned one lesson, I'm not gonna be the person to answer that question. Right, exactly. And so my position is, you know, we will try to make this stuff open and available.Yeah. And, you know, let the community make decisions about what they are or aren't comfortable using. Um, and at the end of the day, you know, it still strikes me as a little bit weird that someone is trying to use these open source licenses to, you know, to close the ecosystem and not to make things more open.That's very much against the ethos of why these licenses were created.Swyx: So the official mosaic position, I guess is like, before you use TC MPC 7B for anything commercial, check your own lawyers now trust our lawyers, not mosaic’s lawyers.Jonathan: Yeah, okay. Yeah. I'm, you know, our lawyers are not your lawyers.Exactly. And, you know, make the best decision for yourself. We've tried to be respectful of the content creators and, you know, at the end of the day, This is complicated. And this is something that is a new law. It's a new law. It's a new law that hasn't been established yet. Um, but it's a place where we're gonna continue to try to do the right thing.Um, and it's, I think, one of the commenters, you know, I really appreciated this said, you know, well, they're trying to do the right thing, but nobody knows what the right thing is to even do, you know, the, I guess the, the most right thing would've been to literally not release a model at all. But I don't think that would've been the best thing for the community either.Swyx: Cool.Well, thanks. Well handled. Uh, we had to cover it, just causeJonathan:  Oh, yes, no worries. A big piece of news. It's been on my mind a lot.TRAINING STABILITY ENHANCEMENT [00:25:15]Swyx: Yeah. Yeah. Well, you've been very thoughtful about it. Okay. So a lot of these other ideas in terms of architecture, flash, attention, alibi, and the other data sets were contributions from the rest of the let's just call it open community of, of machine learning advancements. Uh, but Mosaic in [00:25:30] particular had some stability improvements to mitigate loss spikes, quote unquote, uh, which, uh, I, I took to mean, uh, your existing set of tools, uh, maybe we just co kind of covered that. I don't wanna sort of put words in your mouth, but when you say things like, uh, please enjoy my empty logbook.How much of an oversell is that? How much, you know, how much is that marketing versus how much is that reality?Abhinav: Oh yeah. That, that one's real. Yeah. It's like fully end-to-end. Um, and I think.Swyx: So maybe like what, what specific features of Mosaic malibu?Abhinav: Totally, totally. Yeah. I think I'll break it into two parts.One is like training stability, right? Knowing that your model's gonna basically get to the end of the training without loss spikes. Um, and I think, you know, at the 7B scale, you know, for some models like it ha it's not that big of a deal. As you train for longer and longer durations, we found that it's trickier and trickier to avoid these lost spikes.And so we actually spent a long time figuring out, you know, what can we do about our initialization, about our optimizers, about the architecture that basically prevents these lost spikes. And you know, even in our training run, if you zoom in, you'll see small intermittent spikes, but they recover within a few hundred steps.And so that's kind of the magical bit. Our line is one of defenses we recover from Las Vegas, like just naturally, right? Mm-hmm. Our line two defense was that we used determinism and basically really smart resumption strategies so that if something catastrophic happened, we can resume very quickly, like a few batches before.And apply some of these like, uh, interventions. So we had these kinds of preparations, like a plan B, but we didn't have to use them at all for MPT 7B training. So, that was kind of like a lucky break. And the third part of like basically getting all the way to the empty law book is having the right training infrastructure.[00:27:00]So this is basically what, like is, one of the big selling points of the platform is that when you try to train these models on hundreds of GPUs, not many people outside, you know, like deep industry research owners, but the GPUs fail like a lot. Um, I would say like almost once every thousand a 100 days.So for us on like a big 512 cluster every two days, basically the run will fail. Um, and this is either due to GPUs, like falling off the bus, like that's, that's a real error we see, or kind of networking failures or something like that. And so in those situations, what people have normally done is they'll have an on-call team that's just sitting round the clock, 24-7 on slack, once something goes wrong.And if then they'll basically like to try to inspect the cluster, take nodes out that are broken, restart it, and it's a huge pain. Like we ourselves did this for a few months. And as a result of that, because we're building such a platform, we basically step by step automated every single one of those processes.So now when a run fails, we have this automatic kind of watch talk that's watching. It'll basically stop the job. Test the nodes cord in anyone's that are broken and relaunch it. And because our software's all deterministic has fast resumption stuff, it just continues on gracefully. So within that log you can see sometimes I think maybe at like 2:00 AM or something, the run failed and within a few minutes it's back up and running and all of us are just sleeping peacefully.Jonathan: I do wanna say that was hard one. Mm-hmm. Um, certainly this is not how things were going, you know, many months ago, hardware failures we had on calls who were, you know, getting up at two in the morning to, you know, figure out which node had died for what reason, restart the job, have to cord the node. [00:28:30] Um, we were seeing catastrophic loss spikes really frequently, even at the 7B scale that we're just completely derailing runs.And so this was step by step just ratcheting our way there. As Abhi said, to the point where, Many models are training at the moment and I'm sitting here in the studio and not worrying one bit about whether the runs are gonna continue. Yeah. Swyx: I'm, I'm not so much of a data center hardware kind of guy, but isn't there existing software to do this for CPUs and like, what's different about this domain? Does this question make sense at all?Jonathan: Yeah, so when I think about, like, I think back to all the Google fault tolerance papers I read, you know, as an undergrad or grad student mm-hmm. About, you know, building distributed systems. A lot of it is that, you know, Each CPU is doing, say, an individual unit of work.You've got a database that's distributed across your cluster. You wanna make sure that one CPU failing can't, or one machine failing can't, you know, delete data. So you, you replicate it. You know, you have protocols like Paxos where you're literally, you've got state machines that are replicated with, you know, with leaders and backups and things like that.And in this case, you were performing one giant computation where you cannot afford to lose any node. If you lose a node, you lose model state. If you lose a node, you can't continue. It may be that, that in the future we actually, you know, create new versions of a lot of our distributed training libraries that do have backups and where data is replicated so that if you lose a node, you can detect what node you've lost and just continue training without having to stop the run, you know?Pull from a checkpoint. Yeah. Restart again on different hardware. But for now, we're certainly in a world where if anything dies, that's the end of the run and you have to go back and recover from it. [00:30:00]DATA READINESS & TRAINING PREPARATION [00:30:00]Abhinav: Yeah. Like I think a big part, a big word there is like synchronous data pluralism, right? So like, we're basically saying that on every step, every GP is gonna do some work.They're gonna stay in sync with each other and average their, their gradients and continue. Now that there are algorithmic techniques to get around this, like you could say, oh, if a GP dies, just forget about it. All the data that's gonna see, we'll just forget about it. We're not gonna train on it.But, we don't like to do that currently because, um, it makes us give up determinism, stuff like that. Maybe in the future, as you go to extreme scales, we'll start looking at some of those methods. But at the current time it's like, we want determinism. We wanted to have a run that we could perfectly replicate if we needed to.And it was, the goal is figure out how to run it on a big cluster without humans having to babysit it. Babysit it. Alessio: So as you mentioned, these models are kind of the starting point for a lot of your customers To start, you have a. Inference product. You have a training product. You previously had a composer product that is now kind of not rolled into, but you have like a super set of it, which is like the LLM foundry.How are you seeing that change, you know, like from the usual LOP stack and like how people train things before versus now they're starting from, you know, one of these MPT models and coming from there. Like worship teams think about as they come to you and start their journey.Jonathan: So I think there's a key distinction to make here, which is, you know, when you say starting from MPT models, you can mean two things.One is actually starting from one of our checkpoints, which I think very few of our customers are actually going to do, and one is starting from our configuration. You can look at our friends at Rep for that, where, you know, MPT was in progress when Refl [00:31:30] came to us and said, Hey, we need a 3 billion parameter model by next week on all of our data.We're like, well, here you go. This is what we're doing, and if it's good enough for us, um, hopefully it's good enough for you. And that's basically the message we wanna send to our customers. MPT is basically clearing a path all the way through where they know that they can come bring their data, they can use our training infrastructure, they can use all of our amazing orchestration and other tools that abhi just mentioned, for fault tolerance.They can use Composer, which is, you know, still at the heart of our stack. And then the l l M Foundry is really the specific model configuration. They can come in and they know that thing is gonna train well because we've already done it multiple times. Swyx: Let's dig in a little bit more on what should people have ready before they come talk to you? So data architecture, eval that they're looking, etc.Abhinav: Yeah, I, I mean, I think we'll accept customers at any kind of stage in their pipeline. You know, like I'd say science, there's archetypes of people who have built products around like some of these API companies and reach a stage or maturity level where it's like we want our own custom models now, either for the purpose of reducing cost, right?Like our inference services. Quite a bit cheaper than using APIs or because they want some kind of customization that you can't really get from the other API providers. I'd say the most important things to have before training a big model. You know, you wanna have good eval metrics, you know, some kind of score that you can track as you're training your models and scaling up, they can tell you you're progressing.And it's really funny, like a lot of times customers will be really excited about training the models, right? It's really fun to like launch shelves on hundreds of gfs, just all around. It's super fun. But then they'll be like, but wait, what are we gonna measure? Not just the training loss, right? I mean, it's gotta be more than that.[00:33:00]So eval metrics is like a, it's a good pre-req also, you know, your data, you know, either coming with your own pre-training or fine-tune data and having like a strategy to clean it or we can help clean it too. I think we're, we're building a lot of tooling around that. And I think once you have those two kinds of inputs and sort of the budget that you want, we can pretty much walk you through the rest of it, right?Like that's kind of what we do. Recently we helped build CR FM's model for biomedical language a while back. Jonathan: Um, we can. That's the center of research for foundation models. Abhi: Exactly, exactly.Jonathan: Spelling it out for people. Of course.Abhinav: No, absolutely. Yeah, yeah. No, you've done more of these than I have.Um, I think, uh, basically it's sort of, we can help you figure out what model I should train to scale up so that when I go for my big run company, your here run, it's, uh, it's predictable. You can feel confident that it's gonna work, and you'll kind of know what quality you're gonna get out before you have to spend like a few hundred thousand dollars.DYNAMIC REAL-TIME MODEL EVALUATION [00:34:00]Alessio: The rap Reza from rap was on the podcast last week and, uh, they had human eval and then that, uh, I'm Jon Eval, which is like vibe based. Jonathan: And I, I do think the vibe based eval cannot be, you know, underrated really at the, I mean, at the end of the day we, we did stop our models and do vibe checks and we did, as we monitor our models, one of our evals was we just had a bunch of prompts and we would watch the answers as the model trained and see if they changed cuz honestly, You know, I don't really believe in any of these eval metrics to capture what we care about.Mm-hmm. But when you ask it, uh, you know, I don't know. I think one of our prompts was to suggest games for a three-year-old and a seven-year-old. That would be fun to play. Like that was a lot more [00:34:30] valuable to me personally, to see how that answer evolved and changed over the course of training. So, you know, and human eval, just to clarify for folks, human human eval is an automated evaluation metric.There's no humans in it at all. There's no humans in it at all. It's really badly named. I got so confused the first time that someone brought that to me and I was like, no, we're not bringing humans in. It's like, no, it's, it's automated. They just called it a bad name and there's only a hundred cents on it or something.Abhinav: Yeah. Yeah. And, and it's for code specifically, right?Jonathan: Yeah. Yeah. It's very weird. It's a, it's a weird, confusing name that I hate, but you know, when other metrics are called hella swag, like, you know, you do it, just gotta roll with it at this point. Swyx: You're doing live evals now. So one, one of the tweets that I saw from you was that it is, uh, important that you do it paralyzed.Uh, maybe you kind of wanna explain, uh, what, what you guys did.Abhinav: Yeah, for sure. So with LLM Foundry, there's many pieces to it. There's obviously the core training piece, but there's also, you know, tools for evaluation of models. And we've kind of had one of the, I think it's like the, the fastest like evaluation framework.Um, basically it's multi GPU compatible. It runs with Composer, it can support really, really big models. So basically our framework runs so fast that even Azure models are training. We can run these metrics live during the training. So like if you have a dashboard like weights and biases, you kind of watch all these evil metrics.We have, like, 15 or 20 of them honestly, that we track during the run and add negligible overhead. So we can actually watch as our models go and feel confident. Like, it's not like we wait until the very last day to, to test if the models good or notJonathan: That's amazing. Yeah. I love that we've gotten this far into the conversation.We still haven't talked about efficiency and speed. Those are usually our two watch words at Mosaic, which is, you know, that's great. That says that we're [00:36:00] doing a lot of other cool stuff, but at the end of the day, um, you know, Cost comes first. If you can't afford it, it doesn't matter. And so, you know, getting things down cheap enough that, you know, we can monitor in real time, getting things down cheap enough that we can even do it in the first place.That's the basis for everything we do.OPEN SCIENCE FOR AFFORDABLE AI RESEARCH [00:36:00]Alessio: Do you think a lot of the questions that we have around, you know, what data sets we should use and things like that are just because training was so expensive before that, we just haven't run enough experiments to figure that out. And is that one of your goals is trying to make it cheaper so that we can actually get the answers?Jonathan: Yeah, that's a big part of my personal conviction for being here. I think I'm, I'm still in my heart, the second year grad student who was jealous of all his friends who had GPUs and he didn't, and I couldn't train any models except in my laptop. And that, I mean, the lottery ticket experiments began on my laptop that I had to beg for one K 80 so that I could run amist.And I'm still that person deep down in my heart. And I'm a believer that, you know, if we wanna do science and really understand these systems and understand how to make them work well, understand how they behave, understand what makes them safe and reliable. We need to make it cheap enough that we can actually do science, and science involves running dozens of experiments.When I finally, you know, cleaned out my g c s bucket from my PhD, I deleted a million model checkpoints. I'm not kidding. There were over a million model checkpoints. That is the kind of science we need, you know, that's just what it takes. In the same way that if you're in a biology lab, you don't just grow one cell and say like, eh, the drug seems to work on that cell.Like, there's a lot more science you have to do before you really know.Abhinav: Yeah. And I think one of the special things about Mosaic's kind of [00:37:30] position as well is that we have such, so many customers all trying to train models that basically we have the incentive to like to devote all these resources and time to do this science.Because when we learn which pieces actually work, which ones don't, we get to help many, many people, right? And so that kind of aggregation process I think is really important for us. I remember way back there was a paper about Google that basically would investigate batch sizes or something like that.And it was this paper that must have cost a few million dollars during all the experience. And it was just like, wow, what a, what a benefit to the whole community. Now, like now we all get to learn from that and we get, we get to save. We don't have to spend those millions of dollars anymore. So I think, um, kind of mosaical science, like the insights we get on, on data, on pre-screening architecture, on all these different things, um, that's why customers come to us.Swyx: Yeah, you guys did some really good stuff on PubMed, G B T as well. That's the first time I heard of you. Of you. And that's also published to the community.Abhinav: Yeah, that one was really fun. We were like, well, no one's really trained, like fully from scratch domain specific models before. Like, what if we just did a biomed one?Would it still work? And, uh, yeah, I'd be really excited. That did, um, we'll probably have some follow up soon, I think, later this summer.Jonathan: Yeah. Yes. Stay tuned on that. Um, but I, I will say just in general, it's a really important value for us to be open in some sense. We have no incentive not to be open. You know, we make our money off of helping people train better.There's no cost to us in sharing what we learn with the community. Cuz really at the end of the day, we make our money off of those custom models and great infrastructure and, and putting all the pieces together. That's honestly where the Mosaic name came from. Not off of like, oh, we've got, you know, this one cool secret trick [00:39:00] that we won't tell you, or, you know, closing up.I sometimes, you know, in the past couple weeks I've talked to my friends at places like Brain or, you know, what used to be Brain Now Google DeepMind. Oh, I R I P Brain. Yeah. R i p Brian. I spent a lot of time there and it was really a formative time for me. Um, so I miss it, but. You know, I kind of feel like we're one of the biggest open research labs left in industry, which is a very sad state of affairs because we're not very big.Um, but at least can you say how big the team is actually? Yeah. We were about 15 researchers, so we're, we're tiny compared to, you know, the huge army of researchers I remember at Brain or at fair, at Deep Mind back, you know, when I was there during their heydays. Um, you know, but everybody else is kind of, you know, closed up and isn't saying very much anymore.Yeah. And we're gonna keep talking and we're gonna keep sharing and, you know, we will try to be that vanguard to the best of our ability. We're very small and I, I can't promise we're gonna do what those labs used to do in terms of scale or quantity of research, but we will share what we learn and we will try to create resources for the community.Um, I, I dunno, I just, I believe in openness fundamentally. I'm an academic at heart and it's sad to me to watch that go away from a lot of the big labs. THE OPEN APPROACH [00:40:15]Alessio: We just had a live pod about the, you know, open AI snow mode, uh, post that came out and it was one of the first time I really dove into Laura and some of the this new technologies, like how are you thinking about what it's gonna take for like the open approach to really work?Obviously today, GPT four is still, you know, part of like that state-of-the-art model for a [00:40:30] lot of tasks. Do you think some of the innovation and kind of returning methods that we have today are enough if enough people like you guys are like running these, these research groups that are open? Or do you think we still need a step function improvement there?Jonathan: I think one important point here is the idea of coexistence. I think when you look at, I don't know who won Linux or Windows, the answer is yes. Microsoft bought GitHub and has a Windows subsystem for Linux. Linux runs a huge number of our servers and Microsoft is still a wildly profitable company.Probably the most successful tech company right now. So who won open source or closed source? Yes. Um, and I think that's a similar world that we're gonna be in here where, you know, it's gonna be different things for different purposes. I would not run Linux on my laptop personally cuz I like connecting to wifi and printing things.But I wouldn't run Windows on one of my surfers. And so I do think what we're seeing with a lot of our customers is, do they choose opening IR mosaic? Yes. There's a purpose for each of these. You have to send your data off to somebody else with open eyes models. That's a risk. GPT four is amazing and I would never promise someone that if they come to Mosaic, they're gonna get a GPT four quality model.That's way beyond our means and not what we're trying to do anyway. But there's also a whole world for, you know, domain specific models, context specific models that are really specialized, proprietary, trained on your own data that can do things that you could never do with one of these big models. You can customize in crazy ways like G B T four is not gonna hit 65 K context length for a very long time, cuz they've already trained that [00:42:00] model and you know, they haven't even released the 32 K version yet.So we can, you know, we can do things differently, you know, by being flexible. So I think the answer to all this is yes. But we can't see the open source ecosystem disappear. And that's the scariest thing for me. I hear a lot of talk in academia about, you know, whatever happened to that academic research on this field called information retrieval?Well, in 1999 it disappeared. Why? Because Google came along and who cares about information retrieval research when you know you have a Google Scale, you know, Web Scale database. So you know, there's a balance here. We need to have both. Swyx: I wanna applaud you, Elaine. We'll maybe edit it a little like crowd applause, uh, line.Cuz I, I think that, um, that is something that as a research community, as people interested in progress, we need to see these things instead of just, uh, seeing marketing papers from the advertising GPT 4.Jonathan: Yeah. I, I think I, you know, to get on my soapbox for 10 more seconds. Go ahead. When I talk to policymakers about, you know, the AI ecosystem, the usual fear that I bring up is, Innovation will slow because of lack of openness.I've been complaining about this for years and it's finally happened. Hmm. Why is Google sharing, you know, these papers? Why is Open AI sharing these papers? There are a lot of reasons. You know, I have my own beliefs, but it's not something we should take for granted that everybody's sharing the work that they do and it turns out well, I think we took it for granted for a while and now it's gone.I think it's gonna slow down the pace of progress. In a lot of cases, each of these labs has a bit of a monoculture and being able to pass ideas [00:43:30] back and forth was a lot of what kept, you know, scientific progress moving. So it's imperative not just, you know, for the open source community and for academia, but for the progress of technology.That we have a vibrant open source research community.THE FUTURE OF MOSAIC [00:44:11]Swyx: There’s a preview of the ecosystem and commentary that we're, we're gonna do. But I wanna close out some stuff on Mosaic. You launched a bunch of stuff this month. A lot of stuff, uh, actually was, I was listening to you on Gradient descent, uh, and other podcasts we know and love.Uh, and you said you also said you were not gonna do inference and, and, and last week you were like, here's Mosaic ML inference. Oops. So maybe just a, at a high level, what was Mosaic ml and like, what is it growing into? Like how do you conceptualize this? Jonathan: Yeah, and I will say gradient, when graded dissent was recorded, we weren't doing inference and had no plans to do it.It took a little while for the podcast to get out. Um, in the meantime, basically, you know, one thing I've learned at a startup, and I'm sure abhi can comment on this as well, focus is the most important thing. We have done our best work when we've been focused on doing one thing really well and our worst work when we've tried to do lots of things.Yeah. So, We don't want to do inference, we don't want to have had to do inference. Um, and at the end of the day, our customers were begging us to do it because they wanted a good way to serve the models and they liked our ecosystem. And so in some sense, we got dragged into it kicking and screaming. We're very excited to have a product.We're going to put our best foot forward and make something really truly amazing. But there is, you know, that's something that we were reluctant to do. You know, our customers convinced us it would be good for our business. It's been wonderful for business and we are gonna put everything into this, but you know, back when grading dissent came out, I [00:45:00] was thinking like, or when we recorded it or focused, oh God, like focus is the most important thing.I've learned that the hard way multiple times that Mosaic, abhi can tell you like, you know, I've made a lot of mistakes on not focusing enough. Um, boy inference, that's a whole second thing, and a whole different animal from training. And at the end of the day, when we founded the company, our belief was that inference was relatively well served at that time.There were a lot of great inference companies out there. Um, training was not well served, especially efficient training. And we had something to add there. I think we've discovered that as the nature of the models have changed, the nature of what we had to add to inference changed a lot and there became an opportunity for us to contribute something.But that was not the plan. But now we do wanna be the place that people come when they wanna train these big, complex, difficult models and know that it's gonna go right the first time and they're gonna have something they can servee right away. Um, you know, really the rep example of, you know, with 10 days to go saying, Hey, can you please train that model?And, you know, three or four days later the model was trained and we were just having fun doing interesting, fine tuning work in it for the rest of the 10 days, you know. That also requires good inference. Swyx: That’s true, that's true. Like, so running evals and, and fine tuning. I'm just putting my business hat on and you know, and Alessio as well, like, uh, I've actually had fights with potential co-founders about this on the primary business.Almost like being training, right? Like essentially a one-time cost.Jonathan: Who told you it was a one time cost? What, who, who told you that?Swyx: No, no, no, no. Correct me. Jonathan: Yeah. Yeah. Let me correct you in two ways. Um, as our CEO Navine would say, if he were here, when you create version 1.0 of your software, do you then fire all the engineers?Of [00:46:30] course not. You never, like, MPT has a thousand different things we wanted to do that we never got to. So, you know, there will be future models.Abhinav: And, and the data that's been trained on is also changing over time too, right? If you wanna ask anything about, I guess like May of 2023, we'll have to retrain it further and so on.Right? And I think this is especially true for customers who run like the kind of things that need to be up to date on world knowledge. So I, I think like, you know, the other thing I would say too is that, The malls we have today are certainly not the best malls we'll ever produce. Right. They're gonna get smaller, they're gonna get faster, they're gonna get cheaper, they're gonna get lower latency, they're gonna get higher quality.Right? And so you always want the next gen version of MPT and the one after that and one after that. There's a reason that even the GPT series goes three, four, and we know there's gonna be a five. Right? Um, so I I I also don't see as a, as a one-time cost.Jonathan: Yeah. Yeah. And I, if you wanna cite a stat on this, there are very, very few stats floating around on training versus inference cost.Mm-hmm. One is this blog post from I think David Patterson at Google, um, on the energy usage of ML at Google. And they break down and say three fifths of energy over the previous three years. I think this 2022 article was for inference, and two fifths were for training. And so actually that, you know, this is Google, which is serving models to billions of users.They're probably the most inference heavy place in the world. It's only a two fifth, three fifth breakdown, and that's energy training. Hardware is probably more expensive because it has fancier networking. That could be a 50 50 cost breakdown. And that's Google for a lot of other folks. It's gonna be weighed even more heavily, in favor of training.SPEED AND EFFICIENCY [00:48:01]Swyx: Amazing answer. Well, thanks. Uh, we can, we can touch on a little bit [00:48:00] on, uh, efficiency and speed because we, we, uh, didn't mention about that. So right now people spend between three to 10 days. You, you spend 10 days on, on mpc, seven rep spend three days. What's feasible? What's what Do you wanna get it down to?Abhinav: Oh, for, for these original models? Yeah. Yeah. So I think, um, this is probably one of the most exciting years, I think for training efficiency, just generally speaking, because we have the, the combination of a couple things, like one is like this next generation of hardware, like the H 100 s coming out from Nvidia, which on their own should be like, at least like a two x improvement or they 100 s on top of that, there's also a new floating point format f P eight, um, which could also deliver that alone.Does it? Yes. Yeah. Yeah. How, what, why? Oh, the f p thing? Yeah. Yeah. So basically what's happening is that, you know, when we do all of our math, like in the models matrix, multiplication, math, we do it in a particular precision. We started off in 32 bit precision a few years ago, and then in video came with 16 bit, and over the course of several years, we've all figured out how to do 16 bit training and that basically, you know, due to the harder requirements like.Increase the throughput by two x, reduce the cost by two x. That's about to happen again with FBA eight, like starting this year. And with Mosaic, you know, we've already started profiling L L M training with f p eight on H 100 s. We're seeing really, really good improvements there. And so you're gonna see a huge cost reduction this year just from this hardware fact alone.On top of that, you know, there's a lot of architectural applications. We're looking at ways to introduce some forms of sparsity, not necessarily like the, the, the super unstructured sparsity like lottery ticket. Um, which not that I'm sure I'm really happy to talk about. Um, but, but, um, are there ways of doing, like you [00:49:30] gating or like, kind of like m moe style architectures?So, you know, I think originally, you know, what was like 500 k. I think to try and train a Jeep, the equality model, if at the end of the year we could get that down to a hundred k, that would be fantastic.Swyx: That is this year's type of thing. Jonathan: Not, not, like, that's not a pie in the sky thing. Okay. It is not, it's not a place we are now, but I think it is a, you know, I don't think more than a year in the future these days, cuz it's impossible.I think that is very much a 2023 thing. Yeah. Yeah. Okay. And hold me to that later this year.Swyx: G PT three for a hundred K, let's go. Um, and then also stable diffusion originally reported to be 600 K. Uh, you guys can get it done for under 50. Anything different about image models that we should image, to text?Jonathan: Um, I mean I think the, the most important part in all this is, you know, it took us a while to get 50 down by almost seven x. That was our original kind of proof of concept project for Mosaic. You know, just at the beginning to show like, you know, we can even do this and our investors should give us more money.But what I love about newer models that come out is they're always really slow. We haven't figured out how to optimize them yet. And so there's so much work to be done. So getting, you know, in that case, I guess from the cost you mentioned like a 12 x cost reduction in stable diffusion. Mm-hmm. Honestly it was a lot easier than getting a seven X for RESNET 50 an image net or a three X for Burt, cuz the architecture was much newer and there were a lot of inefficiencies to improve.Um, you know, I'm guessing that's gonna continue to be the case as we lean toward the bleeding edge and try to, you know, push the bleeding edge. I hope that, you know, in some sense you'll see smaller speed ups from us because the new models will come from us and they'll already be fast.Alessio: So that's making existing [00:51:00] things better with the, the long boy, the 60 5K context window, uh, you've doubled instead of the r.There was the R M T a couple weeks ago that had a possible 1 million. Uh, that's the unlimited former thing that came out last week, which is theoretically limitless context. What should people think about trade offs? Implications? You mentioned memories kind of start to become one of the bounds.Yeah. What's the right number? Like is it based on the customer's needs? Like how would you advise customers and startups who might be building their own models?Jonathan: It's all contextual. You know, there's a lot of buzz coming for long contexts lately with a lot of these papers. None of them are exact. In terms of the way that they're doing attention.And so there's, you know, to some extent there's an approximation or a trade off between doing some kind of inexact or approximate or hierarchical or, you know, non quadratic attention versus doing it explicitly correctly the quadratic way. I'm a big fan of approximation, so I'm eager to dig into these papers.If I've learned one thing from writing and reading papers, it's to believe nothing until I've implemented it myself. And we've certainly been let down many, many, many times at Mosaic by papers that look very promising until we implement them and realize, you know, here's how they cook the books on their data.Here's, you know, the one big caveat that didn't show up in the paper. So I look at a lot of this with skepticism until, you know, I believe nothing until I re-implement it. And in general, I'm rewarded for doing that because, you know, a lot of this stuff doesn't end up working quite as well in practice.This is promised in a paper, the [00:52:30] incentives just aren't there, which is part of the reason we went with just pure quadratic attention here. Like it's known to work. We didn't have to make an approximation. There's no asterisk or caveat. This was in some sense a sheer force of will by our amazing engineers.Alessio: So people want super long context because, you know, they wanna feed more documents and right now people do it with embeddings and feed them into the context window. How do you kind of see that changing? Are we gonna get to a point where like, you know, maybe it's 60 4k, maybe it's 120 k, where it's like, okay.You know, semantic search and embeddings are gonna work better than just running a million parameters, like a million token context window.Jonathan: Do, do you wanna say the famous thing about 64 K? Does somebody wanna say that, that statement, the, you know, the 64 K is all you'll ever need? The Bill Gates statement about Rams.Swyx: Andre Kaparthi actually made that comparison before that, uh, context is essentially Ram,Jonathan: if I get quoted here saying 60 4K is all you need, I will be wrong. We have no idea. People are gonna get ambitious. Yes. Um, GPT four has probably taken an image and turning it into a bunch of tokens and plugging it in.I'm guessing each image is worth a hell of a lot of tokens. Um, maybe that's not a thousand words. Not a thousand words, but, you know, probably a thousand words worth of tokens, if not even more so. Maybe that's the reason they did 32 k. Maybe, you know, who knows? Maybe we'll wanna put videos in these models.Like every time that we say, ah, that isn't that model big enough, somebody just gets more ambitious. Who knows? TRENDS AND TRANSFORMERS [00:54:00]Swyx: Right? Um, you've famously made one. [00:54:00] Countertrend, uh, bet, which is, uh, you, you're actually betting that, uh, transformers will stick around for a long time. Jonathan: How is that counter trend? Swyx: Counter trend is in, you just said, a lot of things won't last.Right. A lot of things will get replaced, uh, really easily, butJonathan: transformers will stick around. I mean, look at the history here. How long did the Convolutional neural network stick around for? Oh wait. They're still here and vision Transformers still haven't replaced them. Mm-hmm. How long did r and n stick around for?Decades. And, you know, they're still alive and kicking in a bunch of different places, so, you know. The fundamental architecture improvements are really hard to come by. I can't wait to collect from Sasha on that bet.Abhinav: I, I think a lot of your bet hinges on what counts as attention, right.Swyx: Wait, what do you mean?Well, how, how can that change? Oh, because it'll be approximated.Abhinav: Well, I suppose if, if we ever replace like the Qk multiplication, something that looks sort of like it, I, I wonder who, who, who comes out on top here.Jonathan: Yeah. I mean at the end of the day is a feed forward network, you know, that's fully connected, just a transformer with very simple attention.Mm-hmm. Um, so Sasha better be very generous to me cause it's possible that could change, but at the end of the day, we're still doing Transformers the way, you know, Vaswani had all intended back six years ago now, so, I don't know, things. Six years is a pretty long time. What's another four years at this point?Alessio: Yeah. What do you think will replace it if you lose Ben? What do you think? You would've lost  it time?Jonathan: If I knew that I'd be working on it.Abhinav:  I think it's gonna be just like MLPs, you know, that's the only, that's the only way we can go, I think at this point, because Thelp, I, I dunno. Oh, just basically down to, to um, to linear layers.[00:55:30]Oh, mostly the percepts. Exactly. Got, yeah. Yeah. Yeah. Cuz the architecture's been stripped, simplified so much at this point. I think, uh, there's very little left other than like some linear layers, some like residual connections and, and of course the attention, um, dot product.Jonathan: But you're assuming things will get simpler, maybe things will get more complicated.Swyx: Yeah, there's some buzz about like, the hippo models. Hungry, hungry hippos.Jonathan: I, I mean there's always buzz about something, um, you know, that's not to dismiss this work or any other work, but there's always buzz about something. I tend to wait a little bit to see if things stand the test of time for like two weeks.Um, at this point, it used to be, you know, a year, but now it's down to two weeks. Oh. But you know, I'm. I don't know. I don't like to follow the hype. I like to see what sticks around, what people actually manage to build off of. Swyx: I have a follow up question actually on that. Uh, what's a, what's an egregiously overrated paper that once you actually looked into it fell apart completely?Jonathan: I'm not going down that path. Okay. I, you know, I even, even though I think there are papers that, you know, did not hold up under scrutiny, I don't think any of this was out of malice. And so I don't wanna go down that path. Alessio: Yeah. I know you already talked about your focus on open research. Are you mostly gonna focus on open models or are there also, are you working on configurations that are more just for your customers and private, like, what percentage of your time are you focusing on, on open work?Jonathan: It's a little fuzzy. I mean, I think at the end of the day you have to ask what is the point of our business? Our business is not just to train a bunch of open models and give them to the world. That would, our VCs probably wouldn't be very happy if that were the case. The open [00:57:00] models serve our business because they're demos.A demo does not mean we give away everything. Um, a demo does not mean every single thing we do is shared with the world, but. We do have a business imperative to share with the world, which I kind of like. That was part of the design of the company, was making sure we had an imperative to do science and an imperative to share.But we are still a company and we do have to make money, but it would be a disaster for our business if we didn't share. And that's by design from the start. So, you know, there's certainly going to be some work that we do that is for our customers only, but by and large for anything that we wanna advertise to customers, there has to be something that is meaningful and useful that's out there in the world.Otherwise we can't convince people that we have it. Abhinav: Yeah, I think like this, our recent inference product also makes the decision easier for us, right? So even since these open malls like we've developed so far, um, you can actually like, you know, uh, query them on our inference api, like our starter tier, and we basically charge like a, a per token fee.Very, very similar to the other API fighters. So there are pathways by which, you know, like even the open mall we provide for free still end up like helping our business out, right? You can customize them, deploy them on our, on our platform, and that way we, we still make money off of them.Alessio: Do you wanna jump into the landing ground?Anything else that you guys wanna cover that we didn't get to?Jonathan: This has been great. These are great questions. Swyx: Do you want to dish on why Sparsity is not a focus for Mosaic?Jonathan: Um, I can just say that, you know, sparsity is not a focus for Mosaic and I am definitely over lottery tickets when I give my mosaic talk.The first slide is a, you know, a circle with a slash through it over a lottery ticket. [00:58:30] Um, and anyone who mentions lottery tickets, I ask to leave the room. Um, cuz you know there's other work out there. But Abhi, please feel free to dish on sparsity.Abhinav: Yeah, I, I think it really comes down to the fact that we don't have hardware yet that can accelerate it.Right? Or at least it's been mostly true for a long period of time. So the kinds of sparsity that the lottery check was working on was like if you put random zeros in the, in the weights, you know, and basically we found basically the fast year is that yes, you can turn most of the weights to zeros and the model still does kind of work, but there's no hardware out there that can take a matrix with a bunch of zeros and one without and make it go fast.Now, the one caveat for this, and this is gonna sound like a bit of advertisement, is, is Cereus actually, and they've been, since the beginning, they've built that architecture for Sparsity and they've actually published some research papers just earlier this year showing that yes, they really can train with Sparsity and get, this is, uh, sparse.U P T. Exactly. Yeah, exactly right. So, the final missing piece is really like, okay, we have the science to show you can train with sparse models, you know, from initialization even, or, or close initialization. Um, the last piece is just, is there a piece of hardware that actually speeds it up and gives you a cost savings?In which case, like the, the field is wide open. Jonathan: The other big challenge here is that if you want to make sparsity go fast in general right now on standard hardware, you do need it to be structured in various ways. And any incremental amount of structure that you force on the sparsity dramatically reduces the quality of the resulting model that you get up to the point where if you remove just, you know, entire neurons from the model, you're just making the layers smaller and that really hurts the quality of the model.So these models, steel is all you need. These models love unstructured [01:00:00] sparsity. Um, and yeah, if there were a chip and a software package that made it really, really easy to accelerate it, I bet we would be doing it at Mosaic right now. Alessio: This is like Sarah Hooker's point with the hardware lottery post, talking about lotteries.Absolutely. Where you know, if you don't have the right hardware, some models, architectures just can't emerge quickly enough.Abhinav: This there, there's like an invariance to think of, which is that today's popular models always run fast on today's hardware. Like this, this has to be true. Mm-hmm. Right? Like there's no such thing as a popular model that runs slow cuz no one would've developed it.Yeah. Um, so it's kind of like with the new architectures, right? If there's new hardware that can do sparsity, you have to co-evolve like a new architecture that works with it. And then those two pair together really well. Transformers and GPUs are like a match made in heaven. Jonathan: How would say transformers and GPUs are a match made in heaven.Yeah. And we're lucky that they work on GPUs, but the folks at Google D designed them for TPUs cuz TPUs and R and Ns were not a match made in heaven.LIGHTNING ROUND AND CLOSING [1:00:55]Alessio: All right, we have three questions. One is on acceleration, one on exploration, and then just a takeaway for the audience. And you can, you know, either of you can start and the other can finish.So the first one is, what has already happened in AI That thought would take much longer than it has?Abhinav: Do you have an answer, Jon? Jonathan: Yeah, I have answer everything. Um, you know, I, I remember when GPT two came out and I looked at that and went, eh, you know, that doesn't seem very exciting. And gosh, it's already 1.5 billion parameters.You know, they can't possibly keep getting better as they make it bigger. And then GPT three came out and I was like, eh, it's slightly better at [01:01:30] generating text. Yeah, who cares? And you know, I've been wrong again and again and again. That. Next token prediction, making things big can produce useful models.To be fair, pretty much all of us were wrong about that. So I can't take that precisely on myself. Otherwise, Google, Facebook and Microsoft Research would all have had killer large language models way before opening I ever got the chance to do it. Um, opening I made a very strange bet and it happened to work out very well.But yeah, diffusion models, like they're pretty stupid at the end of the day and they produce beautiful images, it’s astounding.Abhinav: Yeah, I think my, my answer is gonna be like the, the chatbots at scale, like idea, like basically I thought it would be quite a while before, you know, like hundreds of millions of people will be talking to AI models for a large portion of the data, but now there's many startups and companies not, not just open with chat pt, but, but you know, like character and others where, um, it, it's really astounding, like how many people are actually developing like emotional connections to these, to these AI models.And I don't think I was. Would've predicted that like September, October of last year. But you know, the inflection point of the last six months has been really surprising.Swyx: I haven't actually tried any of these models, but I, I don't know. It seems like a very educational thing. It's like, oh, talk to Genius can, but like that's a very educational use case.Right? Right. Like what, what do you think they're using for, I guess, emotional support?Abhinav: Well, yes. I mean, I think some of them are sort of like, yeah, like either for emotional support or honestly just friends and stuff. Right. I mean, I think like, you know, loneliness mental health is a really a big problem everywhere.And so the most interesting I think I've found is that if you go to the subreddits, you know, for those communities and you see like how they [01:03:00] talk about and think about their like AI friends and like these characters, it's, it's, it's like out of a science fiction book, like I would never expect this to be like reality.Swyx: Yeah. What do you think are the most interesting unsolved questions in ai?Abhinav: I'm really interested in seeing how far down we can go in terms of precision and, and stuff like that. Particularly similar to the BF16 FP thing. Swyx: Okay. Um, there's also like just quantizing until like it's two bits.Abhinav: Yeah, exactly. Like, or even like down to analog or something like that. Because our brains obviously are not running on digital logic and stuff and so, you know, how many orders of magnitude do we have remaining in kind of like just these um, things and I wonder if some of these problems just get easier with scale.Like there have been sort of hints in some papers that, you know, it becomes easier to quantize or easier to prune as it gets bigger and bigger. So maybe as we, almost as a natural consequence of a scaling up over the next few years, will we just naturally become easier and easier to just start going to like four bits or two that are even binary leg weights.Jonathan: I want to know how small we can go in a different way. I just want to know how efficient we can make it to get models that are this good. That was my research question for my entire PhD lottery tickets were one way to get at that. That's now kind of the research question I'm chasing at Mosaic in a sense.I, you know, open ai has shown us that there is one path to getting these incredible capabilities that is scale. I hope that's not the only path. I hope there are lots of ways of getting there. There's better modeling, there are better algorithms. I hate the neuroscience metaphors, but in some sense, our existence and our brains are, you know, evidence that there is at least one other way to get to these kinds of incredible capabilities that doesn't require, you know, [01:04:30] a trillion parameters and megawatts and megawatts and gazillions of dollars.So, you know, I do wonder how small we can go? Is there another path to get to these capabilities without having to do it this way? If it's there, I hope we find it at Mosaic.Swyx: Yeah my, my favorite fact is something on the order of the human brain runs on 30 watts of energy, and so we are, we're doing like dozens of orders of magnitude off on that one.Abhinav: I, I don't think you can get like one gpu, one different. Yeah.Alessio: If there’s one message you want everyone. To remember when thinking about this thing. There's a lot of, you know, fear mongering. There's a lot of messaging being spread around, like, what should people think about in ai? What should be top of mind for them?Jonathan: I'll go for it. Which is, you know, stay balanced. They're the people who really feed into the hype or who, you know, eat up the hype. They're the people who are, you know, big pessimists or react very strongly against the hype, or to some extent are in denial. Stay balanced, embrace the fact that we've built extraordinarily useful tools.Um, but we haven't built a g I and you know, personally, I don't think we're anywhere close to that. You know, so stay balanced and follow the science. I think that's really, that's what we try to do around Mosaic. We try to focus on what's useful to people, what will, you know, hopefully make the world a better place.We try our best on that, but especially, you know, how we can follow the science and use data to be our guide, not just, you know, talk a lot, you know, try to talk through our work instead.Abhinav: And I would also say just kinda like research done in the open. I think like, you know, there's no computing with the, the open community, [01:06:00] right?Just in volume, the number of like, kind of eyeballs you basically have, like looking at your models at the, even at the problems with the models, at ways we improve them. Um, I just think, you know, yeah, research done in the open. It will, it will be the way forward, both to keep our models safe and to bely, like examine the consequences of these AI models like in the world.Alessio: Awesome. Thank you so much guys for coming on.Swyx: and thanks for keeping AI open. Abhinav: Thank you for having us. Jonathan: Yeah. Thank you so much for having us. Get full access to Latent Space at www.latent.space/subscribe
undefined
May 16, 2023 • 1h 2min

Guaranteed quality and structure in LLM outputs - with Shreya Rajpal of Guardrails AI

Tomorrow, 5/16, we’re hosting Latent Space Liftoff Day in San Francisco. We have some amazing demos from founders at 5:30pm, and we’ll have an open co-working starting at 2pm. Spaces are limited, so please RSVP here!One of the biggest criticisms of large language models is their inability to tightly follow requirements without extensive prompt engineering. You might have seen examples of ChatGPT playing a game of chess and making many invalid moves, or adding new pieces to the board. Guardrails AI aims to solve these issues by adding a formalized structure around inference calls, which validates both the structure and quality of the output. In this episode, Shreya Rajpal, creator of Guardrails AI, walks us through the inspiration behind the project, why it’s so important for models’ outputs to be predictable, and why she went with an XML-like syntax. Guardrails TLDRGuardrails AI rules are created as RAILs, which have three main “atomic objects”:* Output: what should the output look like?* Prompt: template for requests that can be interpolated* Script: custom rules for validation and correctionEach RAIL can then be used as a “guard” when calling an LLM. You can think of a guard as a wrapper for the API call. Before returning the output, it will validate it, and if it doesn’t pass it will ask the model again. Here’s an example of a bad SQL query being returned, and what the ReAsk query looks like: Each RAIL is also model-agnostic. This allows for output consistency across different models, even if they have slight differences in how they are prompted. Guardrails can easily be used with LangChain and other tools to structure your outputs!Show Notes* Guardrails AI* Text2SQL* Use Guardrails and GPT to play valid chess* Shreya’s AI Tinkerers demo* Hazy Research Lab* AutoPR* Ian Goodfellow* GANs (Generative Adversarial Networks)Timestamps* [00:00:00] Shreya's Intro* [00:02:30] What's Guardrails AI?* [00:05:50] Why XML instead of YAML or JSON?* [00:10:00] SQL as a validation language?* [00:14:00] RAIL composability and package manager?* [00:16:00] Using Guardrails for agents* [00:23:50] Guardrails "contracts" and guarantees* [00:31:30] SLAs for LLMs* [00:40:00] How to prioritize as a solo founder in open source* [00:43:00] Guardrails open source community involvement* [00:46:00] Working with Ian Goodfellow* [00:50:00] Research coming out of Stanford* [00:52:00] Lightning RoundTranscriptAlessio: [00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio partner and CTO-in-Residence at Decibel Partners. I'm joined by my cohost Swyx, writer and editor of Latent Space.Swyx: And today we have Shreya Rajpal in the studio. Welcome Shreya.Shreya: Hi. Hi. Excited to be here.Swyx: Excited to have you too.This has been a long time coming, you and I have chatted a little bit and excited to learn more about guardrails. We do a little intro for you and then we have you fill in the blanks. So you, you got your bachelor's at IIT Delhi minor in computer science with focus on AI, which is super relevant now. I bet you didn't think about that in undergrad.Shreya: Yeah, I think it's, it's interesting because like, I started working in AI back in 2014 and back then I was like, oh, it's, it's here. This is like almost changing the world already. So it feels like that that like took nine years, that meme of like, almost like almost arriving the thing.So yeah, I, it's felt this way where [00:01:00] it's almost shared. It's almost changed the world for as long as I've been working in it.Swyx: Yeah. That's awesome. Maybe we can explore your, like the origins of your interests, because then you went on to U I U C to do your master's also in ai. And then it looks like you went to drive.ai to work on Perception and then to Apple S P G as, as the cool kids call it special projects group working with Ian Goodfellow.Yeah, that's right. And then you were at pretty base up until recently? Actually, I don't know if you've quit yet. I have, yeah. Okay, good, good, good. You haven't updated e LinkedIn, but we're getting the by breaking news that you're working on guardrails full-time. Yeah, well that's the professional history.We can double back to fill in the blanks on anything. But what's a personal side? You know, what's not on your LinkedIn that people should know about you?Shreya: I think the most obvious thing, this is like, this is still professional, but the most obvious thing that isn't on my LinkedIn yet is, is Guardrails.So, yeah. Like you mentioned, I haven't updated my LinkedIn yet, but I quit some time ago and I've been devoting like all of my energy. Yeah. Full-time working on Guardrails and growing the open source package and building out exciting features, et cetera. So that's probably the thing that's missing the most.I think another. More personal skill, which I [00:02:00] think I'm like kind of okay for an amateur and that isn't on my LinkedIn is, is pottery. So I really enjoy pottery and yeah, don't know how to slot that in amongst, like, all of the AI. So that's not in there. Swyx: Well, you like shaping things into containers where, where like unstructured things and kind of flow in, so, yeah, yeah, yeah. See I can, I can spin it for you.Shreya: I should, I should use that. Yeah. Yeah.Alessio: Maybe for the audience, you wanna give a little bit of intro on Guardrails AI, what it is, why you wanted to start itShreya: Yeah, yeah, for sure. So Guardrails or, or the need for Guardrails really came up as I was kind of like building some of my own projects in the space and like really solving some of my own problems.So this was back of like end of last year I was kind of building some applications, like everybody else was very excited about the space. And I built some stuff and I quickly realized that yeah, I could, you know it works like pretty well a bunch of times, but like a lot of other times it really does not work as I, the developer of this tool, like, want my tool to work.And then as a developer like I can tell that there's very few tools available for me to like, get this to, you know cooperate [00:03:00] with me, like get it to follow directions, etc. And the only tool I really have is this prompt. And there's only so, so far you can go with like, putting instructions in like caps, adding a bunch of exclamations and being like, follow my instructions. Like give me this output this way. And so I think like part of it was, You know that it's not reliable, et cetera. But also as a user, it just if I'm building an application for a user, I just want the user to have a have a certain experience using it. And there's just not enough control to me, not enough, like knobs for me to tune, you know as a developer to do that.So guardrails kind of like came up as a way to just like, manage this better. The tool basically, I was like, okay. As I'm building this, I know from the ground up, like what is the experience I want the user to add, to have like, what is a great LLM output look like for me? And so I wanted a tool that allows me to kind of specify that and enforce those constraints.As I was thinking of this, I was like, this should be very extensible, very flexible so that there's a bunch of use cases that can be handled, et cetera. But the need really like, kind of came up from my own from my own, like I was basically solving for my own pain points.[00:04:00]So that's a little bit of the history, but what the tool does is that it allows you to kind of like specify. It's this two-part system where there's a specification framework and then there's like a code that enforces that specification on the LLM outputs. So the specification framework allows you to be like as coarse or as fine grained as you care about.So you can essentially think about what is the, on a very like first order business, like where is the structure and what are the types, etc, of the output that I want. If you want structured outputs from LLMs. But you can also go like very into semantic correctness with this, with a. I just released something this morning, which is that if you're summarizing a bunch of documents, make sure that it's a very faithful summary.Make sure that there's like coherence amongst like what the output is, et cetera. So you can have like all of these semantic guarantees as well. And guardrails created like rails, like a reliable AI markup language that allows you to specify that. And along with that, there's like code that backs up that specification and it makes sure that a, you're just generating prompts that are more likely to get you the output in the right manner to start out with.And then once you get that output all of the specification criteria you entered is like [00:05:00] systematically validated and like corrected. And there's a bunch of like tools in there that allow you a lot of control to like handle failures much more gracefully. So that's in a nutshell what guardrails does.Awesome.Alessio: And this is model agnostic. People can use it on any model.Shreya: Yeah, that's right. When I was doing my prototyping, I like was developing with like OpenAI, as I'm sure like a bunch of other developers were. But since then I've added support where you can basically like plug in any, essentially any function or any callable as long as you, it has a string input.String output you can plug it in there and I've had people test it out with a bunch of other models and get pretty good results. Yeah.Alessio: That's awesome. Why did you start from XML instead of YAML or JSON?Shreya: Yeah. Yeah. I think it's a good question. It's also the question I get asked the most. Yes. I remember we chat about this as well the first chat and I was like, wait, okay, let's get it out of the way. Cause I'm sure you answered this a lot.Shreya: So it is I didn't start out with it is the truth. Like, I think I started out from this code first framework service initially like Python classes, et cetera. And I was like, wait, this is too verbose. This is like I, as I'm thinking about what I want, I truly just [00:06:00] want this is like, this is what this dictionary should look like for me, right?And having to like create classes on top of that just seemed like a higher upfront cost. Like obviously there's a balance there. Like there's some flexibility that classes and code affords you that maybe isn't there in a declarative markup language. But that that was my initial kind of like balance there.And then within markup languages, I experimented with the bunch, but the idea, like a few aesthetic things about xml, like really appeal to me, as unusual as that may sound. But I think one is this idea of like properties off. Any field that you're getting back from an LLM, right. So I think one of the initial ones that I was experimenting with was like TypeScript, et cetera.And with TypeScript, like all of the control you have is like, you try to like stuff as much information as possible in the name of the key, right? But that's not really sufficient because like in, in XML or, or what gars allows you to do is like maybe add like descriptions for each field that you're getting, which like is, is really very helpful because that almost acts as a proxy prompt.You know, and, and it gets you like better outputs. You can add in like what the correctness criteria or what the validity criteria is for this field, et [00:07:00] cetera. That also gets like passed through to the prompt, et cetera. And these are all like, Properties for a single field, right? But fields themselves can be containers and can have like other nested like fields within them.And so the separation of like what's a property of a field versus what's like child of a field, et cetera, was like nice to me. And having like all of this metadata contained within this one, like tag was like kind of elegant. It also mapped very well to this idea of like error handling or like event handling because like each field may fail in weird ways.It's very inspired from H T M L in that way, in that you have these like event handlers for like, oh, if this validity criteria for this field fails maybe I wanna re-ask the large language model and here's my re-asking parameters, et cetera. Whereas like, if other criteria fail there's like maybe other ways to do to handle that.Like maybe I don't care about it as much. Right. So, so that seemed pretty elegant to me. That said, I've talked to a lot of people who are very opinionated about it. My, like, the thing that I was optimizing for was essentially that it seemed clean to me compared to like other things I tried out and seemed as close to English as [00:08:00] possible.I tested it out with, with a bunch of friends you know, who did not have tag backgrounds or worked in tag but weren't like engineers and it like and they resonated and they were able to pick it up. But I think you'll see updates in the works where I meet people where they are in terms of like, people who, especially like really hate xml.Like there's something in the works where there'll be like a code first version of this. And also like other markup languages, which I'm actively exploring. Like what is a, what is a joyful experience to have for like other market languages. Yeah. DoSwyx: you think that non-technical people would.Use rail was because I was, I was just surprised by your mention that you tested it on non-technical people. Is that a design goal? Yeah, yeah,Shreya: for sure. Wow. Okay. We're seeing this big influx of, of of people who are building tools with these applications who are kind of like, not machine learning people.And I think like, that's truly the kind of like big explosion that we're seeing. Right. And a lot of them are like getting so much like value out of like lms, but because it allows you like earlier if you were to like, I don't know. Build a web scraper, you would need to do this like via code.[00:09:00] But now like you can get not all the way, but like a decent amount of way there, like with just English. And that is very, very powerful. So it is a design goal to like have like essentially low floor, high ceiling is, was like absolutely a design goal. So if, if you're used to plain English and prompting using Chad PK with plain English, then you can it should be very easy for you to kind of like pick this up and there's not a lot of gap there, but like you can also build like pretty complex workflows with guardrails and it's like very adaptable in that way.Swyx: The thing about having custom language is essentially other people can build. Stuff that compiles to you. Mm-hmm. Which is also super nice and, and visual layers on top. Like essentially HTML is, is xml, like mm-hmm. And people then build the WordPress that is for non-technical people to interface with html.Shreya: I don't know. Yeah, yeah. No, absolutely. I think like in the very first week that Guardrails was out, like somebody reached out to me and they were pm and they essentially were like, I don't, you know there's a lot of people on my team who would love to use this, but just do not write code.[00:10:00] Like what is the, where is a visual interface for building something like this? But I feel like that's, that's another reason for why XML was appealing, because it's essentially like a document structuring, like it's a way to think about like documents as trees, right? And so again, if you're thinking about like what a visual interface would be, then maps going nicely to xml.But yeah. So those are some of the design considerations. Yeah.Swyx: Oh, I was actually gonna ask this at the end, but I'm gonna bring it up now. Did you explore sql, like. Syntax. And obviously there's a project now l m qr, which I'm sure you've looked at. Yeah. Just compare, contrast, anything.Shreya: Yeah. I think from my use case, like I was very, how I wanted to build this package was like essentially very, very focused on developer ergonomics.And so I didn't want to like add a lot of overhead or add a lot of like, kind of like high friction essentially like learning a whole new dialect of sequel or a sequel like language is seems like a much bigger overhead to me compared to like doing things in XML or doing things in a markup language, which is much more intuitive in some ways.So I think that was part of the inspiration for not exploring sql. I'd looked into it very briefly, but I mean, I think for my, for my own workflows, [00:11:00] I wanted to make it like as easy as possible to like wrap whatever LLM API calls you make. And, and to me that design was in markup or like in XML, where you just define your desiredSwyx: structures.For what it's worth. I agree with you. I would be able to argue for LMQL because SQL is the proven language for business analysts. Right. Like less technical, like let's not have technical versus non-technical. There's also like less like medium technical people Yeah. Who learn sql. Yeah. Yeah. But I, I agree with you.Shreya: Yeah. I think it depends. So I have I've received like, I think the why XML question, like I mentioned is like one of the things I get most, but I also hear like this feedback from other people, which is like all of like essentially enterprises are also like very comfortable with xml, right? So I guess even within the medium technical people, it's like different cohorts of like Yeah.Technologies people are used to and you know, what they would find kind of most comfortable, et cetera. Yeah. And,Swyx: Well, you have a good shot at establishing the standard, which is pretty exciting. I'm someone who has come from a, a long background with React, the JavaScript framework. I don't know if you.And it's kind of has that approach of [00:12:00] taking a templating XML like language to describe something that was typically previously described in Code. I wonder if you took any inspiration from that? If you want to just exchange notes on anything from that like made React successful. Cuz I, I spent a few years studying that.Yeah.Shreya: I'm happy to talk about it, but I will say that I am very uneducated when it comes to front end, so Yeah, that's okay. So I might say some things that like aren't, aren't valid or like don't really, don't really map very well, but I'm gonna give it a shot anyway. So I don't know if it was React specifically.I think just this idea of marrying essentially like event handlers, like with the declarative framework. Yes. And with this idea of being able to like insert scripts, et cetera, and quote snippets into that. Like, that was super duper appealing to me. And that was like something like where you're programming with.Like Gabriels and, and Rail specifically is essentially a way to like program with large language models outside of using like just national language. Right? And so like just thinking of like what are the different like programming workflows that people typically need and like what would be the most elegant way to add that in there?I think that was an inspiration. So I basically looked at like, [00:13:00] If you're familiar with Guardrails and you know that you can insert like dynamic scripting into a rail specification, so you can register custom validators within rail. You can maybe have like essentially code snippets where things are like lists or things are like dynamically generated array, et cetera, within GAR Rail.So that kind of resonated a lot to like using JavaScript injected within like HTML files. And I think other inspiration was like I mentioned this before, but the event handlers was like something that was very appealing, how validators are configured in guardrails right now. How you tack on specific validators that's kind of inspired from like c s s and adding like style tags, et cetera, to specific Oh, inline styling.Okay. Yeah, yeah, yeah, exactly. Wow. So that was like some of the inspiration, I guess that and pedantic and like how pedantic kind of like does its validation. I think those two were probably like the two biggest inspirations while building building the current version of guardrails. Swyx: One part of the design of React is composability.Can I import a guardrails thing from into another guardrails project? [00:14:00] I see. That paves the way for guardrails package managers or libraries or Right. Reusable components, essentially. I think that'sShreya: pretty interesting. Do you wanna expand on that a little bit more? Swyx: Like, so for example, you have guardrails for a specific use case and you want to like, use that, use it in a bigger thing. And then just compose it up. Yeah.Shreya: Yeah. I wanna say that, I think that should be pretty straightforward. I'm trying to think about like, use cases where people have done that, but I think that kind of maps into like chaining or like building complex workflows generally. Right. So how I think about guardrails is that like, I.If you're doing something like chaining, you essentially are composing together these like multiple LLM API calls and you have these like different atomic units of each LLM API calls, right? So where guardrails kind of slots in is add like one of those nodes. It essentially adds guarantees, et cetera, and make sure that you know, that that one node is like water tied, et cetera, in terms of the, the output that is, that it has.So each node in your graph or tree or in your dag would essentially have like a guardrails config associated with it. And you can kind of like use your favorite chaining libraries, like nine chain, et cetera, to like then compose this further together. [00:15:00] I think I've seen like one of the first actually community projects that was like built using guardrails, like had chaining and then had like different rails for each node of that chain.Essentially,Alessio: I'm building an agent internally for us. And Guardrails are obviously very exciting because once you set the initial prompt, like the model creates its own prompts. Can the models create rails for themselves? Like, have you tried this out? Like, can they understand what the output is supposed to be and like where their ownShreya: specs?Yeah. Yeah. I think this is a very interesting question. So I haven't personally tried this out, but I've ha I've received this request you know, a few different times. So on the roadmap like seeing how this can be done, but I think in general, like in all of the prompt engineering experiments I've done, et cetera, I don't see like why with, especially with like few short examples that shouldn't be possible.But that's, that's a fun like experiment. I wanna try out,Alessio: I was just thinking about this because if you think about Baby a gi mm-hmm. And some of these projects mm-hmm. A lot of them are just loops of prompts. Yeah. You know so I can see a future [00:16:00] in which. A lot of these loops are kind off the shelf thing and then you bring your own rails mm-hmm.To make sure that they work the way you expect them to be instead of expecting the model to do everything for you. Yeah. What are your thoughts on agents and kind of like how this plays together? I feel like when you start it, people were mostly just using this for a single prompt. You know, now you have this like automated chainShreya: happening.Yeah. I think agents are like absolutely fascinating in how. Powerful they are, but also how unruly they are sometimes. Right? And how hard to control they are. But I think in general, this kind of like ties into even with machine learning or like all of the machine learning applications that I worked on there's a reason like you don't have like fully end-to-end ML applications even in you know, so I, I worked in self-driving for example, like a driveway.I at driveway you don't have a fully end-to-end deep learning driving system, right? You essentially have like smaller components of it that are deep learning and then you have some kind of guarantees, et cetera, at those interfaces of those boundaries. And then you have like other maybe more deterministic competence, et cetera.So essentially like the [00:17:00] interesting thing about the agent framework for me is like how we will kind of like break this up into smaller tasks and then like assign those guarantees kind of at e each outputs. It's a problem that I've been like thinking about, but it's also like frankly a hard problem to solve because you're.Because the goals are auto generated. You know, there's also like the, the correctness criteria for those goals also needs to be auto generated, right? Which is like a little bit antithetical to you knowing ahead of time, like, what, what a correct output for me for a developer or for your application kind of looking like.So I think like that's the interesting crossroads. But I do think, like with that said, I think guardrails are like absolutely essential for Asian frameworks, right? Like partially because like, not just making sure they're like constrained and they're safe, et cetera, but also, frankly, to just make sure that they're doing what you want them to do, right?And you get the right output from them. So it is a problem. Like I'm, I'm thinking a bunch about, I think just, just this idea of like, how do you make sure that it's not it's not just models checking each other, but there's like some more determinism, some more notion of like guarantees that can be backed up in there.I think like that's [00:18:00] the, that would be like super compelling to me, and that is kind of like the solution that I would be interested in putting out. But yeah, it's, it's something that I'm thinking about for sure. I'mSwyx: curious in the scope of the problem. I feel like we need to. I think a lot of people, when they hear about AI progress, they always assume that, oh, that just if it's not good now, just wait a year later.And I think obviously, I think that's something that you have to think about as well, right? Like how much of what guardrails is gonna do is going to be Threatens or competed with by GC four having 32,000 context tokens. Just like what do you think are like the invariables in model capabilities that you're betting on versus like stuff that you would not bet on because you just expected to get better?Yeah.Shreya: Yeah. I think that's a great question, and I think just this way of thinking about invariables, et cetera is something that is very core to how I've been thinking about this problem and like why I also chose to work on this problem. So, I think again, and this is like guided by some of my past experience in machine learning and also kind of like looking at like how these problems are, how like other applications that I've had a lot [00:19:00] of interest, like how some of the ML challenges have been solved in there.So I think like context, like longer context, length is going to arrive for sure. We are gonna start saying we're already seeing like some, some academic papers and you know, we're gonna start seeing a lot more of them like translated into actual applications.Swyx: This is the new transformer thing that was being sent around with like a millionShreya: context.Yeah. I also, I think my my husband is a PhD student you know, at Stanford and then his lab also does research basically in like some of the more efficient architectures for Oh, that'sSwyx: a secret weapon for guard rails. Oh my god. What? Tell us more.Shreya: Yeah, I think, I think their lab is pretty exciting.This is a shouted to the hazy research lab at Stanford. And yeah, I think like some of, there's basically some active research there about like, basically looking into like newer architectures, like not just transform. Yeah, it might not be the most I've been artifact more architecture.Yeah, more architectural research that allows for like longer context length. So longer context, length is arriving for sure. Yeah. Lower latency lower memory efficiency, et cetera. So that is actually some of my background. I worked in that in my previous jobs, something I'm familiar with.I think there's like known recipes for making [00:20:00] this work. And it's, it's like a problem like once, essentially it's a problem of just kind of like a lot of experimentation and like finding exactly what configurations kind of get you there. So that will also arrive, both of those things combined, you know will like drive down the cost of running inference on these models.So I, all of those trends are coming for sure. I think the trend that. Are the problem that is not solved by these trends is the problem of like determinism on machine learning models, like fundamentally machine learning models, deep learning models specifically, like are impossible to add guarantees on even with temperature zero.Oh, absolutely. Even with temperature zero, it's not the same as like seed equals zero or seed equals like a fixed amount. Mm-hmm. So even if with temperature zero with the same inputs, you run it multiple times, you'll essentially see that you don't get the same output multiple times. Right.Combined with this, System where you don't even actually own the model yourself, right? So the models are updated from under you all the time. Like for building guardrails, like I had to do a bunch of prompt engineering, right? So that users get like really great structured outputs, like share of the bat [00:21:00] without like having to do any work.And I had this where I developed something and it worked and then it ended up like for some internal model version, updated, ended up like not being functional anymore and I had to go back to the drawing board and you know, do that prompt engineering again. There's a bit of a digression, but I do see that as like a strength of guardrails in that like the contract that I'm providing is not between the user.So the user has a contract with me essentially. And then like I am making sure that we are able to do prompt engineering to get like the output from the LLM. And so it kind of like takes away a lot of that burden of having to figure that out for the user, right? So there's a little bit of a digression, but these models change all the time.And temperature zero does not equal like seed zero or fixed seed rather. And so even with all of the trends that we're gonna see arriving pretty soon over the next year, if not sooner, this idea of like determinism reproducibility is not gonna change, right? Ignoring reproducibility is a whole other problem of like the really, really, really long tail of like inputs and outputs that are not covered by, by tests and by training data, [00:22:00] et cetera.And it is like virtually impossible to cover that. You kind of like, this is not simply a problem where like, Throwing more data at the model is going to solve. Right? Yeah. Because like, people are building like genuinely really fascinating, really amazing complex applications and like, and these are just developers, like users are then using those applications in many diverse complex ways.And so it's hard to figure out like, what if you get like weird way word prompts that you know, like aren't, that you didn't kind of account for, et cetera. And so there's no amount of like scaling laws essentially that kind of account for those problems. They can be like internal guardrails, et cetera.Of course. And I would be very surprised if like open air, for example, like doesn't have their own internal guardrails. You can already see it in like some, some differences for example, like URLs like tend to be valid URLs now. Right. Whereas it really Yeah, I didn't notice that.It's my, it's my kind of my job to like keep track of, keep it, yeah. So I'm sure that's, If that's the case that like there's some internal guard rails, and I'm sure that that would be a trend that we would kind of see. But even with that there's like a ton of use cases and a [00:23:00] ton of kind of like application areas where like there's different requirements from different types of guard rails are valuable in different requirements.So this is a problem essentially that would be like, harder to solve or next to impossible to solve with just data, with just scaling up the models. So you would need kind of this ensemble basically of, of LLMs of like these really powerful models along with like deterministic guarantees, rule-based heuristics, et cetera, more traditional you know machine learning tools and like you ensemble all of these together and you end up getting something that you know, is greater than the sum of it.Its parts in terms of what it's able to do. So I think like that is the inva that I'm thinking of is like the way that people would be developing these applications. I will followSwyx: up on, on that because I'm super excited. So when you sent mentioned you have people have a contract with guardrails.I'm actually looking at the validators page on your docs, something, you have something like 20 different contracts that people can have. I'll name some of them just just so that people can have an, have an idea, but also highly encourage people to check it out. Is profanity free, is a, is a good one.Bug-free Python. And that's, that's also pretty, [00:24:00] pretty cool. You have similar to document and extracted summary sentences match. Which I think is, is like don't hallucinate,Shreya: right? Yeah. It's, it's essentially making sure that if you're generating summaries the summary should be very faithful.Yeah. Should be like citable attributable, et cetera to the source text.Swyx: Right. Valid url, which we talked about. Mm-hmm. Maybe open AI is doing a little bit more of internally. Mm-hmm. Maybe open AI uses card rails. You don know be a great endorsement. Uhhuh what is surprisingly popular and what is, what do you think is like underrated?Out of all your contracts? Mm-hmm.Shreya: Mm-hmm. Okay. I think that the, well, not surprisingly, but the most obvious popular ones for me that I've seen are like structure, structure type, et cetera. Anything that kind of guarantees that. So this isn't specifically in the validators, this is essentially like part of the gut, the core proposition.Yeah, the core proposition. I think that is like very popular, but that's also kind of like the first order. Problem that people are kind of solving. I think the sequel thing, for example, it's very exciting because I had just released this like two days ago and then I already got some inbound with like people kinda swapping, like building these products and of swapping it out internally and you know, [00:25:00] getting a lot of value out of what the sequel bug-free SQL provides.So I think like the bug-free SQL is a great example because you can see like how complex these validators can really go because you end up seeing like bug-free sql. What it does is it kind of like takes a connection string or maybe a, a schema file, et cetera. It creates a sandbox SQL environment for you, like from that.And it does that at startups so that like every time you're getting like a text to SQL Query, you're not having to do pay that cost time and time again. It takes that query, it like executes that query on that sandbox in that sandbox environment and then sees if that query is executable or not.And then if there's any errors that you know, like. Packages of those errors very nicely. And if you've configured re-asking it sends it back to the model and you know, basically make sure that that like it tries to get corrected. Sequel. So I think I have an example up there in the docs to be in there, like in applications or something where you can kind of see like how it corrects like weird table names, like weird predicates, et cetera.I think there's other kind of like, You can build pretty complex systems with this. So other things in there are like it takes [00:26:00] information about your database and then injects it into the prompt with like, here's the schema of this table. It automatically, like given a national language query, it finds like what the most similar examples are from the history of like, serving this model and like injects those into the prompt, et cetera.So you end up getting like this very kind of well thought out validator and this very well thought out contract that is, is just way, way, way better than just asking in plain English, the large language model to give you something, right? So I think that is the kind of like experience that I wanna provide.And I basically, you'll see more often the package, my immediateSwyx: response is like, that's cool. It does more than I thought it was gonna do, which is just check the SQL syntax. But you're actually checking against schema, which is. Highly, highly variable. Yeah. It'sShreya: slow though. I love that question. Yeah. Okay.Yeah, so I think like, here's where this idea of like, it doesn't have to be like, you don't have to send every request to your L so you're sampling. Okay. So you can essentially figure out, so for example, like there's like how what guardrails essentially does is there's like corrective actions and re-asking is like one of those corrective actions, [00:27:00] right?But there's like a ton other ways to handle it. Like there's maybe deterministic fixes, like programmatic fixes, there's maybe default values. There's this doesn't work like quite work for sql, but if you're doing like a bunch of structured data and if you know there's an invalid value, you can just filter it or you can just refrain from asking, et cetera.So there's a ton of ways where you can like, just handle errors more gracefully. And the one I kind of wanna point out here is programmatically fixing something that is wrong, like on, on the client side instead of just sending over another request. To the large language model. So for sql, I think the example that I talked about earlier that essentially has like an incorrect table name and to correct the table name, you end up sending another request.But you can think about like other ways to handle disgracefully, right? Like essentially looking at essentially a fuzzy matching with like the existing table names in the repository and in, in the database. And you know, like matching any incorrect names to that. And so you can think of like merging this re-asking thing with like, other error handling things that like smaller, easier errors are able, you can handle them programmatically by just Doing this in like the more patching, patching or I, I guess the more like [00:28:00] classical ML way essentially, like not the super fancy deep learning is like, I think ML 2.0.But like, and this, I, I've been calling it like ML 3.0, but like, even in like ML 1.0 ways you can like, think of how to do this, right? So you're not having to make these like really expensive calls. And so that builds a very powerful system, right? Where you essentially have this, like, depending on what your error is, you don't like, always use G P D three or, or your favorite L M API when you don't need to, you essentially are able to like combine these like other ways, other error handling techniques, like very gracefully so that you get correct outbursts, validated outbursts, and you get them for cheap and like faster, et cetera.So that's, I think there's some other SQL validation things that are in there. So I think like exclude SQL Predicates. Yeah, exclude SQL Predicates. And then there's one about columns that if like some columns are like sensitive columnSwyx: prisons. Yeah. Yeah. Oh, just check if it's there.Shreya: Check if it's there and you know, if there's like only certain columns that you wanna show it to the user and like, maybe like other columns have like private data or sensitive data you know, you can like exclude those and you can think of doing this on the table level.So this is very [00:29:00] easy to do just locally. Right. Like, so there's like different ways essentially to kind of like handle this, which makes for like a more compelling way to build theseSwyx: systems. Yeah. Yeah. By the way, I think we're proving out why. XML was a better choice than SQL Cause now, now you're wrapping sql.Yeah. Yeah. It's pretty cool. Cause you're talking about the text to SQL application example that you put out. It actually puts something, a design choice that isn't talked about very much in center focus, which is your logs. Your logs are gorgeous. I'm sure that took work. I'm sure that's a strong opinion of yours.Yeah. Why do you spend so much time on logs? Just like, how do you, how do you think about designing these things? Should everyone do it this way? What are the drawbacks? Like? Is any like,Shreya: yeah, I'm so excited about this idea of logs because you know, you're like, all of this data is like in there for free, right?Like if you're, if you're do like any validation that is run, like essentially in memory, and then also I write it out to file, et cetera. You essentially get like this you get a history of this was the prompt that was run. This was the this was the L raw LLM output. This was the validation that was run.This was the output of those validations. This [00:30:00] was any corrective actions, et cetera, that were taken. And I think that's like very, like as a developer, like, I'm so happy to see that I use these logs like personally as well.Swyx: Yeah, they're colored. They're like nicely, like there's like form double borders on the, on the logs.I've never seen this in any ML tooling at all.Shreya: Oh, thanks. Yeah. I appreciate it. Yeah, I think this was mostly. For once again, like solving my own problems, which is like, I was building a lot of these things and you know, doing a lot of dog fooding and doing a lot of application building like in notebooks.Yeah. And so in a notebook I wanted to kind of see like what the easiest way to kind of interact with it was. And, and that was kind of what I ended up building. I really appreciate that. I think that's, that's very nice to, nice to hear. I think I'm also thinking about what are, what are interesting ways to be able to like whittle down very deeply into like what kind of went wrong or what is going right when you're like running, running an application and like what the nice kind of interface to design that would be.So yeah, thinking about that problem. Don't have anything on there yet, but, but I do really like this idea of really as a developer you're just like, you really want like all the visibility you can get into what's, [00:31:00] what's happening right. Under the hood. And I wanna be able to provide that. Yeah.Yeah.Swyx: I mean the, the, the downside I'll point out just quickly cuz we, we should, we should move on is that this is not machine readable. So like, how does it work with like a Datadog or, you know? Yeah,Shreya: yeah, yeah, yeah. Well, we can deal with that later. I think that's that's basically my answer as well, that I, I'll do, yeah.Problem for future sreya, basically.Alessio: Yeah. You call Gabriel's SLAs for l m outputs. You know, historically SLAs are pretty objective there's the five nines availability, things like that. How do you build them in a sarcastic system when, say, my queries, like draft me a marketing article. Mm-hmm. Like, Have you read an SLA for something like that?Yeah. But in terms of quality and like, in terms of we talked about what's slow and like latency, like Hmm. Sometimes I would read away more and I, and have a better copy of like, have you thought about what are like the, the access of measurement for some of these things and how should people think about it?Shreya: Yeah, the copy example is interesting because [00:32:00] I think for any of these things, the SLAs are purely on like content and output, not on time. I don't guardrails I don't think even can make any guarantees on the time that it'll take to make these external API calls. But like, even within quality, it's this idea of like, if you're able to communicate what you desire.Either programmatically or by using a model in the loop, then that is something that can be enforced, right? That is something that can be validated and checked. So for example, like for writing content copy, like what's interesting is like for example, if you can break down the copy that you wanna write into, like this is a title, this is maybe a TLDR description, this is a more detailed take on the, the changes or the product announcement, et cetera.And you wanna hit like maybe three, like some set of points in there. So you already kind of like start thinking of like, what was a monolith of like copy to you in, in terms of like smaller building blocks, et cetera. And then on those building blocks you can essentially like then add like certain guarantees.So you can say that let's say like length or readability is a [00:33:00] guarantee. So some of the updates that I pushed today on, on summarization and like specific guards for summarization, one of them essentially was that like the reading time for the summary should be within like some certain amount, right?And so that's like you can start enforcing like all of those guarantees, like on each individual block. So I think like, Some of those things are. Naturally harder to do and you know, like are harder to automate ways. So essentially like, does this copy, I don't know, is this witty or something, right. Or is this Yeah.Something that I guess like the model doesn't have a good idea for, but like other things, as long as you can kind of like enforce them and like check them either via model or programmatically, it's something that you can like start building some some notion of like guarantees around. Yeah.Yeah. So that's why I think about it.Alessio: Yeah. This is super interesting because right now a lot of products are kind of the same because all I do is they call it the model and some are prompted a little differently, but you can only guess so much delta between them in the future. It's be, it'll be really interesting to have products differentiate with the amount of guardrails that they give you.Like you already [00:34:00] see that, Ooh, with open AI today when some people complain that too many of the responses have too much like, Well actually in it where it's like, oh, you ask a question, it's like, but you should remember that's actually not good. And remember this other side of the story and, and all of that.And some people don't want to have that in their automated generation. So, yeah. I'm really curious, and I think to Sean's point before about importing guardrails into products, like if there's a default amount of guardrails that you have and like you've being the provider of it, like that's really powerful.And then maybe there's a faction that is against guardrails and it's like they wanna, they wanna break out, they wanna be free. Yeah. So it's a. Interesting times. Yeah.Shreya: I think to that, like what I, I was actually chatting with someone who was building some application for content creators where like authenticity you know, was a big requirement, like of what they cared about in the right output.And so within authenticity, like why conventional models were not good for them is that they already have a lot of like quote unquote guardrails right. To, to I guess like [00:35:00] appeal to like certain certain sections of the audience to essentially be very cleaned up and then that was like an undesirable trade because that, for them, like, almost took away from that authenticity, et cetera.Right. So I think just this idea of like, I guess like what a guardrail means is like so different for different applications. Like I, I guess like I, there's like about 20 or so things in there. I think there's like a few more that I've added this morning, which Yes. Which are not Yeah. Which are not updated and then in the end.But there's like a lot of the, a lot of the common workflows, like you do have an understanding of like what the right. I guess like what is an appropriate constraint for this? Right. Of course, things like summarization, four things like text sequel, but there's also like so many like just this wide variety of like applications, which are so fascinating to learn about where you, you would wanna build something in-house, which is like your, so which is your secret sauce.And so how Guardrail is kind of designed or, or my intention with designing is that here's this way of breaking down what this problem is, right? Of like getting some determinism, getting some guarantees from your LM outputs. [00:36:00] And you can use this framework and like go crazy with it. Like build whatever you want, right?Like if you want this output to be more authentic or, or, or less clean or whatever, you can like add that in there, like making sure that it does have maybe some profanity and that's a desirable output for you. So I think like the framework side of it is very exciting to me as this, as this way of solving the problem.And then you can build your custom validators or use the ones that I provide out of the box. Yeah. Yeah.Alessio: So chat plugins, it's another big piece of this and. A lot of the integrations are very thin specs and like a lot of prompting, for example, a lot of them are asking to not mention the competitors. I think the Expedia one said, please do not mention any other travel website on the internet.Do not give any other alternative to what we do. Yeah. How do you see all these things come together? Like, do you see guardrails as something that not only helps with the prompting, but also helps with bringing external data into these things, and especially with agents going on any website, do you see each provider having like their own [00:37:00] guardrail where it's like, Hey, this is what you can expect from us, or this is what we want to provide?Or do you think that's, that's not really what, what you're interested in guardrailsShreya: being? Yeah, I think agents are a very fascinating question for me. I don't think I like quite know what the right, who the right owner for this guardrail is. Right. And maybe, I don't know if you guys wanna keep this in there or like maybe cut this front of my answer out, up to, up to you guys.I'm, I'm fine either way, but I think like that problem is, A harder problem to solve just from like a framework design perspective as well. Right. I think this idea of like, okay, right now it's just in the prompt, like don't mention competitors, et cetera. Like that is exactly that use case.Or I feel like, okay, if I was that business owner, right, and if I wanted to build this application, like, is that sufficient? There's like so much prompt injection, right? And you can get, or, or just so much like, just like an absolute lack of guarantees. Like, and, and it's hard to even detect that this is happening.Like let's say I have this running in production and then turns out that there was like some sort of leakage, et cetera, and you know, like my bot has actually been talking about like all of my competitors forever, [00:38:00] right? Like, that's a, that's a substantial risk. And so just this idea of like needing this like post-hoc validation to ensure deterministically that like it does what you want it to do is like, just so is like.As a developer putting myself in the shoes of like people building business applications like that is what gives me like peace of mind, right? So this framework, I think, like applies very well within those settings.Swyx: I'll go right into, we're gonna broaden out a little bit into commentary on other parts of the ecosystem that might, that might be interesting.So I think you and I. Talks briefly about this, but I think the, the broader population should know about it, which is that you also have an LLM API wrapper. Mm-hmm. So, such that the way, part of the way that guardrails works is you in, inject part of the few shot example into the prompt.Mm-hmm. And then you also do re-asking in all the other stuff post, I dunno what the pipeline is in, in, in your terminology. So essentially you have an API wrapper for open ai.completion.com dot create. But so does LangChain, so does Hellicone so does everyone I can name like five other people who are all fighting essentially for [00:39:00] the base layer, LLM API wrapper.Mm-hmm. I think this is valuable real estate, but I don't know how you like, think about working with other people or do you wanna be the base layer, likeShreya: I feel pretty collaboratively about it. I also feel like there's, like lang chain is doing like, it's so flexible as a framework, right?Like you can solve so many of your problems in there. And I think like it's, I, I have like a lang chain integration. I have a GPT Index / Llama integration, et cetera. And I think my view on this is that I wanna integrate with everybody. I think it is valuable real estate. It's not personally real estate that I'm interested in.Like you can essentially bring the LLM callable or the LLM API that's in there. It's just like some stub of a function that you can just add your favorite thing in there, right? It just, the only requirement is that string in first string output, that is all the requirement. And then you can bring in your own favorite component from your own favorite library in order to do that.And so, yeah, it's, I think like I'm pretty focused on this problem of like what is the guardrail that you would wanna build for a certain applications? So it's valuable real estate. I'm sure that people don't own [00:40:00] it.Swyx: It's, as long as people give you a way to insert your stuff, you're good.Shreya: Yeah, yeah. Yeah. I do think that, like I've chat with a bunch of people and then different applications and I do think that the abstractions that I have haven't failed me yet. Like it is very flexible. It is very easy to slot in into any workflow. Yeah.Swyx: I would love to ask about the meta elements of working on guardrails.This is your first company, but you launched five things this morning. The pace of the good AI projects that I've seen out there, like LangChain launches 10 things a week or whatever, I don't know. Surely that's something that you prioritize. How do you, how do you think about like, shipping versus like going going back and like testing and working in community and all the other stuff that you're managing?How do you prioritize? Shreya: That’s such a wonderful question. Yeah. A very hard question as well. I don't know if I would have a good answer for this. I think right now it's instinctive. Like I have a whole kind of stack ranked list of like things I wanna do and features I wanna build and like, support, et cetera.Combined with that is like a feature request I get or maybe some bugs, et cetera, that folks report. So I'm pretty focused on like any failures, any [00:41:00] feature requests from the community. So if those come up, I th those tend to Trump like anything else that I'm working on. But outside of that I have like this whole pool of ideas and like pool of features I wanna build and I kind of.Constantly kind of keep stack ranking them and like pushing something out. So I'm spending like I'm thinking about this problem constantly and as, as a function of that, I have like a ton of ideas for like what would be cool to build and, and what would be the right way to like, do certain things and yeah, wanna basically kind of like I keep jotting it down and keep thinking of like every time I cross something off the list.I think about like, what's the next exciting thing to work on. I think simultaneously with that we mentioned that at the beginning of this conversation, but like this idea of like what the right interface for rail is, right? Like, is it the xl, is it code, et cetera. So I think like those are like fundamental kind of design questions and I'm you know, collaborating with folks and trying to figure that out now.And yeah, I think that's like a parallel project that I'm hoping that yeah, you'll basically, that we'll be out soon. Like in termsSwyx: of the levers, how do you, like, let's just say in like a typical week, is it like 50% [00:42:00] calls with partners mm-hmm. And potential users and just understanding your use cases and the 50% building would you move that, that percentage anyway anywhere?Would you add in something that's significant?Shreya: I think it's frankly very variable week to week. So, yeah. I think early on when I released Guardrails I was like, here's how I'm thinking about this problem. Right? Yeah. Don't need anyone else. You just no, but actually to the contrary, it was like, this is like, I'm very opinionated about like what the right way to solve this is.And this is all of the problems I've thought about and like, and I know this framework maps well to these sets of problems, right? What are your problems? Like there's this whole other like big population of people that are building and you know, I basically wanna make sure that I have like user empathy and I have like I'm able to understand what people are doing and like make sure the framework like maps well.So I think I did a lot of that, like. Immediately after the release, like talking to a lot of teams and talking to a lot of users. I think since then, I basically feel like I have a fair idea of like, you know what's great about it, what's mediocre about it, and what's like, not good about it? And that helps kind of guide my prioritization list of like what I [00:43:00] wanna ship and what I wanna build.So now it's more kind of like, I would say, yeah, back to being more, more balanced. Alessio: All the companies we work with that are in open source, I always try and have them think through open source as a distribution model. Mm-hmm. Or like a development model. I was looking in the contributors list, and you have by far the most code, the second largest contributor. It's your husband. And after that it kind of goes, goes or magnitude lower. What have you found kind of working in, in open source in like a very fast moving project for, for the first time? You know, it's a, like with my husband, it's the community. No, no. It's the, it's the community like, A superpower to you?Do you feel like, do you feel like having to explain why you're doing things a certain way, like getting people buy in is maybe slowing you down when things move so quickly? I'm, I'm always interested to hears people's thoughts.Shreya: Oh that's a good question. I think like, there's part of like, I think guardrails at that stage, right?You know, I have like feature requests and I have [00:44:00] contributors, but I think right now, like I'm doing the bulk of like supporting those feature requests, et cetera. So I think a goal for me, and I remember we chatted about this as well you know, when we, when we spoke last, we're just like, okay.You know, getting into that point where, yeah, you, you essentially like kind of start nurturing and like getting more contributions from like the open source. So I think like that's one of the things that yeah. Is kind of the next goal for me. Yeah, it's been pretty. Fun. I, I would say like up until now, because I haven't made any big breaking a API changes, et cetera, so I haven't like, needed that community input.I think like one of the big ones that is coming right now is like the code, right? Like the code first, a API for creating rails. So I think like that was kind of important for like nailing that user experience, et cetera. So the, so the collaborators that I'm working with, there's basically an an R F C and community input, et cetera, and you know, what the best way to do that would be.And so that's actually, frankly, been like pretty fun as well to see the community be like opinionated about like, here's how I'm doing it and like, this works for me, this doesn't work for me, et cetera. So that's been like new for me as well. Like, I [00:45:00] think I am my previous company we also had like open source project and it was built on open source, but like, this is the first time that I've created a project with an open source project with like that level of engagement.So that's been pretty fun.Swyx: I'm always curious about like potential future business model, modern sensation,Shreya: anything like that. Yeah. I think I'm interested in entrepreneurship generally, honestly, trying to figure out like what the, all of those questions, right?Like business model, ISwyx: think a lot of people are in your shoes, right? They're developers. Mm-hmm. They and see a lot of energy they would like to start working on with open source projects. Mm-hmm. What is a deciding factor? What do you think people should think about when deciding whether or not, Hey, this is just a project that I maintained versus, Nope, I'm going to do the whole thing that get funding and allShreya: that.I think for me So I'm already kind of like I'm al I'm working on the open source full time. I think like the motivating thing for me was that, okay, this is. A problem that would need to get solved, like one way or another.This we talked about in variance earlier, and I do think that this is a, like being able to, like, I think if, if there's a contraction or a correction and [00:46:00] the, these LMS like don't have the kind of impact that we're, we're all hoping they would, I think it would be because of like, this problem because people kind of find that it's not as useful when it's running at very large scales when it's running in production, et cetera.So I think like that was very, that gave me a lot of conviction that it's something that I kind of wanted to work on and that was a switch for me. That it gave me the conviction to, for example, quit my job. Yeah. Also, yeah. Slightly confidential. Off the record. Off the record, yeah. Yeah.Alessio: We're not gonna talk about. Special project at Apple. That's a, that's very secret. Yeah. But you overlap Apple with Ian Goodfellow, which is obviously a, a very public figure in the AI space.Swyx: Actually, not that many people know what he did, so maybe we can, she can introduce Ian Goodfellow as well.Shreya: But, yeah, so Ian Goodfellow is the creator of Ganz or a generative adversarial network.So this was, I think I'm gonna mess up between 1215, I think 14, 15 ish if I remember correctly. So he basically created gans as a PhD student. As a PhD student. And he has a pretty interesting story of like how he thought of them and how [00:47:00] he kind of, Built the, and I I'm sure there's like interviews in like podcasts, et cetera with him where he talks about it, where like, how he got the idea for it and how he kind of like wrote the paper and did the experiments.So gans essentially were kind of like the first wave of generative images where you would see essentially kind of like fake auto-generated images, you know conditioned on like certain distributions. And so they were like very many variants of gans, like DC GAN, I'm gonna mess up the pronunciation, but dub, I'm just gonna call it w GaN.Mm-hmm. GAN Yeah. That like, you would essentially see these like really wonderful generative art. And I do think that like so I, I got the chance to work with him while at Apple. He had just moved to Apple from Google Brain and was building the cross-functional machine learning team within SPG.And I got the chance to work with him, which is very exciting. I learned so much and he is a fantastic manager and yeah, really, really enjoyed working withAlessio: him. And then he, he quit his job when they forced him to go back to the office. Right? That's theSwyx: Oh, really? Oh,Alessio: I didn't see that. Oh, okay. I think he basically, apple was like, you gotta go [00:48:00] back to the office.He said peace. That justSwyx: went toon. I'm curious, like what's some, some things that you learned from Ian that, or maybe some stories that,Shreya: Could be interesting. So there's like one, maybe machine learning specific and like one, maybe not machine learning specific and just general, like career stuff.Yeah. So the ML specific one was that well, Very high level. I think like working with him, you just truly see the creativity. And like after I worked with him, I was like, yeah, I, I totally get that. This is the the guy, like how his, how his brain works it's totally, it's so obvious that this is the guy who made like gans work basically.So I think he, when he does machine learning and when he thinks about like problems to solve, he thinks about it from a very creative out of the box way of thinking about it. And we kind of saw that with like, some of the problems where he was working on where anytime he had like feedback or suggestions on the, on the approaches that I was taking, I was like, wow, this is really exciting and like very creative and yeah, it was very, very cool to work on.So that was very high level machine learning.Swyx: I think the apple, apple standing by with like a blow dart if you, if like, say anymore.Shreya: I think the, the non-technical stuff, which [00:49:00] was I think truly made him such a fantastic manager. But when I went to Apple, I was, you know maybe a year outta school outta my job at that point.And I remember that I like most new grads was. Had like, okay, I, I need to kind of solve this problem on my own before I kind of get external help. Yeah. Yeah. And like, one of my first, I think probably my first or second week, like Ian and I, we were para programming and I remember that we were working together and like some setup issues were happening.And he would wait like exactly 45 seconds before he would like, fire up a message on Slack and like, how do I, how do I kind of fix this? How do they do this? And it just like totally transformed like, like, they're just like us, you know? I think not even that, it's that like. I kind of realized that I was optimizing for the wrong thing, right?By trying to like solve this myself. And instead of just if I'm running into a problem posting on Slack and like getting collaborative information, it wasn't that, yeah, it was, it was more the idea of my job is not like to solve this myself. My job is to solve this period.Mm-hmm. And the fastest way to solve this is the most, is the most correct way to do it. And like, [00:50:00] yeah, I truly, like, he's one of my favorite people. And I truly enjoyed working with him a lot, but that was one of my, Super early into my job there. Like I, I learned that that was You're verySwyx: lucky to do that.Yeah. Yeah. That's awesome. I love learning about the people side. Mm-hmm. You know, because that's what we deal with on a day-to-day basis, so. Mm-hmm. It's really nice to Yeah. To hear about that kind of stuff. Yeah. I was gonna go into one more academia question and then we'll go into lighting rounds.So you're close to Stanford. There'sShreya: obviously a lot of By, by my, yeah. My, my husband basically. Yeah. He doesn't have aSwyx: choice. There's a lot of interesting things coming on to Stanford, right. Vicuna, Alpaca and, and Stanford home. Are you keeping a close eye on like, the academic outputs? What are you seeing that is interesting to you?Shreya: I think obviously because of I'm, I'm focused on this problem, definitely looking at like how people are, you know thinking about the guard rails and like kind of adding more constraints.Swyx: It's such a great name by the way. I love it. Every time I see people say Guardrails, I'm like, yeah. Shreya: Yeah, I appreciate that. So I think like that is definitely one of the things. I think other ones are kind of like more out of like curiosity because of like some ML problems that I worked on in the past. Like I, [00:51:00] I mentioned that I worked on a efficient ml, so looking into like how people are doing, like more efficient inference.I think that is very fascinating to me. Mm-hmm. So, yeah, looking into that. I think evaluation helm was pretty exciting, really looking forward to like longer context length and seeing what's possible with that. More better fine tuning with like maybe lower data, et cetera. I think those are all some of the themes that I'm interested in.Swyx: Yeah. Yeah. Okay. So just because you have more expertise with efficiency, are you talking about quantization? Are you talking about pruning? Are you talking about. Distillation. I doShreya: think that the right way to solve these problems is always like to a mix. Yeah. A mix. Everything of them and like ensemble, all of these methods together.So I think, yeah, basically there's this like constant like tug of war and like push and pull between adding like some of these colonization for example, like improved memory, improved latency, et cetera. But then immediately you get like a performance hit, right? So like there's this like balance between like making it smaller and making it more efficient, but like not losing out on like what that performance is.And it's a big kind of experimentation framework. It's like understanding like where the bottlenecks are. So it's very, it's [00:52:00] very. You know, exploratory and experimental in nature. And so it's hard to kind of like be prescriptive about this is exactly what would work. It like, truly depends, like use case to use case architecture to architecture, hardware to hardware, et cetera.Yeah. WannaAlessio: jump into lightning round? Yeah. You ready?Shreya: I, IAlessio: hope so. Yeah. So we have five questions. Mm-hmm. And yeah, just respond in a sentence or two. Sean sometimes has the follow up tendency to follow up questions. The light. Yeah. You wanna get more info, which is, which is be ready. So the first one we always ask is what's your favorite AI product?Shreya: Very boring answer, but co-pilot life changing. Yeah. Yeah. Absolutely. Love it. Yeah.Swyx: Surprisingly not that many people have called out copilot in Oh, really? In our interviews. Cuz everyone's going to arts, like, they're like mid journeys, they will diff stuff. I see. Gotcha. But yeah, co-pilot is is great.Underrated. Yeah. It's still for $10 a month.Shreya: I mean, why not? Yeah. It's, it's, it's so wonderful.Swyx: I'm looking forward to co-pilot X, which is sort of the next iteration. Yeah.Shreya: I was testing on my co-pilot, so I [00:53:00] just got upgrade my laptop and then setting up vs code. And then I got co-pilot labs, I think is it?Or experimental. Yeah. Even that like Yes. Brushes and stuff. Yeah. Yeah. Yeah.Swyx: That was pretty cool. Talk to Amelia, who works on GitHub next. They, they build copilot labs and there's the voice component, which I don't know if you've tried. Oh, I, I stick whisper with co-pilot.Shreya: I see. It's just like your instructions and, yeah.Yeah. Oh,wellSwyx: also I have rsi. Mm-hmm. So actually sometimes it, it hurts when I type. I So, see it's actually super helpful to talk to your,Shreya: ah, interesting. Okay. Id, yeah, it's pretty, yeah. I, it was, Playing around with it yesterday, I was like, wow, this is so cool.Swyx: Yeah. Next question. What is something you thought would take much longer than, but it's already here.Like this is an acceleration question.Shreya: Let's see. Yeah, maybe this is getting like too developer focused too. Code focused. It's, but I, I do think like a lot of the auto generating code stuff is is really freaking cool. And I think especially if combine it with like maybe testing, right? Mm-hmm.Where you have like code and then you have like test to make sure the code work. And like you have this like, kind of like iterative loop until you refinement, until you're able to kind of [00:54:00] like self-heal code or like automatically generate code. I think like that is superSwyx: fascinating to you. Are you referring to some productsShreya: or demos that Actually I wouldn't give a, a plug for like basically this GitHub action called AutoPR, which like one of my community contributors kind of built using guardrails.And so the idea of what auto PR does is it takes a GitHub issue and if you have the right label for it, it automatically triggers this action where you create a PR given the issue text, et cetera. Huh? Yeah. Oh, it's so cool. It's, so your issue is the prompt. Yeah. Amongst like, other things other like Other context that you don't like?I'm gonna try this out right now. Yeah. Yeah. This is crazy. Yeah, it, it's, it's really cool. So I think like these types of workflows, it will take time before we can use them seamlessly, but Yeah. Truly very fascinating. Alessio: There's another open source project called a Wolverine by BiobootloaderYeah. Yeah, it's cool. It's really cool. It's basically like self-healing code. Yeah. You just let it run and then it makes a mistake and runs in a REPL, takes the code and ask it to just give you the diff and [00:55:00] like drops out the code and runs it again. It justSwyx: automates what I do anyway. Exactly.Alessio: So we can focus on the podcast.Shreya: This is one of the things that won't be automated away. Yeah. I think like, yeah, I, I saw over bringing, I think it was pretty cool and I think I'm very excited about that problem also because if you can think about it as like framing it within the context of these validators, et cetera, right?Like I think so bug-free sequel. What that does is like exactly that workflow of like generates code, executes, it takes failures, re-ask, et cetera. So implements that whole workflow like within a validator. Yeah. Swyx:The future is here.Alessio: Well, this kind of ties into the next question.A year from now, what will be will be the most surprised by in AI?Shreya: Hmm. Yeah. Not to be a downer, but I do think that like how hard it is to truly take these things to production and like get consistently amazing user experiences from it. But I think like this, yeah, we're at that stage where there's basically like a little bit of a gap between like what, what you kind of [00:56:00] see as being very exciting.And I think it's like, it's a demonstration of what's possible with this, right? But like, closing that gap between like what's possible versus like what's consistently deliverable. I think it's, it's a harder problem to solve. So I do think that it's gonna take some time before all of these experiences are like absolutely wonderful.So yeah, I think like a year from now we'll kind of like find some of these things taking a little bit longer than expected.Swyx: Request for startups or request for product. What's an AI thing you would pay for if somebodyShreya: built it? I think this is already exists and I just kind of maybe have to hook it up, et cetera, but I would a hundred percent pay for this, like emails.Emails in my tone. Oh, I see. Yeah, no, keep yeah,Swyx: emails, list your specs. Like what, what should it do? What should IShreya: not do? Yeah. I think like, I basically have an idea always of like this is tldr what I want this email to say. Sure. I want it to be in my tone so that it's not super formal, it's not super like lax, et cetera.I want it to be like tours and short and I want it to like I wanted to have context of like a previous history and maybe some [00:57:00] other like links, et cetera that I'm adding. So I wanted to hook it up to like, some of my data sources and do that. I think that would, I would like pay Yeah.Good money for that every month. Yeah. Nice.Alessio: I, I bill one the only as the, the email trend as the context, but then as a bunch of things like For example, for me it's like if this company is not in the developer tool space, I'm gonna pass on it. So direct to pass email, if the person is asking to schedule, please ask them to send them to send me their calendarly so I can pick a time from there.All these different things I see. But sometimes it's a new thread with somebody you already spoken with a bunch of times, so it should pull all of that stuff too. But I open source all of it because I don't want to deal with storing peoples email. It'sShreya: like the, the hardest thing. Do you find that it does tone well?Like does it match your tone or doesAlessio: it I have to use right now public figures as a I see thing. So it, I do things like write like Paul Graham or write or like, people that are like, have a lot of variety. Oh, that's actually pretty cool. Yeah. You know? Yeah. Yeah. It works pretty well. I see. Nice.There's some things Paul Graham would not [00:58:00] say that it writes in the, in the emails, but overall I would say probably like 20% of the drafts it creates are like, Usually good to go, like 70% it needs some work. And then there's like the 10% that is like, I have no idea why you just said that. It's completely like out of left field.I see. Yeah. But it will, it'll get better if I spend more time on it. But you know, it kind of adds up because I use G B D four, I get a lot of emails, so like having an autodraft responses for everything in my inbox, it, it adds up. So maybe the pattern of having, based on the label you put on the email to auto generate, it'sShreya: it's good.Oh, that's pretty cool. Yeah. And actually, yeah, as a separate follower, I would love to know like all of the ways it messes up and, you know if we get on guard, let's talk about it now. Let's,Swyx: yeah. Sometimes it doesn't, your project should use guardrails.Alessio: Yeah. No, no, no. Definitely. I think sometimes it doesn't understand the, the email is not a pitch, so somebody emails me something that's like unrelated and then it's like, oh, thank you.[00:59:00]But since you're not working in the space, I'm not gonna be investing in you. But good luck with the rest of your fundraise. But it's like, never mention a fundraise, but because in the prompt, it, as part of the prompt is like, if it's a pitch and it's not in the space, a pre-draft, an email, it thinks it has to do it a lot more than it should.Or like, same with scheduling somebody you know, any sales call that, any sales email that I get, it always wants to schedule a call with them. And I was like, I don't wanna meet with them, I don't wanna buy this thing. But the, the context of the email is like, they wanna schedule something so the responders you know, is helping you schedule, but it doesn't know that I don't want to, doesShreya: it like autodraft all, like is there any input that you give for each email or does it autodraft everything?Alessio: I just give it the tread and then a blank blank slate. I don't give it anything else because I wanted to run while I'm not in the inbox, but yours. It's a little better. What I'm doing is draft generation. What you wanna do is like draft expansion. So instead of looking at the [01:00:00] inbox in your case, you will look at the draft folder and look through each draft and expend the draft.Yeah, to be a full response, which makes a lot of sense.Shreya: Yeah, that's pretty interesting. I, I can think of like some guardrails that I can know quick, quick and dirty guardrails that I can hook up that would make some of those problems like go away. Yeah. Yeah,Swyx: like as in do they existShreya: now or they don't exist?They don't exist now, but I can like, think about like, I'm like always looking for problems so yeah. This is aSwyx: API design issue, right? Because if, if one conversation, you come away with like three guardrails and then another conversation, you come, none of three guardrails. How do you think about like, there's so many APIs that you could possibly do, right?You need to design for generally composable orShreya: reusable APIs. Yeah, so I would probably like break this down into like, like a relevant action item guardrail or something, right? And it's basically like essentially only talk about, or only like the action items should only be things that are within the context of those emails.And if something hasn't been mentioned, don't add context about that. So that would probably be a generic gar that I could, I could add. And then you, you could probably configure it with like, what are the sets of like [01:01:00] follow up action items that you typically have and, and correct for it that way.Swyx: We, we just heard a new API being designed live, which doesn't happen very often.Shreya: It's very cool. Yeah. AndAlessio: last but not least, if there's one thing you want people to take away about AI and kind of this moment that we're in, in technology, what would that be?Shreya: I do think this is the most exciting time in machine learning, as least as long as I've been working on it.And so I do think, like, frankly, we're all just so lucky to kind of be living through this and it's just very fascinating to be part of that. I think at the same time the technology is so exciting that you, you get like, Driven by wanting to use it. But I think like really thinking about like what's the best way to use it along with like other systems that have existed so that it's more kind of like task focused and like outcome focused rather than like technology focused.So this kind of like obviously I'm biased because I feel this way because I've designed guardrails this way that it kind of like merges LLMs with rules and heuristics and like traditional ML, et cetera. But I do think [01:02:00] that like this, this general framework of like thinking about how to build ML products is something that I'm bullish on and something I'd want people to like think about as well.Yeah.Alessio: Awesome. Well thank you so much for comingShreya: Yeah, absolutely. Thanks for inviting me. Get full access to Latent Space at www.latent.space/subscribe
undefined
May 8, 2023 • 51min

The AI Founder Gene: Being Early, Building Fast, and Believing in Greatness — with Sharif Shameem of Lexica

Thanks to the over 42,000 latent space explorers who checked out our Replit episode! We are hosting/attending a couple more events in SF and NYC this month. See you if in town!Lexica.art was introduced to the world 24 hours after the release of Stable Diffusion as a search engine for prompts, gaining instant product-market fit as a world discovering generative AI also found they needed to learn prompting by example.Lexica is now 8 months old, serving 5B image searches/day, and just shipped V3 of Lexica Aperture, their own text-to-image model! Sharif Shameem breaks his podcast hiatus with us for an exclusive interview covering his journey building everything with AI!The conversation is nominally about Sharif’s journey through his three startups VectorDash, Debuild, and now Lexica, but really a deeper introspection into what it takes to be a top founder in the fastest moving tech startup scene (possibly ever) of AI. We hope you enjoy this conversation as much as we did!Full transcript is below the fold. We would really appreciate if you shared our pod with friends on Twitter, LinkedIn, Mastodon, Bluesky, or your social media poison of choice!Timestamps* [00:00] Introducing Sharif* [02:00] VectorDash* [05:00] The GPT3 Moment and Building Debuild* [09:00] Stable Diffusion and Lexica* [11:00] Lexica’s Launch & How it Works* [15:00] Being Chronically Early* [16:00] From Search to Custom Models* [17:00] AI Grant Learnings* [19:30] The Text to Image Illuminati?* [20:30] How to Learn to Train Models* [24:00] The future of Agents and Human Intervention* [29:30] GPT4 and Multimodality* [33:30] Sharif’s Startup Manual* [38:30] Lexica Aperture V1/2/3* [40:00] Request for AI Startup - LLM Tools* [41:00] Sequencing your Genome* [42:00] Believe in Doing Great Things* [44:30] Lightning RoundShow Notes* Sharif’s website, Twitter, LinkedIn* VectorDash (5x cheaper than AWS)* Debuild Insider, Fast company, MIT review, tweet, tweet* Lexica* Introducing Lexica* Lexica Stats* Aug: “God mode” search* Sep: Lexica API * Sept: Search engine with CLIP * Sept: Reverse image search* Nov: teasing Aperture* Dec: Aperture v1* Dec - Aperture v2* Jan 2023 - Outpainting* Apr 2023 - Aperture v3* Same.energy* AI Grant* Sharif on Agents: prescient Airpods tweet, Reflection* MiniGPT4 - Sharif on Multimodality* Sharif Startup Manual* Sharif Future* 23andMe Genome Sequencing Tool: Promethease* Lightning Round* Fave AI Product: Cursor.so. Swyx ChatGPT Menubar App.* Acceleration: Multimodality of GPT4. Animated Drawings* Request for Startup: Tools for LLMs, Brex for GPT Agents* Message: Build Weird Ideas!TranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO on Residence at Decibel Partners. I'm joined by my co-host Wix, writer and editor of Latent Space. And today we have Sharish Amin. Welcome to the studio. Sharif: Awesome. Thanks for the invite.Swyx: Really glad to have you. [00:00] Introducing SharifSwyx: You've been a dream guest, actually, since we started drafting guest lists for this pod. So glad we could finally make this happen. So what I like to do is usually introduce people, offer their LinkedIn, and then prompt you for what's not on your LinkedIn. And to get a little bit of the person behind the awesome projects. So you graduated University of Maryland in CS. Sharif: So I actually didn't graduate, but I did study. Swyx: You did not graduate. You dropped out. Sharif: I did drop out. Swyx: What was the decision behind dropping out? Sharif: So first of all, I wasn't doing too well in any of my classes. I was working on a side project that took up most of my time. Then I spoke to this guy who ended up being one of our investors. And he was like, actually, I ended up dropping out. I did YC. And my company didn't end up working out. And I returned to school and graduated along with my friends. I was like, oh, it's actually a reversible decision. And that was like that. And then I read this book called The Case Against Education by Brian Kaplan. So those two things kind of sealed the deal for me on dropping out. Swyx: Are you still on hiatus? Could you still theoretically go back? Sharif: Theoretically, probably. Yeah. Still on indefinite leave. Swyx: Then you did some work at Mitra? Sharif: Mitra, yeah. So they're lesser known. So they're technically like an FFRDC, a federally funded research and development center. So they're kind of like a large government contractor, but nonprofit. Yeah, I did some computer vision work there as well. [02:00] VectorDashSwyx: But it seems like you always have an independent founder bone in you. Because then you started working on VectorDash, which is distributed GPUs. Sharif: Yes. Yeah. So VectorDash was a really fun project that we ended up working on for a while. So while I was at Mitra, I had a friend who was mining Ethereum. This was, I think, 2016 or 2017. Oh my God. Yeah. And he was mining on his NVIDIA 1080Ti, making around like five or six dollars a day. And I was trying to train a character recurrent neural network, like a character RNN on my iMessage text messages to make it like a chatbot. Because I was just curious if I could do it. Because iMessage stores all your past messages from years ago in a SQL database, which is pretty nifty. But I wanted to train it. And I needed a GPU. And it was, I think, $60 to $80 for a T4 on AWS, which is really slow compared to a 1080Ti. If you normalize the cost and performance versus the 1080Ti when someone's mining Ethereum, it's like a 20x difference. So I was like, hey, his name was Alex. Alex, I'll give you like 10 bucks if you let me borrow your 1080Ti for a week. I'll give you 10 bucks per day. And it was like 70 bucks. And I used it to train my model. And it worked great. The model was really bad, but the whole trade worked really great. I got a really high performance GPU to train my model on. He got much more than he was making by mining Ethereum. So we had this idea. I was like, hey, what if we built this marketplace where people could rent their GPUs where they're mining cryptocurrency and machine learning researchers could just rent them out and pay a lot cheaper than they would pay AWS. And it worked pretty well. We launched in a few months. We had over 120,000 NVIDIA GPUs on the platform. And then we were the cheapest GPU cloud provider for like a solid year or so. You could rent a pretty solid GPU for like 20 cents an hour. And cryptocurrency miners were making more than they would make mining crypto because this was after the Ethereum crash. And yeah, it was pretty cool. It just turns out that a lot of our customers were college students and researchers who didn't have much money. And they weren't necessarily the best customers to have as a business. Startups had a ton of credits and larger companies were like, actually, we don't really trust you with our data, which makes sense. Yeah, we ended up pivoting that to becoming a cloud GPU provider for video games. So we would stream games from our GPUs. Oftentimes, like many were located just a few blocks away from you because we had the lowest latency of any cloud GPU provider, even lower than like AWS and sometimes Cloudflare. And we decided to build a cloud gaming platform where you could pretty much play your own games on the GPU and then stream it back to your Mac or PC. Swyx: So Stadia before Stadia. Sharif: Yeah, Stadia before Stadia. It's like a year or so before Stadia. Swtx: Wow. Weren't you jealous of, I mean, I don't know, it sounds like Stadia could have bought you or Google could have bought you for Stadia and that never happened? Sharif: It never happened. Yeah, it didn't end up working out for a few reasons. The biggest thing was internet bandwidth. So a lot of the hosts, the GPU hosts had lots of GPUs, but average upload bandwidth in the United States is only 35 megabits per second, I think. And like a 4K stream needs like a minimum of 15 to 20 megabits per second. So you could really only utilize one of those GPUs, even if they had like 60 or 100. [05:00] The GPT3 Moment and Building DebuildSwyx: And then you went to debuild July 2020, is the date that I have. I'm actually kind of just curious, like what was your GPT-3 aha moment? When were you like GPT-3-pilled? Sharif: Okay, so I first heard about it because I was also working on another chatbot. So this was like after, like everything ties back to this chatbot I'm trying to make. This was after working on VectorDash. I was just like hacking on random projects. I wanted to make the chatbot using not really GPT-2, but rather just like it would be pre-programmed. It was pretty much you would give it a goal and then it would ask you throughout the week how much progress you're making to that goal. So take your unstructured response, usually a reply to a text message, and then it would like, plot it for you in like a table and you could see your progress over time. It could be for running or tracking calories. But I wanted to use GPT-3 to make it seem more natural because I remember someone on Bookface, which is still YC's internal forum. They posted and they were like, OpenAI just released AGI and it's GPT-3. I asked it like a bunch of logic puzzles and it solved them all perfectly. And I was like, what? How's no one else talking about this? Like this is either like the greatest thing ever that everyone is missing or like it's not that good. So like I tweeted out if anyone could get me access to it. A few hours later, Greg Brockman responded. Swyx: He is everywhere. Sharif: He's great. Yeah, he's on top of things. And yeah, by that afternoon, I was like messing around with the API and I was like, wow, this is incredible. You could chat with fake people or people that have passed away. You could like, I remember the first conversation I did was this is a chat with Steve Jobs and it was like, interviewer, hi. What are you up to today on Steve? And then like you could talk to Steve Jobs and it was somewhat plausible. Oh, the thing that really blew my mind was I tried to generate code with it. So I'd write the function for a JavaScript header or the header for a JavaScript function. And it would complete the rest of the function. I was like, whoa, does this code actually work? Like I copied it and ran it and it worked. And I tried it again. I gave more complex things and like I kind of understood where it would break, which was like if it was like something, like if it was something you couldn't easily describe in a sentence and like contain all the logic for in a single sentence. So I wanted to build a way where I could visually test whether these functions were actually working. And what I was doing was like I was generating the code in the playground, copying it into my VS code editor, running it and then reloading the react development page. And I was like, okay, cool. That works. So I was like, wait, let me just put this all in like the same page so I can just compile in the browser, run it in the browser and then submit it to the API in the browser as well. So I did that. And it was really just like a simple loop where you just type in the prompt. It would generate the code and then compile it directly in the browser. And it showed you the response. And I did this for like very basic JSX react components. I mean, it worked. It was pretty mind blowing. I remember staying up all night, like working on it. And it was like the coolest thing I'd ever worked on at the time so far. Yeah. And then I was like so mind blowing that no one was talking about this whole GPT three thing. I was like, why is this not on everyone's minds? So I recorded a quick 30 second demo and I posted on Twitter and like I go to bed after staying awake for like 20 hours straight. When I wake up the next morning and I had like 20,000 likes and like 100,000 people had viewed it. I was like, oh, this is so cool. And then I just kept putting demos out for like the next week. And yeah, that was like my GPT three spark moment. Swyx: And you got featured in like Fast Company, MIT Tech Review, you know, a bunch of stuff, right? Sharif: Yeah. Yeah. I think a lot of it was just like the API had been there for like a month prior already. Swyx: Not everyone had access. Sharif: That's true. Not everyone had access. Swyx: So you just had the gumption to tweet it out. And obviously, Greg, you know, on top of things as always. Sharif: Yeah. Yeah. I think it also makes a lot of sense when you kind of share things in a way that's easily consumable for people to understand. Whereas if you had shown a terminal screenshot of a generating code, that'd be pretty compelling. But whereas seeing it get rendered and compiled directly in front of you, there's a lot more interesting. There's also that human aspect to it where you want to relate things to the end user, not just like no one really cares about evals. When you can create a much more compelling demo explaining how it does on certain tasks. [09:00] Stable Diffusion and LexicaSwyx: Okay. We'll round it out soon. But in 2022, you moved from Debuild to Lexica, which was the search engine. I assume this was inspired by stable diffusion, but I can get the history there a little bit. Sharif: Yeah. So I was still working on Debuild. We were growing at like a modest pace and I was in the stable... Swyx: I was on the signup list. I never got off. Sharif: Oh yeah. Well, we'll get you off. It's not getting many updates anymore, but yeah, I was in the stable diffusion discord and I was in it for like many hours a day. It was just like the most exciting thing I'd ever done in a discord. It was so cool. Like people were generating so many images, but I didn't really know how to write prompts and people were like writing really complicated things. They would be like, like a modern home training on our station by Greg Rutkowski, like a 4k Unreal Engine. It's like that there's no way that actually makes the images look better. But everyone was just kind of copying everyone else's prompts and like changing like the first few words. Swyx: Yeah. Yeah. Sharif: So I was like using the discord search bar and it was really bad because it showed like five images at a time. And I was like, you know what? I could build a much better interface for this. So I ended up scraping the entire discord. It was like 10 million images. I put them in a database and I just pretty much built a very basic search engine where you could just type for type a word and then it returned all the prompts that had that word. And I built the entire website for it in like 20, in like about two days. And we shipped it the day I shipped it the day after the stable diffusion weights were open sourced. So about 24 hours later and it kind of took off in a way that I never would have expected. Like I thought it'd be this cool utility that like hardcore stable diffusion users would find useful. But it turns out that almost anyone who mentioned stable diffusion would also kind of mention Lexica in conjunction with it. I think it's because it was like it captured the zeitgeist in an easy to share way where it's like this URL and there's this gallery and you can search. Whereas running the model locally was a lot harder. You'd have to like to deploy it on your own GPU and like set up your own environment and like do all that stuff. Swyx: Oh, my takeaway. I have two more to add to the reasons why Lexica works at the time. One is lower latency is all you need. So in other words, instead of waiting a minute for your image, you could just search and find stuff that other people have done. That's good. And then two is everyone knew how to search already, but people didn't know how to prompt. So you were the bridge. Sharif: That's true. Yeah. You would get a lot better looking images by typing a one word prompt versus prompting for that one word. Yeah. Swyx: Yeah. That is interesting. [11:00] Lexica’s Explosion at LaunchAlessio: The numbers kind of speak for themselves, right? Like 24 hours post launch, 51,000 queries, like 2.2 terabytes in bandwidth. Going back to the bandwidth problem that you have before, like you would have definitely run into that. Day two, you doubled that. It's like 111,000 queries, four and a half terabytes in bandwidth, 22 million images served. So it's pretty crazy. Sharif: Yeah. I think we're, we're doing like over 5 billion images served per month now. It's like, yeah, that's, it's pretty crazy how much things have changed since then. Swyx: Yeah. I'm still showing people like today, even today, you know, it's been a few months now. This is where you start to learn image prompting because they don't know. Sharif: Yeah, it is interesting. And I, it's weird because I didn't really think it would be a company. I thought it would just be like a cool utility or like a cool tool that I would use for myself. And I really was just building it for myself just because I didn't want to use the Discord search bar. But yeah, it was interesting that a lot of other people found it pretty useful as well. [11:00] How Lexica WorksSwyx: So there's a lot of things that you release in a short amount of time. The God mode search was kind of like, obviously the first thing, I guess, like maybe to talk about some of the underlying technology you're using clip to kind of find, you know, go from image to like description and then let people search it. Maybe talk a little bit about what it takes to actually make the search magic happen. Sharif: Yeah. So the original search was just using Postgres' full text search and it would only search the text contents of the prompt. But I was inspired by another website called Same Energy, where like a visual search engine. It's really cool. Do you know what happened to that guy? I don't. Swyx: He released it and then he disappeared from the internet. Sharif: I don't know what happened to him, but I'm sure he's working on something really cool. He also worked on like Tabnine, which was like the very first version of Copilot or like even before Copilot was Copilot. But yeah, inspired by that, I thought like being able to search images by their semantics. The contents of the image was really interesting. So I pretty much decided to create a search index on the clip embeddings, the clip image embeddings of all the images. And when you would search it, we would just do KNN search on pretty much the image embedding index. I mean, we had way too many embeddings to store on like a regular database. So we had to end up using FAISS, which is a Facebook library for really fast KNN search and embedding search. That was pretty fun to set up. It actually runs only on CPUs, which is really cool. It's super efficient. You compute the embeddings on GPUs, but like you can serve it all on like an eight core server and it's really, really fast. Once we released the semantic search on the clip embeddings, people were using the search way more. And you could do other cool things. You could do like similar image search where if you found like a specific image you liked, you could upload it and it would show you relevant images as well. Swyx: And then right after that, you raised your seed money from AI grant, NetFreedman, then Gross. Sharif: Yeah, we raised about $5 million from Daniel Gross. And then we also participated in AI grant. That was pretty cool. That was kind of the inflection point. Not much before that point, Lexic was kind of still a side project. And I told myself that I would focus on it full time or I'd consider focusing on it full time if we had broke like a million users. I was like, oh, that's gonna be like years away for sure. And then we ended up doing that in like the first week and a half. I was like, okay, there's something here. And it was kind of that like deal was like growing like pretty slowly and like pretty linearly. And then Lexica was just like this thing that just kept going up and up and up. And I was so confused. I was like, man, people really like looking at pictures. This is crazy. Yeah. And then we decided to pivot the entire company and just focus on Lexica full time at that point. And then we raised our seed round. [15:00] Being Chronically EarlySwyx: Yeah. So one thing that you casually dropped out, the one that slip, you said you were working on Lexica before the launch of Stable Diffusion such that you were able to launch Lexica one day after Stable Diffusion. Sharif: Yeah.Swyx: How did you get so early into Stable Diffusion? Cause I didn't hear about it. Sharif: Oh, that's a good question. I, where did I first hear about Stable Diffusion? I'm not entirely sure. It must've been like somewhere on Twitter or something. That changed your life. Yeah, it was great. And I got into the discord cause I'd used Dolly too before, but, um, there were a lot of restrictions in place where you can generate human faces at the time. You can do that now. But when I first got access to it, like you couldn't do any faces. It was like, there were like a, the list of adjectives you couldn't use was quite long. Like I had a friend from Pakistan and it can generate anything with the word Pakistan in it for some reason. But Stable Diffusion was like kind of the exact opposite where there were like very, very few rules. So that was really, really fun and interesting, especially seeing the chaos of like a bunch of other people also using it right in front of you. That was just so much fun. And I just wanted to do something with it. I thought it was honestly really fun. Swyx: Oh, well, I was just trying to get tips on how to be early on things. Cause you're pretty consistently early to things, right? You were Stadia before Stadia. Um, and then obviously you were on. Sharif: Well, Stadia is kind of shut down now. So I don't know if being early to that was a good one. Swyx: Um, I think like, you know, just being consistently early to things that, uh, you know, have a lot of potential, like one of them is going to work out and you know, then that's how you got Lexica. [16:00] From Search to Custom ModelsAlessio: How did you decide to go from search to running your own models for a generation? Sharif: That's a good question. So we kind of realized that the way people were using Lexica was they would have Lexica open in one tab and then in another tab, they'd have a Stable Diffusion interface. It would be like either a discord or like a local run interface, like the automatic radio UI, um, or something else. I just, I would watch people use it and they would like all tabs back and forth between Lexica and their other UI. And they would like to scroll through Lexica, click on the prompt, click on an image, copy the prompt, and then paste it and maybe change a word or two. And I was like, this should really kind of just be all within Lexica. Like, it'd be so cool if you could just click a button in Lexica and get an editor and generate your images. And I found myself also doing the all tab thing, or it was really frustrating. I was like, man, this is kind of tedious. Like I really wish it was much simpler. So we just built generations directly within Lexica. Um, so we do, we deployed it on, I don't remember when we first launched, I think it was November, December. And yeah, people love generating directly within it. [17:00] AI Grant LearningsSwyx: I was also thinking that this was coming out of AI grants where, you know, I think, um, yeah, I was like a very special program. I was just wondering if you learned anything from, you know, that special week where everyone was in town. Sharif: Yeah, that was a great week. I loved it. Swyx: Yeah. Bring us, bring us in a little bit. Cause it was awesome. There. Sharif: Oh, sure. Yeah. It's really, really cool. Like all the founders in AI grants are like fantastic people. And so I think the main takeaway from the AI grant was like, you have this massive overhang in compute or in capabilities in terms of like these latest AI models, but to the average person, there's really not that many products that are that cool or useful to them. Like the latest one that has hit the zeitgeist was chat GPT, which used arguably the same GPT three model, but like RLHF, but you could have arguably built like a decent chat GPT product just using the original GPT three model. But no one really did it. Now there were some restrictions in place and opening. I like to slowly release them over the few months or years after they release the original API. But the core premise behind AI grants is that there are way more capabilities than there are products. So focus on building really compelling products and get people to use them. And like to focus less on things like hitting state of the art on evals and more on getting users to use something. Swyx: Make something people want.Sharif: Exactly. Host: Yeah, we did an episode on LLM benchmarks and we kind of talked about how the benchmarks kind of constrain what people work on, because if your model is not going to do well, unlike the well-known benchmarks, it's not going to get as much interest and like funding. So going at it from a product lens is cool. [19:30] The Text to Image Illuminati?Swyx: My hypothesis when I was seeing the sequence of events for AI grants and then for Lexica Aperture was that you had some kind of magical dinner with Emad and David Holtz. And then they taught you the secrets of training your own model. Is that how it happens? Sharif: No, there's no secret dinner. The Illuminati of text to image. We did not have a meeting. I mean, even if we did, I wouldn't tell you. But it really boils down to just having good data. If you think about diffusion models, really the only thing they do is learn a distribution of data. So if you have high quality data, learn that high quality distribution. Or if you have low quality data, it will learn to generate images that look like they're from that distribution. So really it boils down to the data and the amount of data you have and that quality of that data, which means a lot of the work in training high quality models, at least diffusion models, is not really in the model architecture, but rather just filtering the data in a way that makes sense. So for Lexica, we do a lot of aesthetic scoring on images and we use the rankings we get from our website because we get tens of millions of people visiting it every month. So we can capture a lot of rankings. Oh, this person liked this image when they saw this one right next to it. Therefore, they probably preferred this one over that. You can do pairwise ranking to rank images and then compute like ELO scores. You can also just train aesthetic models to learn to classify a model, whether or not someone will like it or whether or not it's like, rank it on a scale of like one to ten, for example. So we mostly use a lot of the traffic we get from Lexica and use that to kind of filter our data sets and use that to train better aesthetic models. [20:30] How to Learn to Train ModelsSwyx: You had been a machine learning engineer before. You've been more of an infrastructure guy. To build, you were more of a prompt engineer with a bit of web design. This was the first time that you were basically training your own model. What was the wrap up like? You know, not to give away any secret sauce, but I think a lot of people who are traditional software engineers are feeling a lot of, I don't know, fear when encountering these kinds of domains. Sharif: Yeah, I think it makes a lot of sense. And to be fair, I didn't have much experience training massive models at this scale before I did it. A lot of times it's really just like, in the same way when you're first learning to program, you would just take the problem you're having, Google it, and go through the stack overflow post. And then you figure it out, but ultimately you will get to the answer. It might take you a lot longer than someone who's experienced, but I think there are enough resources out there where it's possible to learn how to do these things. Either just reading through GitHub issues for relevant models. Swyx: Oh God. Sharif: Yeah. It's really just like, you might be slower, but it's definitely still possible. And there are really great courses out there. The Fast AI course is fantastic. There's the deep learning book, which is great for fundamentals. And then Andrej Karpathy's online courses are also excellent, especially for language modeling. You might be a bit slower for the first few months, but ultimately I think if you have the programming skills, you'll catch up pretty quickly. It's not like this magical dark science that only three people in the world know how to do well. Probably was like 10 years ago, but now it's becoming much more open. You have open source collectives like Eleuther and LAION, where they like to share the details of their large scale training runs. So you can learn from a lot of those people. Swyx: Yeah. I think what is different for programmers is having to estimate significant costs upfront before they hit run. Because it's not a thing that you normally consider when you're coding, but yeah, like burning through your credits is a fear that people have. Sharif: Yeah, that does make sense. In that case, like fine tuning larger models gets you really, really far. Even using things like low rank adaptation to fine tune, where you can like fine tune much more efficiently on a single GPU. Yeah, I think people are underestimating how far you can really get just using open source models. I mean, before Lexica, I was working on Debuild and we were using the GP3 API, but I was also like really impressed at how well you could get open source models to run by just like using the API, collecting enough samples from like real world user feedback or real world user data using your product. And then just fine tuning the smaller open source models on those examples. And now you have a model that's pretty much state of the art for your specific domain. Whereas the runtime cost is like 10 times or even 100 times cheaper than using an API. Swyx: And was that like GPT-J or are you talking BERT? Sharif: I remember we tried GPT-J, but I think FLAN-T5 was like the best model we were able to use for that use case. FLAN-T5 is awesome. If you can, like if your prompt is small enough, it's pretty great. And I'm sure there are much better open source models now. Like Vicuna, which is like the GPT-4 variant of like Lama fine tuned on like GPT-4 outputs. Yeah, they're just going to get better and they're going to get better much, much faster. Swyx: Yeah. We're just talking in a previous episode to the creator of Dolly, Mike Conover, which is actually commercially usable instead of Vicuna, which is a research project. Sharif: Oh, wow. Yeah, that's pretty cool. [24:00] Why No Agents?Alessio: I know you mentioned being early. Obviously, agents are one of the hot things here. In 2021, you had this, please buy me AirPods, like a demo that you tweeted with the GPT-3 API. Obviously, one of the things about being early in this space, you can only do one thing at a time, right? And you had one tweet recently where you said you hoped that that demo would open Pandora's box for a bunch of weird GPT agents. But all we got were docs powered by GPT. Can you maybe talk a little bit about, you know, things that you wish you would see or, you know, in the last few, last few weeks, we've had, you know, Hugging GPT, Baby AGI, Auto GPT, all these different kind of like agent projects that maybe now are getting closer to the, what did you say, 50% of internet traffic being skips of GPT agents. What are you most excited about, about these projects and what's coming? Sharif: Yeah, so we wanted a way for users to be able to paste in a link for the documentation page for a specific API, and then describe how to call that API. And then the way we would need to pretty much do that for Debuild was we wondered if we could get an agent to browse the docs page, read through it, summarize it, and then maybe even do things like create an API key and register it for that user. To do that, we needed a way for the agent to read the web page and interact with it. So I spent about a day working on that demo where we just took the web page, serialized it into a more compact form that fit within the 2048 token limit of like GPT-3 at the time. And then just decide what action to do. And then it would, if the page was too long, it would break it down into chunks. And then you would have like a sub prompt, decide on which chunk had the best action. And then at the top node, you would just pretty much take that action and then run it in a loop. It was really, really expensive. I think that one 60 second demo cost like a hundred bucks or something, but it was wildly impractical. But you could clearly see that agents were going to be a thing, especially ones that could read and write and take actions on the internet. It was just prohibitively expensive at the time. And the context limit was way too small. But yeah, I think it seems like a lot of people are taking it more seriously now, mostly because GPT-4 is way more capable. The context limit's like four times larger at 8,000 tokens, soon 32,000. And I think the only problem that's left to solve is finding a really good representation for a webpage that allows it to be consumed by a text only model. So some examples are like, you could just take all the text and pass it in, but that's probably too long. You could take all the interactive only elements like buttons and inputs, but then you miss a lot of the relevant context. There are some interesting examples, which I really like is you could run the webpage or you could run the browser in a terminal based browser. So there are some browsers that run in your terminal, which serialize everything into text. And what you can do is just take that frame from that terminal based browser and pass that directly to the model. And it's like a really, really good representation of the webpage because they do things where for graphical elements, they kind of render it using ASCII blocks. But for text, they render it as actual text. So you could just remove all the weird graphical elements, just keep all the text. And that works surprisingly well. And then there are other problems to solve, which is how do you get the model to take an action? So for example, if you have a booking page and there's like a calendar and there are 30 days on the calendar, how do you get it to specify which button to press? It could say 30, and you can match string based and like find the 30. But for example, what if it's like a list of friends in Facebook and trying to delete a friend? There might be like 30 delete buttons. How do you specify which one to click on? The model might say like, oh, click on the one for like Mark. But then you'd have to figure out the delete button in relation to Mark. And there are some ways to solve this. One is there's a cool Chrome extension called Vimium, which lets you use Vim in your Chrome browser. And what you do is you can press F and over every interactive element, it gives you like a character or two characters. Or if you type those two characters, it presses that button or it opens or focuses on that input. So you could combine a lot of these ideas and then get a really good representation of the web browser in text, and then also give the model a really, really good way to control the browser as well. And I think those two are the core part of the problem. The reasoning ability is definitely there. If a model can score in the top 10% on the bar exam, it can definitely browse a web page. It's really just how do you represent text to the model and how do you get the model to perform actions back on the web page? Really, it's just an engineering problem. Swyx: I have one doubt, which I'd love your thoughts on. How do you get the model to pause when it doesn't have enough information and ask you for additional information because you under specified your original request? Sharif: This is interesting. I think the only way to do this is to have a corpus where your training data is like these sessions of agents browsing the web. And you have to pretty much figure out where the ones that went wrong or the agents that went wrong, or did they go wrong and just replace it with, hey, I need some help. And then if you were to fine tune a larger model on that data set, you would pretty much get them to say, hey, I need help on the instances where they didn't know what to do next. Or if you're using a closed source model like GPT-4, you could probably tell it if you're uncertain about what to do next, ask the user for help. And it probably would be pretty good at that. I've had to write a lot of integration tests in my engineering days and like the dome. Alessio: They might be over. Yeah, I hope so. I hope so. I don't want to, I don't want to deal with that anymore. I, yeah, I don't want to write them the old way. Yeah. But I'm just thinking like, you know, we had the robots, the TXT for like crawlers. Like I can definitely see the DOM being reshaped a little bit in terms of accessibility. Like sometimes you have to write expats that are like so long just to get to a button. Like there should be a better way to do it. And maybe this will drive the change, you know, making it easier for these models to interact with your website. Sharif: There is the Chrome accessibility tree, which is used by screen readers, but a lot of times it's missing a lot of, a lot of useful information. But like in a perfect world, everything would be perfectly annotated for screen readers and we could just use that. That's not the case. [29:30] GPT4 and MultimodalitySwyx: GPT-4 multimodal, has your buddy, Greg, and do you think that that would solve essentially browser agents or desktop agents? Sharif: Greg has not come through yet, unfortunately. But it would make things a lot easier, especially for graphically heavy web pages. So for example, you were using Yelp and like using the map view, it would make a lot of sense to use something like that versus a text based input. Where, how do you serialize a map into text? It's kind of hard to do that. So for more complex web pages, that would make it a lot easier. You get a lot more context to the model. I mean, it seems like that multimodal input is very dense in the sense that it can read text and it can read it really, really well. So you could probably give it like a PDF and it would be able to extract all the text and summarize it. So if it can do that, it could probably do anything on any webpage. Swyx: Yeah. And given that you have some experience integrating Clip with language models, how would you describe how different GPT-4 is compared to that stuff? Sharif: Yeah. Clip is entirely different in the sense that it's really just good at putting images and text into the same latent space. And really the only thing that's useful for is similarity and clustering. Swyx: Like literally the same energy, right? Sharif: Yeah. Swyx: Yeah. And then there's Blip and Blip2. I don't know if you like those. Sharif: Yeah. Blip2 is a lot better. There's actually a new project called, I think, Mini GPT-4. Swyx: Yes. It was just out today. Sharif: Oh, nice. Yeah. It's really cool. It's actually really good. I think that one is based on the Lama model, but yeah, that's, that's like another. Host: It's Blip plus Lama, right? So they, they're like running through Blip and then have Lama ask your, interpret your questions so that you do visual QA. Sharif: Oh, that's cool. That's really clever. Yeah. Ensemble models are really useful. Host: Well, so I was trying to articulate, cause that was, that's, there's two things people are talking about today. You have to like, you know, the moment you wake up, you open Hacker News and go like, all right, what's, what's the new thing today? One is Red Pajama. And then the other one is Mini GPT-4. So I was trying to articulate like, why is this not GPT-4? Like what is missing? And my only conclusion was it just doesn't do OCR yet. But I wonder if there's anything core to this concept of multimodality that you have to train these things together. Like what does one model doing all these things do that is separate from an ensemble of models that you just kind of duct tape together? Sharif: It's a good question. This is pretty related to interoperability. Like how do we understand that? Or how, how do we, why do models trained on different modalities within the same model perform better than two models perform or train separately? I can kind of see why that is the case. Like, it's kind of hard to articulate, but when you have two different models, you get the reasoning abilities of a language model, but also like the text or the vision understanding of something like Clip. Whereas Clip clearly lacks the reasoning abilities, but if you could somehow just put them both in the same model, you get the best of both worlds. There were even cases where I think the vision version of GPT-4 scored higher on some tests than the text only version. So like there might even be some additional learning from images as well. Swyx: Oh yeah. Well, uh, the easy answer for that was there was some chart in the test. That wasn't translated. Oh, when I read that, I was like, Oh yeah. Okay. That makes sense. Sharif: That makes sense. I thought it'd just be like, it sees more of the world. Therefore it has more tokens. Swyx: So my equivalent of this is I think it's a well-known fact that adding code to a language model training corpus increases its ability to do language, not just with code. So, the diversity of datasets that represent some kind of internal logic and code is obviously very internally logically consistent, helps the language model learn some internal structure. Which I think, so, you know, my ultimate test for GPT-4 is to show the image of like, you know, is this a pipe and ask it if it's a pipe or not and see what it does. Sharif: Interesting. That is pretty cool. Yeah. Or just give it a screenshot of your like VS code editor and ask it to fix the bug. Yeah. That'd be pretty wild if it could do that. Swyx: That would be adult AGI. That would be, that would be the grownup form of AGI. [33:30] Sharif’s Startup ManualSwyx: On your website, you have this, um, startup manual where you give a bunch of advice. This is fun. One of them was that you should be shipping to production like every two days, every other day. This seems like a great time to do it because things change every other day. But maybe, yeah, tell some of our listeners a little bit more about how you got to some of these heuristics and you obviously build different projects and you iterate it on a lot of things. Yeah. Do you want to reference this? Sharif: Um, sure. Yeah, I'll take a look at it. Swyx: And we'll put this in the show notes, but I just wanted you to have the opportunity to riff on this, this list, because I think it's a very good list. And what, which one of them helped you for Lexica, if there's anything, anything interesting. Sharif: So this list is, it's pretty funny. It's mostly just like me yelling at myself based on all the mistakes I've made in the past and me trying to not make them again. Yeah. Yeah. So I, the first one is like, I think the most important one is like, try when you're building a product, try to build the smallest possible version. And I mean, for Lexica, it was literally a, literally one screen in the react app where a post-process database, and it just showed you like images. And I don't even know if the first version had search. Like I think it did, but I'm not sure. Like, I think it was really just like a grid of images that were randomized, but yeah, don't build the absolute smallest thing that can be considered a useful application and ship it for Lexica. That was, it helps me write better prompts. That's pretty useful. It's not that useful, but it's good enough. Don't fall into the trap of intellectual indulgence with over-engineering. I think that's a pretty important one for myself. And also anyone working on new things, there's often times you fall into the trap of like thinking you need to add more and more things when in reality, like the moment it's useful, you should probably get in the hands of your users and they'll kind of set the roadmap for you. I know this has been said millions of times prior, but just, I think it's really, really important. And I think if I'd spent like two months working on Lexica, adding a bunch of features, it wouldn't have been anywhere as popular as it was if I had just released the really, really boiled down version alongside the stable diffusion release. Yeah. And then there are a few more like product development doesn't start until you launch. Think of your initial product as a means to get your users to talk to you. It's also related to the first point where you really just want people using something as quickly as you can get that to happen. And then a few more are pretty interesting. Create a product people love before you focus on growth. If your users are spontaneously telling other people to use your product, then you've built something people love. Swyx: So this is pretty, it sounds like you've internalized Paul Graham's stuff a lot. Yeah. Because I think he said stuff like that. Sharif: A lot of these are just probably me taking notes from books I found really interesting or like PG essays that were really relevant at the time. And then just trying to not forget them. I should probably read this list again. There's some pretty personalized advice for me here. Oh yeah. One of my favorite ones is, um, don't worry if what you're building doesn't sound like a business. Nobody thought Facebook would be a $500 billion company. It's easy to come up with a business model. Once you've made something people want, you can even make pretty web forms and turn that into a 200 person company. And then if you click the link, it's to LinkedIn for type form, which is now, uh, I think they're like an 800 person company or something like that. So they've grown quite a bit. There you go. Yeah. Pretty web forms are pretty good business, even though it doesn't sound like it. Yeah. It's worth a billion dollars. [38:30] Lexica Aperture V1/2/3Swyx: One way I would like to tie that to the history of Lexica, which we didn't go over, which was just walk us through like Aperture V1, V2, V3, uh, which you just released last week. And how maybe some of those principles helped you in that journey.Sharif: Yeah. So, um, V1 was us trying to create a very photorealistic version of our model of Sable to Fusion. Uh, V1 actually didn't turn out to be that popular. It turns out people loved not generating. Your marketing tweets were popular. They were quite popular. So I think at the time you couldn't get Sable to Fusion to generate like photorealistic images that were consistent with your prompt that well. It was more so like you were sampling from this distribution of images and you could slightly pick where you sampled from using your prompt. This was mostly just because the clip text encoder is not the best text encoder. If you use a real language model, like T5, you get much better results. Like the T5 XXL model is like a hundred times larger than the clip text encoder for Sable to Fusion 1.5. So you could kind of steer it into like the general direction, but for more complex prompts, it just didn't work. So a lot of our users actually complained that they preferred the 1.5, Sable to Fusion 1.5 model over the Aperture model. And it was just because a lot of people were using it to create like parts and like really weird abstract looking pictures that didn't really work well with the photorealistic model trained solely on images. And then for V2, we kind of took that into consideration and then just trained it more on a lot of the art images on Lexica. So we took a lot of images that were on Lexica that were art, used that to train aesthetic models that ranked art really well, and then filtered larger sets to train V2. And then V3 is kind of just like an improved version of that with much more data. I'm really glad we didn't spend too much time on V1. I think we spent about one month working on it, which is a lot of time, but a lot of the things we learned were useful for training future versions. Swyx: How do you version them? Like where do you decide, okay, this is V2, this is V3? Sharif: The versions are kind of weird where you can't really use semantic versions because like if you have a small update, you usually just make that like V2. Versions are kind of used for different base models, I'd say. So if you have each of the versions were a different base model, but we've done like fine tunes of the same version and then just release an update without incrementing the version. But I think when there's like a clear change between running the same prompt on a model and you get a different image, that should probably be a different version. [40:00] Request for AI Startup - LLM ToolsAlessio: So the startup manual was the more you can actually do these things today to make it better. And then you have a whole future page that has tips from, you know, what the series successor is going to be like to like why everyone's genome should be sequenced. There's a lot of cool stuff in there. Why do we need to develop stimulants with shorter half-lives so that we can sleep better. Maybe talk a bit about, you know, when you're a founder, you need to be focused, right? So sometimes there's a lot of things you cannot build. And I feel like this page is a bit of a collection of these. Like, yeah. Are there any of these things that you're like, if I were not building Lexica today, this is like a very interesting thing. Sharif: Oh man. Yeah. There's a ton of things that I want to build. I mean, off the top of my head, the most exciting one would be better tools for language models. And I mean, not tools that help us use language models, but rather tools for the language models themselves. So things like giving them access to browsers, giving them access to things like payments and credit cards, giving them access to like credit cards, giving them things like access to like real world robots. So like, it'd be cool if you could have a Boston dynamic spot powered by a language model reasoning module and you would like to do things for you, like go and pick up your order, stuff like that. Entirely autonomously given like high level commands. That'd be like number one thing if I wasn't working on Lexica. [40:00] Sequencing your GenomeAnd then there's some other interesting things like genomics I find really cool. Like there's some pretty cool things you can do with consumer genomics. So you can export your genome from 23andMe as a text file, like literally a text file of your entire genome. And there is another tool called Prometheus, I think, where you upload your 23andMe text file genome and then they kind of map specific SNPs that you have in your genome to studies that have been done on those SNPs. And it tells you really, really useful things about yourself. Like, for example, I have the SNP for this thing called delayed sleep phase disorder, which makes me go to sleep about three hours later than the general population. So like I used to always be a night owl and I never knew why. But after using Prometheus it pretty much tells you, oh, you have the specific genome for specific SNP for DSPS. It's like a really tiny percentage of the population. And it's like something you should probably know about. And there's a bunch of other things. It tells you your likelihood for getting certain diseases, for certain cancers, oftentimes, like even weird personality traits. There's one for like, I have one of the SNPs for increased risk taking and optimism, which is pretty weird. That's an actual thing. Like, I don't know how. This is the founder gene. You should sequence everybody. It's pretty cool. And it's like, it's like $10 for Prometheus and like 70 bucks for 23andMe. And it explains to you how your body works and like the things that are different from you or different from the general population. Wow. Highly recommend everyone do it. Like if you're, if you're concerned about privacy, just purchase a 23andMe kit with a fake name. You don't have to use your real name. I didn't use my real name. Swyx: It's just my genes. Worst you can do is clone me. It ties in with what you were talking about with, you know, we want the future to be like this. And like people are building uninspired B2B SaaS apps and you and I had an exchange about this. [42:00] Believe in Doing Great ThingsHow can we get more people to believe they can do great things? Sharif: That's a good question. And I like a lot of the things I've been working on with GP3. It has been like trying to solve this by getting people to think about more interesting ideas. I don't really know. I think one is just like the low effort version of this is just putting out really compelling demos and getting people inspired. And then the higher effort version is like actually building the products yourself and getting people to like realize this is even possible in the first place. Like I think the baby AGI project and like the GPT Asian projects on GitHub are like in practice today, they're not super useful, but I think they're doing an excellent job of getting people incredibly inspired for what can be possible with language models as agents. And also the Stanford paper where they had like the mini version of Sims. Yeah. That one was incredible. That was awesome. Swyx: It was adorable. Did you see the part where they invented day drinking? Sharif: Oh, they did? Swyx: Yeah. You're not supposed to go to these bars in the afternoon, but they were like, we're going to go anyway. Nice. Sharif: That's awesome. Yeah. I think we need more stuff like that. That one paper is probably going to inspire a whole bunch of teams to work on stuff similar to that. Swyx: And that's great. I can't wait for NPCs to actually be something that you talk to in a game and, you know, have their own lives and you can check in and, you know, they would have their own personalities as well. Sharif: Yeah. I was so kind of off topic. But I was playing the last of us part two and the NPCs in that game are really, really good. Where if you like, point a gun at them and they'll beg for their life and like, please, I have a family. And like when you kill people in the game, they're like, oh my God, you shot Alice. Like they're just NPCs, but they refer to each other by their names and like they plead for their lives. And this is just using regular conditional rules on NPC behavior. Imagine how much better it'd be if it was like a small GPT-4 agent running in every NPC and they had the agency to make decisions and plead for their lives. And I don't know, you feel way more guilty playing that game. Alessio: I'm scared it's going to be too good. I played a lot of hours of Fallout. So I feel like if the NPCs were a lot better, you would spend a lot more time playing the game. Yeah. [44:30] Lightning RoundLet's jump into lightning round. First question is your favorite AI product. Sharif: Favorite AI product. The one I use the most is probably ChatGPT. The one I'm most excited about is, it's actually a company in AI grants. They're working on a version of VS code. That's like an entirely AI powered cursor, yeah. Cursor where you would like to give it a prompt and like to iterate on your code, not by writing code, but rather by just describing the changes you want to make. And it's tightly integrated into the editor itself. So it's not just another plugin. Swyx: Would you, as a founder of a low code prompting-to-code company that pivoted, would you advise them to explore some things or stay away from some things? Like what's your learning there that you would give to them?Sharif: I would focus on one specific type of code. So if I'm building a local tool, I would try to not focus too much on appealing developers. Whereas if I was building an alternative to VS code, I would focus solely on developers. So in that, I think they're doing a pretty good job focusing on developers. Swyx: Are you using Cursor right now? Sharif: I've used it a bit. I haven't converted fully, but I really want to. Okay. It's getting better really, really fast. Yeah. Um, I can see myself switching over sometime this year if they continue improving it. Swyx: Hot tip for, for ChatGPT, people always say, you know, they love ChatGPT. Biggest upgrade to my life right now is the, I forked a menu bar app I found on GitHub and now I just have it running in a menu bar app and I just do command shift G and it pops it up as a single use thing. And there's no latency because it just always is live. And I just type, type in the thing I want and then it just goes away after I'm done. Sharif: Wow. That's cool. Big upgrade. I'm going to install that. That's cool. Alessio: Second question. What is something you thought would take much longer, but it's already here? Like what, what's your acceleration update? Sharif: Ooh, um, it would take much longer, but it's already here. This is your question. Yeah, I know. I wasn't prepared. Um, so I think it would probably be kind of, I would say text to video. Swyx: Yeah. What's going on with that? Sharif: I think within this year, uh, by the end of this year, we'll have like the jump between like the original DALL-E one to like something like mid journey. Like we're going to see that leap in text to video within the span of this year. Um, it's not already here yet. So I guess the thing that surprised me the most was probably the multi-modality of GPT four in the fact that it can technically see things, which is pretty insane. Swyx: Yeah. Is text to video something that Aperture would be interested in? Sharif: Uh, it's something we're thinking about, but it's still pretty early. Swyx: There was one project with a hand, um, animation with human poses. It was also coming out of Facebook. I thought that was a very nice way to accomplish text to video while having a high degree of control. I forget the name of that project. It was like, I think it was like drawing anything. Swyx: Yeah. It sounds familiar. Well, you already answered a year from now. What will people be most surprised by? Um, and maybe the, uh, the usual requests for startup, you know, what's one thing you will pay for if someone built it? Sharif: One thing I would pay for if someone built it. Um, so many things, honestly, I would probably really like, um, like I really want people to build more, uh, tools for language models, like useful tools, give them access to Chrome. And I want to be able to give it a task. And then just, it goes off and spins up a hundred agents that perform that task. And like, sure. Like 80 of them might fail, but like 20 of them might kind of succeed. That's all you really need. And they're agents. You can spin up thousands of them. It doesn't really matter. Like a lot of large numbers are on your side. So that'd be, I would pay a lot of money for that. Even if it was capable of only doing really basic tasks, like signing up for a SAS tool and booking a call or something. If you could do even more things where it could have handled the email, uh, thread and like get the person on the other end to like do something where like, I don't even have to like book the demo. They just give me access to it. That'd be great. Yeah. More, more. Like really weird language model tools would be really fun.Swyx: Like our chat, GPT plugins, a step in the right direction, or are you envisioning something else? Sharif: I think GPT, chat GPT plugins are great, but they seem to only have right-only access right now. I also want them to have, I want these like theoretical agents to have right access to the world too. So they should be able to perform actions on web browsers, have their own email inbox, and have their own credit card with their own balance. Like take it, send emails to people that might be useful in achieving their goal. Ask them for help. Be able to like sign up and register for accounts on tools and services and be able to like to use graphical user interfaces really, really well. And also like to phone home if they need help. Swyx: You just had virtual employees. You want to give them a Brex card, right? Sharif: I wouldn't be surprised if, a year from now there was Brex GPT or it's like Brex cards for your GPT agents. Swyx: I mean, okay. I'm excited by this. Yeah. Kind of want to build it. Sharif: You should. Yeah. Alessio: Well, just to wrap up, we always have like one big takeaway for people, like, you know, to display on a signboard for everyone to see what is the big message to everybody. Sharif: Yeah. I think the big message to everybody is you might think that a lot of the time the ideas you have have already been done by someone. And that may be the case, but a lot of the time the ideas you have are actually pretty unique and no one's ever tried them before. So if you have weird and interesting ideas, you should actually go out and just do them and make the thing and then share that with the world. Cause I feel like we need more people building weird ideas and less people building like better GPT search for your documentation. Host: There are like 10 of those in the recent OST patch. Well, thank you so much. You've been hugely inspiring and excited to see where Lexica goes next. Sharif: Appreciate it. Thanks for having me. Get full access to Latent Space at www.latent.space/subscribe
undefined
May 5, 2023 • 44min

No Moat: Closed AI gets its Open Source wakeup call — ft. Simon Willison

It’s now almost 6 months since Google declared Code Red, and the results — Jeff Dean’s recap of 2022 achievements and a mass exodus of the top research talent that contributed to it in January, Bard’s rushed launch in Feb, a slick video showing Google Workspace AI features and confusing doubly linked blogposts about PaLM API in March, and merging Google Brain and DeepMind in April — have not been inspiring. Google’s internal panic is in full display now with the surfacing of a well written memo, written by software engineer Luke Sernau written in early April, revealing internal distress not seen since Steve Yegge’s infamous Google Platforms Rant. Similar to 2011, the company’s response to an external challenge has been to mobilize the entire company to go all-in on a (from the outside) vague vision.Google’s misfortunes are well understood by now, but the last paragraph of the memo: “We have no moat, and neither does OpenAI”, was a banger of a mic drop.Combine this with news this morning that OpenAI lost $540m last year and will need as much as $100b more funding (after the complex $10b Microsoft deal in Jan), and the memo’s assertion that both Google and OpenAI have “no moat” against the mighty open source horde have gained some credibility in the past 24 hours.Many are criticising this memo privately:* A CEO commented to me yesterday that Luke Sernau does not seem to work in AI related parts of Google and “software engineers don’t understand moats”. * Emad Mostaque, himself a perma-champion of open source and open models, has repeatedly stated that “Closed models will always outperform open models” because closed models can just wrap open ones.* Emad has also commented on the moats he does see: “Unique usage data, Unique content, Unique talent, Unique product, Unique business model”, most of which Google does have, and OpenAI less so (though it is winning on the talent front)* Sam Altman famously said that “very few to no one is Silicon Valley has a moat - not even Facebook” (implying that moats don’t actually matter, and you should spend your time thinking about more important things)* It is not actually clear what race the memo thinks Google and OpenAI are in vs Open Source. Neither are particularly concerned about running models locally on phones, and they are perfectly happy to let “a crazy European alpha male” run the last mile for them while they build actually monetizable cloud infrastructure.However moats are of intense interest by everybody keen on productized AI, cropping up in every Harvey, Jasper, and general AI startup vs incumbent debate. It is also interesting to take the memo at face value and discuss the searing hot pace of AI progress in open source. We hosted this discussion yesterday with Simon Willison, who apart from being an incredible communicator also wrote a great recap of the No Moat memo. 2,800 have now tuned in on Twitter Spaces, but we have taken the audio and cleaned it up here. Enjoy!Timestamps* [00:00:00] Introducing the Google Memo* [00:02:48] Open Source > Closed?* [00:05:51] Running Models On Device* [00:07:52] LoRA part 1* [00:08:42] On Moats - Size, Data* [00:11:34] Open Source Models are Comparable on Data* [00:13:04] Stackable LoRA* [00:19:44] The Need for Special Purpose Optimized Models* [00:21:12] Modular - Mojo from Chris Lattner* [00:23:33] The Promise of Language Supersets* [00:28:44] Google AI Strategy* [00:29:58] Zuck Releasing LLaMA* [00:30:42] Google Origin Confirmed* [00:30:57] Google's existential threat* [00:32:24] Non-Fiction AI Safety ("y-risk")* [00:35:17] Prompt Injection* [00:36:00] Google vs OpenAI* [00:41:04] Personal plugs: Simon and TravisTranscripts[00:00:00] Introducing the Google Memo[00:00:00] Simon Willison: So, yeah, this is a document, which Kate, which I first saw at three o'clock this morning, I think. It claims to be leaked from Google. There's good reasons to believe it is leaked from Google, and to be honest, if it's not, it doesn't actually matter because the quality of the analysis, I think stands alone.[00:00:15] If this was just a document by some anonymous person, I'd still think it was interesting and worth discussing. And the title of the document is We Have No Moat and neither does Open ai. And the argument it makes is that while Google and OpenAI have been competing on training bigger and bigger language models, the open source community is already starting to outrun them, given only a couple of months of really like really, really serious activity.[00:00:41] You know, Facebook lama was the thing that really kicked us off. There were open source language models like Bloom before that some G P T J, and they weren't very impressive. Like nobody was really thinking that they were. Chat. G P T equivalent Facebook Lama came out in March, I think March 15th. And was the first one that really sort of showed signs of being as capable maybe as chat G P T.[00:01:04] My, I don't, I think all of these models, they've been, the analysis of them has tend to be a bit hyped. Like I don't think any of them are even quite up to GT 3.5 standards yet, but they're within spitting distance in some respects. So anyway, Lama came out and then, Two weeks later Stanford Alpaca came out, which was fine tuned on top of Lama and was a massive leap forward in terms of quality.[00:01:27] And then a week after that Vicuna came out, which is to this date, the the best model I've been able to run on my own hardware. I, on my mobile phone now, like, it's astonishing how little resources you need to run these things. But anyway, the the argument that this paper made, which I found very convincing is it only took open source two months to get this far.[00:01:47] It's now every researcher in the world is kicking it on new, new things, but it feels like they're being there. There are problems that Google has been trying to solve that the open source models are already addressing, and really how do you compete with that, like with your, it's closed ecosystem, how are you going to beat these open models with all of this innovation going on?[00:02:04] But then the most interesting argument in there is it talks about the size of models and says that maybe large isn't a competitive advantage, maybe actually a smaller model. With lots of like different people fine tuning it and having these sort of, these LoRA l o r a stackable fine tuning innovations on top of it, maybe those can move faster.[00:02:23] And actually having to retrain your giant model every few months from scratch is, is way less useful than having small models that you can tr you can fine tune in a couple of hours on laptop. So it's, it's fascinating. I basically, if you haven't read this thing, you should read every word of it. It's not very long.[00:02:40] It's beautifully written. Like it's, it's, I mean, If you try and find the quotable lines in it, almost every line of it's quotable. Yeah. So, yeah, that's that, that, that's the status of this[00:02:48] Open Source > Closed?[00:02:48] swyx: thing. That's a wonderful summary, Simon. Yeah, there, there's so many angles we can take to this. I, I'll just observe one, one thing which if you think about the open versus closed narrative, Ima Mok, who is the CEO of Stability, has always been that open will trail behind closed, because the closed alternatives can always take.[00:03:08] Learnings and lessons from open source. And this is the first highly credible statement that is basically saying the exact opposite, that open source is moving than, than, than closed source. And they are scared. They seem to be scared. Which is interesting,[00:03:22] Travis Fischer: Travis. Yeah, the, the, the, a few things that, that I'll, I'll, I'll say the only thing which can keep up with the pace of AI these days is open source.[00:03:32] I think we're, we're seeing that unfold in real time before our eyes. And. You know, I, I think the other interesting angle of this is to some degree LLMs are they, they don't really have switching costs. They are going to be, become commoditized. At least that's, that's what a lot of, a lot of people kind of think to, to what extent is it Is it a, a rate in terms of, of pricing of these things?[00:03:55] , and they all kind of become roughly the, the, the same in, in terms of their, their underlying abilities. And, and open source is gonna, gonna be actively pushing, pushing that forward. And, and then this is kind of coming from, if it is to be believed the kind of Google or an insider type type mentality around you know, where is the actual competitive advantage?[00:04:14] What should they be focusing on? How can they get back in into the game? When you know, when, when, when, when currently the, the, the external view of, of Google is that they're kind of spinning their wheels and they have this code red,, and it's like they're, they're playing catch up already.[00:04:28] Like how could they use the open source community and work with them, which is gonna be really, really hard you know, from a structural perspective given Google's place in the ecosystem. But a, a lot, lot, a lot of jumping off points there.[00:04:42] Alessio Fanelli: I was gonna say, I think the Post is really focused on how do we get the best model, but it's not focused on like, how do we build the best product around it.[00:04:50] A lot of these models are limited by how many GPUs you can get to run them and we've seen on traditional open source, like everybody can use some of these projects like Kafka and like Alaska for free. But the reality is that not everybody can afford to run the infrastructure needed for it.[00:05:05] So I, I think like the main takeaway that I have from this is like, A lot of the moats are probably around just getting the, the sand, so to speak, and having the GPUs to actually serve these models. Because even if the best model is open source, like running it at large scale for an end is not easy and like, it's not super convenient to get a lot, a lot of the infrastructure.[00:05:27] And we've seen that model work in open source where you have. The opensource project, and then you have a enterprise cloud hosted version for it. I think that's gonna look really different in opensource models because just hosting a model doesn't have a lot of value. So I'm curious to hear how people end up getting rewarded to do opensource.[00:05:46] You know, it's, we figured that out in infrastructure, but we haven't figured it out in in Alans[00:05:51] Running Models On Device[00:05:51] Simon Willison: yet. I mean, one thing I'll say is that the the models that you can run on your own devices are so far ahead of what I ever dreamed they would be at this point. Like Vicuna 13 b i i, I, I think is the current best available open mo model that I've played with.[00:06:08] It's derived from Facebook Lama, so you can't use it for commercial purposes yet. But the point about MCK 13 B is it runs in the browser directly on web gpu. There's this amazing web l l M project where you literally, your browser downloaded a two gigabyte file. And it fires up a chat g D style interface and it's quite good.[00:06:27] It can do rap battles between different animals and all of the kind of fun stuff that you'd expect to be able to do the language model running entirely in Chrome canary. It's shocking to me that that's even possible, but that kind of shows that once, once you get to inference, if you can shrink the model down and the techniques for shrinking these models, the, the first one was the the quantization.[00:06:48] Which the Lama CPP project really sort of popularized Matt can by using four bits instead of 16 bit floating point numbers, you can shrink it down quite a lot. And then there was a paper that came out days ago suggesting that you can prune the models and ditch half the model and maintain the same level of quality.[00:07:05] So with, with things like that, with all of these tricks coming together, it's really astonishing how much you can get done on hardware that people actually have in their pockets even.[00:07:15] swyx: Just for completion I've been following all of your posts. Oh, sorry. Yes. I just wanna follow up, Simon. You're, you said you're running a model on your phone. Which model is it? And I don't think you've written it up.[00:07:27] Simon Willison: Yeah, that one's vina. I did, did I write it up? I did. I've got a blog post about how it it, it, it knows who I am, sort of, but it said that I invented a, a, a pattern for living called bear or bunny pattern, which I definitely didn't, but I loved that my phone decided that I did.[00:07:44] swyx: I will hunt for that because I'm not yet running Vic on my phone and I feel like I should and, and as like a very base thing, but I'll, okay.[00:07:52] Stackable LoRA Modules[00:07:52] swyx: Also, I'll follow up two things, right? Like one I'm very interesting and let's, let's talk about that a little bit more because this concept of stackable improvements to models I think is extremely interesting.[00:08:00] Like, I would love to MPM install abilities onto my models, right? Which is really awesome. But the, the first thing thing is under-discussed is I don't get the panic. Like, honestly, like Google has the most moats. I I, I was arguing maybe like three months ago on my blog. Like Google has the most mote out of a lot of people because, hey, we have your calendar.[00:08:21] Hey, we have your email. Hey, we have your you know, Google Docs. Like, isn't that a, a sufficient mode? Like, why are these guys panicking so much? I don't, I still don't get it. Like, Sure open source is running ahead and like, it's, it's on device and whatev, what have you, but they have so much more mode.[00:08:36] Like, what are we talking about here? There's many dimensions to compete on.[00:08:42] On Moats - Size, Data[00:08:42] Travis Fischer: Yeah, there's like one of, one of the, the things that, that the author you know, mentions in, in here is when, when you start to, to, to have the feeling of what we're trailing behind, then you're, you're, you're, you're brightest researchers jump ship and go to OpenAI or go to work at, at, at academia or, or whatever.[00:09:00] And like the talent drain. At the, the level of the, the senior AI researchers that are pushing these things ahead within Google, I think is a serious, serious concern. And my, my take on it's a good point, right? Like, like, like, like what Google has modes. They, they, they're not running outta money anytime soon.[00:09:16] You know, I think they, they do see the level of the, the defensibility and, and the fact that they want to be, I'll chime in the, the leader around pretty much anything. Tech first. There's definitely ha ha have lost that, that, that feeling. Right? , and to what degree they can, they can with the, the open source community to, to get that back and, and help drive that.[00:09:38] You know all of the llama subset of models with, with alpaca and Vicuna, et cetera, that all came from, from meta. Right. Like that. Yeah. Like it's not licensed in an open way where you can build a company on top of it, but is now kind of driving this family of, of models, like there's a tree of models that, that they're, they're leading.[00:09:54] And where is Google in that, in that playbook? Like for a long time they were the one releasing those models being super open and, and now it's just they, they've seem to be trailing and there's, there's people jumping ship and to what degree can they, can they, can they. Close off those wounds and, and focus on, on where, where they, they have unique ability to, to gain momentum.[00:10:15] I think is a core part of my takeaway from this. Yeah.[00:10:19] Alessio Fanelli: And think another big thing in the post is, oh, as long as you have high quality data, like you don't need that much data, you can just use that. The first party data loops are probably gonna be the most important going forward if we do believe that this is true.[00:10:32] So, Databricks. We have Mike Conover from Databricks on the podcast, and they talked about how they came up with the training set for Dolly, which they basically had Databricks employees write down very good questions and very good answers for it. Not every company as the scale to do that. And I think products like Google, they have millions of people writing Google Docs.[00:10:54] They have millions of people using Google Sheets, then millions of people writing stuff, creating content on YouTube. The question is, if you wanna compete against these companies, maybe the model is not what you're gonna do it with because the open source kind of commoditizes it. But how do you build even better data?[00:11:12] First party loops. And that's kind of the hardest thing for startups, right? Like even if we open up the, the models to everybody and everybody can just go on GitHub and. Or hugging face and get the waste to the best model, but get enough people to generate data for me so that I can still make it good. That's, that's what I would be worried about if I was a, a new company.[00:11:31] How do I make that happen[00:11:32] Simon Willison: really quickly?[00:11:34] Open Source Models are Comparable on Data[00:11:34] Simon Willison: I'm not convinced that the data is that big a challenge. So there's this PO project. So the problem with Facebook LAMA is that it's not available for, for commercial use. So people are now trying to train a alternative to LAMA that's entirely on openly licensed data.[00:11:48] And that the biggest project around that is this red pajama project, which They released their training data a few weeks ago and it was 2.7 terabytes. Right? So actually tiny, right? You can buy a laptop that you can fit 2.7 terabytes on. Got it. But it was the same exact data that Facebook, the same thing that Facebook Lamb had been trained on.[00:12:06] Cuz for your base model. You're not really trying to teach it fact about the world. You're just trying to teach it how English and other languages work, how they fit together. And then the real magic is when you fine tune on top of that. That's what Alpaca did on top of Lama and so on. And the fine tuning sets, it looks like, like tens of thousands of examples to kick one of these role models into shape.[00:12:26] And tens of thousands of examples like Databricks spent a month and got the 2000 employees of their company to help kick in and it worked. You've got the open assistant project of crowdsourcing this stuff now as well. So it's achievable[00:12:40] swyx: sore throat. I agree. I think it's a fa fascinating point. Actually, so I've heard through the grapevine then red pajamas model.[00:12:47] Trained on the, the data that they release is gonna be releasing tomorrow. And it's, it's this very exciting time because the, the, there, there's a, there's a couple more models that are coming down the pike, which independently we produced. And so yeah, that we, everyone is challenging all these assumptions from, from first principles, which is fascinating.[00:13:04] Stackable LoRA[00:13:04] swyx: I, I did, I did wanted to, to like try to get a little bit more technical in terms of like the, the, the, the specific points race. Cuz this doc, this doc was just amazing. Can we talk about LoRA. I, I, I'll open up to Simon again if he's back.[00:13:16] Simon Willison: I'd rather someone else take on. LoRA, I've, I, I know as much as I've read in that paper, but not much more than that.[00:13:21] swyx: So I thought it was this kind of like an optimization technique. So LoRA stands for lower rank adaptation. But this is the first mention of LoRA as a form of stackable improvements. Where he I forget what, let, just, let me just kind of Google this. But obviously anyone's more knowledgeable please.[00:13:39] So come on in.[00:13:40] Alessio Fanelli: I, all of Lauren is through GTS Man, about 20 minutes on GT four, trying to figure out word. It was I study computer science, but this is not this is not my area of expertise. What I got from it is that basically instead of having to retrain the whole model you can just pick one of the ranks and you take.[00:13:58] One of like the, the weight matrix tests and like make two smaller matrixes from it and then just two to be retrained and training the whole model. So[00:14:08] swyx: it save a lot of Yeah. You freeze part of the thing and then you just train the smaller part like that. Exactly. That seems to be a area of a lot of fruitful research.[00:14:15] Yeah. I think Mini GT four recently did something similar as well. And then there's, there's, there's a, there's a Spark Model people out today that also did the same thing.[00:14:23] Simon Willison: So I've seen a lot of LoRA stable, the stable diffusion community has been using LoRA a lot. So they, in that case, they had a, I, the thing I've seen is people releasing LoRA's that are like you, you train a concept like a, a a particular person's face or something you release.[00:14:38] And the, the LoRA version of this end up being megabytes of data, like, which is, it's. You know, it's small enough that you can just trade those around and you can effectively load multiple of those into the model. But what I haven't realized is that you can use the same trick on, on language models. That was one of the big new things for me in reading the the leaks Google paper today.[00:14:56] Alessio Fanelli: Yeah, and I think the point to make around on the infrastructure, so what tragedy has told me is that when you're figuring out what rank you actually wanna do this fine tuning at you can have either go too low and like the model doesn't actually learn it. Or you can go too high and the model overfit those learnings.[00:15:14] So if you have a base model that everybody agrees on, then all the subsequent like LoRA work is done around the same rank, which gives you an advantage. And the point they made in the, that, since Lama has been the base for a lot of this LoRA work like they own. The, the mind share of the community.[00:15:32] So everything that they're building is compatible with their architecture. But if Google Opensources their own model the rank that they chose For LoRA on Lama might not work on the Google model. So all of the existing work is not portable. So[00:15:46] Simon Willison: the impression I got is that one of the challenges with LoRA is that you train all these LoRAs on top of your model, but then if you retrain that base model as LoRA's becoming invalid, right?[00:15:55] They're essentially, they're, they're, they're built for an exact model version. So this means that being the big company with all of the GPUs that can afford to retrain a model every three months. That's suddenly not nearly as valuable as it used to be because now maybe there's an open source model that's five years old at this point and has like multiple, multiple stacks of LoRA's trained all over the world on top of it, which can outperform your brand new model just because there's been so much more iteration on that base.[00:16:20] swyx: I, I think it's, I think it's fascinating. It's I think Jim Fan from Envidia was recently making this argument for transformers. Like even if we do come up with a better. Architecture, then transformers, they're the sheer hundreds and millions of dollars that have been invested on top of transformers.[00:16:34] Make it actually there is some switching costs and it's not exactly obvious that better architecture. Equals equals we should all switch immediately tomorrow. It's, it's, it's[00:16:44] Simon Willison: kinda like the, the difficulty of launching a new programming language today Yes. Is that pipeline and JavaScript have a million packages.[00:16:51] So no matter how good your new language is, if it can't tap into those existing package libraries, it's, it's not gonna be useful for, which is why Moji is so clever, because they did build on top of Pips. They get all of that existing infrastructure, all of that existing code working already.[00:17:05] swyx: I mean, what, what thought you, since you co-create JAO and all that do, do we wanna take a diversion into mojo?[00:17:10] No, no. I[00:17:11] Travis Fischer: would, I, I'd be happy to, to, to jump in, and get Simon's take on, on Mojo. 1, 1, 1 small, small point on LoRA is I, I, I just think. If you think about at a high level, what the, the major down downsides are of these, these large language models. It's the fact that they well they're, they're, they're difficult to, to train, right?[00:17:32] They, they tend to hallucinate and they are, have, have a static, like, like they were trained at a certain date, right? And with, with LoRA, I think it makes it a lot more amenable to Training new, new updates on top of that, that like base model on the fly where you can incorporate new, new data and in a way that is, is, is an interesting and potentially more optimal alternative than Doing the kind of in context generation cuz, cuz most of like who at perplexity AI or, or any of these, these approaches currently, it's like all based off of doing real-time searches and then injecting as much into the, the, the local context window as possible so that you, you try to ground your, your, your, your language model.[00:18:16] Both in terms of the, the information it has access to that, that, that helps to reduce hallucinations. It can't reduce it, but helps to reduce it and then also gives it access to up-to-date information that wasn't around for that, that massive like, like pre-training step. And I think LoRA in, in, in mine really makes it more, more amenable to having.[00:18:36] Having constantly shifting lightweight pre-training on top of it that scales better than than normal. Pre I'm sorry. Fine tune, fine tuning. Yeah, that, that was just kinda my one takeaway[00:18:45] Simon Willison: there. I mean, for me, I've never been, I want to run models on my own hard, I don't actually care about their factual content.[00:18:52] Like I don't need a model that's been, that's trained on the most upstate things. What I need is a model that can do the bing and bar trick, right? That can tell when it needs to run a search. And then go and run a search to get extra information and, and bring that context in. And similarly, I wanted to be able to operate tools where it can access my email or look at my notes or all of those kinds of things.[00:19:11] And I don't think you need a very powerful model for that. Like that's one of the things where I feel like, yeah, vicuna running on my, on my laptop is probably powerful enough to drive a sort of personal research assistant, which can look things up for me and it can summarize things for my notes and it can do all of that and I don't care.[00:19:26] But it doesn't know about the Ukraine war because the Ukraine war training cutoff, that doesn't matter. If it's got those additional capabilities, which are quite easy to build the reason everyone's going crazy building agents and tools right now is that it's a few lines of Python code, and a sort of couple of paragraphs to get it to.[00:19:44] The Need for Special Purpose Optimized Models[00:19:44] Simon Willison: Well, let's, let's,[00:19:45] Travis Fischer: let's maybe dig in on that a little bit. And this, this also is, is very related to mojo. Cuz I, I do think there are use cases and domains where having the, the hyper optimized, like a version of these models running on device is, is very relevant where you can't necessarily make API calls out on the fly.[00:20:03] and Aug do context, augmented generation. And I was, I was talking with, with a a researcher. At Lockheed Martin yesterday, literally about like, like the, the version of this that's running of, of language models running on, on fighter jets. Right? And you, you talk about like the, the, the amount of engineering, precision and optimization that has to go into, to those type of models.[00:20:25] And the fact that, that you spend so much money, like, like training a super distilled ver version where milliseconds matter it's a life or death situation there. You know, and you couldn't even, even remotely ha ha have a use case there where you could like call out and, and have, have API calls or something.[00:20:40] So I, I do think there's like keeping in mind the, the use cases where, where. There, there'll be use cases that I'm more excited about at, at the application level where, where, yeah, I want to to just have it be super flexible and be able to call out to APIs and have this agentic type type thing.[00:20:56] And then there's also industries and, and use cases where, where you really need everything baked into the model.[00:21:01] swyx: Yep. Agreed. My, my favorite piece take on this is I think DPC four as a reasoning engine, which I think came from the from Nathan at every two. Which I think, yeah, I see the hundred score over there.[00:21:12] Modular - Mojo from Chris Lattner[00:21:12] swyx: Simon, do you do you have a, a few seconds on[00:21:14] Simon Willison: mojo. Sure. So Mojo is a brand new program language you just announced a few days ago. It's not actually available yet. I think there's an online demo, but to zooming it becomes an open source language we can use. It's got really some very interesting characteristics.[00:21:29] It's a super set of Python, so anything written in Python, Python will just work, but it adds additional features on top that let you basically do very highly optimized code with written. In Python syntax, it compiles down the the main thing that's exciting about it is the pedigree that it comes from.[00:21:47] It's a team led by Chris Latner, built L L V M and Clang, and then he designed Swift at Apple. So he's got like three, three for three on, on extraordinarily impactful high performance computing products. And he put together this team and they've basically, they're trying to go after the problem of how do you build.[00:22:06] A language which you can do really high performance optimized work in, but where you don't have to do everything again from scratch. And that's where building on top of Python is so clever. So I wasn't like, if this thing came along, I, I didn't really pay attention to it until j Jeremy Howard, who built Fast ai put up a very detailed blog post about why he was excited about Mojo, which included a, there's a video demo in there, which everyone should watch because in that video he takes Matrix multiplication implemented in Python.[00:22:34] And then he uses the mojo extras to 2000 x. The performance of that matrix multiplication, like he adds a few static types functions sort of struck instead of the class. And he gets 2000 times the performance out of it, which is phenomenal. Like absolutely extraordinary. So yeah, that, that got me really excited.[00:22:52] Like the idea that we can still use Python and all of this stuff we've got in Python, but we can. Just very slightly tweak some things and get literally like thousands times upwards performance out of the things that matter. That's really exciting.[00:23:07] swyx: Yeah, I, I, I'm curious, like, how come this wasn't thought of before?[00:23:11] It's not like the, the, the concept of a language super set hasn't hasn't, has, has isn't, is completely new. But all, as far as I know, all the previous Python interpreter approaches, like the alternate runtime approaches are like they, they, they're more, they're more sort of, Fit conforming to standard Python, but never really tried this additional approach of augmenting the language.[00:23:33] The Promise of Language Supersets[00:23:33] swyx: I, I'm wondering if you have many insights there on, like, why, like why is this a, a, a breakthrough?[00:23:38] Simon Willison: Yeah, that's a really interesting question. So, Jeremy Howard's piece talks about this thing called M L I R, which I hadn't heard of before, but this was another Chris Latner project. You know, he built L L VM as a low level virtual machine.[00:23:53] That you could build compilers on top of. And then M L I R was this one that he initially kicked off at Google, and I think it's part of TensorFlow and things like that. But it was very much optimized for multiple cores and GPU access and all of that kind of thing. And so my reading of Jeremy Howard's article is that they've basically built Mojo on top of M L I R.[00:24:13] So they had a huge, huge like a starting point where they'd, they, they knew this technology better than anyone else. And because they had this very, very robust high performance basis that they could build things on. I think maybe they're just the first people to try and build a high, try and combine a high level language with M L A R, with some extra things.[00:24:34] So it feels like they're basically taking a whole bunch of ideas people have been sort of experimenting with over the last decade and bundled them all together with exactly the right team, the right level of expertise. And it looks like they've got the thing to work. But yeah, I mean, I've, I've, I'm. Very intrigued to see, especially once this is actually available and we can start using it.[00:24:52] It, Jeremy Howard is someone I respect very deeply and he's, he's hyping this thing like crazy, right? His headline, his, and he's not the kind of person who hypes things if they're not worth hyping. He said Mojo may be the biggest programming language advanced in decades. And from anyone else, I'd kind of ignore that headline.[00:25:09] But from him it really means something.[00:25:11] swyx: Yes, because he doesn't hype things up randomly. Yeah, and, and, and he's a noted skeptic of Julia which is, which is also another data science hot topic. But from the TypeScript and web, web development worlds there has been a dialect of TypeScript that was specifically optimized to compile, to web assembly which I thought was like promising and then, and, and eventually never really took off.[00:25:33] But I, I like this approach because I think more. Frameworks should, should essentially be languages and recognize that they're language superset and maybe working compilers that that work on them. And then that is the, by the way, that's the direction that React is going right now. So fun times[00:25:50] Simon Willison: type scripts An interesting comparison actually, cuz type script is effectively a superset of Java script, right?[00:25:54] swyx: It's, but there's no, it's purely[00:25:57] Simon Willison: types, right? Gotcha. Right. So, so I guess mojo is the soup set python, but the emphasis is absolutely on tapping into the performance stuff. Right.[00:26:05] swyx: Well, the just things people actually care about.[00:26:08] Travis Fischer: Yeah. The, the one thing I've found is, is very similar to the early days of type script.[00:26:12] There was the, the, the, the most important thing was that it's incrementally adoptable. You know, cuz people had a script code basis and, and they wanted to incrementally like add. The, the, the main value prop for TypeScript was reliability and the, the, the, the static typing. And with Mojo, Lucia being basically anyone who's a target a large enterprise user of, of Mojo or even researchers, like they're all going to be coming from a, a hardcore.[00:26:36] Background in, in Python and, and have large existing libraries. And the the question will be for what use cases will mojo be like a, a, a really good fit for that incremental adoption where you can still tap into your, your, your massive, like python exi existing infrastructure workflows, data tooling, et cetera.[00:26:55] And, and what does, what does that path to adoption look like?[00:26:59] swyx: Yeah, we, we, we don't know cuz it's a wait listed language which people were complaining about. They, they, the, the mojo creators were like saying something about they had to scale up their servers. And I'm like, what language requires essential server?[00:27:10] So it's a little bit suss, a little bit, like there's a, there's a cloud product already in place and they're waiting for it. But we'll see. We'll see. I mean, emojis should be promising in it. I, I actually want more. Programming language innovation this way. You know, I was complaining years ago that programming language innovation is all about stronger types, all fun, all about like more functional, more strong types everywhere.[00:27:29] And, and this is, the first one is actually much more practical which I, which I really enjoy. This is why I wrote about self provisioning run types.[00:27:36] Simon Willison: And[00:27:37] Alessio Fanelli: I mean, this is kind of related to the post, right? Like if you stop all of a sudden we're like, the models are all the same and we can improve them.[00:27:45] Like, where can we get the improvements? You know, it's like, Better run times, better languages, better tooling, better data collection. Yeah. So if I were a founder today, I wouldn't worry as much about the model, maybe, but I would say, okay, what can I build into my product and like, or what can I do at the engineering level that maybe it's not model optimization because everybody's working on it, but like you said, it's like, why haven't people thought of this before?[00:28:09] It's like, it's, it's definitely super hard, but I'm sure that if you're like Google or you're like open AI or you're like, Databricks, we got smart enough people that can think about these problems, so hopefully we see more of this.[00:28:21] swyx: You need, Alan? Okay. I promise to keep this relatively tight. I know Simon on a beautiful day.[00:28:27] It is a very nice day in California. I wanted to go through a few more points that you have pulled out Simon and, and just give you the opportunity to, to rant and riff and, and what have you. I, I, are there any other points from going back to the sort of Google OpenAI mode documents that, that you felt like we, we should dive in on?[00:28:44] Google AI Strategy[00:28:44] Simon Willison: I mean, the really interesting stuff there is the strategy component, right? The this idea that that Facebook accidentally stumbled into leading this because they put out this model that everyone else is innovating on top of. And there's a very open question for me as to would Facebook relic Lama to allow for commercial usage?[00:29:03] swyx: Is there some rumor? Is that, is that today?[00:29:06] Simon Willison: Is there a rumor about that?[00:29:07] swyx: That would be interesting? Yeah, I saw, I saw something about Zuck saying that he would release the, the Lama weights officially.[00:29:13] Simon Willison: Oh my goodness. No, that I missed. That is, that's huge.[00:29:17] swyx: Let me confirm the tweet. Let me find the tweet and then, yeah.[00:29:19] Okay.[00:29:20] Simon Willison: Because actually I met somebody from Facebook machine learning research a couple of weeks ago, and I, I pressed 'em on this and they said, basically they don't think it'll ever happen because if it happens, and then somebody does horrible fascist stuff with this model, all of the headlines will be Meg releases a monster into the world.[00:29:36] So, so hi. His, the, the, the, a couple of weeks ago, his feeling was that it's just too risky for them to, to allow it to be used like that. But a couple of weeks is, is, is a couple of months in AI world. So yeah, it wouldn't be, it feels to me like strategically Facebook should be jumping right on this because this puts them at the very.[00:29:54] The very lead of, of open source innovation around this stuff.[00:29:58] Zuck Releasing LLaMA[00:29:58] swyx: So I've pinned the tweet talking about Zuck and Zuck saying that meta will open up Lama. It's from the founder of Obsidian, which gives it a slight bit more credibility, but it is the only. Tweet that I can find about it. So completely unsourced,[00:30:13] we shall see. I, I, I mean I have friends within meta, I should just go ask them. But yeah, I, I mean one interesting angle on, on the memo actually is is that and, and they were linking to this in, in, in a doc, which is apparently like. Facebook got a bunch of people to do because they, they never released it for commercial use, but a lot of people went ahead anyway and, and optimized and, and built extensions and stuff.[00:30:34] They, they got a bunch of free work out of opensource, which is an interesting strategy.[00:30:39] There's okay. I don't know if I.[00:30:42] Google Origin Confirmed[00:30:42] Simon Willison: I've got exciting piece of news. I've just heard from somebody with contacts at Google that they've heard people in Google confirm the leak. That that document wasn't even legit Google document, which I don't find surprising at all, but I'm now up to 10, outta 10 on, on whether that's, that's, that's real.[00:30:57] Google's existential threat[00:30:57] swyx: Excellent. Excellent. Yeah, it is fascinating. Yeah, I mean the, the strategy is, is, is really interesting. I think Google has been. Definitely sleeping on monetizing. You know, I, I, I heard someone call when Google Brain and Devrel I merged that they would, it was like goodbye to the Xerox Park of our era and it definitely feels like Google X and Google Brain would definitely Xerox parks of our, of our era, and I guess we all benefit from that.[00:31:21] Simon Willison: So, one thing I'll say about the, the Google side of things, like the there was a question earlier, why are Google so worried about this stuff? And I think it's, it's just all about the money. You know, the, the, the engine of money at Google is Google searching Google search ads, and who uses Chachi PT on a daily basis, like me, will have noticed that their usage of Google has dropped like a stone.[00:31:41] Because there are many, many questions that, that chat, e p t, which shows you no ads at all. Is, is, is a better source of information for than Google now. And so, yeah, I'm not, it doesn't surprise me that Google would see this as an existential threat because whether or not they can be Bard, it's actually, it's not great, but it, it exists, but it hasn't it yet either.[00:32:00] And if I've got a Chatbook chatbot that's not showing me ads and chatbot that is showing me ads, I'm gonna pick the one that's not showing[00:32:06] swyx: me ads. Yeah. Yeah. I, I agree. I did see a prototype of Bing with ads. Bing chat with ads. I haven't[00:32:13] Simon Willison: seen the prototype yet. No.[00:32:15] swyx: Yeah, yeah. Anyway, I I, it, it will come obviously, and then we will choose, we'll, we'll go out of our ways to avoid ads just like we always do.[00:32:22] We'll need ad blockers and chat.[00:32:23] Excellent.[00:32:24] Non-Fiction AI Safety ("y-risk")[00:32:24] Simon Willison: So I feel like on the safety side, the, the safety side, there are basically two areas of safety that I, I, I sort of split it into. There's the science fiction scenarios, the AI breaking out and killing all humans and creating viruses and all of that kind of thing. The sort of the terminated stuff. And then there's the the.[00:32:40] People doing bad things with ai and that's latter one is the one that I think is much more interesting and that cuz you could u like things like romance scams, right? Romance scams already take billions of dollars from, from vulner people every year. Those are very easy to automate using existing tools.[00:32:56] I'm pretty sure for QNA 13 b running on my laptop could spin up a pretty decent romance scam if I was evil and wanted to use it for them. So that's the kind of thing where, I get really nervous about it, like the fact that these models are out there and bad people can use these bad, do bad things.[00:33:13] Most importantly at scale, like romance scamming, you don't need a language model to pull off one romance scam, but if you wanna pull off a thousand at once, the language model might be the, the thing that that helps you scale to that point. And yeah, in terms of the science fiction stuff and also like a model on my laptop that can.[00:33:28] Guess what comes next in a sentence. I'm not worried that that's going to break out of my laptop and destroy the world. There. There's, I'm get slightly nervous about the huge number of people who are trying to build agis on top of this models, the baby AGI stuff and so forth, but I don't think they're gonna get anywhere.[00:33:43] I feel like if you actually wanted a model that was, was a threat to human, a language model would be a tiny corner of what that thing. Was actually built on top of, you'd need goal setting and all sorts of other bits and pieces. So yeah, for the moment, the science fiction stuff doesn't really interest me, although it is a little bit alarming seeing more and more of the very senior figures in this industry sort of tip the hat, say we're getting a little bit nervous about this stuff now.[00:34:08] Yeah.[00:34:09] swyx: So that would be Jeff Iton and and I, I saw this me this morning that Jan Lacoon was like happily saying, this is fine. Being the third cheer award winner.[00:34:20] Simon Willison: But you'll see a lot of the AI safe, the people who've been talking about AI safety for the longest are getting really angry about science fiction scenarios cuz they're like, no, the, the thing that we need to be talking about is the harm that you can cause with these models right now today, which is actually happening and the science fiction stuff kind of ends up distracting from that.[00:34:36] swyx: I love it. You, you. Okay. So, so Uher, I don't know how to pronounce his name. Elier has a list of ways that AI will kill us post, and I think, Simon, you could write a list of ways that AI will harm us, but not kill us, right? Like the, the, the non-science fiction actual harm ways, I think, right? I haven't seen a, a actual list of like, hey, romance scams spam.[00:34:57] I, I don't, I don't know what else, but. That could be very interesting as a Hmm. Okay. Practical. Practical like, here are the situations we need to guard against because they are more real today than that we need to. Think about Warren, about obviously you've been a big advocate of prompt injection awareness even though you can't really solve them, and I, I worked through a scenario with you, but Yeah,[00:35:17] Prompt Injection[00:35:17] Simon Willison: yeah.[00:35:17] Prompt injection is a whole other side of this, which is, I mean, that if you want a risk from ai, the risk right now is everyone who's building puts a building systems that attackers can trivially subvert into stealing all of their private data, unlocking their house, all of that kind of thing. So that's another very real risk that we have today.[00:35:35] swyx: I think in all our personal bios we should edit in prompt injections already, like in on my website, I wanna edit in a personal prompt injections so that if I get scraped, like I all know if someone's like reading from a script, right? That that is generated by any iBot. I've[00:35:49] Simon Willison: seen people do that on LinkedIn already and they get, they get recruiter emails saying, Hey, I didn't read your bio properly and I'm just an AI script, but would you like a job?[00:35:57] Yeah. It's fascinating.[00:36:00] Google vs OpenAI[00:36:00] swyx: Okay. Alright, so topic. I, I, I think, I think this this, this mote is is a peak under the curtain of the, the internal panic within Google. I think it is very val, very validated. I'm not so sure they should care so much about small models or, or like on device models.[00:36:17] But the other stuff is interesting. There is a comment at the end that you had by about as for opening open is themselves, open air, doesn't matter. So this is a Google document talking about Google's position in the market and what Google should be doing. But they had a comment here about open eye.[00:36:31] They also say open eye had no mode, which is a interesting and brave comment given that open eye is the leader in, in a lot of these[00:36:38] Simon Willison: innovations. Well, one thing I will say is that I think we might have identified who within Google wrote this document. Now there's a version of it floating around with a name.[00:36:48] And I look them up on LinkedIn. They're heavily involved in the AI corner of Google. So my guess is that at Google done this one, I've worked for companies. I'll put out a memo, I'll write up a Google doc and I'll email, email it around, and it's nowhere near the official position of the company or of the executive team.[00:37:04] It's somebody's opinion. And so I think it's more likely that this particular document is somebody who works for Google and has an opinion and distributed it internally and then it, and then it got leaked. I dunno if it's necessarily. Represents Google's sort of institutional thinking about this? I think it probably should.[00:37:19] Again, this is such a well-written document. It's so well argued that if I was an executive at Google and I read that, I would, I would be thinking pretty hard about it. But yeah, I don't think we should see it as, as sort of the official secret internal position of the company. Yeah. First[00:37:34] swyx: of all, I might promote that person.[00:37:35] Cuz he's clearly more,[00:37:36] Simon Willison: oh, definitely. He's, he's, he's really, this is a, it's, I, I would hire this person about the strength of that document.[00:37:42] swyx: But second of all, this is more about open eye. Like I'm not interested in Google's official statements about open, but I was interested like his assertion, open eye.[00:37:50] Doesn't have a mote. That's a bold statement. I don't know. It's got the best people.[00:37:55] Travis Fischer: Well, I, I would, I would say two things here. One, it's really interesting just at a meta, meta point that, that they even approached it this way of having this public leak. It, it, it kind of, Talks a little bit to the fact that they, they, they felt that that doing do internally, like wasn't going to get anywhere or, or maybe this speaks to, to some of the like, middle management type stuff or, or within Google.[00:38:18] And then to the, the, the, the point about like opening and not having a moat. I think for, for large language models, it, it, it will be over, over time kind of a race to the bottom just because the switching costs are, are, are so low compared with traditional cloud and sas. And yeah, there will be differences in, in, in quality, but, but like over time, if you, you look at the limit of these things like the, I I think Sam Altman has been quoted a few times saying that the, the, the price of marginal price of intelligence will go to zero.[00:38:47] Time and the marginal price of energy powering that intelligence will, will also hit over time. And in that world, if you're, you're providing large language models, they become commoditized. Like, yeah. What, what is, what is your mode at that point? I don't know. I think they're e extremely well positioned as a team and as a company for leading this space.[00:39:03] I'm not that, that worried about that, but it is something from a strategic point of view to keep in mind about large language models becoming a commodity. So[00:39:11] Simon Willison: it's quite short, so I think it's worth just reading the, in fact, that entire section, it says epilogue. What about open ai? All of this talk of open source can feel unfair given open AI's current closed policy.[00:39:21] Why do we have to share if they won't? That's talking about Google sharing, but the fact of the matter is we are already sharing everything with them. In the form of the steady flow of poached senior researchers until we spent that tide. Secrecy is a moot point. I love that. That's so salty. And, and in the end, open eye doesn't matter.[00:39:38] They are making the same mistakes that we are in their posture relative to open source. And their ability to maintain an edge is necessarily in question. Open source alternatives. Canned will eventually eclipse them. Unless they change their stance in this respect, at least we can make the first move. So the argument this, this paper is making is that Google should go, go like meta and, and just lean right into open sourcing it and engaging with the wider open source community much more deeply, which OpenAI have very much signaled they are not willing to do.[00:40:06] But yeah, it's it's, it's read the whole thing. The whole thing is full of little snippets like that. It's just super fun. Yes,[00:40:12] swyx: yes. Read the whole thing. I, I, I also appreciate that the timeline, because it set a lot of really great context for people who are out of the loop. So Yeah.[00:40:20] Alessio Fanelli: Yeah. And the final conspiracy theory is that right before Sundar and Satya and Sam went to the White House this morning, so.[00:40:29] swyx: Yeah. Did it happen? I haven't caught up the White House statements.[00:40:34] Alessio Fanelli: No. That I, I just saw, I just saw the photos of them going into the, the White House. I've been, I haven't seen any post-meeting updates.[00:40:41] swyx: I think it's a big win for philanthropic to be at that table.[00:40:44] Alessio Fanelli: Oh yeah, for sure. And co here it's not there.[00:40:46] I was like, hmm. Interesting. Well, anyway,[00:40:50] swyx: yeah. They need, they need some help. Okay. Well, I, I promise to keep this relatively tight. Spaces do tend to have a, have a tendency of dragging on. But before we go, anything that you all want to plug, anything that you're working on currently maybe go around Simon are you still working on dataset?[00:41:04] Personal plugs: Simon and Travis[00:41:04] Simon Willison: I am, I am, I'm having a bit of a, so datasets my open source project that I've been working on. It's about helping people analyze and publish data. I'm having an existential crisis of it at the moment because I've got access to the chat g p T code, interpreter mode, and you can upload the sequel light database to that and it will do all of the things that I, on my roadmap for the next 12 months.[00:41:24] Oh my God. So that's frustrating. So I'm basically, I'm leaning data. My interest in data and AI are, are rapidly crossing over a lot harder about the AI features that I need to build on top of dataset. Make sure it stays relevant in a chat. G p t can do most of the stuff that it does already. But yeah the thing, I'll plug my blog simon willis.net.[00:41:43] I'm now updating it daily with stuff because AI move moved so quickly and I have a sub newsletter, which is effectively my blog, but in email form sent out a couple of times a week, which Please subscribe to that or RSS feed on my blog or, or whatever because I'm, I'm trying to keep track of all sorts of things and I'm publishing a lot at the moment.[00:42:02] swyx: Yes. You, you are, and we love you very much for it because you, you are a very good reporter and technical deep diver into things, into all the things. Thank you, Simon. Travis are you ready to announce the, I guess you've announced it some somewhat. Yeah. Yeah.[00:42:14] Travis Fischer: So I'm I, I just founded a company.[00:42:16] I'm working on a framework for building reliable agents that aren't toys and focused on more constrained use cases. And you know, I I, I look at kind of agi. And these, these audigy type type projects as like jumping all the way to str to, to self-driving. And, and we, we, we kind of wanna, wanna start with some more enter and really focus on, on reliable primitives to, to start that.[00:42:38] And that'll be an open source type script project. I'll be releasing the first version of that soon. And that's, that's it. Follow me you know, on here for, for this type of stuff, I, I, I, everything, AI[00:42:48] swyx: and, and spa, his chat PT bot,[00:42:50] Travis Fischer: while you still can. Oh yeah, the chat VT Twitter bot is about 125,000 followers now.[00:42:55] It's still running. I, I'm not sure if it's your credit. Yeah. Can you say how much you spent actually, No, no. Well, I think probably totally like, like a thousand bucks or something, but I, it's, it's sponsored by OpenAI, so I haven't, I haven't actually spent any real money.[00:43:08] swyx: What? That's[00:43:09] awesome.[00:43:10] Travis Fischer: Yeah. Yeah.[00:43:11] Well, once, once I changed, originally the logo was the Chachi VUI logo and it was the green one, and then they, they hit me up and asked me to change it. So it's now it's a purple logo. And they're, they're, they're cool with that. Yeah.[00:43:21] swyx: Yeah. Sending take down notices to people with G B T stuff apparently now.[00:43:26] So it's, yeah, it's a little bit of a gray area. I wanna write more on, on mos. I've been actually collecting and meaning to write a piece of mos and today I saw the memo, I was like, oh, okay. Like I guess today's the day we talk about mos. So thank you all. Thanks. Thanks, Simon. Thanks Travis for, for jumping on and thanks to all the audience for engaging on this with us.[00:43:42] We'll continue to engage on Twitter, but thanks to everyone. Cool. Thanks everyone. Bye. Alright, thanks everyone. Bye. Get full access to Latent Space at www.latent.space/subscribe
undefined
May 3, 2023 • 1h 10min

Training a SOTA Code LLM in 1 week and Quantifying the Vibes — with Reza Shabani of Replit

Latent Space is popping off! Welcome to the over 8500 latent space explorers who have joined us. Join us this month at various events in SF and NYC, or start your own!This post spent 22 hours at the top of Hacker News.As announced during their Developer Day celebrating their $100m fundraise following their Google partnership, Replit is now open sourcing its own state of the art code LLM: replit-code-v1-3b (model card, HF Space), which beats OpenAI’s Codex model on the industry standard HumanEval benchmark when finetuned on Replit data (despite being 77% smaller) and more importantly passes AmjadEval (we’ll explain!)We got an exclusive interview with Reza Shabani, Replit’s Head of AI, to tell the story of Replit’s journey into building a data platform, building GhostWriter, and now training their own LLM, for 22 million developers!8 minutes of this discussion go into a live demo discussing generated code samples - which is always awkward on audio. So we’ve again gone multimodal and put up a screen recording here where you can follow along on the code samples!Recorded in-person at the beautiful StudioPod studios in San Francisco.Full transcript is below the fold. We would really appreciate if you shared our pod with friends on Twitter, LinkedIn, Mastodon, Bluesky, or your social media poison of choice!Timestamps* [00:00:21] Introducing Reza* [00:01:49] Quantitative Finance and Data Engineering* [00:11:23] From Data to AI at Replit* [00:17:26] Replit GhostWriter* [00:20:31] Benchmarking Code LLMs* [00:23:06] AmjadEval live demo* [00:31:21] Aligning Models on Vibes* [00:33:04] Beyond Chat & Code Completion* [00:35:50] Ghostwriter Autonomous Agent* [00:38:47] Releasing Replit-code-v1-3b* [00:43:38] The YOLO training run* [00:49:49] Scaling Laws: from Kaplan to Chinchilla to LLaMA* [00:52:43] MosaicML* [00:55:36] Replit's Plans for the Future (and Hiring!)* [00:59:05] Lightning RoundShow Notes* Reza Shabani on Twitter and LinkedIn* also Michele Catasta and Madhav Singhal* Michele Catasta’s thread on the release of replit-code-v1-3b* Intro to Replit Ghostwriter* Replit Ghostwriter Chat and Building Ghostwriter Chat* Reza on how to train your own LLMs (their top blog of all time)* Our Benchmarks 101 episode where we discussed HumanEval* AmjadEval live demo* Nat.dev* MosaicML CEO Naveen Rao on Replit’s LLM* MosaicML Composer + FSDP code* Replit’s AI team is hiring in North America timezone - Fullstack engineer, Applied AI/ML, and other roles!Transcript[00:00:00] Alessio Fanelli: Hey everyone. Welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners. I'm joined by my co-host, swyx, writer and editor of Latent Space.[00:00:21] Introducing Reza[00:00:21] swyx: Hey and today we have Reza Shabani, Head of AI at Replit. Welcome to the studio. Thank you. Thank you for having me. So we try to introduce people's bios so you don't have to repeat yourself, but then also get a personal side of you.[00:00:34] You got your PhD in econ from Berkeley, and then you were a startup founder for a bit, and, and then you went into systematic equity trading at BlackRock in Wellington. And then something happened and you were now head of AI at Relet. What should people know about you that might not be apparent on LinkedIn?[00:00:50] One thing[00:00:51] Reza Shabani: that comes up pretty often is whether I know how to code. Yeah, you'd be shocked. A lot of people are kind of like, do you know how to code? When I was talking to Amjad about this role, I'd originally talked to him, I think about a product role and, and didn't get it. Then he was like, well, I know you've done a bunch of data and analytics stuff.[00:01:07] We need someone to work on that. And I was like, sure, I'll, I'll do it. And he was like, okay, but you might have to know how to code. And I was like, yeah, yeah, I, I know how to code. So I think that just kind of surprises people coming from like Ancon background. Yeah. Of people are always kind of like, wait, even when people join Relet, they're like, wait, does this guy actually know how to code?[00:01:28] Is he actually technical? Yeah.[00:01:30] swyx: You did a bunch of number crunching at top financial companies and it still wasn't[00:01:34] Reza Shabani: obvious. Yeah. Yeah. I mean, I, I think someone like in a software engineering background, cuz you think of finance and you think of like calling people to get the deal done and that type of thing.[00:01:43] No, it's, it's not that as, as you know, it's very very quantitative. Especially what I did in, in finance, very quantitative.[00:01:49] Quantitative Finance and Data Engineering[00:01:49] swyx: Yeah, so we can cover a little bit of that and then go into the rapid journey. So as, as you, as you know, I was also a quantitative trader on the sell side and the buy side. And yeah, I actually learned Python there.[00:02:01] I learned my, I wrote my own data pipelines there before airflow was a thing, and it was just me writing running notebooks and not version controlling them. And it was a complete mess, but we were managing a billion dollars on, on my crappy code. Yeah, yeah. What was it like for you?[00:02:17] Reza Shabani: I guess somewhat similar.[00:02:18] I, I started the journey during grad school, so during my PhD and my PhD was in economics and it was always on the more data intensive kind of applied economic side. And, and specifically financial economics. And so what I did for my dissertation I recorded cnbc, the Financial News Network for 10 hours a day, every day.[00:02:39] Extracted the close captions from the video files and then used that to create a second by second transcript of, of cmbc, merged that on with high frequency trading, quote data and then looked at, you know, went in and did some, some nlp, tagging the company names, and and then looked at the price response or the change in price and trading volume in the seconds after a company was mentioned.[00:03:01] And, and this was back in. 2009 that I was doing this. So before cloud, before, before a lot of Python actually. And, and definitely before any of these packages were available to make this stuff easy. And that's where, where I had to really learn to code, like outside of you know, any kind of like data programming languages.[00:03:21] That's when I had to learn Python and had to learn all, all of these other skills to work it with data at that, at that scale. So then, you know, I thought I wanted to do academia. I did terrible on the academic market because everyone looked at my dissertation. They're like, this is cool, but this isn't economics.[00:03:37] And everyone in the computer science department was actually way more interested in it. Like I, I hung out there more than in the econ department and You know, didn't get a single academic offer. Had two offer. I think I only applied to like two industry jobs and got offers from both of them.[00:03:53] They, they saw value in it. One of them was BlackRock and turned it down to, to do my own startup, and then went crawling back two and a half years later after the startup failed.[00:04:02] swyx: Something on your LinkedIn was like you're trading Chinese news tickers or something. Oh, yeah. I forget,[00:04:07] Reza Shabani: forget what that was.[00:04:08] Yeah, I mean oh. There, there was so much stuff. Honestly, like, so systematic active equity at, at BlackRock is, was such an amazing. Group and you just end up learning so much and the, and the possibilities there. Like when you, when you go in and you learn the types of things that they've been trading on for years you know, like a paper will come out in academia and they're like, did you know you can use like this data on searches to predict the price of cars?[00:04:33] And it's like, you go in and they've been trading on that for like eight years. Yeah. So they're, they're really ahead of the curve on, on all of that stuff. And the really interesting stuff that I, that I found when I went in was all like, related to NLP and ml a lot of like transcript data, a lot of like parsing through the types of things that companies talk about, whether an analyst reports, conference calls, earnings reports and the devil's really in the details about like how you make sense of, of that information in a way that, you know, gives you insight into what the company's doing and, and where the market is, is going.[00:05:08] I don't know if we can like nerd out on specific strategies. Yes. Let's go, let's go. What, so one of my favorite strategies that, because it never, I don't think we ended up trading on it, so I can probably talk about it. And it, it just kind of shows like the kind of work that you do around this data.[00:05:23] It was called emerging technologies. And so the whole idea is that there's always a new set of emerging technologies coming onto the market and the companies that are ahead of that curve and stay up to date on on the latest trends are gonna outperform their, their competitors.[00:05:38] And that's gonna reflect in the, in the stock price. So when you have a theory like that, how do you actually turn that into a trading strategy? So what we ended up doing is, well first you have to, to determine what are the emergent technologies, like what are the new up and coming technologies.[00:05:56] And so we actually went and pulled data on startups. And so there's like startups in Silicon Valley. You have all these descriptions of what they do, and you get that, that corpus of like when startups were getting funding. And then you can run non-negative matrix factorization on it and create these clusters of like what the various Emerging technologies are, and you have this all the way going back and you have like social media back in like 2008 when Facebook was, was blowing up.[00:06:21] And and you have things like mobile and digital advertising and and a lot of things actually outside of Silicon Valley. They, you know, like shale and oil cracking. Yeah. Like new technologies in, in all these different types of industries. And then and then you go and you look like, which publicly traded companies are actually talking about these things and and have exposure to these things.[00:06:42] And those are the companies that end up staying ahead of, of their competitors. And a lot of the the cases that came out of that made a ton of sense. Like when mobile was emerging, you had Walmart Labs. Walmart was really far ahead in terms of thinking about mobile and the impact of mobile.[00:06:59] And, and their, you know, Sears wasn't, and Walmart did well, and, and Sears didn't. So lots of different examples of of that, of like a company that talks about a new emerging trend. I can only imagine, like right now, all of the stuff with, with ai, there must be tons of companies talking about, yeah, how does this affect their[00:07:17] swyx: business?[00:07:18] And at some point you do, you do lose the signal. Because you get overwhelmed with noise by people slapping a on everything. Right? Which is, yeah. Yeah. That's what the Long Island Iced Tea Company slaps like blockchain on their name and, you know, their stock price like doubled or something.[00:07:32] Reza Shabani: Yeah, no, that, that's absolutely right.[00:07:35] And, and right now that's definitely the kind of strategy that would not be performing well right now because everyone would be talking about ai. And, and that's, as you know, like that's a lot of what you do in Quant is you, you try to weed out other possible explanations for for why this trend might be happening.[00:07:52] And in that particular case, I think we found that, like the companies, it wasn't, it wasn't like Sears and Walmart were both talking about mobile. It's that Walmart went out of their way to talk about mobile as like a future, mm-hmm. Trend. Whereas Sears just wouldn't bring it up. And then by the time an invest investors are asking you about it, you're probably late to the game.[00:08:12] So it was really identifying those companies that were. At the cutting edge of, of new technologies and, and staying ahead. I remember like Domino's was another big one. Like, I don't know, you[00:08:21] swyx: remember that? So for those who don't know, Domino's Pizza, I think for the run of most of the 2010s was a better performing stock than Amazon.[00:08:29] Yeah.[00:08:31] Reza Shabani: It's insane.[00:08:32] swyx: Yeah. Because of their investment in mobile. Mm-hmm. And, and just online commerce and, and all that. I it must have been fun picking that up. Yeah, that's[00:08:40] Reza Shabani: that's interesting. And I, and I think they had, I don't know if you, if you remember, they had like the pizza tracker, which was on, on mobile.[00:08:46] I use it[00:08:46] swyx: myself. It's a great, it's great app. Great app. I it's mostly faked. I think that[00:08:50] Reza Shabani: that's what I heard. I think it's gonna be like a, a huge I don't know. I'm waiting for like the New York Times article to drop that shows that the whole thing was fake. We all thought our pizzas were at those stages, but they weren't.[00:09:01] swyx: The, the challenge for me, so that so there's a, there's a great piece by Eric Falkenstein called Batesian Mimicry, where every signal essentially gets overwhelmed by noise because the people who wants, who create noise want to follow the, the signal makers. So that actually is why I left quant trading because there's just too much regime changing and like things that would access very well would test poorly out a sample.[00:09:25] And I'm sure you've like, had a little bit of that. And then there's what was the core uncertainty of like, okay, I have identified a factor that performs really well, but that's one factor out of. 500 other factors that could be going on. You have no idea. So anyway, that, that was my existential uncertainty plus the fact that it was a very highly stressful job.[00:09:43] Reza Shabani: Yeah. This is a bit of a tangent, but I, I think about this all the time and I used to have a, a great answer before chat came out, but do you think that AI will win at Quant ever?[00:09:54] swyx: I mean, what is Rentech doing? Whatever they're doing is working apparently. Yeah. But for, for most mortals, I. Like just waving your wand and saying AI doesn't make sense when your sample size is actually fairly low.[00:10:08] Yeah. Like we have maybe 40 years of financial history, if you're lucky. Mm-hmm. Times what, 4,000 listed equities. It's actually not a lot. Yeah, no, it's,[00:10:17] Reza Shabani: it's not a lot at all. And, and constantly changing market conditions and made laden variables and, and all of, all of that as well. Yeah. And then[00:10:24] swyx: retroactively you're like, oh, okay.[00:10:26] Someone will discover a giant factor that, that like explains retroactively everything that you've been doing that you thought was alpha, that you're like, Nope, actually you're just exposed to another factor that you're just, you just didn't think about everything was momentum in.[00:10:37] Yeah. And one piece that I really liked was Andrew Lo. I think he had from mit, I think he had a paper on bid as Spreads. And I think if you, if you just. Taken, took into account liquidity of markets that would account for a lot of active trading strategies, alpha. And that was systematically declined as interest rates declined.[00:10:56] And I mean, it was, it was just like after I looked at that, I was like, okay, I'm never gonna get this right.[00:11:01] Reza Shabani: Yeah. It's a, it's a crazy field and I you know, I, I always thought of like the, the adversarial aspect of it as being the, the part that AI would always have a pretty difficult time tackling.[00:11:13] Yeah. Just because, you know, there's, there's someone on the other end trying to out, out game you and, and AI can, can fail in a lot of those situations. Yeah.[00:11:23] swyx: Cool.[00:11:23] From Data to AI at Replit[00:11:23] Alessio Fanelli: Awesome. And now you've been a rep almost two years. What do you do there? Like what does the, the team do? Like, how has that evolved since you joined?[00:11:32] Especially since large language models are now top of mind, but, you know, two years ago it wasn't quite as mainstream. So how, how has that evolved?[00:11:40] Reza Shabani: Yeah, I, so when I joined, I joined a year and a half ago. We actually had to build out a lot of, of data pipelines.[00:11:45] And so I started doing a lot of data work. And we didn't have you know, there, there were like databases for production systems and, and whatnot, but we just didn't have the the infrastructure to query data at scale and to process that, that data at scale and replica has tons of users tons of data, just tons of ripples.[00:12:04] And I can get into, into some of those numbers, but like, if you wanted to answer the question, for example of what is the most. Forked rep, rep on rep, you couldn't answer that back then because it, the query would just completely time out. And so a lot of the work originally just went into building data infrastructure, like modernizing the data infrastructure in a way where you can answer questions like that, where you can you know, pull in data from any particular rep to process to make available for search.[00:12:34] And, and moving all of that data into a format where you can do all of this in minutes as opposed to, you know, days or weeks or months. That laid a lot of the groundwork for building anything in, in ai, at least in terms of training our own own models and then fine tuning them with, with replica data.[00:12:50] So then you know, we, we started a team last year recruited people from, you know from a team of, of zero or a team of one to, to the AI and data team today. We, we build. Everything related to, to ghostrider. So that means the various features like explain code, generate code, transform Code, and Ghostrider chat which is like a in context ide or a chat product within the, in the ide.[00:13:18] And then the code completion models, which are ghostwriter code complete, which was the, the very first version of, of ghostrider. Yeah. And we also support, you know, things like search and, and anything in terms of what creates, or anything that requires like large data scale or large scale processing of, of data for the site.[00:13:38] And, and various types of like ML algorithms for the site, for internal use of the site to do things like detect and stop abuse. Mm-hmm.[00:13:47] Alessio Fanelli: Yep. Sounds like a lot of the early stuff you worked on was more analytical, kind of like analyzing data, getting answers on these things. Obviously this has evolved now into some.[00:13:57] Production use case code lms, how is the team? And maybe like some of the skills changed. I know there's a lot of people wondering, oh, I was like a modern data stack expert, or whatever. It's like I was doing feature development, like, how's my job gonna change? Like,[00:14:12] Reza Shabani: yeah. It's a good question. I mean, I think that with with language models, the shift has kind of been from, or from traditional ml, a lot of the shift has gone towards more like nlp backed ml, I guess.[00:14:26] And so, you know, there, there's an entire skill set of applicants that I no longer see, at least for, for this role which are like people who know how to do time series and, and ML across time. Right. And, and you, yeah. Like you, you know, that exact feeling of how difficult it is to. You know, you have like some, some text or some variable and then all of a sudden you wanna track that over time.[00:14:50] The number of dimensions that it, that it introduces is just wild and it's a totally different skill set than what we do in a, for example, in in language models. And it's very it's a, it's a skill that is kind of you know, at, at least at rep not used much. And I'm sure in other places used a lot, but a lot of the, the kind of excitement about language models has pulled away attention from some of these other ML areas, which are extremely important and, and I think still going to be valuable.[00:15:21] So I would just recommend like anyone who is a, a data stack expert, like of course it's cool to work with NLP and text data and whatnot, but I do think at some point it's going to you know, having, having skills outside of that area and in more traditional aspects of ML will, will certainly be valuable as well.[00:15:39] swyx: Yeah. I, I'd like to spend a little bit of time on this data stack notion pitch. You were even, you were effectively the first data hire at rep. And I just spent the past year myself diving into data ecosystem. I think a lot of software engineers are actually. Completely unaware that basically every company now eventually evolves.[00:15:57] The data team and the data team does everything that you just mentioned. Yeah. All of us do exactly the same things, set up the same pipelines you know, shop at the same warehouses essentially. Yeah, yeah, yeah, yeah. So that they enable everyone else to query whatever they, whatever they want. And to, to find those insights that that can drive their business.[00:16:15] Because everyone wants to be data driven. They don't want to do the janitorial work that it comes, that comes to, yeah. Yeah. Hooking everything up. What like, so rep is that you think like 90 ish people now, and then you, you joined two years ago. Was it like 30 ish people? Yeah, exactly. We're 30 people where I joined.[00:16:30] So and I just wanna establish your founders. That is exactly when we hired our first data hire at Vilify as well. I think this is just a very common pattern that most founders should be aware of, that like, You start to build a data discipline at this point. And it's, and by the way, a lot of ex finance people very good at this because that's what we do at our finance job.[00:16:48] Reza Shabani: Yeah. Yeah. I was, I was actually gonna Good say that is that in, in some ways, you're kind of like the perfect first data hire because it, you know, you know how to build things in a reliable but fast way and, and how to build them in a way that, you know, it's, it scales over time and evolves over time because financial markets move so quickly that if you were to take all of your time building up these massive systems, like the trading opportunities gone.[00:17:14] So, yeah. Yeah, they're very good at it. Cool. Okay. Well,[00:17:18] swyx: I wanted to cover Ghost Writer as a standalone thing first. Okay. Yeah. And then go into code, you know, V1 or whatever you're calling it. Yeah. Okay. Okay. That sounds good. So order it[00:17:26] Replit GhostWriter[00:17:26] Reza Shabani: however you like. Sure. So the original version of, of Ghost Writer we shipped in August of, of last year.[00:17:33] Yeah. And so this was a. This was a code completion model similar to GitHub's co-pilot. And so, you know, you would have some text and then it would predict like, what, what comes next. And this was, the original version was actually based off of the cogen model. And so this was an open source model developed by Salesforce that was trained on, on tons of publicly available code data.[00:17:58] And so then we took their their model, one of the smaller ones, did some distillation some other kind of fancy tricks to, to make it much faster and and deployed that. And so the innovation there was really around how to reduce the model footprint in a, to, to a size where we could actually serve it to, to our users.[00:18:20] And so the original Ghost Rider You know, we leaned heavily on, on open source. And our, our friends at Salesforce obviously were huge in that, in, in developing these models. And, but, but it was game changing just because we were the first startup to actually put something like that into production.[00:18:38] And, and at the time, you know, if you wanted something like that, there was only one, one name and, and one place in town to, to get it. And and at the same time, I think I, I'm not sure if that's like when the image models were also becoming open sourced for the first time. And so the world went from this place where, you know, there was like literally one company that had all of these, these really advanced models to, oh wait, maybe these things will be everywhere.[00:19:04] And that's exactly what's happened in, in the last Year or so, as, as the models get more powerful and then you always kind of see like an open source version come out that someone else can, can build and put into production very quickly at, at, you know, a fraction of, of the cost. So yeah, that was the, the kind of code completion Go Strider was, was really just, just that we wanted to fine tune it a lot to kind of change the way that our users could interact with it.[00:19:31] So just to make it you know, more customizable for our use cases on, on Rep. And so people on Relet write a lot of, like jsx for example, which I don't think was in the original training set for, for cogen. And and they do specific things that are more Tuned to like html, like they might wanna run, right?[00:19:50] Like inline style or like inline CSS basically. Those types of things. And so we experimented with fine tuning cogen a bit here and there, and, and the results just kind of weren't, weren't there, they weren't where you know, we, we wanted the model to be. And, and then we just figured we should just build our own infrastructure to, you know, train these things from scratch.[00:20:11] Like, LMS aren't going anywhere. This world's not, you know, it's, it's not like we're not going back to that world of there's just one, one game in town. And and we had the skills infrastructure and the, and the team to do it. So we just started doing that. And you know, we'll be this week releasing our very first open source code model.[00:20:31] And,[00:20:31] Benchmarking Code LLMs[00:20:31] Alessio Fanelli: and when you say it was not where you wanted it to be, how were you benchmarking[00:20:36] Reza Shabani: it? In that particular case, we were actually, so, so we have really two sets of benchmarks that, that we use. One is human eval, so just the standard kind of benchmark for, for Python, where you can generate some code or you give you give the model a function definition with, with some string describing what it's supposed to do, and then you allow it to complete that function, and then you run a unit test against it and and see if what it generated passes the test.[00:21:02] So we, we always kind of, we would run this on the, on the model. The, the funny thing is the fine tuned versions of. Of Cogen actually did pretty well on, on that benchmark. But then when we, we then have something called instead of human eval. We call it Amjad eval, which is basically like, what does Amjad think?[00:21:22] Yeah, it's, it's exactly that. It's like testing the vibes of, of a model. And it's, it's cra like I've never seen him, I, I've never seen anyone test the model so thoroughly in such a short amount of time. He's, he's like, he knows exactly what to write and, and how to prompt the model to, to get you know, a very quick read on, on its quote unquote vibes.[00:21:43] And and we take that like really seriously. And I, I remember there was like one, one time where we trained a model that had really good you know, human eval scores. And the vibes were just terrible. Like, it just wouldn't, you know, it, it seemed overtrained. So so that's a lot of what we found is like we, we just couldn't get it to Pass the vibes test no matter how the, how[00:22:04] swyx: eval.[00:22:04] Well, can you formalize I'm jal because I, I actually have a problem. Slight discomfort with human eval. Effectively being the only code benchmark Yeah. That we have. Yeah. Isn't that[00:22:14] Reza Shabani: weird? It's bizarre. It's, it's, it's weird that we can't do better than that in some, some way. So, okay. If[00:22:21] swyx: I, if I asked you to formalize Mja, what does he look for that human eval doesn't do well on?[00:22:25] Reza Shabani: Ah, that is a, that's a great question. A lot of it is kind of a lot of it is contextual like deep within, within specific functions. Let me think about this.[00:22:38] swyx: Yeah, we, we can pause for. And if you need to pull up something.[00:22:41] Reza Shabani: Yeah, I, let me, let me pull up a few. This, this[00:22:43] swyx: is gold, this catnip for people.[00:22:45] Okay. Because we might actually influence a benchmark being evolved, right. So, yeah. Yeah. That would be,[00:22:50] Reza Shabani: that would be huge. This was, this was his original message when he said the vibes test with, with flying colors. And so you have some, some ghostrider comparisons ghost Rider on the left, and cogen is on the right.[00:23:06] AmjadEval live demo[00:23:06] Reza Shabani: So here's Ghostrider. Okay.[00:23:09] swyx: So basically, so if I, if I summarize it from a, for ghosting the, there's a, there's a, there's a bunch of comments talking about how you basically implement a clone. Process or to to c Clooney process. And it's describing a bunch of possible states that he might want to, to match.[00:23:25] And then it asks for a single line of code for defining what possible values of a name space it might be to initialize it in amjadi val With what model is this? Is this your, this is model. This is the one we're releasing. Yeah. Yeah. It actually defines constants which are human readable and nice.[00:23:42] And then in the other cogen Salesforce model, it just initializes it to zero because it reads that it starts of an int Yeah, exactly. So[00:23:51] Reza Shabani: interesting. Yeah. So you had a much better explanation of, of that than than I did. It's okay. So this is, yeah. Handle operation. This is on the left.[00:24:00] Okay.[00:24:00] swyx: So this is rep's version. Yeah. Where it's implementing a function and an in filling, is that what it's doing inside of a sum operation?[00:24:07] Reza Shabani: This, so this one doesn't actually do the infill, so that's the completion inside of the, of the sum operation. But it, it's not, it's, it, it's not taking into account context after this value, but[00:24:18] swyx: Right, right.[00:24:19] So it's writing an inline lambda function in Python. Okay.[00:24:21] Reza Shabani: Mm-hmm. Versus[00:24:24] swyx: this one is just passing in the nearest available variable. It's, it can find, yeah.[00:24:30] Reza Shabani: Okay. So so, okay. I'll, I'll get some really good ones in a, in a second. So, okay. Here's tokenize. So[00:24:37] swyx: this is an assertion on a value, and it's helping to basically complete the entire, I think it looks like an E s T that you're writing here.[00:24:46] Mm-hmm. That's good. That that's, that's good. And then what does Salesforce cogen do? This is Salesforce cogen here. So is that invalidism way or what, what are we supposed to do? It's just making up tokens. Oh, okay. Yeah, yeah, yeah. So it's just, it's just much better at context. Yeah. Okay.[00:25:04] Reza Shabani: And, and I guess to be fair, we have to show a case where co cogen does better.[00:25:09] Okay. All right. So here's, here's one on the left right, which[00:25:12] swyx: is another assertion where it's just saying that if you pass in a list, it's going to throw an exception saying in an expectedly list and Salesforce code, Jen says,[00:25:24] Reza Shabani: This is so, so ghost writer was sure that the first argument needs to be a list[00:25:30] swyx: here.[00:25:30] So it hallucinated that it wanted a list. Yeah. Even though you never said it was gonna be a list.[00:25:35] Reza Shabani: Yeah. And it's, it's a argument of that. Yeah. Mm-hmm. So, okay, here's a, here's a cooler quiz for you all, cuz I struggled with this one for a second. Okay. What is.[00:25:47] swyx: Okay, so this is a four loop example from Amjad.[00:25:50] And it's, it's sort of like a q and a context in a chat bot. And it's, and it asks, and Amjad is asking, what does this code log? And it just paste in some JavaScript code. The JavaScript code is a four loop with a set time out inside of it with a cons. The console logs out the iteration variable of the for loop and increasing numbers of of, of times.[00:26:10] So it's, it goes from zero to five and then it just increases the, the delay between the timeouts each, each time. Yeah.[00:26:15] Reza Shabani: So, okay. So this answer was provided by by Bard. Mm-hmm. And does it look correct to you? Well,[00:26:22] the[00:26:22] Alessio Fanelli: numbers too, but it's not one second. It's the time between them increases.[00:26:27] It's like the first one, then the one is one second apart, then it's two seconds, three seconds. So[00:26:32] Reza Shabani: it's not, well, well, so I, you know, when I saw this and, and the, the message and the thread was like, Our model's better than Bard at, at coding Uhhuh. This is the Bard answer Uhhuh that looks totally right to me.[00:26:46] Yeah. And this is our[00:26:47] swyx: answer. It logs 5 5 55, what is it? Log five 50. 55 oh oh. Because because it logs the state of I, which is five by the time that the log happens. Mm-hmm. Yeah.[00:27:01] Reza Shabani: Oh God. So like we, you know we were shocked. Like, and, and the Bard dancer looked totally right to, to me. Yeah. And then, and somehow our code completion model mind Jude, like this is not a conversational chat model.[00:27:14] Mm-hmm. Somehow gets this right. And and, you know, Bard obviously a much larger much more capable model with all this fancy transfer learning and, and and whatnot. Some somehow, you know, doesn't get it right. So, This is the kind of stuff that goes into, into mja eval that you, you won't find in any benchmark.[00:27:35] Good. And and, and it's, it's the kind of thing that, you know, makes something pass a, a vibe test at Rep.[00:27:42] swyx: Okay. Well, okay, so me, this is not a vibe, this is not so much a vibe test as the, these are just interview questions. Yeah, that's, we're straight up just asking interview questions[00:27:50] Reza Shabani: right now. Yeah, no, the, the vibe test, the reason why it's really difficult to kind of show screenshots that have a vibe test is because it really kind of depends on like how snappy the completion is, how what the latency feels like and if it gets, if it, if it feels like it's making you more productive.[00:28:08] And and a lot of the time, you know, like the, the mix of, of really low latency and actually helpful content and, and helpful completions is what makes up the, the vibe test. And I think part of it is also, is it. Is it returning to you or the, the lack of it returning to you things that may look right, but be completely wrong.[00:28:30] I think that also kind of affects Yeah. Yeah. The, the vibe test as well. Yeah. And so, yeah, th this is very much like a, like a interview question. Yeah.[00:28:39] swyx: The, the one with the number of processes that, that was definitely a vibe test. Like what kind of code style do you expect in this situation? Yeah.[00:28:47] Is this another example? Okay.[00:28:49] Reza Shabani: Yeah. This is another example with some more Okay. Explanations.[00:28:53] swyx: Should we look at the Bard one[00:28:54] Reza Shabani: first? Sure. These are, I think these are, yeah. This is original GT three with full size 175. Billion[00:29:03] swyx: parameters. Okay, so you asked GPC three, I'm a highly intelligent question answering bot.[00:29:07] If you ask me a question that is rooted in truth, I'll give you the answer. If you ask me a question that is nonsense I will respond with unknown. And then you ask it a question. What is the square root of a bananas banana? It answers nine. So complete hallucination and failed to follow the instruction that you gave it.[00:29:22] I wonder if it follows if one, if you use an instruction to inversion it might, yeah. Do what better?[00:29:28] Reza Shabani: On, on the original[00:29:29] swyx: GP T Yeah, because I like it. Just, you're, you're giving an instructions and it's not[00:29:33] Reza Shabani: instruction tuned. Now. Now the interesting thing though is our model here, which does follow the instructions this is not instruction tuned yet, and we still are planning to instruction tune.[00:29:43] Right? So it's like for like, yeah, yeah, exactly. So,[00:29:45] swyx: So this is a replica model. Same question. What is the square of bananas? Banana. And it answers unknown. And this being one of the, the thing that Amjad was talking about, which you guys are. Finding as a discovery, which is, it's better on pure natural language questions, even though you trained it on code.[00:30:02] Exactly. Yeah. Hmm. Is that because there's a lot of comments in,[00:30:07] Reza Shabani: No. I mean, I think part of it is that there's a lot of comments and there's also a lot of natural language in, in a lot of code right. In terms of documentation, you know, you have a lot of like markdowns and restructured text and there's also just a lot of web-based code on, on replica, and HTML tends to have a lot of natural language in it.[00:30:27] But I don't think the comments from code would help it reason in this way. And, you know, where you can answer questions like based on instructions, for example. Okay. But yeah, it's, I know that that's like one of the things. That really shocked us is the kind of the, the fact that like, it's really good at, at natural language reasoning, even though it was trained on, on code.[00:30:49] swyx: Was this the reason that you started running your model on hella swag and[00:30:53] Reza Shabani: all the other Yeah, exactly. Interesting. And the, yeah, it's, it's kind of funny. Like it's in some ways it kind of makes sense. I mean, a lot of like code involves a lot of reasoning and logic which language models need and need to develop and, and whatnot.[00:31:09] And so you know, we, we have this hunch that maybe that using that as part of the training beforehand and then training it on natural language above and beyond that really tends to help. Yeah,[00:31:21] Aligning Models on Vibes[00:31:21] Alessio Fanelli: this is so interesting. I, I'm trying to think, how do you align a model on vibes? You know, like Bard, Bard is not purposefully being bad, right?[00:31:30] Like, there's obviously something either in like the training data, like how you're running the process that like, makes it so that the vibes are better. It's like when it, when it fails this test, like how do you go back to the team and say, Hey, we need to get better[00:31:44] Reza Shabani: vibes. Yeah, let's do, yeah. Yeah. It's a, it's a great question.[00:31:49] It's a di it's very difficult to do. It's not you know, so much of what goes into these models in, in the same way that we have no idea how we can get that question right. The programming you know, quiz question. Right. Whereas Bard got it wrong. We, we also have no idea how to take certain things out and or, and to, you know, remove certain aspects of, of vibes.[00:32:13] Of course there's, there's things you can do to like scrub the model, but it's, it's very difficult to, to get it to be better at something. It's, it's almost like all you can do is, is give it the right type of, of data that you think will do well. And then and, and of course later do some fancy type of like, instruction tuning or, or whatever else.[00:32:33] But a lot of what we do is finding the right mix of optimal data that we want to, to feed into the model and then hoping that the, that the data that's fed in is sufficiently representative of, of the type of generations that we want to do coming out. That's really the best that, that you can do.[00:32:51] Either the model has. Vibes or, or it doesn't, you can't teach vibes. Like you can't sprinkle additional vibes in it. Yeah, yeah, yeah. Same in real life. Yeah, exactly right. Yeah, exactly. You[00:33:04] Beyond Code Completion[00:33:04] Alessio Fanelli: mentioned, you know, co being the only show in town when you started, now you have this, there's obviously a, a bunch of them, right.[00:33:10] Cody, which we had on the podcast used to be Tap nine, kite, all these different, all these different things. Like, do you think the vibes are gonna be the main you know, way to differentiate them? Like, how are you thinking about. What's gonna make Ghost Rider, like stand apart or like, do you just expect this to be like table stakes for any tool?[00:33:28] So like, it just gonna be there?[00:33:30] Reza Shabani: Yeah. I, I do think it's, it's going to be table stakes for sure. I, I think that if you don't if you don't have AI assisted technology, especially in, in coding it's, it's just going to feel pretty antiquated. But but I do think that Ghost Rider stands apart from some of, of these other tools for for specific reasons too.[00:33:51] So this is kind of the, one of, one of the things that these models haven't really done yet is Come outside of code completion and outside of, of just a, a single editor file, right? So what they're doing is they're, they're predicting like the text that can come next, but they're not helping with the development process quite, quite yet outside of just completing code in a, in a text file.[00:34:16] And so the types of things that we wanna do with Ghost Rider are enable it to, to help in the software development process not just editing particular files. And so so that means using a right mix of like the right model for for the task at hand. But but we want Ghost Rider to be able to, to create scaffolding for you for, for these projects.[00:34:38] And so imagine if you would like Terraform. But, but powered by Ghostrider, right? I want to, I put up this website, I'm starting to get a ton of traffic to it and and maybe like I need to, to create a backend database. And so we want that to come from ghostrider as well, so it can actually look at your traffic, look at your code, and create.[00:34:59] You know a, a schema for you that you can then deploy in, in Postgres or, or whatever else? You know, I, I know like doing anything in in cloud can be a nightmare as well. Like if you wanna create a new service account and you wanna deploy you know, nodes on and, and have that service account, kind of talk to those nodes and return some, some other information, like those are the types of things that currently we have to kind of go, go back, go look at some documentation for Google Cloud, go look at how our code base does it you know, ask around in Slack, kind of figure that out and, and create a pull request.[00:35:31] Those are the types of things that we think we can automate away with with more advanced uses of, of ghostwriter once we go past, like, here's what would come next in, in this file. So, so that's the real promise of it, is, is the ability to help you kind of generate software instead of just code in a, in a particular file.[00:35:50] Ghostwriter Autonomous Agent[00:35:50] Reza Shabani: Are[00:35:50] Alessio Fanelli: you giving REPL access to the model? Like not rep, like the actual rep. Like once the model generates some of this code, especially when it's in the background, it's not, the completion use case can actually run the code to see if it works. There's like a cool open source project called Walgreen that does something like that.[00:36:07] It's like self-healing software. Like it gives a REPL access and like keeps running until it fixes[00:36:11] Reza Shabani: itself. Yeah. So, so, so right now there, so there's Ghostrider chat and Ghostrider code completion. So Ghostrider Chat does have, have that advantage in, in that it can it, it knows all the different parts of, of the ide and so for example, like if an error is thrown, it can look at the, the trace back and suggest like a fix for you.[00:36:33] So it has that type of integration. But the what, what we really want to do is is. Is merge the two in a way where we want Ghost Rider to be like, like an autonomous agent that can actually drive the ide. So in these action models, you know, where you have like a sequence of of events and then you can use you know, transformers to kind of keep track of that sequence and predict the next next event.[00:36:56] It's how, you know, companies like, like adapt work these like browser models that can, you know, go and scroll through different websites or, or take some, some series of actions in a, in a sequence. Well, it turns out the IDE is actually a perfect place to do that, right? So like when we talk about creating software, not just completing code in a file what do you do when you, when you build software?[00:37:17] You, you might clone a repo and then you, you know, will go and change some things. You might add a new file go down, highlight some text, delete that value, and point it to some new database, depending on the value in a different config file or in your environment. And then you would go in and add additional block code to, to extend its functionality and then you might deploy that.[00:37:40] Well, we, we have all of that data right there in the replica ide. And and we have like terabytes and terabytes of, of OT data you know, operational transform data. And so, you know, we can we can see that like this person has created a, a file what they call it, and, you know, they start typing in the file.[00:37:58] They go back and edit a different file to match the you know, the class name that they just put in, in the original file. All of that, that kind of sequence data is what we're looking to to train our next model on. And so that, that entire kind of process of actually building software within the I D E, not just like, here's some text what comes next, but rather the, the actions that go into, you know, creating a fully developed program.[00:38:25] And a lot of that includes, for example, like running the code and seeing does this work, does this do what I expected? Does it error out? And then what does it do in response to that error? So all, all of that is like, Insanely valuable information that we want to put into our, our next model. And and that's like, we think that one can be way more advanced than the, than this, you know, go straighter code completion model.[00:38:47] Releasing Replit-code-v1-3b[00:38:47] swyx: Cool. Well we wanted to dive in a little bit more on, on the model that you're releasing. Maybe we can just give people a high level what is being released what have you decided to open source and maybe why open source the story of the YOLO project and Yeah. I mean, it's a cool story and just tell it from the start.[00:39:06] Yeah.[00:39:06] Reza Shabani: So, so what's being released is the, the first version that we're going to release. It's a, it's a code model called replica Code V1 three B. So this is a relatively small model. It's 2.7 billion parameters. And it's a, it's the first llama style model for code. So, meaning it's just seen tons and tons of tokens.[00:39:26] It's been trained on 525 billion tokens of, of code all permissively licensed code. And it's it's three epox over the training set. And And, you know, all of that in a, in a 2.7 billion parameter model. And in addition to that, we, for, for this project or, and for this model, we trained our very own vocabulary as well.[00:39:48] So this, this doesn't use the cogen vocab. For, for the tokenize we, we trained a totally new tokenize on the underlying data from, from scratch, and we'll be open sourcing that as well. It has something like 32,000. The vocabulary size is, is in the 32 thousands as opposed to the 50 thousands.[00:40:08] Much more specific for, for code. And, and so it's smaller faster, that helps with inference, it helps with training and it can produce more relevant content just because of the you know, the, the vocab is very much trained on, on code as opposed to, to natural language. So, yeah, we'll be releasing that.[00:40:29] This week it'll be up on, on hugging pace so people can take it play with it, you know, fine tune it, do all type of things with it. We want to, we're eager and excited to see what people do with the, the code completion model. It's, it's small, it's very fast. We think it has great vibes, but we, we hope like other people feel the same way.[00:40:49] And yeah. And then after, after that, we might consider releasing the replica tuned model at, at some point as well, but still doing some, some more work around that.[00:40:58] swyx: Right? So there are actually two models, A replica code V1 three B and replica fine tune V1 three B. And the fine tune one is the one that has the 50% improvement in in common sense benchmarks, which is going from 20% to 30%.[00:41:13] For,[00:41:13] Reza Shabani: for yes. Yeah, yeah, yeah, exactly. And so, so that one, the, the additional tuning that was done on that was on the publicly available data on, on rep. And so, so that's, that's you know, data that's in public res is Permissively licensed. So fine tuning on on that. Then, Leads to a surprisingly better, like significantly better model, which is this retuned V1 three B, same size, you know, same, very fast inference, same vocabulary and everything.[00:41:46] The only difference is that it's been trained on additional replica data. Yeah.[00:41:50] swyx: And I think I'll call out that I think in one of the follow up q and as that Amjad mentioned, people had some concerns with using replica data. Not, I mean, the licensing is fine, it's more about the data quality because there's a lot of beginner code Yeah.[00:42:03] And a lot of maybe wrong code. Mm-hmm. But it apparently just wasn't an issue at all. You did[00:42:08] Reza Shabani: some filtering. Yeah. I mean, well, so, so we did some filtering, but, but as you know, it's when you're, when you're talking about data at that scale, it's impossible to keep out, you know, all of the, it's, it's impossible to find only select pieces of data that you want the, the model to see.[00:42:24] And, and so a lot of the, a lot of that kind of, you know, people who are learning to code material was in there anyway. And, and you know, we obviously did some quality filtering, but a lot of it went into the fine tuning process and it really helped for some reason. You know, there's a lot of high quality code on, on replica, but there's like you, like you said, a lot of beginner code as well.[00:42:46] And that was, that was the really surprising thing is that That somehow really improved the model and its reasoning capabilities. It felt much more kind of instruction tuned afterward. And, and you know, we have our kind of suspicions as as to why there's, there's a lot of like assignments on rep that kind of explain this is how you do something and then you might have like answers and, and whatnot.[00:43:06] There's a lot of people who learn to code on, on rep, right? And, and like, think of a beginner coder, like think of a code model that's learning to, to code learning this reasoning and logic. It's probably a lot more valuable to see that type of, you know, the, the type of stuff that you find on rep as opposed to like a large legacy code base that that is, you know, difficult to, to parse and, and figure out.[00:43:29] So, so that was very surprising to see, you know, just such a huge jump in in reasoning ability once trained on, on replica data.[00:43:38] The YOLO training run[00:43:38] swyx: Yeah. Perfect. So we're gonna do a little bit of storytelling just leading up to the, the an the developer day that you had last week. Yeah. My understanding is you decide, you raised some money, you decided to have a developer day, you had a bunch of announcements queued up.[00:43:52] And then you were like, let's train the language model. Yeah. You published a blog post and then you announced it on Devrel Day. What, what, and, and you called it the yolo, right? So like, let's just take us through like the[00:44:01] Reza Shabani: sequence of events. So so we had been building the infrastructure to kind of to, to be able to train our own models for, for months now.[00:44:08] And so that involves like laying out the infrastructure, being able to pull in the, the data processes at scale. Being able to do things like train your own tokenizes. And and even before this you know, we had to build out a lot of this data infrastructure for, for powering things like search.[00:44:24] There's over, I think the public number is like 200 and and 30 million res on, on re. And each of these res have like many different files and, and lots of code, lots of content. And so you can imagine like what it must be like to, to be able to query that, that amount of, of data in a, in a reasonable amount of time.[00:44:45] So we've You know, we spent a lot of time just building the infrastructure that allows for for us to do something like that and, and really optimize that. And, and this was by the end of last year. That was the case. Like I think I did a demo where I showed you can, you can go through all of replica data and parse the function signature of every Python function in like under two minutes.[00:45:07] And, and there's, you know, many, many of them. And so a and, and then leading up to developer day, you know, we had, we'd kind of set up these pipelines. We'd started training these, these models, deploying them into production, kind of iterating and, and getting that model training to production loop.[00:45:24] But we'd only really done like 1.3 billion parameter models. It was like all JavaScript or all Python. So there were still some things like we couldn't figure out like the most optimal way to to, to do it. So things like how do you pad or yeah, how do you how do you prefix chunks when you have like multi-language models, what's like the optimal way to do it and, and so on.[00:45:46] So you know, there's two PhDs on, on the team. Myself and Mike and PhDs tend to be like careful about, you know, a systematic approach and, and whatnot. And so we had this whole like list of things we were gonna do, like, oh, we'll test it on this thing and, and so on. And even these, like 1.3 billion parameter models, they were only trained on maybe like 20 billion tokens or 30 billion tokens.[00:46:10] And and then Amjad joins the call and he's like, no, let's just, let's just yolo this. Like, let's just, you know, we're raising money. Like we should have a better code model. Like, let's yolo it. Let's like run it on all the data. How many tokens do we have? And, and, and we're like, you know, both Michael and I are like, I, I looked at 'em during the call and we were both like, oh God is like, are we really just gonna do this?[00:46:33] And[00:46:34] swyx: well, what is the what's the hangup? I mean, you know that large models work,[00:46:37] Reza Shabani: you know that they work, but you, you also don't know whether or not you can improve the process in, in In important ways by doing more data work, scrubbing additional content, and, and also it's expensive. It's like, it, it can, you know it can cost quite a bit and if you, and if you do it incorrectly, you can actually get it.[00:47:00] Or you, you know, it's[00:47:02] swyx: like you hit button, the button, the go button once and you sit, sit back for three days.[00:47:05] Reza Shabani: Exactly. Yeah. Right. Well, like more like two days. Yeah. Well, in, in our case, yeah, two days if you're running 256 GP 100. Yeah. Yeah. And and, and then when that comes back, you know, you have to take some time to kind of to test it.[00:47:19] And then if it fails and you can't really figure out why, and like, yeah, it's, it's just a, it's kind of like a, a. A time consuming process and you just don't know what's going to, to come out of it. But no, I mean, I'm Judd was like, no, let's just train it on all the data. How many tokens do we have? We tell him and he is like, that's not enough.[00:47:38] Where can we get more tokens? Okay. And so Michele had this you know, great idea to to train it on multiple epox and so[00:47:45] swyx: resampling the same data again.[00:47:47] Reza Shabani: Yeah. Which, which can be, which is known risky or like, or tends to overfit. Yeah, you can, you can over overfit. But you know, he, he pointed us to some evidence that actually maybe this isn't really a going to be a problem.[00:48:00] And, and he was very persuasive in, in doing that. And so it, it was risky and, and you know, we did that training. It turned out. Like to actually be great for that, for that base model. And so then we decided like, let's keep pushing. We have 256 TVs running. Let's see what else we can do with it.[00:48:20] So we ran a couple other implementations. We ran you know, a the fine tune version as I, as I said, and that's where it becomes really valuable to have had that entire pipeline built out because then we can pull all the right data, de-dupe it, like go through the, the entire like processing stack that we had done for like months.[00:48:41] We did that in, in a matter of like two days for, for the replica data as well removed, you know, any of, any personal any pii like personal information removed, harmful content, removed, any of, of that stuff. And we just put it back through the that same pipeline and then trained on top of that.[00:48:59] And so I believe that replica tune data has seen something like 680. Billion tokens. And, and that's in terms of code, I mean, that's like a, a universe of code. There really isn't that much more out there. And, and it, you know, gave us really, really promising results. And then we also did like a UL two run, which allows like fill the middle capabilities and and, and will be, you know working to deploy that on, on rep and test that out as well soon.[00:49:29] But it was really just one of those Those cases where, like, leading up to developer day, had we, had we done this in this more like careful, systematic way what, what would've occurred in probably like two, three months. I got us to do it in, in a week. That's fun. It was a lot of fun. Yeah.[00:49:49] Scaling Laws: from Kaplan to Chinchilla to LLaMA[00:49:49] Alessio Fanelli: And so every time I, I've seen the stable releases to every time none of these models fit, like the chinchilla loss in, in quotes, which is supposed to be, you know, 20 tokens per, per, what's this part of the yo run?[00:50:04] Or like, you're just like, let's just throw out the tokens at it doesn't matter. What's most efficient or like, do you think there's something about some of these scaling laws where like, yeah, maybe it's good in theory, but I'd rather not risk it and just throw out the tokens that I have at it? Yeah,[00:50:18] Reza Shabani: I think it's, it's hard to, it's hard to tell just because there's.[00:50:23] You know, like, like I said, like these runs are expensive and they haven't, if, if you think about how many, how often these runs have been done, like the number of models out there and then, and then thoroughly tested in some forum. And, and so I don't mean just like human eval, but actually in front of actual users for actual inference as part of a, a real product that, that people are using.[00:50:45] I mean, it's not that many. And, and so it's not like there's there's like really well established kind of rules as to whether or not something like that could lead to, to crazy amounts of overfitting or not. You just kind of have to use some, some intuition around it. And, and what we kind of found is that our, our results seem to imply that we've really been under training these, these models.[00:51:06] Oh my god. And so like that, you know, all, all of the compute that we kind of. Through, with this and, and the number of tokens, it, it really seems to help and really seems to to improve. And I, and I think, you know, these things kind of happen where in, in the literature where everyone kind of converges to something seems to take it for for a fact.[00:51:27] And like, like Chinchilla is a great example of like, okay, you know, 20 tokens. Yeah. And but, but then, you know, until someone else comes along and kind of tries tries it out and sees actually this seems to work better. And then from our results, it seems imply actually maybe even even lla. Maybe Undertrained.[00:51:45] And, and it may be better to go even You know, like train on on even more tokens then and for, for the[00:51:52] swyx: listener, like the original scaling law was Kaplan, which is 1.7. Mm-hmm. And then Chin established 20. Yeah. And now Lama style seems to mean 200 x tokens to parameters, ratio. Yeah. So obviously you should go to 2000 X, right?[00:52:06] Like, I mean, it's,[00:52:08] Reza Shabani: I mean, we're, we're kind of out of code at that point, you know, it's like there, there is a real shortage of it, but I know that I, I know there are people working on I don't know if it's quite 2000, but it's, it's getting close on you know language models. And so our friends at at Mosaic are are working on some of these really, really big models that are, you know, language because you with just code, you, you end up running out of out of context.[00:52:31] So Jonathan at, at Mosaic has Jonathan and Naveen both have really interesting content on, on Twitter about that. Yeah. And I just highly recommend following Jonathan. Yeah,[00:52:43] MosaicML[00:52:43] swyx: I'm sure you do. Well, CAGR, can we talk about, so I, I was sitting next to Naveen. I'm sure he's very, very happy that you, you guys had such, such success with Mosaic.[00:52:50] Maybe could, could you shout out like what Mosaic did to help you out? What, what they do well, what maybe people don't appreciate about having a trusted infrastructure provider versus a commodity GPU provider?[00:53:01] Reza Shabani: Yeah, so I mean, I, I talked about this a little bit in the in, in the blog post in terms of like what, what advantages like Mosaic offers and, and you know, keep in mind, like we had, we had deployed our own training infrastructure before this, and so we had some experience with it.[00:53:15] It wasn't like we had just, just tried Mosaic And, and some of those things. One is like you can actually get GPUs from different providers and you don't need to be you know, signed up for that cloud provider. So it's, it kind of detaches like your GPU offering from the rest of your cloud because most of our cloud runs in, in gcp.[00:53:34] But you know, this allowed us to leverage GPUs and other providers as well. And then another thing is like train or infrastructure as a service. So you know, these GPUs burn out. You have note failures, you have like all, all kinds of hardware issues that come up. And so the ability to kind of not have to deal with that and, and allow mosaic and team to kind of provide that type of, of fault tolerance was huge for us.[00:53:59] As well as a lot of their preconfigured l m configurations for, for these runs. And so they have a lot of experience in, in training these models. And so they have. You know, the, the right kind of pre-configured setups for, for various models that make sure that, you know, you have the right learning rates, the right training parameters, and that you're making the, the best use of the GPU and, and the underlying hardware.[00:54:26] And so you know, your GPU utilization is always at, at optimal levels. You have like fewer law spikes than if you do, you can recover from them. And you're really getting the most value out of, out of the compute that you're kind of throwing at, at your data. We found that to be incredibly, incredibly helpful.[00:54:44] And so it, of the time that we spent running things on Mosaic, like very little of that time is trying to figure out why the G P U isn't being utilized or why you know, it keeps crashing or, or why we, you have like a cuda out of memory errors or something like that. So like all, all of those things that make training a nightmare Are are, you know, really well handled by, by Mosaic and the composer cloud and and ecosystem.[00:55:12] Yeah. I was gonna[00:55:13] swyx: ask cuz you're on gcp if you're attempted to rewrite things for the TPUs. Cause Google's always saying that it's more efficient and faster, whatever, but no one has experience with them. Yeah.[00:55:23] Reza Shabani: That's kind of the problem is that no one's building on them, right? Yeah. Like, like we want to build on, on systems that everyone else is, is building for.[00:55:31] Yeah. And and so with, with the, with the TPUs that it's not easy to do that.[00:55:36] Replit's Plans for the Future (and Hiring!)[00:55:36] swyx: So plans for the future, like hard problems that you wanna solve? Maybe like what, what do you like what kind of people that you're hiring on your team?[00:55:44] Reza Shabani: Yeah. So We are, we're currently hiring for for two different roles on, on my team.[00:55:49] Although we, you know, welcome applications from anyone that, that thinks they can contribute in, in this area. Replica tends to be like a, a band of misfits. And, and the type of people we work with and, and have on our team are you know, like just the, the perfect mix to, to do amazing projects like this with very, very few people.[00:56:09] Right now we're hiring for the applied a applied to AI ml engineer. And so, you know, this is someone who's. Creating data pipelines, processing the data at scale creating runs and and training models and you know, running different variations, testing the output running human evals and, and solving a, a ton of the issues that come up in the, in the training pipeline from beginning to end.[00:56:34] And so, you know, if you read the, the blog post we'll be going into, we'll be releasing additional blog posts that go into the details of, of each of those different sections. You know, just like tokenized training is incredibly complex and you can write, you know, a whole series of blog posts on that.[00:56:50] And so the, those types of really challenging. Engineering problems of how do you sample this data at, at scale from different languages in different RDS and pipelines and, and feed them to you know, sense peace tokenize to, to learn. If you're interested in working in that type of, of stuff we'd love to speak with you.[00:57:10] And and same for on the inference side. So like, if you wanna figure out how to make these models be lightning fast and optimize the the transformer layer to get like as much out of out of inference and reduce latency as much as possible you know, you'd be, you'd be joining our team and working alongside.[00:57:29] Bradley, for example, who was like he, I always embarrass him and he's like the most humble person ever, but I'm gonna embarrass him here. He was employee number seven at YouTube and Wow. Yeah, so when I met him I was like, why are you here? But that's like the kind of person that joins Relet and, you know, he, he's obviously seen like how to scale systems and, and seen, seen it all.[00:57:52] And like he's like the type of person who works on like our inference stack and makes it faster and scalable and and is phenomenal. So if you're just a solid engineer and wanna work on anything related to LLMs In terms of like training inference, data pipelines the applied AI ML role is, is a great role.[00:58:12] We're also hiring for a full stack engineer. So this would be someone on my team who does both the model training stuff, but, but is more oriented towards bringing that AI to to users. And so that could mean many different things. It could mean you know, on the front end building the integrations with the workspace that allow you to, to receive the code completion models.[00:58:34] It means working on Go rider chats, like the conversational ability between. Ghost Writer and what you're trying to do, building the various agents that we want replica to have access to. Creating embeddings to allow people to ask questions about you know, docs or or, or their own projects or, or other teams, projects that they're collaborating with.[00:58:55] All of those types of things are in the, in the kind of full stack role that that I'm hiring for on my team as well. Perfect. Awesome.[00:59:05] Lightning Round[00:59:05] Alessio Fanelli: Yeah, let's jump into Lining Ground. We'll ask you Factbook questions give us a short answer. I know it's a landing ground, but Sean likes to ask follow up questions to the landing ground questions.[00:59:15] So be ready.[00:59:18] swyx: Yeah. This is an acceleration question. What is something you thought would take much longer, but it's already here.[00:59:24] It's coming true much faster than you thought.[00:59:27] Reza Shabani: Ai I mean, it's, it's like I, I know it's cliche, but like every episode of Of Black Mirror that I watched like in the past five years is already Yeah. Becoming true, if not, will become true very, very soon. I remember that during there was like one episode where this, this woman, her boyfriend dies and then they train the data on, they, they go through all of his social media and train a, a chat bot to speak like him.[00:59:54] And at the, and you know, she starts speaking to him and, and it speaks like him. And she's like, blown away by this. And I think everyone was blown away by that. Yeah. That's like old news. That's like, it's, and, and I think that that's mind blowing. How, how quickly it's here and, and how much it's going to keep changing.[01:00:13] Yeah.[01:00:14] swyx: Yeah. Yeah. And, and you, you mentioned that you're also thinking about the social impact of some of these things that we're doing.[01:00:19] Reza Shabani: Yeah. That that'll be, I think one of the. Yeah, I, I think like another way to kind of answer that question is it's, it's forcing us, the, the speed at which everything is developing is forcing us to answer some important questions that we might have otherwise kind of put off in terms of automation.[01:00:39] I think like one of the there's a bit of a tangent, but like, one, one of the things is I think we used to think of AI as these things that would come and take blue collar jobs. And then now, like with a lot of white collar jobs that seem to be like at risk from something like chat G B T all of a sudden that conversation becomes a lot, a lot more important.[01:00:59] And how do we it, it suddenly becomes more important to talk about how do we allow AI to help people as opposed to replace them. And and you know, what changes we need to make over the very long term as a society to kind of Allow you know, people to enjoy the kind of benefits that AI brings to an economy and, and to a society and not feel threatened by it instead.[01:01:23] Alessio Fanelli: Yeah. What do you think a year from now, what will people be the most[01:01:26] Reza Shabani: surprised by? I think a year from now, I'm really interested in seeing how a lot of this technology will be applied to domains outside of chat. And, and I think we're kind of just at the beginning of, of that world you know, chat, G B T, that that took a lot of people by surprise because it was the first time that people started to, to actually interact with it and see what the the capabilities were.[01:01:54] And, and I think it's still just a, a chatbot for many people. And I think that once you start to apply it to actual products, businesses use cases, it's going to become incredibly Powerful. And, and I don't think that we're kind of thinking of the implications for, for companies and, and for the, for the economy.[01:02:14] You know, if you, for example, are like traveling and you want to be able to ask like specific questions about where you're going and plan out your trip, and maybe you wanna know if like if there are like noise complaints in the Airbnb, you just are thinking of booking. And, and you might have like a chat bots actually able to create a query that goes and looks at like, noise complaints that were filed or like construction permits that are filed that are fall within the same date range of your stay.[01:02:40] Like I, I think that that type of like transfer learning when applied to like specific industries and specific products is gonna be incredibly powerful. And I don't think. Anyone has like that much clue in terms of like what's what's going to be possible there and how much a lot of our favorite products might, might change and become a lot more powerful with this technology.[01:03:00] swyx: Request for products or request for startups. What is an AI thing you would pay for if somebody built it with their personal work?[01:03:08] Reza Shabani: Oh, man. The, the, there's a lot of a lot of this type of stuff, but or, or a lot of people trying to build this type of, of thing, but a good L l m IDE is kind of what, what we call it in You mean the one, like the one you work on?[01:03:22] Yeah, exactly. Yeah. Well, so that's why we're trying to build it so that people Okay. Okay. Will pay for it. No, I, but, but I mean, seriously, I think that I, I, I think something that allows you to kind of. Work with different LLMs and not have to repeat a lot of the, the annoyance that kind of comes with prompt engineering.[01:03:44] So think, think of it this way. Like I want to be able to create different prompts and and test them and against different types of models. And so maybe I want to test open AI's models. Google's models. Yeah. Cohere.[01:03:57] swyx: So the playground, like from[01:03:59] Reza Shabani: net Devrel, right? Exactly. So, so like think Nat dot Devrel for Yeah.[01:04:04] For, well, for anything I guess. So Nat, maybe we should say what Nat dot Devrel is for people don't know. So Nat Friedman, Nat Friedman former GitHub ceo. CEO and, and or not current ceo, right? No. Former. Yeah. Went on replica Hired a bounty and, and had a bounty build this website for him.[01:04:25] Yeah. That allows you to kind of compare different language models and and get a response back. Like you, you add one prompt and then it queries these different language models, gets the response back. And it, it turned into this really cool tool that people were using to compare these models.[01:04:39] And then he put it behind a paywall because people were starting to bankrupt him as a result of using it. But but something like that, that allows you to test different models, but also goes further and lets you like, keep the various responses that were, that were generated with these various parameters.[01:04:56] And, and, you know, you can do things like perplexity analysis and how, how widely The, the, the responses differ and over time and using what prompts, strategies and whatnot, I, I do think something like that would be really useful and isn't really built into most ides today. But that's definitely something, especially given how much I'm playing around with prompts and and language models today would be incredibly useful to have.[01:05:22] I[01:05:22] swyx: perceive you to be one layer below prompts. But you're saying that you actually do a lot of prompt engineering yourself because you, I thought you were working on the model, not the prompts, but maybe I'm wrong.[01:05:31] Reza Shabani: No, I, so I work on, on everything. Both, yeah. On, on everything. I think most people still work with pro, I mean, even a code completion model, you're still working with prompts to Yeah.[01:05:40] When you're, when you're you know running inference and, and whatever else. And, you know, instruction tuning, you're working with prompts. And so like, there's There's still a big need for for, for prompt engineering tools as well. I, I do, I guess I should say, I do think that that's gonna go away at some point.[01:05:59] That's my, that's my like, hot take. I don't know if, if you all agree on that, but I do kind of, yeah. I think some of that stuff is going to, to go away at[01:06:07] swyx: some point. I'll, I'll represent the people who disagree. People need problems all the time. Humans need problems all the time. We, you know, humans are general intelligences and we need to tell them to align and prompts our way to align our intent.[01:06:18] Yeah. So, I don't know the, it's a way to inject context and give instructions and that will never go away. Right. Yeah.[01:06:25] Reza Shabani: I think I think you're, you're right. I totally agree by the way that humans are general intelligences. Yeah. Well, I was, I was gonna say like one thing is like as a manager, you're like the ultimate prompt engineer.[01:06:34] Prompt engineer.[01:06:35] swyx: Yeah. Any executive. Yeah. You have to communicate extremely well. And it is, it is basically akin of prompt engineering. Yeah. They teach you frameworks on how to communicate as an executive. Yeah.[01:06:45] Reza Shabani: No, absolutely. I, I completely agree with that. And then someone might hallucinate and you're like, no, no, this is, let's try it this way instead.[01:06:52] No, I, I completely agree with that. I think a lot of the more kind of I guess the algorithmic models that will return something to you the way like a search bar might, right? Yeah. I think that type of You wanted to disappear. Yeah. Yeah, exactly. And so like, I think that type of prompt engineering will, will go away.[01:07:08] I mean, imagine if in the early days of search when the algorithms weren't very good, imagine if you were to go create a middleware that says, Hey type in what you're looking for, and then I will turn it into the set of words that you should be searching for. Yes. To get back the information that's most relevant, that, that feels a little like what prompt engineering is today.[01:07:28] And and sure that would've been really useful. But like then, you know, Google slash yahoo slash search engine Yeah. Would kind of removes that. Like that benefit by improving the, the underlying model. And so I do think that there's gonna be improvements in, in transformer architecture and the models themselves to kind of reduce Like overly yeah.[01:07:51] Like different types of prompt engineering as we know them today. But I completely agree that for the way larger, kind of like more human-like models Yeah. That you'll always need to, we'll talk some form of, of prompt engineering. Yeah. Okay.[01:08:04] Alessio Fanelli: Awesome. And to wrap this up, what's one thing you want everyone to take away about ai?[01:08:09] Both. It can be about work, it can be about personal life and the[01:08:13] Reza Shabani: societal impact. Learn how to use it. I, I would say learn how to learn how to use it, learn how it can help you and, and benefit you. I think there's like a lot of fear of, of ai and, and how it's going to impact society. And I think a lot of that might be warranted, but it, it's in the same way that pretty much anything new that comes along changes society in that way, and it's very powerful and very fundamental.[01:08:36] Like the internet. Change society in a lot of ways. And, and sure kids can go like cheat on their homework by finding something online, but there's also plenty of good that kind of comes out of opening up the the world to, to everyone. And I think like AI's gonna be just another iteration of, of that same thing.[01:08:53] Another example of, of that same thing. So I think the, the people who will be really successful are the ones that kind of understand it know how to use it, know its limitations and, and know how it can make them more productive and, and better at anything they want to do. Awesome. Well, thank[01:09:08] Alessio Fanelli: you so much for coming on.[01:09:10] This was[01:09:10] Reza Shabani: great. Of course. Thank you. Get full access to Latent Space at www.latent.space/subscribe
undefined
Apr 29, 2023 • 1h 16min

Mapping the future of *truly* Open Models and Training Dolly for $30 — with Mike Conover of Databricks

The race is on for the first fully GPT3/4-equivalent, truly open source Foundation Model! LLaMA’s release proved that a great model could be released and run on consumer-grade hardware (see llama.cpp), but its research license prohibits businesses from running it and all it’s variants (Alpaca, Vicuna, Koala, etc) for their own use at work. So there is great interest and desire for *truly* open source LLMs that are feasible for commercial use (with far better customization, finetuning, and privacy than the closed source LLM APIs).The previous leading contenders were Eleuther’s GPT-J and Neo on the small end (<6B parameters), and Google’s FLAN-T5 (137B), PaLM (540B), and BigScience’s BLOOM (176B) on the high end. But Databricks is to my knowledge the first to release not just a cleanly licensed, high quality LLM that can run on affordable devices, but also a simple Databricks notebook that can be customized to be finetuned for your data/desired style - for $30 in 30 minutes on one machine!Mike Conover tells the story of how a small team of Applied AI engineers got convinced Ali Ghodsi and 5,000 of their coworkers to join in the adventure of building the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use. He also indulges our questions on other recent open source LLM projects, CerebasGPT and RedPajama, though we recorded this a week before Stability’s StableLM release. Stick around to the end for some easter eggs featuring AI Drake!Recorded in-person at the beautiful StudioPod studios in San Francisco.Full transcript is below the fold.Show Notes* Mike Conover LinkedIn and Twitter* Dolly 1.0* Dolly 2.0* CICERO and Diplomacy* Dolly and Deepspeed* LLMops: * https://nat.dev/* PromptLayer* HumanLoop* Spreadsheets??* Quadratic* Alessio’s Email GPT Drafter* Open Models* Open Assistant* Cerebras GPT* RedPajama* Reflexion, Recursive Criticism and Improvement* Lightning Round* AI Product: Google Maps* AI People: EleutherAI, Huggingface’s Stas Bekman* AI Prediction: Open LLaMA reproduction, AI Twins of People (AI Drake), Valuing Perplexity * Request for Startups: LLMOps/Benchmarks, Trail MappingTimestamps* [00:00:21] Introducing Mike Conover* [00:03:10] Dolly 1.0* [00:04:18] Making Dolly* [00:06:12] Dolly 2.0* [00:09:28] Gamifying Instruction Tuning* [00:11:36] Summarization - Thumbnails for Language* [00:15:11] CICERO and Geopolitical AI Agents* [00:17:09] Datasets vs Intentional Design* [00:21:44] Biological Basis of AI* [00:23:27] Training Your Own LLMs* [00:28:21] You May Not Need a Large Model* [00:29:59] Good LLM Use cases* [00:31:33] Dolly Cost $30 on Databricks* [00:36:06] Databricks Open Source* [00:37:31] LLMOps and Prompt Tooling* [00:42:26] "I'm a Sheets Maxi"* [00:44:19] AI and Workplace Productivity* [00:47:02] OpenAssistant* [00:47:41] CerebrasGPT* [00:51:35] RedPajama* [00:54:07] Why Dolly > OpenAI GPT* [00:56:19] Open Source Licensing for AI Models* [00:57:09] Why Open Source Models?* [00:58:05] Moving Models* [01:00:34] Learning in a Simulation* [01:01:28] Why Model Reflexion and Self Criticism Works* [01:03:51] Lightning RoundTranscripts[00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio Partner and CT and Residence and Decibel Partners. I'm Joan Bama, cohost swyx Brighter and Editor of Space. Welcome, Mike.[00:00:21] Introducing Mike Conover[00:00:21] Hey, pleasure to be here. Yeah, so[00:00:23] we tend to try to introduce you so that you don't have to introduce yourself. Yep.[00:00:27] But then we also ask you to fill in the blanks. So you are currently a, uh, staff software engineer at Databricks. Uh, but you got your PhD at Indiana on the University of Bloomington in Complex Systems analysis where you did some, uh, analysis of clusters on, on Twitter, which I found pretty interesting.[00:00:43] Yeah. Uh, I highly recommend people checking that out if you're interested in getting information from indirect sources or I, I don't know how you describe it. Yes. Yeah. And then you went to LinkedIn working on. Homepage News, relevance, and then SkipFlag, which is a smart enterprise knowledge graph, which was then acquired, uh, by Workday, where you became director of machine learning engineering and now your Databricks.[00:01:06] So that's the quick bio and we can kind of go over Yeah. Step by step. But, uh, what's not on your LinkedIn that people[00:01:12] should know about you? So, because I worked at LinkedIn, that's actually how new hires introduce themselves at LinkedIn is this question. So I, okay. I have a pat answer to it. Uhhuh. Um, I love getting off trail in the backcountry.[00:01:25] Okay. And I, you know, I think that the sort of like radical responsibility associated to that is clarifies the mind. And I think that the, the things that I really like about machine learning engineering and sort of the topology of high-dimensional spaces kind of manifest when you think about a topographic mat as a contour plot.[00:01:44] You know, it's a two-dimensional projection of a three-dimensional space and it's very much like looking at information visualizations and you're trying to relate your. Localized perception of the environment around you and the contours of, uh, ridges that you see, or basins that you might go into and you're like, there's that little creek down there.[00:02:04] And relate that to the projection that you see on the map. I think it's physically demanding. It's intellectually challenging. It's natural. Beauty is a big part of it, and you're generally spending time with friends, and so I just, I love that. I love that these are camping trips. Uh, multi-day. Yeah. Yeah.[00:02:21] Camping. I, I hunt too, you know, I, um, shoot archery, um, big game back country hunting, but yeah. You know, sometimes it's just, let's take a walk in the woods and see where it goes.[00:02:32] Oh yeah. You ever think about going on one of those, um, journeys in the, uh, the Australian Outbacks? Like where people find themselves?[00:02:40] I'm[00:02:40] a mountain. I'm a mountain guy. I like to You're mountain guy. I like to fly fish. I like to, you like to hill climb? Yeah. Like the outback seems beautiful. I think eight of the 10 most deadly snakes live in Australia. Like I'm, uh, yeah, you're good. You're good. Yeah. Yeah.[00:02:52] Yeah. Any lessons from like, Real hill climbing[00:02:55] versus machine learning, hill climbing.[00:02:56] Great Dude. It's a lot like gradient descent. Yeah, for sure, man. Um, yeah, I that I have remarked on that to myself before for sure. Yeah, I don't, I'm not sure. This is like least resistance, please.[00:03:10] Dolly 1.0[00:03:10] That's awesome. So Dolly, you know, it's kind of come up in the last three weeks you went from a brand new project at Databricks to one of the hottest open source things out there.[00:03:19] So March 24th you had Dolly 1.0. It was a 6 billion parameters model based on GPT-J 6 billion and you saw alpaca training set to train it. First question is, why did you start with GPT-J instead of LLaMA, which was what everybody else was kind of starting from[00:03:34] at the time. Yeah, well, I mean, so, you know, we had talked about this a little before the show, but LLaMA's hard to get.[00:03:40] We had requested the model weights and had just not heard back. And you know, I think our experience with the, um, The original email alias for Dolly, before it was available on hugging face, you get hundreds of people asking for it, and I think it's like, it's easy to just not be able to handle the inbound.[00:03:56] Mm-hmm. And so like, I mean, there was a practical consideration, which is that, you know, we did not have the LLaMA weights, but additionally I think it's like much more interesting if anybody can build it. Right. And so I think that was our, um, and I had worked with the GPT-J model in the past and, and knew it to be high quality from a grammatical ness standpoint.[00:04:15] And so I think it was a reasonable choice. Mm-hmm. Yeah.[00:04:18] Making Dolly[00:04:18] Yeah. Maybe we should, we can also go into the impetus of why you started work on Dolly. Uh, you had been at Databricks for about a year. Mm-hmm. Was there, was this like a top-down directive? Was this your idea? We'll see, uh,[00:04:31] what happened? I've been working in N L P and language understanding for a fair while now.[00:04:36] I mean certainly since Skip flag back in 20 16, 20 17, we can introduce Skip flag is that's, if that's, sorry. You know, we don't have to focus too much on it, but like, this is a, an area how information moves through networks of people is a longstanding interest of mine. And we built a hack day project and I just slacked it to our c e o and I was, you know, this was when ChatGPT came out and it was an integration into the developer experience.[00:05:02] And I was like, as a user, this should exist. I want this. Mm-hmm. We should build this. It doesn't have to be us. And I mean, to our, uh, our leadership team is like 10 years into this journey, probably more than that at Databricks. And they are still. So hungry. It's wild. It's just wild to see these, these people in action, you know, this like this far into the marathon.[00:05:23] And, um, he's like, great, build it. Do make it. So, you know, and I, we had have, uh, full-time responsibilities and infrastructure forecasting and infrastructure optimization. And so we did, you know, and, um, we just started building and, you know, so we'd been working on this class of technologies for, um, several months.[00:05:46] And we had a stack that in part how we were able to kind of pivot on the balls of our feet. Uh, we repurposed a lot of existing code that we had built up, you know, in the past several quarters, um, to, to create Dolly and, and just to[00:05:58] be clear, like is this an internal stack or is this, uh, externally available as data?[00:06:02] Much of what we open sourced what, you know, like that that is a, that is the, the, it's, I mean, no, it's not the exhaustive stack by any account, but it's, it's some of the core components. Okay. Yeah.[00:06:12] Dolly 2.0[00:06:12] It only took 19 days to go from 1.0 to 2.0. Yeah. So 2.0 is 12 billion. So twist the number of parameters. You base this on the model family from Elu.[00:06:23] I instead, and I think the, the biggest change is like instead of using the alpaca turning set, which is change generated, so it has its own limitations, you created a brand new, uh, training data set created by the Databricks employees. So I would love to talk about how you actually made that happen. You know, did you just go around and say, Hey guys, I just need to like today, spend your day coming up with the instruction set?[00:06:47] Or like, did people volunteer to be a part of this?[00:06:50] Yeah, I mean, so again, like a lot of credit to our founding team, they see it, I think as much as anybody you'll talk to who is a new founder or somebody trying to work in this space, like our executives have the fire and will see a, a bright neon meta future that, uh, Databricks will confidently lead.[00:07:12] The world into. And so Ali just sent emails twice a day. Do it, do it. You know, we put together, you know, we, we use the InstructGPT sort of task families, you know, gen content generation, brainstorming close qa, open qa, paraphrasing, things like this, and basically put together these Google forms.[00:07:34] You know, just like, how can we build this as quickly as possible? We see this need, you know, the alpaca trick is amazing that it works. It's amazing that we're highly non-obvious that, you know, for GPT-J or even lLLaMA, you know, hundreds of billions of tokens into the train, this whisper of new data, you know, sort of moves it in, moves the parameter, uh, tensors into a new part of the state space.[00:08:02] I think, you know, my background is roughly in statistical physics related areas, and I think kind of like a phase transition. Mm-hmm. Like ice and water. It's like they're. Very, very little separates the two, but they could not be more different. And so Ali just kept haranging, like a huge email list of people.[00:08:21] Um, thousands and thousands of people. And, um, it worked. The other thing is, you know, to our employees credit, people see the moment and they wanna be part of something. And I think there's just passion and enthusiasm for. Doing this. So it was easier than you would expect[00:08:37] The answer is, so you put some answers in the blog post.[00:08:40] Yeah. And they're pretty comprehensive. Cuz one of the questions was like, how do I build a campfire? Yeah. And then the response was four paragraphs[00:08:46] of actual Truly, and I think Yeah, true. Yeah. And I think part of it is that because of the rapid adoption of these technologies like that, you have hundreds of millions of people, you know, who knows what the numbers are.[00:08:58] But on ChatGPT. People have become educated in terms of, and opinionated about what they expect from these tools. And so I think, you know, a lot of the answers are like, written in the style of what you would want from one of these assistants. And I think just to kind of like riff on how this question of like how the composition, cuz this is really re relevant to our enterprise customers, how the composition of the dataset qualitatively shapes the resulting behaviors of the fine-tuned models that are exposed to that stimulus.[00:09:28] Gamifying Instruction Tuning[00:09:28] You know, you look at a dataset like flan, which is a really, really large dataset that is, I think thousand plus tasks. Um, that's, you know, kind of this. Gold standard instruction data set, and a lot of it's synthesized the responses and we'll talk about evaluation, but the responses are very brief. You know, it's like emit the word positive or negative in relation to the, you know, as a judgment of the sentiment of this utterance.[00:09:52] And so it's, it's very multitask and I think like having thousands of different task types perform sort of irregular, you can't overfit to one specific behavior and so you have to compress and like do many things reasonably well. And so that I think you, you have to kind of wind up in interpolating between different types of behaviors that way.[00:10:12] But there's also like the question of like, when do you predict the end of sequence token? And if your completions, particularly for instruction tuning are short. Our empirical observation is that the fine tune model emits shorter results. And so having how to build a campfire. And like a narrative thoughtful human-like description.[00:10:36] I think it requires that demonstration to get that behavior from the model. And you had a, you had a leaderboard, um, who did[00:10:43] what, uh, any fun shenanigans that came out of, uh, the gamification?[00:10:46] Well, so the thing is like, you know, I think you can just ask people like be helpful. Uh, you know, like, like some people always take it too far and then Sure.[00:10:55] Yeah. Well, so you definitely see a long tail distribution. I think I was looking at the open assistant paper last night, and I think, I mean, don't quote me on this, but something like 12 people accounted for 10% of the total responses, which is super, that's just human systems have that long tail distribution terms of activity thing.[00:11:12] Yeah, yeah, exactly. So it's not surprising. And we see that to a some degree in our data set as well, but, um, not in the way that you would if you opened it up to the, like internet at large. So I, I think people are incentivized coworkers. Yeah. Do the right thing and you know, it's, you know, and also it's our company.[00:11:29] Like we. Want it to actually be useful, not just a performance of usefulness. And I think people got that.[00:11:36] Summarization - Thumbnails for Language[00:11:36] Is there a task[00:11:37] that you found like particularly hard to get data on? Like good data summarization?[00:11:41] Oh, because it's like a, it's both like long, uh, it's long and requires thought, you know, you have to synthesize and as opposed to name all the people in places in this passage from Wikipedia that's like, I can kind of do that while I'm watching television, but like writing an essay.[00:11:59] Yeah, it's a compare is hard. Yeah, there's probably more structure and like in terms of um, like an information theoretic standpoint, how much new signal each record introduces into the model. I expect that summarization is actually. A very demanding task and would not soon become overfit. We're developing our, our, I don't have like definitive answers to how that works because we're still, it's an open research project for the, for the business.[00:12:27] Yeah. Well, I, you know, just categorically, I think sum summarization is becoming more important, the more generative ai. For freights because we kind of need to expand and we see the contract again, in terms of what, uh, what we consume in terms of, uh,[00:12:41] information. Truly. I mean, like, to kind of riff on that, I think the, there's just so much material at your business.[00:12:48] You think about like, uh, PRDs, like, or, you know, product requirement stocks, you know, reasonable people. You kind of want like a zoom lens on language and you want the ability to see the high level structure of something and then be able to get details on demand like you would pan or like, you know, zoom into an information visualization.[00:13:09] I was talking with. Um, The head of AI at Notion about this and who, you know, you guys probably know and as a really remarkable person, and this idea of like, what does a thumbnail for language look like? Because like your visual cortex is structured such that like it's highly evolutionarily conserved to be able to glance at something and perceive its essence.[00:13:28] And that makes seeing a field of thumbnails. Like you guys I think are gonna speak with, um, Lexi folks here shortly. And you can see us like the field of images in response to a query and get a sense for like, oh, these are all like moody cyber punk scenes. Mm-hmm. What is that for language? And maybe it's like, maybe it doesn't exist.[00:13:52] Maybe it's the case. Stop me if I'm getting too far afield here. But you think about clothes as a technology that has shaped our physiology. Right. Like, and our, our phen, our phenotypic expression, we used to be covered in hair. We evolved this technology fire would also be in this class, and our bodies changed in response to it on the very long time scale of human history.[00:14:15] Mm-hmm. It may be the case that AI in the way that the visual cortex has been evolutionarily conserved to be able to rapidly perceive things, shapes how we process information. I don't know. What to do about language right now. It looks like reading a lot of samples from different models and seeing how they perform as we move through the loss curve.[00:14:34] That makes[00:14:34] sense. I mean, if you think about images in text, you don't really have like peripheral vision. You know, when you're like seeing something, you focus on the main thing and then you kind of like start to expand to see the rest. Yes. Like text is kind of like a, the density is like the same across the tax.[00:14:49] Like nothing jumps out when you see a wall of tax versus when you see an NI image. Just like something usually jumps out first. Yes. So I don't have the answer either. Was gonna say, I'm really curious word[00:14:58] clouds, which, but that, that's the thing is like, that's such a joke, right? Wait for me. Yeah, it's like punchline.[00:15:06] You must have[00:15:06] done, you know, your, your Twitter[00:15:08] work. I've cut a few word clouds in my day.[00:15:11] CICERO and Geopolitical AI Agents[00:15:11] Um, you know, I also think like this question of like, what are you most excited about in ai? Like what do you see as the sort of like grandest potential? And one of the things that I reflect on is, is the. Possibility of having agents that are able to, to negotiate intractable geopolitical problems.[00:15:31] So like if you look at like, the Cicero paper from, from Meta, can you recap for those who are making Yeah. So I mean it's, you know, I don't wanna like represent somebody else's work as like you're just talking Yeah, exactly. But like, um, my understanding is that diplomacy is a, um, turn-based negotiating game, like risk where you are all making the decision in simultaneously and you're trying to convince people that you're going to do or not do something.[00:15:56] And, uh, this paper was co-authored with one of the top diplomacy players and Meta built a system that was very, very capable at this negotiating game. I. Can envision nation states operating ais that find game theoretically optimal and sort of non exploitable steady states basically. Mm-hmm. That, you know, if you think about a lot of the large scale geopolitical disputes where it's just like human mediators are unable to find a compromise, ais may be able to satisfying conditions that you're like, yeah, actually I don't, that works for me.[00:16:36] Mm-hmm. And to your point about like how the phobia and attention generally, but like how the actual visual cortex works, the idea that like a great writer says something in a way and it hits unique structures in your brain and you have that chemical cascade, which is understanding, we may be able to design systems that compress very long documents on a per person basis so as to maximize information transfer, and maybe that's what the thumbnail looks like.[00:17:03] Mm-hmm.[00:17:04] Yeah, maybe it's emojis all the way down. I dunno.[00:17:08] Yeah.[00:17:09] Datasets vs Intentional Design[00:17:09] Obviously the dataset is like one of the, the big things in Dolly. Yeah. But you talked about some of these technologies being like discover, not designed, like maybe talk a bit about the process that took it to Dolly and like the experimentation[00:17:21] there.[00:17:22] So it's not my, my friend, my dear friend, Jacob Burk kind of had this insight, which is that AI is you, you design a jet turbine, like for sure you make a plan. Mm-hmm. And you, you know, have some working model of aerodynamics and you execute on the jet turbine. I think that with ai, generally we see. You know, this instruction following behavior that we saw in Dolly was not present in the, the base model.[00:17:53] It, you know, effectively will, it's a, you know, very powerful base model, but it will just complete the prefix as though it's random page on the internet. We had Databricks, but also the community with Alpaca discovered that you can perturb them just, just so, and get quite different behavior. That was not really a design.[00:18:13] I mean, it's designed in the sense that you had an intent and then you saw it happen. But we do not like choose the parameters they are arrived upon. And the question that I have is, what other capabilities are latent in these models, right? GPT-J was two years old. Can it do anything else? That's surprising?[00:18:36] Probably so, and I think you look at, you know, particularly, and this is why the Pithia Suite is so cool, is that, and you know, a ton of credit to, for. Having this vision, and I think it will probably take some time for the research community to, to understand what to do with these artifacts that they've created.[00:18:54] But it's effectively like this matrix of model checkpoints and sizes where you say, I'm gonna take from I think 110 million all the way up to 12 billion, which is what Dolly two is based on. And then at every checkpoint through the training run under, I think it's 2 million. Yeah. Tokens. Yeah. Well, so the, I think the Pithia suite is just trained on the pile, so it's like three, 400 million, which is probably undertrained.[00:19:18] And did you guys see this red? I think it's red Pajama released this morning. They've reproduced the lLLaMA training data set. So, so it's 1.2 trillion tokens and it's, um, I mean, you know, a separate topic, but we looked pretty hard at what it would take to reproduce the LLaMA data set. And it's like, Non-trivial.[00:19:35] I mean, bringing Common Crawl online and then d near de-duping it and you know, filtering it for quality. So the, the Common Crawl data set in LLLaMA is they fit a model to predict whether a page in common crawl is likely to be a reference on Wikipedia. And so that's like a way to like, I don't want lists of phone numbers, for example, or like ads.[00:19:58] All of that is a lot of work. And so anyway, with Pit, I think we can start to ask questions like through this, this matrix with size and like checkpoint depth. We have these different model parameters. How do behaviors emerge through that training process? And at different scales, you know, maybe it will be less of a discovery process.[00:20:22] Maybe we will get more intentional about, like, I want to elicit the fol, I want summarization, I want closed form, question answering. Those are the only things that matter to me. How much data do I need to. Generate or buy, how many parameters do I need to solve that compression problem? And maybe it will become much more deterministic, but right now it feels a lot like we're just trying things and seeing if it works, which is quite different from a lot of engineering disciplines.[00:20:51] I'm curious, does that reflect your experiences? Like Yeah, I[00:20:54] think like we had a whole episode on, um, kind of like scaling loss and everything with Varun from Exafunction. And I feel like the, when the Chinch paper came out, a lot of teams look at their work and they were like, we're just kind of throwing darts.[00:21:07] Exactly. That's now one,[00:21:10] 1.2 to, uh, 1.7 tokens, uh, you know, per, uh, per parameter. And, uh, now we're redoing everything with[00:21:16] 20 tokens. It's exciting, but also as like, you know, I'm, I'm a, an engineer and a hacker, like I'm not a scientist, but I, you know, used to pretend to be a scientist. Not, you know, not really pretend, but like I respect the, I respect the craft and like, It's also very exciting to have something you really don't understand that well, because that's an opportunity to create knowledge.[00:21:41] So that's part of why it's such an exciting time in the field. There's some work[00:21:44] Biological Basis of AI[00:21:44] on with, um, understanding the development of AI progress, uh, using biological basis. Mm-hmm. So in, in some sense, we're a speed running evolution Yeah. With training. Yeah. So in a sense that of just natural discovery of things and, and just kind of throwing epox at it Yeah.[00:22:02] Is, makes intuitive sense to me. But, uh, I do think that it is unintuitive to estimate how different artificial life might evolve differently[00:22:12] from biological life. Yeah. I, so like Richard Dawkins had, um, this sort of toy model called bio morphs. Which, uh, no, I haven't heard of it. Yeah, it's, I think it was dates to the eighties.[00:22:25] So it's a pretty old school demonstration of capabilities. But the idea is that you have, imagine they look, they're little insects that look like vector art. And the parameters of how they are rendered are governed by, you know, it's parametric, right? So some of them have long antennas and some of them have wide bodies and some of them have 10 legs, some of them have four legs.[00:22:46] And the underlying method is, is genetic algorithms where you take subsets of the parameters and kind of recombine them. And you're presented as a user with a three by three grid, and you click based on what you find subjectively beautiful. And so the fitness function, then they're re combined and you render a new set of nine by nine, some of which are mutated.[00:23:05] And so the fitness function is your perception of aesthetic beauty. That is the pressure from the environment. And I think like with things like RLHF where you're having this preference learning task, that is a little different from next token prediction in terms of like what is synthetic life and how are our preferences reflected there?[00:23:23] I think it's a very sort of interesting, yeah, interesting area. Okay. So a[00:23:27] Training Your Own LLMs[00:23:27] lot of people are very inspired by work with Dolly. Obviously Databricks, uh, is doing it. Partially out of the kindness of your hearts, but also to advertise Databricks capabilities. Uh, how should businesses who want to do the similar things for their own data sets and companies, uh, how, how should they think about[00:23:43] going about this?[00:23:44] I really would actually say that it's probably less about advertising our capabilities. I mean, that, you know, we're exercising our capabilities, but I, I really think that to the extent that we can help define some of the moves that reasonable teams would make in creating technologies like this, it, it helps everybody understand more clearly what needs to be done to make it useful and not just interesting.[00:24:08] And so, one, you know, one of the canonical examples that we had in the original Dolly was write a love letter, ed Growlin Poe. Yep. Which is super cool and like very moody. You know, I, I dunno if you guys remember the particulars of it, but it was like, I. The person, the imagined person writing this letter was like, I, I basically couldn't, like, I couldn't stand you, but I can't stop thinking about you, you know, which is a very like, gothic, uh, kinda, uh, mood in, in a letter like that not relevant to the enterprise context.[00:24:39] Right. So, you know, like it's neat that it does it, but if I don't have to buy training data that gets it to write moody, gothic letters to Edgar and Poe, and if I can be choosy about how I invest my token budget, that is useful to many businesses. And so, you know, one of the things that. We're trying to understand more clearly is I, we talked a little bit about like different tasks require that you compress in a way that generalizes, you know, if you think about it, the, the parameters as compressing language and also world knowledge.[00:25:15] The question is like, for a given model size, how many demonstrations of summarization, for example, are required in order to get a really useful, grounded QA bot? And so I think in building these kinds of solutions and sort of seeing how the. Categories of behaviors in the instruction tuning or sort of fine tuning data sets are related to those behaviors, I think will develop a playbook for startups in the enterprise that makes it, um, so that you can move with an economy of motion.[00:25:44] And this is related to evaluations as well. So one of the things that we had talked about sort of before we started recording was the using the EleutherAI evaluation benchmarks, and I think helm and the, you know, there's a bunch of other batteries that you can push your models through. But the metrics that we looked at first when we built the first version of Dolly, and this is on our hanging face page, you can go see this yourself.[00:26:08] The GPT-J model. And the fine-tuned dolly model have almost identical benchmark scores, but the qualitative character of the model just couldn't be any more different. And so I think that it requires better ways to measure the desired behavior, and especially in these enterprise contexts where it's like, is this a good summary and how can I determine that without asking a person?[00:26:37] And maybe it's kind of like you train reward bottles where you, you know, you have sort of a learned preferences and then you show, you know, you take kind of an active learning approach where you show the ones that it's most uncertain about to crowd workers and it's kind of like human in the loop.[00:26:52] Would this be p p o ish?[00:26:54] I mean, potential. That's, so this, that's not an area of expertise in mine yet. You know, this is something that we're also trying to, uh, more deeply understand kind of what the applicability of that stack is to, like, I'm just trying to ship. Mm-hmm. You know, my understanding is that that's somewhat challenging to bring online and also requires a fair number of labels.[00:27:14] And so it's like from an active learning standpoint, uh, my thinking would be more like, You have a reward model that you've trained and you said like, this is based on human judgments from my employees or some crowd workers, what I want from a summarization or a close, close form question answering. And then you basically, you choose new examples to show to humans that are close to the decision boundary and that are like maximally confusing.[00:27:38] It's like, I'm just really not sure rather than things that are far from the decision boundary. And it's, it's kind of like, I actually think there's gonna be, in terms of value creation in the next, let's say 18 to 36 months, there's still room for like old tricks. You know, like not everything has to be generative AI for it to be very valuable and very useful.[00:27:56] And maybe, maybe these models and, and zero shot prompting just eats everything. But it's probably the case that like an ensemble of techniques will be valuable and that you don't have to, you know, establish like room temperature fusion to like, you know, create value in the world, at least for, you know, another year and a half.[00:28:20] You know, like[00:28:21] You May Not Need a Large Model[00:28:21] just, just to spell it out for people trying to, uh, go deep on stuff. Um, maybe leave breadcrumbs. Um, sure. When you say techniques, you don't just mean prompting.[00:28:29] Oh, I mean even like named entity recognition, like Yeah, there's just like classic NLP stuff, you know, like supervised learning. I mean, multi-class classifi.[00:28:37] I have customer support tickets. I want to know whether this is going to be flagged as. P zero. Like that's just, it's not a complicated problem to solve, but it's still very valuable in these models that can deeply understand the essence of something and not necessarily generate language. But understand, I expect that you will see like s because, so for example, inference right now is time consuming.[00:29:04] Mm-hmm. Just, you know, it's like, unless you are really rigorous, and I think it, one of the things I'm excited about at Databricks is that we're, our inference stack is very, very fast. Like orders of magnitude faster than you would get if you took the naive approach. And that leads to very qualitative, like a very different way that you interact with these models.[00:29:22] You can explore more and understand their behavior more when it doesn't take 30 or 40 seconds to generate a sample and it's instead 1800 milliseconds. You know, that's something that's very exciting. But if you need to spend your compute budget, Efficiently and you have tens of thousands of possible things that you could summarize, but you can really only, you know, in a day do so many.[00:29:45] Having some stack ranking of them with a classical machine learning model is just valuable. And I, I expect that you'll see like an ecosystem of tools and that it's not all going to be necessarily agents talking to agents. I could be proven wrong on that. Like, I, I don't know. We'll see. Hey,[00:29:59] Good LLM Use cases[00:29:59] going back to the evolutionary point, I feel like people think that the generative AI piece is like the one with the most like, uh, possible branches of the tree still to explore.[00:30:09] So they're all focusing on that. But like you said, we're probably gonna stop at some point and be like, oh. That thing we were doing is just as good. Let's pair them together and like use that instead of just like trying to make this model do everything.[00:30:22] Yeah. And there, yeah, there are things like categorically that only generative models can accomplish.[00:30:28] And I do think, I mean, one of the reasons that at Databricks we see so much value for companies is that you can, with zero shot prompting, you can say, given this customer support ticket, for example, give me a summary of the key issues represented in it. And then simply by changing that prefix, say, write a thoughtfully composed reply that addresses these issues in the tone and voice of our company.[00:30:53] And imagine you have a model that's been fine tuned on the tone voice that's in your, in your, uh, from your support team. Both of those problems historically would've taken like a reasonable machine learning team, six to eight weeks to build. And frankly, the right, the response, I'm not sure you can do it without generative techniques.[00:31:13] And now your director of sales can do that. You know, and it's like, the thing that might make me look foolish in retrospect is that. Orders of magnitudes cheaper to do it with prompting. And maybe it's like, well, sure the inference costs are non-trivial, but it's just we've saved all of that in time. I don't know.[00:31:33] Dolly Cost $30 on Databricks[00:31:33] We'll see. I'm[00:31:34] always interested in, uh, more economics of, um, of these things. Uh, and one of the headline figures that you guys put out for Dolly was the $30 training cost. Yes. How did you get that number? Was it. Much lower than you expected and just let's just go as deep[00:31:50] as you want. Well, you just think about, so you know, we trained the original dolly on a 100 s and so one of the cool things about this is we're doing this all on Databricks clusters, right?[00:32:00] So this like, this works out of the box on Databricks and like turns out, you know, I think you would probably need slightly different configurations if you were going to do your own full pre-training run on, you know, trillions of tokens. You have to think about things like network interconnect and like placement groups in the data center in a more like opinionated way than you might for spark clusters.[00:32:23] But for multi-node distributed fine tuning, the Databricks stack is great out of the box. That was wonderful to find.[00:32:32] You've been building the perfect fine tuning architecture the whole[00:32:34] time. Yeah. You know, may, maybe it's not perfect yet, but like, It's pretty good. And I think, so for the original Dolly, it was just a single node, and so you can bring up an eight node, a 100 machine, and I'm, you know, I thinking of just the off the rack pricing from the cloud providers, it's about 30 bucks.[00:32:55] I think the actual number's probably less than $30. For How long are you for? It was less than an hour to train the thing. It's 50, I mean it's 50 thou alpacas, 50,000 records. Right.[00:33:04] And you've open sourced the, the notebook, which people can check out what[00:33:07] gonna show notes. There's. The risk that I am making this up is zero.[00:33:11] Yeah. No, no, no. I'm not, I'm[00:33:12] not saying the I know you're not. I'm just saying I'm, I'm, I'm leaving break rooms for people to say, Hey, it, it's 30[00:33:17] bucks, takes an hour. Go do it. It's, it's crazy. And, and that's like the, I mean, you think about, I yeah, I, I, I know for a fact that you're not suggesting that, but it's just like, what's nuts is that you can just try it.[00:33:28] You know, you can, if you have 30 bucks, you can stand this thing up and, um, on a single machine, execute this training run. And I think I talked about like this idea that it's kind of like a phase transition. What's surprising about it, if you were to say, Hey, given a corpus of millions of instruction pairs, you can for.[00:33:50] $10,000, which is still an order of magnitude less than it cost to train the thing, get this qualitatively different behavior. I'd be like, yeah, that that sounds about right. And it's like, yeah, if you have an afternoon, like you can do this. That was not certainly, it was not obvious to me that that was true.[00:34:08] I think especially like, you know, like with libraries, like deep speed that, you know, so deep speed is a, is a library that gives you many different options for dealing with models that don't fit in memory and helping increase the effective batch size by, you know, for example, putting the entire model on a GP on several different GPUs and then having device local batches that are then the gradients are, are accumulated, are sort of aggregated for those, those from those different devices to get an effective batch or sharding the actual different model submodules across GPUs.[00:34:43] And this is all available in the notebook and the, the model that we train does not fit on a single device. And so you have to shard the model across the GPUs to run the training, you know, an incredible time that like this technology is just like free and open source and it's like the Microsoft team and the, you know, the hugging face team have made it so easy.[00:35:04] To accomplish things that even just two years ago really required a PhD. And so it's like level of effort, capital expenditure, substantially less than I would've expected. Yeah.[00:35:17] And you, you sort of co-evolve this cuz you also happen to work on the infrastructure optimization[00:35:21] team. Yeah, I mean that's kind of, um, like, you know, this is really kind of a separate project at Databricks, which is like making sure that we have a great customer experience and that we have the resources that are required for all of our customers.[00:35:37] You can push a button, get a computer, uh, get a Spark cluster. And I think when you look to a world where everybody is using GPUs on Databricks, making sure that we are running as efficiently as possible so that we can make Databricks a place that is extremely cost effective to train and operate these models.[00:35:55] I think you have to solve both problems simultaneously. And I think the company that does that effectively is, um, is gonna create a lot of value for the market.[00:36:06] Databricks Open Source[00:36:06] Yeah. You mentioned Spark, obviously Databricks, you know, Started, like the founders of Databricks created a spark. Yeah. At Berkeley. Then, you know, from an open source project, you start thinking about the enterprise use cases.[00:36:18] You end up building a whole platform. Yeah. You still had a lot of great open source projects like uh, ML Flow, Delta Lakes. Yeah. Um, yeah. Things like that. How are you thinking about that was kind of the ML ops phase. Yeah. Right. As you think about the l lm ops, like needs, you know, like obviously. We can think of some of these models as the spark, so to speak, of this new generation.[00:36:39] Like what are some of the things that you see needed in infrastructure and that maybe you're thinking about building?[00:36:44] Yeah, I mean, um, so kind of first to address this, this matter of open source. I think, you know, Databricks has done a lot of things that, and has released into the public domain a lot of technologies where a reasonable person could have said, you should.[00:37:00] Treat that as IP that you and no one else has. And I think time and again, the story has been more, is better and we all succeed together. And when you create a new class, people rush in to fill it with ideas and use cases and that it's, it's really powerful. It's both good business and it's good for the community.[00:37:21] And Dolly I think is very much a natural extension of that urge, which just, I think reflects our founders tastes and beliefs about markets and, and technology[00:37:31] LLMOps and Prompt Tooling[00:37:31] when it comes to LM ops, which is not a phrase that rolls off the tongue. We'll, we're gonna need something better than that. We, this kinda gets back to like what is a thumbnail for text.[00:37:43] Mm-hmm. One of the things that my team winds up doing a fair amount of right now is like slacking back and forth examples of like generated samples. Okay. Because like these evaluation benchmarks do not capture the behaviors of interest. And so we often have like a reference battery of prompts. Let's say 50 to a hundred.[00:38:03] Write a love letter to Edgar and Poe. Yeah. Give me a list of ins. Like what are, what are one of our things is what are considerations? Like it should keep in mind when planning for a backcountry backpacking trip can you generate a list of reasonable suggestions for a backpacking trip. And you see, as you kind of move the model through the loss curve under instruction tuning that um, that behavior emerges and that like you kind of wind up qualitatively evaluating is the model doing what I want in respect to these prompts that I've seen many different models answer this model or this, this instruction tuning data set is generating shorter completions.[00:38:40] This one is generating the. Wackier completions, you know, this one is much likelier to produce lists all of these things. I don't know if you've seen Nat Devrel. Mm-hmm. I'm sure, of course you have that idea of the grid of like, I want to run inference in parallel on arbitrary prompts and compare and contrast, like tooling like that is going to make it, and especially with a fast inference layer, and this is where I think Databricks has a lot of opportunity to create value for people is being able to serve, interact, and measure the behavior of the model as it changes over time and subject it not only to quantitative.[00:39:19] Benchmarks, but also qualitative subjective benchmarks plus human in the loop feedback where imagine that I burn a model checkpoint and every thousand steps, I send it off to an annotation team and I get a hundred pieces of human feedback on the results. And it's like there's kind of like what is the right volume of human feedback to get to statistical significance?[00:39:43] But I think there is. An ensemble, you know, each of these is like a different perspective on the behavior of the model. A quantitative, qualitative, and then human, uh, feedback at scale. Somebody's going to build a product that does these things well in a delightful user form factor. And that is fast and um, addresses the specific needs of AI developers.[00:40:04] And I think that business will be very successful and I would like for it to be Databricks. Ah, okay.[00:40:10] Teasing what you might be[00:40:11] building. Interesting. You know, and this, not to make forward-looking statements, but it's just like, make sense as obvious as a person, you wanna do it? Mm-hmm. I need that. Yeah.[00:40:19] Yeah. I need that. Yeah. I happen to work at a company.[00:40:21] Yeah. So just to push on, uh, uh, this one a little bit, cuz I have spent some time looking into this. Sure. Have you come across prompt layer? That would be one of the leading tools. And then I think Human Loop has a little bit of it, but yes, it's not a course focus of theirs, is it?[00:40:34] Prompt layer? Yeah. I'll, okay. Send And happy to drop that reference cuz uh, he has reached out to me and I, I looked at his demo video and it, yeah, it kind of is, isn't that in the ballpark? And I think there are a lot of people, uh, zeroing in on it. But the reason I have not done anything in, in, in this area at all is because I could just do it in a spreadsheet.[00:40:51] Like all you need to do is Yeah.[00:40:53] Spreadsheet function that you can, but I mean like editing text and Google Sheets is a drag. Is it? Yeah. I, I mean mm-hmm. What's missing? You know? Oh, so a, like the text editing experience in it, like you're trying to wrap these cells. Okay. And so now you gotta like double click to get into the editing mode.[00:41:12] I think they struggle with large record sets. So like the spreadsheets slow down, you kind of want, this is not some, like a, this specific question of like, how does Google Sheets fail to meet the need is something that, you know, I don't have a talk track around Sure. But like linking it to an underlying data source where it's sort of like persisted.[00:41:34] Cuz now I'm, now I have a bunch of spreadsheets that I'm managing and it's like, those live on in Google Drive, which has kind of a garbage ui. Or is it on my local machine? Am I sending those around? Like, if, can I lock the records so that they can't be annotated later? How do I collect multiple evaluations from different people?[00:41:50] How do I compute summary statistics across those evaluations? Listen, I'm the first person to like, fire up sublime. Yeah. You know, like, keep it simple, right? Yeah. Just for sure. I feel like the, the way that I have talked with colleagues about it is it's like we are emailing around. Photocopies of signed printouts of PDFs and DocuSign doesn't exist yet, and nobody realizes that they're doing this like ridiculous dance.[00:42:16] And I get it. I too have used Google Sheets to solve this problem, and I believe that they're, there's maybe something better. I've Stockholm Syndrome.[00:42:26] "I'm a Sheets Maxi"[00:42:26] So there's a couple more that I would highlight, uh, which is Quadra. Uh, okay. Uh, full disclosure, an investment of mine, but basically Google Sheets implement, implemented a web assembly.[00:42:35] Yeah. And a, and a canvas. Okay. And it speaks Python and sql. Yeah. Yeah. And, uh, and Scala. Yeah. Uh, so I, I think, I think, yeah, there, there's some people working on interesting hearings[00:42:46] at those. And what you could do is like, like imagine that you have a Google Sheets type ui, the ability to select like a column or a range and subject all of those values to a prompt.[00:42:59] Yes. And like say like, I have template filling and I want, that's what I want. My problem[00:43:04] with most other SaaS attempts is people tend to build UIs that get in your way of just free range experimentation. Yes. And I'm a sheet's, uh, maxi. Like if I can do it in a sheet, I'll do[00:43:16] in a sheet, you know? Yeah. Well, and I mean, kind of to continue, like on the sheets, sort of mining that vein, you know, on the, sort of like how does AI impact the workplace and like human productivity?[00:43:29] I think like a, I really like the metaphor, which is comparing, uh, AI technologies to the development, the advent of spreadsheets in the eighties, and this idea that like you had a lot of professionals who were like well educated, like serious people doing serious accounting and finance work, who saw as their kind of core job function manually calculating.[00:43:53] Values in forecasts on paper as like, this is how I create value for the business. And spreadsheets came along and I think. There was a lot of concern that like, what am I gonna do? Yeah. With my days? And it turns out that like I think of it sometimes, like being in a warm bath and you don't notice how nice the water is until you wiggle your toes a little bit.[00:44:14] You kind of get used to your circumstances and you stop noticing the things that would stand out.[00:44:19] AI and Workplace Productivity[00:44:19] So on the subject of how artificial intelligence technologies will shape productivity in the workplace, you have, I think, a good metaphor in comparing this to spreadsheets and the Adventist spreadsheets In the eighties, I think you had a lot of really serious people who were taking, making an earnest effort to be as productive and effective as possible in their lives, who were not making it their business to waste time.[00:44:42] Saw spreadsheet technology come out and it's like, man, well what am I gonna do? I'm the person that calculates things. Like I write it all down and that's how I create value. And then like you start using this new tool and it's like, oh, it turns out that was the Ted most tedious and least rewarding part of my job.[00:44:58] And I'm just so, you know, like I have, like, I still have that human drive to create. You just kind of point it at like more pressing and important problems. And I think that, that we probably don't, especially, and even when it comes to writing, which feels like a very like quintessentially human and creative act, there's a lot of just formulaic writing that you have to do.[00:45:22] Oh yeah. And it's like, maybe I shouldn't be spending my time on all of that kind of boiler plate. And, you know, there's a question of like, should we be spending our time reading boilerplate? And if so, why is there so much boiler plate? But I, I think that humans are incredibly resourceful and incredibly perceptive as to how they can be effective.[00:45:43] And that, you know, the, I think it will free us up to do much more useful things with our time. I think right now[00:45:50] there's still a, a bit of a stigma around, you know, you're using the model mm-hmm. To generate some of the text. But I built a open source, like a email drafter. Yeah. So for all of my emails, I get a G PT four pre-draft response.[00:46:04] And a lot of them I just sent, but now I'm still pretending to be me.[00:46:07] Okay. So that's why I'm talking to you[00:46:09] When I talk to you, you need to fine tune it. Right.[00:46:12] But in the future, maybe it's just gonna be acceptable that it's like, Hey, we don't actually need to spend this time, you and I talking. Yes. It's like, let the agents like cash it out and then come back to us and say, this[00:46:22] is what you're gonna do next.[00:46:23] Articulate your preferences and then you, I think this like trustworthiness is a piece of this here where like hallucinations, T b D, whether it is like actually attractable problem or whether you need other affordances like grounded methods to, to sort of. Is a hallucination, just a form of creativity, like, we'll see.[00:46:42] But um, I do think eventually we'll get to a point where we can, we trust these things to act on our behalf. And that scenario of like calendaring, for example, or just like, you know, even working out contract details, it's like, Just let me tell you exactly what I want and you make sure that you faithfully represent my interests.[00:47:00] That'll be really powerful.[00:47:02] OpenAssistant[00:47:02] So we haven't run this by you, but uh, I think you have a lot of opinions about, you know, the projects that are out there, uh mm-hmm. And three that are, are on mine. For one, you've already mentioned Open Assistant two, cereus, G B T also came out roughly in the same timeframe. I'm not sure if you want to comment on it, I'd like to compare because they, they also had a similar starting point as as you guys, and then three Red Pajama, which, uh, was just out this morning.[00:47:24] Yeah. We might, as might as well get a soundbite from you on your thoughts. So yeah, if you want to pick one, what was the first one? Uh, open Assistant.[00:47:30] Yeah. So, I mean, open Assistant is awesome. I love what they've done. I will be eager to use their free and open data set, uh, to improve the quality of Dolly three.[00:47:41] CerebrasGPT[00:47:41] Yeah, but also just like we're seeing the, the training is, so Cerus is a good example of, you know, I think they were, my understanding, and I don't know that team or really, you know, I haven't looked too closely at the technology, but I have worked with the model is that it's a demonstration of their capabilities on this unique chip that they've designed where they don't have to federate the models out to multiple cards.[00:48:04] But I think if you look at some of the benchmarks, it is on par or maybe a little shy of some of the Ethe I models. And I think that one of the things that you may see here is that the market for foundation models and like the importance of having your own foundation model is actually not that great.[00:48:27] That like you have a few. Core trains that people, I think of these kind of like stem cells where, you know, a stem cell is a piece of is, is a cell that can become more like its surrounding context. It can become anything upon differentiation when it's exposed to eye tissue or kidney tissue. These foundation models sort of are archetypal and then under fine tuning become the specific agent that you have a desire for.[00:48:53] And so I think they're expensive to train. They take a long time to train. Even with thousands of GPUs, I think you're still looking at like a month to stand up some of these really big models, and that's assuming everything goes correctly. And so what Open Assistant is doing is. I think representative of the next stage, which is like open data sets, and that's what the Dolly release is also about, is, I kind of think of it like an upgrade in a video game.[00:49:21] I don't play a ton of video games, but I, you know, I, I used to, and I'm familiar with the concept of like, your character can now double jump. Mm-hmm. Right. Great. You know, it's like, here's a data set that gives it the ability to talk to you. Hmm. Here's a data set that gives it the ability to answer questions over passages from a vector index.[00:49:38] I think anybody who's listening, I think there's a tremendous opportunity to create a lot of value for people by going through this exercise of the unsexy work, of just writing it down and figuring out ways to do that at scale. Some of that looks like semi-synthetic methods, so something I would love to see from the Dolly data set.[00:49:58] Is paraphrasing of all the prompts. So basically you now have multiple different ways of saying the same thing and you have the completions which are correct answers to different variants of the question. I think that will act as like a regular, it's kind of like image augmentation. I was gonna say, you flip it.[00:50:13] Yeah. Yeah. I believe that that will work for language. Like one of the things you could do. Cause we, we saw that within 24 hours the dataset had been translated into Spanish and Japanese. The dolly dataset. Yeah, it was, I mean, you know, it's maybe, yeah. Yeah. Right. Yeah. So that's super cool. Um, and also something that is only possible with open data.[00:50:31] Well, it's only useful with open data, but I just last night was thinking like, I wonder if you could to paraphrase, cuz it's not obvious to me like what the best and state of the most state-of-the-art paraphrasing model is. You could use Google Translate potentially and take the prompt. Translate it to Spanish and then translate it back to English, you get a slightly different way of saying the same thing.[00:50:54] Ah, right. So I think the self instruct paper is really about like few shot prompting to get more prompts and then using large models to get completions and then using human annotators to judge or train a reward model. I think that bootstrapping loop on the back of these open data sets is going to create multimillion scale training corpuses.[00:51:14] And so I, what Open Assistant has done is a, it's a great model. I don't know if you've tried their interactive chat, but it's just really quite an impressive accomplishment. But that the gesture towards open data that you know, the Dolly dataset and the open assistant dataset represent, I think is probably gonna define the next six to nine months of.[00:51:35] RedPajama[00:51:35] Work in this space. Um, and then the red, a red pajama. Red pajama, I mean, yeah, it's like I said, you can do a close read of the LLaMA paper. There's the dataset section and I think they use seven distinct data sets, archive, and I think maybe Stack exchange and common crawl.[00:51:50] Okay. So they have common crawl.[00:51:52] Yep. C4, which is Common crawl, but filtered subset. Yeah. Uh, GitHub archive books. Wikipedia Stack Exchange.[00:51:59] Yes. So, you know, take Common Crawl, for example, when you read the lLLaMA paper. So a common crawl I think is three terabytes in the lLLaMA paper. It's not something you just download from, like it's, you have to produce this data set, or at least the CC net, um, implementation that they reference there.[00:52:18] And you have like a single paragraph in this research paper dedicated to how they produce Common Crawl and they do near de-duplication. They train a model to predict whether something is likely to be a link, a reference link on Wikipedia. And there's just a bunch of other stuff that. Not only from like a, where do you get the model to predict whether something is a link as a reference on Wikipedia when you train it and then like where's your cut point?[00:52:41] You know, now you have kinda this precision recall trade off and it's like those decisions have material impacts on the quality and the character of the model that you learn. But also just from a scale standpoint, like building Common Crawl locally requires like a non-trivial distributed systems left.[00:52:59] And so I think Red Pajama is, and I think it's Mila and Chris Ray's lab hazy research, I think, or at least he's attached and together and I think together is kind of leading. There's a bunch of great teams behind that and so I have no reason to think they didn't do. The hard, difficult work correctly.[00:53:21] Yeah. And now is this major piece of the lift if you're wanting to do a lLLaMA repro in public. And I think that's would very naturally be the next step. And I would be kind of surprised if a train was not currently underway. Everybody agrees. LLLaMA is very, very strong. Also, we agree that it is not open incentives for somebody to spend a couple million bucks and produce it and then be the team that opened this architecture is, are quite high.[00:53:50] Mm-hmm. So I, I think in the next, you know, you asked for like predictions. I think we're five months at most away from a open LLaMA clone that is as high quality as, as what meta is produced. I will be disappointed if that's not the case.[00:54:07] Why Dolly > OpenAI GPT[00:54:07] And I think like there's the big distinction between what is open and what is like, Open in a way that is commercially usable.[00:54:13] Yeah. After that, I know the Dolly two post, you mentioned that you had a lot of inbound with Dolly. Yeah. 1.0. But a lot of businesses could not use it. Yeah. Because of where the data training data came from. Yes. What are some of the use cases that people have? There is, uh, a lot of it kind of like talking to your data.[00:54:30] Are there like, uh, other things that are maybe people are not thinking about using it for?[00:54:34] Yeah, so I mean, we have a number of customers who have reached out with really concrete use cases around customer support ticket resolution. One of the things that a lot of business open AI's models are incredibly powerful, and Databricks wants to be a business where you can use the right tool for the job.[00:54:55] Like if you have information from the public web, let's say you have forum posts, right, that you need to synthesize and process, that's just not sensitive information. You should be able to use truly whatever model. That might be a fine-tuned model that is like laser focused on your problem. It might be a general instruction following model and, and sort of whatever kind of intelligence GPT4 is, it's, you know, it's quite powerful.[00:55:20] You should be able to use those tools. There are definitely use cases in the enterprise where it's like, I either just, I'm not interested in sharing this ip. You know, these are effectively our state secrets. Or from a regulatory and compliance standpoint. I just can't send this data to a third party sub-process or something.[00:55:38] Even as quotidian is like, I just really don't want to go through procurement on this. You know, like it's kind of around those, um, I have some reasons to keep this in house. A lot of use cases like that and that, you know, I'm not a lawyer and so I won't speculate on the sort of actual licensing considerations or the actual obligations, but it's just like people like to be able to move confidently and what we've done with Dolly is make it super clear.[00:56:09] This model and this data set are licensed for commercial use. You can build a business on the back of this. And that, I think is a big part of why the response has been so positive.[00:56:19] Open Source Licensing for AI Models[00:56:19] Hugging face has, uh, the rail license responsible, um mm-hmm. AI license, which isn't recognized as open source yet. So that was the whole problem with stable diffusion, that it's just unclear cuz this, this is completely new license that is, uh, unproven.[00:56:32] But I just find it interesting that the existing open source licensing regime is mostly around code. And right now, you know, the, the value has shifted from code to the waits.[00:56:43] Yes. I think we can go on a three hour rant about the open source initiative and like who decides what an open source license is.[00:56:51] But I think there's a, I think the approach of like, hey, We know what commercial uses. Like this is good for it. Yes, it's good. You're not gonna have to worry about us suing you. It's like, you know, the semantics of it. Clear is always better. Exactly. It's like we don't need to be approved by the osi. Yeah.[00:57:07] You're gonna be okay. Just[00:57:09] Why Open Source Models?[00:57:09] to kind of like continue, like why open source? Yeah. I think that like it is with many eyes, all bugs are shallow. I think the reality is that like we do not know what the challenges we face with AI systems will be. Mm-hmm. And that the likelihood that we can get it a representative and comprehensive solution to the challenges they present by putting it in public and creating research artifacts that people who deal with ethics bias, ai, safety, security, these really sort of thorny issues, that they can take a hard look at how the actual thing is built and how it works and study it comprehensively rather than, Hey, we've got a team for that.[00:57:50] You're gonna mm-hmm. Just, you're just, we're just gonna need you to trust our work. I think I wanna be in that the former future rather than sort of like, I, I hope that people have done this correctly. I hope that this is somebody is taking care of this.[00:58:05] Moving Models[00:58:05] When people[00:58:06] evaluate this, how do you think about moving between models?[00:58:10] You know, obviously we talked about how the data set kind of shapes how the model behaves. Hmm. There's obviously people that might start on open AI and now they wanna try dollies. Yeah. Like what are some of the infrastructure there that maybe needs to be built to allow people to move their prompts from model to model?[00:58:26] Like to figure out, uh, how that works.[00:58:28] That's really interesting. Um, because you see even like moving between GPT3.5 and GPT4 that the behavior, like some things that were not possible on three five are No, I mean, many, many things that were not possible on three five are not possible on four, but you kind of want like slightly different problem formula, like slightly different prompt formulations or.[00:58:51] It's kind of like you want regression tests for prompts, and you could see like an automated system, which is uh, helps design a prompt such that the output of this new model is isomorphic to the outputs of the previous model. And sort of like using a language model to iterate on the prompt. So it just kind of evolves it to like adapt to the new model.[00:59:13] I have two beautiful boys who are, they're just incredible humans and my friend Ben and I built them a, an interactive choose your own adventure storytelling book that uses ChatGPT to generate stories and then options within those stories, and then uses open AI's image generation model Dolly to illustrate.[00:59:36] Those options. And then the kids can kind of choose their way through these stories. And the thing that you really like when you start to really push these things for more than just like single turn prompt response and I'm, I'm, you know, it's fine if it's language and you really need it to be like an api.[00:59:52] Is that like 19 times out, 20 it's like an p i and then the 20th generation. It's like just a totally different format. And he just like, you really like try to in the system prompt really seriously. I just only want you to give me three options. Yeah. And letter A, B, C, you know, I think that from a regression test standpoint, how do you know, like if I run this prompt a hundred times does a hundred out of a hun, does it come back a hundred out of a hundred in the format and sort of character that I require?[01:00:21] That's not something a person can really do effectively, and so I think you do need sort of model meta models that judge the outputs and that manage those migrations. Mm-hmm. Yeah, so I had, that's an interesting. Product class. I hadn't thought about it too much. Yeah.[01:00:34] Learning in a Simulation[01:00:34] When you mentioned before the example of the, you know, back country trip, I was like, yeah, it would be so cool if you had a, like a simulation where like, okay, this is the list you had.[01:00:44] Now I have this game where like I'm putting a character with that inventory and see if they survive in the back country. Cause you can like, you know, the first time I went to Yellowstone to camp, I forgot to pack like a fly for my tent and obviously it rained. That's because, you know, you get punished[01:00:58] right away.[01:00:59] Yeah. That's the environment providing you with a gradient. Exactly. Update your model eight. You should be grateful to have such an excellent Yeah. Mini[01:01:06] these models like the, the evolutionary piece that is missing is like, these models cannot. Die. They cannot break a arm. They cannot, when they make suggestions, like they don't actually Yeah.[01:01:16] Have any repercussion on them. Um, so I'm really curious if in the future, you know, okay, you wanna make a poem, uh, you know, I love poem. Now we're gonna send this structural people. Yeah. And if you get rejected, your model's gonna[01:01:28] Why Model Reflexion and Self Criticism Works[01:01:28] die. So I think like one of the things that's cool about Lang Chain, for example, we all know they're doing awesome work and building useful tools, but these models can tell if they're wrong.[01:01:38] So you can, like, you can ask a model to generate an utterance. And that next token prediction loss function may not capture. You may hallucinate something, you may make something up, but then you can show that generation to the same model. And ask it to tell you if it's correct or not. And it can, it can recognize that it's not, and I think that is a directly a function of the attention weights and that you can attend to the entire.[01:02:03] Whereas like for next token prediction, all I can see is the prefix and I'm just trying to choose and choosing sarcastically. Right. You're f frequently, like it's a weighted sample from the distribution over that soft softmax output vector, which does not have any. Reference to factuality, but when you resubmit to the model and you give it like, here's the entire generated passage, judge it in its completeness.[01:02:25] Well now I can attend to all of the token simultaneously, and it's just a much, much easier problem to solve. And so I think that like, oh, that's a cool insight. Yeah. Yeah. I mean it's, yeah. It's just, this is reflection. Yeah. You, you can just see what you said and like the model may contain enough information to judge it.[01:02:41] And so it's kind of like subject your plan mm-hmm. To an environment and see how it performs. I think like you could probably ask the model, I mean, we can try this today. Here's my plan for a trip. Critique it. Mm-hmm. Right? Like, what are, what are the things that could go wrong with this inventory? And I think that there's one scenario, there's one trajectory for this class of technologies, which would be like self-reflexive models where it is not super linear.[01:03:10] You don't get anything more than what is already contained in the models, and you just kind of saturate and it's like, okay, you need human feedback. There's another scenario, which is the alpha go scenario where models can play themselves and in observing their behavior and interactions they. Get stronger and better and more capable.[01:03:31] That's a much more interesting scenario and this idea that like in considering the entire generated sample, I have more insight than just when I'm sampling the next token. Mm-hmm. Suggests that there may. Be that escape potential in terms of getting super, you know, unsaturated returns on quality.[01:03:51] Lightning Round[01:03:51] Yeah, this was great, Mike kind of we're where a time, maybe we can jump into landing ground next.[01:03:55] We'll read you the questions again. Okay. If you wanna think about it. So, okay. Favorite AI[01:04:00] product? This is a boring answer, but it's true. Google Maps. Ah. And it's, how is it AI A, they're recently doing stuff with Nerf so that you can using Yeah. Multiple different photos. You can explore the interior of a business.[01:04:15] They are also undoubtedly, I mean like, I don't know the team at Google doing this, but digesting the sum total of human knowledge about each entity in their graph to like process that language and make judgements about what is this business? And listen, it's not an AI product, but it is a machine learning product categorically, and it's also an amazing product.[01:04:37] You forget how much you use it. I was at the coffee shop around the corner. I used it to figure out where to come. It was literally 150 meter walk, you know, it's just like that reflexive, but it's also from a, an information visualization. So I love maps. Mm-hmm. I opened our conversation saying that I think a lot about maps, that it is adaptive at multiple scales and will corson and refine the, the information that's displayed requires many, many judgements to be made and sim simultaneously about what is relevant and it's personalized.[01:05:08] It will take your intent. Are you driving? Okay, well show me parking garages preferentially. So it's very adaptive in such subtle ways that we don't notice it. And I think that's like great product design is like good editing. You don't notice it when it's good. Mm-hmm. And so I think Google Maps is an incredible AI ml.[01:05:28] Product accomplishment. Google Maps. Yeah. It's a great pick. Great. Well, and they need the help. Yeah.[01:05:36] It is actually the best ad uh, real estate, right? Like, there should be a ton of people buying ads specifically on Google Maps. Yeah. So they just show up and I, I don't know how big that business is, but it's gotta be huge.[01:05:45] Yeah. And, and then my subsequent thing is like, there should be Google Maps optimization, where you would name your business like Best Barbershop and it would show up as Best Barbershop when you look at it. Yeah,[01:05:55] of course. Right? Yeah. It's like AAA lock picks. Yeah. Right at the front of the Yellow Pages.[01:06:01] Favorite[01:06:01] AI people and communities you wanna shout out?[01:06:03] You know, I don't think that I have necessarily anything super original to say on this front. Um, The best of my understanding, this is an all volunteer effort and it's, you know, incredible what they have been able to accomplish. And it's like kind of in the constellation of projects.[01:06:20] You know, the additionally, I think these are what you would say and answer in response to this question, I think like the hugging face group is, it's kind of like Google Maps in a way, in the sense that you like, forget how complicated the thing that it's doing is, and I think they have. You see like specific people, I was thinking of STAs STAs, who works on the, works on a lot of the deep speed stuff, just super conscientious and like engaged with the community and like that the entire team at Hugging face is incredible and you know, they, you know, have made a lot of what is happening possible in the industry at large.[01:06:53] And so, um, and I think, yeah, this is like the power of open source ultimately Transformers, library, diffusers, all of it. It's just great. It's a great, it's a delightful product experience.[01:07:03] I think a lot of people, like I had, I once had hugging Face explained to me as Free, get LFS hosting. And I think they've, uh, they've moved beyond that in, in[01:07:11] recent years.[01:07:11] Yeah. A bit. Yeah. It's, it's quite strong work. Yeah.[01:07:14] Yeah. A year from now, what will people be the most surprised by in ai? You already[01:07:19] hinted[01:07:19] at? Uh, yeah, but I think that's not, like, I think that won't be surprising, I think as we're on a ballistic trajectory to having like a, an open lLLaMA reproduction. So here's something I think that will happen that we are not, like socially, we don't have a lot of priors for how to deal with, so this ghost writer track just came out this Kanye West Weekend.[01:07:40] Mm-hmm. AI collaboration. He has thoughts, Drake? Yeah. His thoughts. It's not really, Dave has thoughts. It's not really like, I, I like a different breed of hiphop, but like, it's. For an example of the class, it's like that does sound like a thing I might hear on the radio. So there's a world in, so skip flag was this knowledge graph that's builds itself from your workplace communication.[01:08:02] Think about all of the times that you have expressed your position and intent around a given topic in workplace communication or on the internet at large. I think like character AI is going in this direction where you're going to be able to talk to high fidelity avatars that represent the beliefs and intents of people around you, and that it will be both useful and convincing.[01:08:27] I don't know that like society has good models for how to sort of adapt to that existing and that it will, I suspect just on the basis of like what people are doing. Happened rather quickly at first.[01:08:41] Listen, you can definitely tell it's really good. Mm-hmm. I'm really curious what the long-term results are gonna be, because once you listen it once or twice, you can tell that it's like, it's not really like a coherent song kind of written.[01:08:55] But to me that the funniest thing is that actually, so Drake and the Weekend that never made a song together again because they kinda had a, a follow up between then and, and the Weekend at One song where he said, if you made me then replace me. Because Drake basically hinting that like if he didn't put the weekend on his album, he would've never become popular.[01:09:13] Okay. So it's funny that now there's like this AI generated song from the weekend. It just kind of puts the, you know, if you made me then replace me line in in a different context. But I think this will be super interesting for the labels, you know, like a lot of them do on the Masters to a lot of this music they do on, yeah.[01:09:31] A lot of rides. So, At some point, it's much easier to generate music this way than to do it in person. But I still think you need the artist touch.[01:09:39] Just like what is it that is unique and what, you know. I think artists frequently, you know, I, I know in my own writing and sort of like creative process, you sometimes feel like you're just going through the motions.[01:09:50] And it's funny how we have ways of talking about a phrase rolls off the tongue. That's very much like a causal language model. Mm-hmm. Where like we talk about talk tracks. I have a whole spiel, you know, you talk to a startup founder and you're like, oh my God, how many times have you said like, very close, like very tight variance on this Three minutes sometimes.[01:10:10] That's good. Yeah. It's, it's fine. It's just, it's a thing that we do. And so touching on this idea that like some of what we consider creative acts may not actually be creative acts and sort of, is there a pr, is there a market pressure to favor things that are truly creative versus just like formulaic and like re like rehashing kind of the same essence?[01:10:29] I think like art. Transcends boundaries is often the most interesting art to engage with, where it, it truly does confront you with something you haven't considered before. I hope that that's the place where humans play. And that they're kind of like, oh, I just need some lo-fi study beats. It's like, just gimme an infinite stream.[01:10:49] I'm fine. Because I'm just like,[01:10:52] you've seen that chart of like pop uh, songs, declining interns of the key changes, key changes in[01:10:58] Octa ranges. Completely. Completely. And like, I mean, we used to have[01:11:02] Bohemian Rhapsody and, and[01:11:03] yeah, it's a great example of something that would not be priced appropriately.[01:11:08] This is why I, I think perplexity AI is just very well named because we want more perplexity in our lives. Yes, by the way, shout out for replica ai. I don't know if you've come across them, but Absolutely. They are working on the digital twin stuff. Okay. Ai, uh, request for startups. AI thing you would pay for if someone[01:11:21] built it.[01:11:22] Well, so the LM op stuff for sure. Just like make it easy to generate and evaluate samples using multimodal, multimodal, I mean multiple modalities, not images and texts, but rather like humans, quantitative benchmarks and qualitative Oh, samples that I, I am able to evaluate myself, but other AI startups. I think that we have your sister, your wife, your wife has family that works in the park system.[01:11:49] Mm-hmm. Because it is so everybody has access to effectively the same information about what's interesting in the outdoors. I think you get to a lot of trail heads and you have very, very tight parking lots and it's difficult to get to a lot of these beautiful places. And like, um, mere Woods is another example of like, you gotta reserve a parking spot in the woods that's a plumber.[01:12:12] But I think that the US in particular is so unique in that we have such an expansive public lands, and I think that there are a lot of really majestic and beautiful places in the world that are not written about. And so I think from a geospatial standpoint, you could imagine representing each tile on a map like a word deve.[01:12:39] Embedding where you look at the context in which a location exists and the things people have said about it, and you, you kind of distill the essence of a place and you can given a statement about how I wanna spend my day route traffic more evenly. On the surface of the earth so that we are not all competing for the same fixed pool of resources.[01:13:03] I don't know that that's something really that's monetizable in like a, you know, is this gonna be the next 10 billion business sort of way. But like there's so much public land and there's so many back roads and like the days where I have, you know, rumbling down a dirt road, my brother are just the best days of my life.[01:13:22] And, uh, I want more of those. I want systems that help us live as fully as possible as humans. Yeah, there's definitely[01:13:29] a lot of, you know, you got the. The most popular trails. Everybody wants to be there. Yeah. And then there's the less known ones. And I feel like a lot of people back to the text to back is like, they don't know what they're gonna find, you know?[01:13:41] Mm-hmm. There's not like YouTube reviews of all these trails. Totally. But like you can see it. So I think a way to, to better understand that would be, would be cool.[01:13:49] I mean, just to kind of like riff on this just a little more and we can wrap, like I do think there's a AI technology as a swarm management.[01:13:59] Tool, you know, being able to perceive sensor and camera inputs from multiple different agents in a system. And I think about like ultra low powered gliders as an example of like, I would like to be able to get, I mean, there, there are tools now where you can, uh, for 180 bucks get a satellite to take a da a picture today of like a five by five kilometer area.[01:14:21] I just wanna be able to run recon fleets on the back country and get like up to date trail conditions. I don't know that anybody's gonna make any real money doing this, but if it existed, I would use it. So maybe I should build it maybe. Yeah, exactly. Open source. It's part of Databricks longstanding commitment to open source for diversifying new markets.[01:14:44] Awesome. Mike, it was, it was great[01:14:45] to have you. Oh, this was a, yeah. Get full access to Latent Space at www.latent.space/subscribe
undefined
Apr 22, 2023 • 1h 4min

AI-powered Search for the Enterprise — with Deedy Das of Glean

The most recent YCombinator W23 batch graduated 59 companies building with Generative AI for everything from sales, support, engineering, data, and more:Many of these B2B startups will be seeking to establish an AI foothold in the enterprise. As they look to recent success, they will find Glean, started in 2019 by a group of ex-Googlers to finally solve AI-enabled enterprise search. In 2022 Sequoia led their Series C at a $1b valuation and Glean have just refreshed their website touting new logos across Databricks, Canva, Confluent, Duolingo, Samsara, and more in the Fortune 50 and announcing Enterprise-ready AI features including AI answers, Expert detection, and In-context recommendations.We talked to Deedy Das, Founding Engineer at Glean and a former Tech Lead on Google Search, on why he thinks many of these startups are solutions looking for problems, and how Glean’s holistic approach to enterprise probllem solving has brought so much success. Deedy is also just a fascinating commentator on AI current events, being both extremely qualified and great at distilling insights, so we also went over his many viral tweets diving into Google’s competitive threats, AI Startup investing, and his exposure of Indian University Exam Fraud!Show Notes* Deedy on LinkedIn and Twitter and Personal Site* Glean* Glean and Google Moma* Golinks.io* Deedy on Google vs ChatGPT* Deedy on Google Ad Revenue* Deedy on How much does it cost to train a state-of-the-art foundational LLM?* Deedy on Google LaMDA cost* Deedy’s Indian Exam Fraud Story* Lightning Round* Favorite Products: (covered in segment)* Favorite AI People: AI Pub* Predictions: Models will get faster for the same quality* Request for Products: Hybrid Email Autoresponder* Parting Takeaway: Read the research!Timestamps* [00:00:21] Introducing Deedy* [00:02:27] Introducing Glean* [00:05:41] From Syntactic to Semantic Search* [00:09:39] Why Employee Portals* [00:12:01] The Requirements of Good Enterprise Search* [00:15:26] Glean Chat?* [00:15:53] Google vs ChatGPT* [00:19:47] Search Issues: Freshness* [00:20:49] Search Issues: Ad Revenue* [00:23:17] Search Issues: Latency* [00:24:42] Search Issues: Accuracy* [00:26:24] Search Issues: Tool Use* [00:28:52] Other AI Search takes: Perplexity and Neeva* [00:30:05] Why Document QA will Struggle* [00:33:18] Investing in AI Startups* [00:35:21] Actually Interesting Ideas in AI* [00:38:13] Harry Potter IRL* [00:39:23] AI Infra Cost Math* [00:43:04] Open Source LLMs* [00:46:45] Other Modalities* [00:48:09] Exam Fraud and Generated Text Detection* [00:58:01] Lightning RoundTranscript[00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO and residence at Decibel Partners. I'm joined by my, cohost swyx, writer and editor of[00:00:19] Latent Space. Yeah. Awesome.[00:00:21] Introducing Deedy[00:00:21] And today we have a special guest. It's Deedy Das from Glean. Uh, do you go by Deedy or Debarghya? I go by Deedy. Okay.[00:00:30] Uh, it's, it's a little bit easier for the rest of us to, uh, to, to spell out. And so what we typically do is I'll introduce you based on your LinkedIn profile, and then you can fill in what's not on your LinkedIn. So, uh, you graduated your bachelor's and masters in CS from Cornell. Then you worked at Facebook and then Google on search, specifically search, uh, and also leading a sports team focusing on cricket.[00:00:50] That's something that we, we can dive into. Um, and then you moved over to Glean, which is now a search unicorn in building intelligent search for the workplace. What's not on your LinkedIn that people should know about you? Firstly,[00:01:01] guys, it's a pleasure. Pleasure to be here. Thank you so much for having me.[00:01:04] What's not on my LinkedIn is probably everything that's non-professional. I think the biggest ones are I'm a huge movie buff and I love reading, so I think I get through, usually I like to get through 10 books ish a year, but I hate people who count books, so I should say the number. And increasingly, I don't like reading non-fiction books.[00:01:26] I actually do prefer reading fiction books purely for pleasure and entertainment. I think that's the biggest omission from my LinkedIn.[00:01:34] What, what's, what's something that, uh, caught your eye for fiction stuff that you would recommend people?[00:01:38] Oh, I recently, we started reading the Three Body Problem and I finished it and it's a three part series.[00:01:45] And, uh, well, my controversial take is I did not really enjoy the second part, and so I just stopped. But the first book was phenomenal. Great concept. I didn't know you could write alien fiction with physics so Well, and Chinese literature in particular has a very different cadence to it than Western literature.[00:02:03] It's very less about the, um, let's describe people and what they're all about and their likes and dislikes. And it's like, here's a person, he's a professor of physics. That's all you need to know about him. Let's continue with the story. Um, and, and I, I, I, I enjoy it. It's a very different style from, from what I'm used.[00:02:21] Yeah, I, I heard it's, uh, very highly recommended. I think it's being adapted to a TV show, so looking forward[00:02:26] to that.[00:02:27] Introducing Glean[00:02:27] Uh, so you spend now almost four years at gle. The company's not unicorn, but you were on the founding team and LMS and tech interfaces are all the reach now. But you were building this before.[00:02:38] It was cool, so to speak. Maybe tell us more about the story, how it became, and some of the technological advances you've seen. Because I think you started, the company started really close to some of the early GPT models. Uh, so you've seen a lot of it from, from day one.[00:02:53] Yeah. Well, the first thing I'll say is Glean was never started to be a.[00:02:58] Technical product looking for a solution. We were always wanted to solve a very critical problem first that we saw, not only in the companies that we'd worked in before, but in all of the companies that a lot of our, uh, a lot of the founding team had been in past their time at Google. So Google has a really neat tool that already kind of does this internally.[00:03:18] It's called MoMA, and MoMA sort of indexes everything that you'd use inside Google because they have first party API accessed who has permissions to what document and what documents exist, and they rank them with their internal search tool. It's one of those things where when you're at Google, you sort of take it for granted, but when you leave and go anywhere else, you're like, oh my God, how do I function without being able to find things that I've worked on?[00:03:42] Like, oh, I remember this guy had a presentation that he made three meetings ago and I don't remember anything about it. I don't know where he shared it. I don't know if he shared it, but I do know the, it was a, something about X and I kind of wanna find that now. So that's the core. Information retrieval problem that we had set out to tackle, and we realized when we started looking at this problem that enterprise search is actually, it's not new.[00:04:08] People have been trying to tackle enterprise search for decades. Again, pre two thousands people have been trying to build these on-prem enterprise search systems. But one thing that has really allowed us to build it well, A, you now have, well, you have distributed elastic, so that really helps you do a lot of the heavy lifting on core infra.[00:04:28] But B, you also now have API support that's really nuanced on all of the SaaS apps that you use. So back in the day, it was really difficult to integrate with a messaging app. They didn't have an api. It didn't have any way to sort of get the permissions information and get the messaging information. But now a lot of SaaS apps have really robust APIs that really let.[00:04:50] Index everything that you'd want though though. That's two. And the third sort of big macro reason why it's happening now and why we're able to do it well is the fact that the SaaS apps have just exploded. Like every company uses, you know, 10 to a hundred apps. And so just the urgent need for information, especially with, you know, remote work and work from home, it's just so critical that people expect this almost as a default that you should have in your company.[00:05:17] And a lot of our customers just say, Hey, I don't, I can't go back to a life without internal search. And I think we think that's just how it should be. So that's kind of the story about how Glean was founded and a lot of the LLM stuff. It's neat that all, a lot of that's happening at the same time that we are trying to solve this problem because it's definitely applicable to the problem we're trying to solve.[00:05:37] And I'm really excited by some of the stuff that we are able to do with it.[00:05:41] From Syntactic to Semantic Search[00:05:41] I was talking with somebody last weekend, they were saying the last couple years we're going from the web used to be syntex driven. You know, you siegal for information retrieval, going into a symantics driven where the syntax is not as important.[00:05:55] It's like the, how you actually explain the question. And uh, we just asked Sarah from Seek.ai on the previous episode and instead of doing natural language and things like that for enterprise knowledge, it's more for business use cases. So I'm curious to see, you know, The enterprise of the future, what that looks like, you know, is there gonna be way less dropdowns and kind of like, uh, SQL queries and stuff like that.[00:06:19] And it's more this virtual, almost like person that embodies the company that is like a, an LLM in a way. But how do you do that without being able to surface all the knowledge that people have in the organization? So something like Lean is, uh, super useful for[00:06:35] that. Yeah, I mean, already today we see these natural language queries as well.[00:06:39] I, I will say at, at this point, it's still a small fraction of the queries. You see a lot of, a lot of the queries are, hey, what is, you know, just a name of a project or an acronym or a name of a person or some someone you're looking for. Yeah, I[00:06:51] think actually the Glean website explains gleans features very well.[00:06:54] When I, can I follow the video? Actually, video wasn't that, that informative video was more like a marketing video, but the, the actual website was showing screenshots of what you see there in my language is an employee portal. That happens to have search because you also surface like collections, which proactively show me things without me searching anything.[00:07:12] Right. Like, uh, you even have Go links, you should copy it, I think from Google, right? Which like, it's basically, uh, you know, in my mind it's like this is ex Googlers missing Google internal stuff. So they just built it for everyone else. So,[00:07:25] well, I can, I can comment on that. So a, I should just plug that we have a new website as of today.[00:07:30] I don't know how, how it's received. So I saw it yesterday, so let, let me know. I think today we just launch, I don't know when we launched a new one, I think today or yesterday. Yeah,[00:07:38] it's[00:07:38] new. I opened it right now it's different than yesterday.[00:07:41] Okay. It's, it's today and yeah. So one thing that we find is that, Search in itself.[00:07:48] This is actually, I think, quite a big insight. Search in itself is not a compelling enough use case to keep people drawn to your product. It's easy to say Google search is like that, but Google Search was also in an era where that was the only website people knew, and now it's not like that. When you are a new tool that's coming into a company, you can't sit on your high horse and say, yeah, of course you're gonna use my tool to search.[00:08:13] No, they're not gonna remember who you are. They're gonna use it once and completely forget to really get that retention. You need to sort of go from being just a search engine to exactly what you said, Sean, to being sort of an employee portal that does much more than that. And yeah, the Go Links thing, I, I mean, yes, it is copied from Google.[00:08:33] I will say there's a complete other startup called Go links.io that has also copied it from Google and, and everyone, everyone misses Go Links. It's very useful to be able to write a document and just be like, go to go slash this. And. That's where the document is. And, and so we have built a big feature set around it.[00:08:50] I think one of the critical ones that I will call out is the feed. Just being able to see, not just, so documents that are trending in your sub-organization documents that you, we think you should see are a limited set of them, as well as now we've launched something called Mentions, which is super useful, which is all of your tags across all of your apps in one place in the last whatever, you know, time.[00:09:14] So it's like all of the hundred Slack pings that you have, plus the Jira pings, plus the, the, the email, all of that in one place is super useful to have. So you did GitHub. Yeah, we do get up to, we do get up to all the mentions.[00:09:28] Oh my God, that's amazing. I didn't know you had it, but, uh, um, this is something I wish for myself.[00:09:33] It's amazing.[00:09:34] It's still a little buggy right now, but I think it's pretty good. And, and we're gonna make it a lot better as as we go.[00:09:39] Why Employee Portals[00:09:39] This[00:09:39] is not in our preset list of questions, but I have one follow up, which is, you know, I've worked in quite a few startups now that don't have employee portals, and I've worked at Amazon, which had an employee portal, but it wasn't as beautiful or as smart as as glean.[00:09:53] Why isn't this a bigger norm in all[00:09:56] companies? Well, there's several reasons. I would say one reason is just the dynamics of how enterprise sales happens is. I wouldn't say broken. It is, it is what it is, but it doesn't always cater to employees being happy with the best tools. What it does cater to is there's different incentive structures, right?[00:10:16] So if I'm an IT buyer, I have a budget and I need to understand that for a hundred of these tools that are pitched to me all the time, which ones really help the company And the way usually those things are evaluated is does it increase revenue and does it cut cost? Those are the two biggest ones. And for a software like Glean or a search portal or employee portal, it's actually quite difficult when you're in, generally bucketed in the space of productivity to say, Hey, here's a compelling use use case for why we will cut your cost or increase your revenue.[00:10:52] It's just a softer argument that you have to make there. It's just a fundamental nature of the problem versus if you say, Hey, we're a customer support tool. Everyone in SaaS knows that customer support tools is just sort of the. The last thing that you go to when you're looking for ideas, because it's easy to sell.[00:11:08] It's like, here's a metric. How many tickets can your customer support agent resolve? We've built a thing that makes it 20% better. That means it's 1,000 thousand dollars cost savings. Pay us 50 k. Call it a deal. That's a good argument. That's a very simple, easy to understand argument. It's very difficult to make that argument with search, which you're like, okay, you're gonna get see about 10 to 20 searches that's gonna save about this much time, uh, a day.[00:11:33] And that results in this much employee productivity. People just don't buy it as easily. So the first reaction is, oh, we work fine without it. Why do we need this now? It's not like the company didn't work without this tool, and uh, and only when they have it do they realize what they were missing out on.[00:11:50] So it's a difficult thing to sell in, in some ways. So even though the product is, in my opinion, fantastic, sometimes the buyer isn't easily convinced because it doesn't increase revenue or cut cost.[00:12:01] The Requirements of Good Enterprise Search[00:12:01] In terms of technology, can you maybe talk about some of the stack and you see a lot of companies coming up now saying, oh, we help you do enterprise search.[00:12:10] And it's usually, you know, embedding to then do context for like a LLM query mostly. I'm guessing you started as like closer to like the vector side of thing maybe. Yeah. Talk a bit about that and some learning siva and as founders try to, to build products like this internally, what should they think[00:12:27] about?[00:12:28] Yeah, so actually leading back from the last answer, one of the ways a lot of companies who are in the enterprise search space are trying to tackle the problem of sales is to lean into how advance the technology is, which is useful. It's useful to say we are AI powered, LLM powered vector search, cutting edge, state-of-the-art, yada, yada, yada.[00:12:47] Put it all your buzzwords. That's nice, but. The question is how often does that translate to better user experience is sort of, a fuzzy area where it, it's really hard for even users to tell, to be honest. Like you can have one or two great queries and one really bad query and be like, I don't know if this thing is smart.[00:13:06] And it takes time to evaluate and understand how a certain engine is doing. So to that, I think one of the things that we learned from Google, a lot of us come from an ex Google search background, and one of the key learnings is often with search, it's not about how advanced or how complex the technology is, it's about the rigor and intellectual honesty that you put into tuning the ranking algorithm.[00:13:30] That's a painstaking long-term and slow process at Google until I would say maybe 20 17, 20 18. Everything was run off of almost no real ai, so to speak. It was just information retrieval at its core, very basic from the seventies, eighties, and a bunch of these ranking components that are put stacked on top of it that do various tasks really, really well.[00:13:57] So one task in search is query understanding what does the query mean? One task is synonymous. What are other synonyms for this thing that we can also match on? One task is document understanding. Is this document itself a high quality document or not? Or is it some sort of SEO spam? And admittedly, Google doesn't do so well on that anymore, but there's so many tough sub problems that it breaks search down into and then just gets each of those problems, right, to create a nice experience.[00:14:24] So to answer your question, also, vector search we do, but it is not the only way we get results. We do a hybrid approach both using, you know, core IR signal synonymy. Query accentuation with things like acronym expansion, as well as stuff like vector search, which is also useful. And then we apply our level of ranking understanding on top of that, which includes personalization, understanding.[00:14:50] If you're an engineer, you're probably not looking for Salesforce documents. You know, you're probably looking for documents that are published or co-authored by people in your team, in your immediate team, and our understanding of all of your interactions with people around you. Our personalization layer, our good work on ranking is what makes us.[00:15:09] Good. It's not sort of, Hey, drop in LLM and embeddings and we become amazing at search. That's not how we think it[00:15:16] works. Yeah. I think there's a lot of polish that mix into quality products, and that's the difference that you see between Hacker News, demos and, uh, glean, which is, uh, actual, you know, search and chat unicorn.[00:15:26] Glean Chat?[00:15:26] But also is there a glean chat coming? Is is, what do you think about the[00:15:30] chat form factor? I can't say anything about it, but I think that we are experi, my, my politically correct answer is we're experimenting with many technologies that use modern AI and LLMs, and we will launch what we think users like best.[00:15:49] Nice. You got some media training[00:15:51] again? Yeah. Very well handed.[00:15:53] Google vs ChatGPT[00:15:53] We can, uh, move off of Glean and just go into Google search. Uh, so you worked on search for four years. I've always wanted to ask what happens when I type something into Google? I feel like you know more than others and you obviously there's the things you cannot say, but I'm sure Google does a lot of the things that Glean does as well.[00:16:08] How do you think about this Google versus ChatGPT debate? Let's, let's maybe start at a high level based on what you see out there, and I think you, you see a lot of[00:16:15] misconceptions. Yeah. So, okay, let me, let me start with Google versus ChatGPT first. I think it's disingenuous, uh, if I don't say my own usage pattern, which is I almost don't go back to Google for a large section of my queries anymore.[00:16:29] I just use ChatGPT I am a paying plus subscriber and it's sort of my go-to for a lot of things. That I ask, and I also have to train my mind to realize that, oh, there's a whole set of questions in your head that you never realize the internet could answer for you, and that now you're like, oh, wait, I could actually ask this, and then you ask it.[00:16:48] So that's my current usage pattern. That being said, I don't think that ChatGPT is the best interface or technology for all sets of queries. I think humans are obviously very easily excited by new technology, but new technology does not always mean the previous technology was worse. The previous technology is actually really good for a lot of things, and for search in particular, if you think about all the queries that come into Google search, they fall into various kinds of query classes, depending on whatever taxonomy you want to use.[00:17:24] But one sort of way of, of of understanding broad, generally, the query classes is something that is information seeking or exploratory. And for information for exploratory queries. I think there are uses where Google does really well. Like for example, let's say you want to just know a list of songs of this artist in this year.[00:17:49] Google will probably be able to add a hundred percent, tell you that pretty accurately all the time. Or if you want to say understand like what showtimes of movies came out today. So fresh queries, another query class, Google will be really good at that chat, not so good at that. But if you look at information seeking queries, you could even argue that if I ask for information about Donald Trump, Maybe ChatGPT will spit out a reasonable sounding paragraph and it makes sense, but it doesn't give me enough stuff to like click on and go to and navigate to in a news article here.[00:18:25] And I just kind wanna see a lot of stuff happening. So if you really break down the problem, I think it's not as easy as saying ChatGPT is a silver bullet for every kind of information need. There's a lot of information needs, especially for tail queries. So for long. Un before seen queries like, Hey, tell me the cheat code in Doom three.[00:18:43] This level, this boss ChatGPTs gonna blow it out the water on those kind of queries cuz it's gonna figure out all of these from these random sparse documents and random Reddit threads and assemble one consistent answer for you where it takes forever to find this kind of stuff on Google. For me personally, coding is the biggest use case for anything technical.[00:19:02] I just go to ChatGPT cuz parsing through Stack Overflow is just too mentally taxing and I don't care about, even if ChatGPT hallucinates a wrong answer, I can verify that. But I like seeing a coherent, nice answer that I can just kind of good starting point for my research on whatever I'm trying to understand.[00:19:20] Did you see the, the statistic that, uh, the Allin guys have been saying, which is, uh, stack overflow traffic is down 15%? Yeah, I did, I did.[00:19:27] See that[00:19:28] makes sense. But I, I, I don't know if it's like only because of ChatGPT, but yeah, sure. I believe[00:19:33] it. No, the second part was just about if some of the enterprise product search moves out of Google, like cannot, that's obviously a big AdWords revenue driver.[00:19:43] What are like some of the implications in terms of the, the business[00:19:46] there?[00:19:47] Search Issues: Freshness[00:19:47] Okay,[00:19:47] so I would split this answer into two parts. My first part is just talking about freshness, cuz the query that you mentioned is, is specifically the, the issue there is being able to access fresh information. Google just blanket calls his freshness.[00:20:01] Today's understanding of large language models is that it cannot do anything that's highly fresh. You just can't train these things fast enough and cost efficiently enough to constantly index new, new. Sources of data and then serve it at the same time in any way that's feasible. That might change in the future, but today it's not possible.[00:20:20] The best thing that you can get that's close to it is what, you know, the fancy term is retrieval, augmented generation, but it's a fancy way of saying just do the search in the background and then use the results to create the actual response. That's what Bing does today. So to answer the question about freshness, I would say it is possible to do with these methods, but those methods all in all involve using search in the backend to, to sort of get the context to generate the answer.[00:20:49] Search Issues: Ad Revenue[00:20:49] The second part of the answer is, okay, talk about ad revenue. A lot of Google's ad revenue just comes from the fact that over the last two decades, it's figured out how to put ad links on top of a search result page that sometimes users click. Now the user behavior on a chat product is not to click on anything.[00:21:10] You don't click on stuff you just read and you move on. And that actually, in my opinion, has severe impacts on the web ecosystem, on all of Google and all of technology and how we use the internet in the future. And, and the reason is one thing we also take for granted is that this ad revenue where everyone likes to say Google is bad, Google makes money off ads, yada, yada, yada, but this ad revenue kind of sponsored the entire internet.[00:21:37] So you have Google Maps and Google search and photos and drive and all of this great free stuff basically because of ads. Now, when you have this new interface, sure it, it comes with some benefits, but if users aren't gonna click on ads and you replace the search interface with just chat, that can actually be pretty dangerous in terms of what it even means.[00:21:59] To have to create a website, like why would I create a website if no one's gonna come to my. If it's just gonna be used to train a model and then someone's gonna spit out whatever my website says, then there's no incentive. And that kind of dwindles the web ecosystem. In the end, it means less ad revenue.[00:22:15] And then the other existential question is, okay, I'm okay with saying the incumbent. Google gets defeated and there's this new hero, which is, I don't know, open AI and Microsoft. Now reinvent the wheel. All of that stuff is great, but how are they gonna make money? They can make money off, I guess, subscriptions.[00:22:31] But subscriptions is not nearly gonna make you enough. To replace what you can make on ad revenue. Even for Bing today. Bing makes it 11 billion off ad revenue. It's not a society product like it's a huge product, and they're not gonna make 11 billion off subscriptions, I'll tell you that. So even they can't really replace search with this with chat.[00:22:51] And then there are some arguments around, okay, what if you start to inject ads in textual form? But you know, in my view, if the natural user inclination is not to click on something or chat, they're clearly not gonna click on something. No matter how much you try to inject, click targets into your result.[00:23:10] So, That's, that's my long answer to the ads question. I don't really know. I just smell danger in the horizon.[00:23:17] Search Issues: Latency[00:23:17] You mentioned the information augmented generation as well. Uh, I presumably that is literally Bing is probably just using the long context of GPT4 and taking the full text of all the links that they find, dumping it in, and then generating some answer.[00:23:34] Do you think like speed is a concern or people are just people willing to wait for smarter?[00:23:40] I think it's a concern. We noticed that every, every single product I've worked on, there's almost a linear, at least for some section of it, a very linear curve. A linear line that says the more the latency, the less the engagement, so there's always gonna be some drop off.[00:23:55] So it is a concern, but with things like latency, I just kind of presume that time solves these things. You optimize stuff, you make things a little better, and the latency will get down with time. And it's a good time to even mention that. Bard, we just came out today. Google's LLM. For Google's equivalent, I haven't tried it, but I've been reading about it, and that's based off a model called LamDA.[00:24:18] And LamDA intrinsically actually does that. So it does query what they call a tool set and they query search or a calculator or a compiler or a translator. Things that are good at factual, deterministic information. And then it keeps changing its response depending on the feedback from the tool set, effectively doing something very similar to what Bing does.[00:24:42] Search Issues: Accuracy[00:24:42] But I like their framing of the problem where it's just not just search, it's any given set of tools. Which is similar to what a Facebook paper called Tool Former, where you can think of language as one aspect of the problem and language interfaces with computation, which is another aspect of the problem.[00:24:58] And if you can separate those two, this one just talks to these things and figures out what to, how to phrase it. Yeah, so it's not really coming up with the answer. Their claim is like GPT4, for example. The reason it's able to do factual accuracy without search is just by memorizing facts. And that doesn't scale.[00:25:18] It's literally somewhere in the whole model. It knows that the CEO of Tesla is Elon Musk. It just knows that. But it doesn't know that this is a competition. It just knows that. Usually I see CEO, Tesla, Elon, that's all it knows. So the abstraction of language model to computational unit or tool set is an interesting one that I think is gonna be more explored by all of these engines.[00:25:40] Um, and the latency, you know, it'll.[00:25:42] I think you're focusing on the right things there. I actually saw another article this morning about the memorization capability. You know how GPT4 is a lot of, uh, marketed on its ability to answer SAT questions and GRE questions and bar exams and, you know, we covered this in our benchmarks podcast Alessio, but like I forgot to mention that all these answers are out there and were probably memorized.[00:26:05] And if you change them just, just a little bit, the model performance will probably drop a lot.[00:26:10] It's true. I think the most compelling, uh, proof of that, of what you just said is the, the code forces one where somebody I think tweeted, tweeted, tweeted about the, yeah, the 2021. Everything before 2021. It solves everything after.[00:26:22] It doesn't, and I thought that was interesting.[00:26:24] Search Issues: Tool Use[00:26:24] It's just, it's just dumb. I'm interested in two former, and I'm interested in react type, uh, patterns. Zapier just launched a natural language integration with LangChain. Are you able to compare contrast, like what approaches you like when it comes to LMS using[00:26:36] tools?[00:26:37] I think it's not boiled down to a science enough for me to say anything that's uh, useful. Like I think everyone is at a point of time where they're just playing with it. There's no way to reason about what LLMs can and can't do. And most people are just throwing things at a wall and seeing what sticks.[00:26:57] And if anyone claims to be doing better, they're probably lying because no one knows how these things behaves. You can't predict what the output is gonna be. You just think, okay, let's see if this works. This is my prompt. And then you measure and you're like, oh, that worked. Versus the stint and things like react and tool, form are really cool.[00:27:16] But those are just examples of things that people have thrown at a wall that stuck. Well, I mean, it's provably, it works. It works pretty, pretty well. I will say that one of the. It's not really of the framing of what kind of ways can you use LLMs to make it do cool things, but people forget when they're looking at cutting edge stuff is a lot of these LLMs can be used to generate synthetic data to bootstrap smaller models, and it's a less sexy space of it all.[00:27:44] But I think that stuff is really, really cool. Where, for example, I want to tag entities in a sentence that's a very simple classical natural language problem of NER. And what I do is I just, before I had to gather training data, train model, tune model, all of this other stuff. Now what I can do is I can throw GPT4 at it to generate a ton of synthetic data, which looks actually really good.[00:28:11] And then I can either just train whatever model I wanted to train before on this data, or I can use something called like low rank adaptation, which is distilling this large model into a much smaller, cost effective, fast model that does that task really well. And in terms of productionable natural language systems, that is amazing that this is stuff you couldn't do before.[00:28:35] You would have teams working for years to solve NER and that's just what that team does. And there's a great red and viral thread about our, all the NLP teams at Big Tech, doomed and yeah, I mean, to an extent now you can do this stuff in weeks, which is[00:28:51] huge.[00:28:52] Other AI Search takes: Perplexity and Neeva[00:28:52] What about some of the other kind of like, uh, AI native search, things like perplexity, elicit, have you played with, with any of them?[00:29:00] Any thoughts on[00:29:01] it? Yeah. I have played with perplexity and, and niva. Everyone. I think both of those products sort of try to do, again, search results, synthesis. Personally, I think Perplexity might be doing something else now, but I don't see the, any of those. Companies or products are disrupting either open AI or ChatGPT or Google being whatever prominent search engines with what they do, because they're all built off basically the Bing API or their own version of an index and their search itself is not good enough and there's not a compelling use case enough, I think, to use those products.[00:29:40] I don't know how they would make money, a lot of Neeva's way of making money as subscriptions. Perplexity I don't think has ever turned on the revenue dial. I just have more existential concerns about those products actually functioning in the long run. So, um, I think I see them as they're, they're nice, they're nice to play with.[00:29:56] It's cool to see the cutting edge innovation, but I don't really understand if they will be long lasting widely used products.[00:30:05] Why Document QA will Struggle[00:30:05] Do you have any idea of what it might take to actually do like a new kind of like, type of company in this space? Like Google's big thing was like page rank, right? That was like one thing that kind of set them apart.[00:30:17] Like people tried doing search before, like. Do you have an intuition for what, like the LM native page rank thing is gonna be to make something like this exist? Or have we kinda, you know, hit the plateau when it comes to search innovation?[00:30:31] So I, I talk to so many of my friends who are obviously excited about this technology as well, and many of them who are starting LLM companies.[00:30:38] You know, how many companies in the YC batch of, you know, winter 23 are LM companies? Crazy half of them. Right? Right. It's, it's ridiculous. But what I always, I think everyone's struggling with this problem is what is your advantage? What is your moat? I don't see it for a lot of these companies, and, uh, it's unclear.[00:30:58] I, I don't have a strong intuition. My sense is that the people who focus on problem first usually get much further than the people who focus solution first. And there's way too many companies that are solutions first. Which makes sense. It's always been the, a big achilles heel of the Silicon Valley.[00:31:16] We're a bunch of nerds that live in a whole different dimension, which nobody else can relate to, but nobody else. The problem is nobody else can relate to them and we can't relate to their problems either. So we look at tech first, not problem first a lot. And I see a lot of companies just, just do that.[00:31:32] Where I'll tell you one, this is quite entertaining to me. A very common theme is, Hey, LMS are cool, that, that's awesome. We should build something. Well, what should we build? And it's like, okay, consumer, consumer is cool, we should build consumer. Then it's like, ah, nah man. Consumers, consumer's pretty hard.[00:31:49] Uh, it's gonna be a clubhouse gonna blow up. I don't wanna blow up, I just wanna build something that's like, you know, pretty easy to be consistent with. We should go enter. Cool. Let's go enterprise. So you go enterprise. It's like, okay, we brought LMS to the enterprise. Now what problem do we tackle? And it's like, okay, well we can do q and A on documents.[00:32:06] People know how to do that, right? We've seen a couple of demos on that. So they build it, they build q and a on documents, and then they struggle with selling, or they're like, or people just ask, Hey, but I don't ask questions to my documents. Like, you realize this is just not a flow that I do, like I, oh no.[00:32:22] I ask questions in general, but I don't ask them to my documents. And also like what documents can you ask questions to? And they'll be like, well, any of them is, they'll say, can I ask them to all of my documents? And they'll be like, well, sure, if you give them, give us all your documents, you can ask anything.[00:32:39] And then they'll say, okay, how will you take all my document? Oh, it seems like we have to build some sort of indexing mechanism and then from one thing to the other, you get to a point where it's like we're building enterprise search and we're building an LM on top of it, and that is our product. Or you go to like ML ops and I'm gonna help you host models, I'm gonna help you train models.[00:33:00] And I don't know, it's, it seems very solution first and not problem first. So the only thing I would recommend is if you think about the actual problems and talk to users and understand what this can be useful for. It doesn't have to be that sexy of how it's used, but if it works and solves the problem, you've done your job.[00:33:18] Investing in AI Startups[00:33:18] I love that whole evolution because I think quite a few companies ha are, independently finding this path and, going down this route to build a glorified, you know, search spot. We actually interviewed a very problem focused builder, Mickey Friedman, who's very, very focused on products placement, image generation.[00:33:34] , and, you know, she's not focused on anything else in terms of image generation, like just focused on product placement and branding. And I think that's probably the right approach, you know, and, and if you think about like Jasper, right? Like they, they're out of all the other GPT3 companies when, when GPT3 first came out, they built focusing on, you know, writers on Facebook, you know, didn't even market on Twitter.[00:33:56] So like most people haven't heard of them. Uh, I think it's a timeless startup lesson, but it's something to remind people when they're building with, uh, language models. I mean, as a, as an investor like you, you know, you are an investor, you're your scout with me. Doesn't that make it hard to invest in anything like, cuz.[00:34:10] Mostly it's just like the incumbents will get to the innovation faster than startups will find traction.[00:34:16] Really. Like, oh, this is gonna be a hot take too. But, okay. My, my in, in investing, uh, with people, especially early, is often for me governed by my intuition of how they approach the problem and their experience with the technology, and pretty much solely that I don.[00:34:37] Really pretend to be an expert in the industry or the space that's their problem. If I think they're smart and they understand the space better than me, then I mostly convinced as if they've thought through enough of the business stuff, if they've thought through the, the market and everything else. I'm convinced I typically stray away from, you know, just what I just said.[00:34:57] Founders who are like LMS are cool and we should build something with them. That's not like usually very convincing to me. That's not a thesis. But I don't concern myself too much with pretending to understand what this space means. I trust them to do that. If I'm convinced that they're smart and they've thought about it, well then I'm pretty convinced that that they're a good person to, to, to[00:35:20] back.[00:35:21] Cool.[00:35:21] Actually Interesting Ideas in AI[00:35:21] Kinda like super novel idea that you wanna shout.[00:35:25] There's a lot of interesting explorations, uh, going on. Um, I, I, okay, I'll, I'll preface this with I, anything in enterprise I just don't think is cool. It's like including, like, it's just, it's, you can't call it cool, man. You're building products for businesses.[00:35:37] Glean is pretty cool. I'm impressed by Glean. This is what I'm saying. It's, it's cool for the Silicon Valley. It's not cool. Like, you're not gonna go to a dinner party with your parents and be like, Hey mom, I work on enterprise search. Isn't that awesome? And they're not all my, all my[00:35:51] notifications in one place.[00:35:52] Whoa.[00:35:55] So I will, I'll, I'll start by saying, for in my head, cool means like, the world finds this amazing and, and it has to be somewhat consumer. And I do think that. The ideas that are being played with, like Quora is playing with Poe. It's kind of strange to think about, and may not stick as is, but I like that they're approaching it with a very different framing, which is, Hey, how about you talk to this, this chat bot, but let's move out of this, this world where everyone's like, it's not WhatsApp or Telegram, it's not a messaging app.[00:36:30] You are actually generating some piece of content that now everybody can make you use of. And is there something there Not clear yet, but it's an interesting idea. I can see that being something where, you know, people just learn. Or see cool things that GPT4 has said or chatbots have said that's interesting in the image space.[00:36:49] Very contrasted to the language space. There's so much like I don't even begin to understand the image space. Everything I see is just like blows my mind. I don't know how mid journey gets from six fingers to five fingers. I don't understand this. It's amazing. I love it. I don't understand what the value is in terms of revenue.[00:37:08] I don't know where the markets are in, in image, but I do think that's way, way cooler because that's a demo where, and I, and I tried this, I showed GPT4 to, to my mom and my mom's like, yeah, this is pretty cool. It does some pretty interesting stuff. And then I showed the image one and she is just like, this is unbelievable.[00:37:28] There's no way a computer could write do this, and she just could not digest it. And I love when you see those interactions. So I do think image world is a whole different beast. Um, and, and in terms of coolness, lot more cool stuff happening in image video multimodal I think is really, really cool. So I haven't seen too many startups that are doing something where I'm like, wow, that's, that's amazing.[00:37:51] Oh, 11 labs. I'll, I'll mention 11 labs is pretty cool. They're the only ones that I know that are doing Oh, the voice synthesis. Have you tried it? I've only played with it. I haven't really tried generating my own voice, but I've seen some examples and it looks really, really awesome. I've heard[00:38:06] that Descript is coming up with some stuff as well to compete, cuz yeah, this is definitely the next frontier in terms of, podcasting.[00:38:13] Harry Potter IRL[00:38:13] One last thing I I will say on the cool front is I think there is something to be said about. A product that brings together all these disparate advancements in ai. And I have a view on what that looks like. I don't know if everyone shares that view, but if you bring together image generation, voice recognition, language modeling, tts, and like all of the other image stuff they can do with like clip and Dream booth and putting someone's actual face in it.[00:38:41] What you can actually make, this is my view of it, is the Harry Potter picture come to life where you actually have just a digital stand where there's a person who's just capable of talking to you in their voice, in, you know, understandable dialogue. That is how they speak. And you could just sort of walk by, they'll look at you, you can say hi, they'll be, they'll say hi back.[00:39:03] They'll start talking to you. You start talking back to it. That's sort of my, that's my my wild science fiction dream. And I think the technology exists to put all of those pieces together and. The implications for people who are older or saving people over time are huge. This could be a really cool thing to productionize.[00:39:23] AI Infra Cost Math[00:39:23] There's one more part of you that also tweets about numbers and math, uh, AI math essentially is how I'm thinking about it. What gets you into talking about costs and math and, and you know, just like first principles of how to think about language models.[00:39:39] One of my biggest beefs with big companies is how they abstract the cost away from all the engineers.[00:39:46] So when you're working on a Google search, I can't tell you a single number that is cost related at all. Like I just don't know the cost numbers. It's so far down the chain that I have no clue how much it actually costs to run search, and how much these various things cost aside from what the public knows.[00:40:03] And I found that very annoying because when you are building a startup, particularly maybe an enterprise startup, you have to be extremely cognizant about the cost because that's your unit economics. Like your primary cost is the money you spend on infrastructure, not your actual labor costs. The whole thesis is the labor doesn't scale, but the inf.[00:40:21] Does scale. So you need to understand how your infra costs scale. So when it comes to language models, given that these things are so compute heavy, but none of the papers talk about cost either. And it's just bothers me. I'm like, why can't you just tell me how much it costs you to, to build this thing?[00:40:39] It's not that hard to say. And it's also not that hard to figure out. They give you everything else, which is, you know, how many TPUs it took and how long they trained it for and all of that other stuff, but they don't tell you the cost. So I've always been curious because ev all everybody ever says is it's expensive and a startup can't do it, and an individual can't do it.[00:41:01] So then the natural question is, okay, how expensive is it? And that's sort of the, the, the background behind. Why I started doing some more AI math and, and one of the tweets that probably the one that you're talking about is where I compare the cost of LlaMA, which is Facebook's LLM, to PaLM with, uh, my best estimates.[00:41:23] And, uh, the only thing I'll add to that is it is quite tricky to even talk about these things publicly because you get rammed in the comments because by people who are like, oh, don't you know that this assumption that you made is completely BS because you should have taken this cost per hour? Because obviously people do bulk deals.[00:41:42] And yeah, I have two 80 characters. This is what I could have said. But I think ballpark, I think I got close. I, I'd like to imagine, I think I was off maybe by, by by two x on the lower side. I think I took an upper bound and I might have been off by, by two x. So my quote was 4 million for LlaMA and 27 for PaLM.[00:42:01] In fact, later today I'm going to do, uh, one on Bard. So. Oh oh one bar. Oh, the exclusive is that It's four, it's 4 million for Bard two.[00:42:10] Nice. Nice. Which is like, do you think that's like, don't you think that's actually not a lot, like it's a drop in the bucket for these[00:42:17] guys. One, and one of the, the valuable things to note when you're talking about this cost is this is the cost of the final training step.[00:42:24] It's not the cost of the entire process. And a common rebuttal is, well, yeah, this is your cost of the final training process, but in total it's about 10 x this amount cost. Because you have to experiment. You have to tune hyper parameters, you have to understand different architectures, you have to experiment with different kinds of training data.[00:42:43] And sometimes you just screw it up and you don't know why. And you have, you're just spend a lot of time figuring out why you screwed it up. And that's where the actual cost buildup happens, not in the one final last step where you actually train the final model. So even assuming like a 10 x on top of this, I think is, is, is fair for how much it would actually cost a startup to build this from scratch?[00:43:03] I would say.[00:43:04] Open Source LLMs[00:43:04] How do you think about open source in this then? I think a lot of people's big 2023 predictions are an LLM, you know, open source LLM, that is comparable performance to the GPT3 model. Who foots the bill for the mistakes? You know, like when when somebody opens support request that it's not good.[00:43:25] It doesn't really cost people much outside of like a GitHub actions run as people try entering these things separately. Like do you think open source is actually bad because you're wasting so much compute by so many people trying to like do their own things and like, do you think it's better to have a centralized team that organizes these experiments or Yeah.[00:43:43] Any thoughts there? I have some thoughts. I. The most easy comparison to make is to image generation world where, you know, you had Mid Journey and Dolly come out first, and then you had Imad come out with stability, which was completely open source. But the difference there is I think stability. You can pretty much run on your machine and it's okay.[00:44:06] It works pretty fast. So it, so the entire concept of, of open sourcing, it worked and people made forks that fine tuned it on a bunch of different random things and it made variance of stability that could. A bunch of things. So I thought the stability thing, agnostic of the general ethical concerns of training on everyone's art.[00:44:25] I thought it was a cool, cool addition to the sort of trade-offs in different models that you can have in image generation for text generation. We're seeing an equivalent effect with LlaMA and alpaca, which LlaMA being, being Facebook's model, which they didn't really open source, but then the weights got leaked and then people clone them and then they tuned them using GPT4 generated synthetic data and made alpaca.[00:44:50] So the version I think that's out there is only the 7,000,000,001 and then this crazy European c plus plus God. Came and said, you know what, I'm gonna write this entire thing in c plus plus so you can actually run it locally and and not have to buy GPUs. And a combination of those. And of course a lot of people have done work in optimizing these things to make it actually function quickly.[00:45:13] And we can get into details there, but a function of all of these things has enabled people to actually. Semi-good models on their computer. I don't have that much, I don't have any comments on, you know, energy usage and all of that. I don't really have an opinion on that. I think the fact that you can run a local version of this is just really, really cool, but also supremely dangerous because with images, conceivably, people can tell what's fake and what's real, even though there, there's some concerns there as well. But for text it's, you know, like you can do a lot of really bad things with your own, you know, text generation algorithm. You know, if I wanted to make somebody's life hell, I could spam them in the most insidious ways with all sorts of different kinds of text generation indefinitely, which I, I can't really do with images.[00:46:02] I don't know. I find it somewhat ethically problematic in terms of the power is too much for an individual to wield. But there are some libertarians who are like, yeah, why should only open AI have this power? I want this power too. So there's merits to both sides of the argument. I think it's generally good for the ecosystem.[00:46:20] Generally, it will get faster and the latency will get better and the models may not ever reach the size of the cutting edge that's possible, but it could be good enough to do. 80% of the things that bigger model could do. And I think that's a really good start for innovation. I mean, you could just have people come up with stuff instead of companies, and that always unlocks a whole vector of innovation that didn't previously exist.[00:46:45] Other Modalities[00:46:45] That was a really good, conclusion. I, I, I want to ask follow up questions, but also, that was a really good place to end it. Was there any other AI topics that you wanted to[00:46:52] touch on? I think Runway ML is the one company I didn't mention and that, that one's, uh, one to look out for.[00:46:58] I think doing really cool stuff in terms of video editing with generative techniques. So people often talk about the open AI and the Googles of the world and philanthropic and clo and cohere and big journey, all the image stuff. But I think the places that people aren't paying enough attention to that will get a lot more love in the next couple of years.[00:47:19] Better whisper, so better streaming voice recognition, better t t s. So some open source version of 11 labs that people can start using. And then the frontier is sort of multi-modality and videos. Can you do anything with videos? Can you edit videos? Can you stitch things together into videos from images, all sorts of different cool stuff.[00:47:40] And then there's sort of the long tail of companies like Luma that are working on like 3D modeling with generative use cases and taking an image and creating a 3D model from nothing. And uh, that's pretty cool too, although the practical use cases to me are a little less clear. Uh, so that's kind of covers the entire space in my head at least.[00:48:00] I[00:48:00] like using the Harry Potter image, like the moving and speaking images as a end goal. I think that's something that consumers can really get behind as well. That's super cool.[00:48:09] Exam Fraud and Generated Text Detection[00:48:09] To double back a little bit before we go into the lining round, I have one more thing, which is, relevant to your personal story, but then also relevant to our debate, which is a nice blend.[00:48:18] You're concerned about the safety of everyone having access to language models and you know, the potential harm that you can do there. My guess is that you're also not that positive on watermarking. Techniques from internal languages, right? Like maybe randomly sprinkling weird characters so that people can see like that this is generated by an AI model, but also like you have some personal experience with this because you found manipulation in the Indian Exam Board, which, uh, maybe you might be a similar story.[00:48:48] I, I don't know if you like, have any thoughts about just watermarking manipulation, like, you know, ethical deployments of, of, uh,[00:48:55] generated data.[00:48:57] Well, I think those two things are a little separate. Okay. One I would say is for watermarking text data. There is a couple of different approaches. I think there is actual value to that because from a pure technical perspective, you don't want models to train on stuff they've generated.[00:49:13] That's kind of bad for models. Yes. And two is obviously you don't want people to keep using Chatt p t for i, I don't know if you want this to use it for all their assignments and never be caught. Maybe you don't. Maybe you don't. But it, it seems like it's valuable to at least understand that this is a machine generated text versus not just ethically that seems, seems like something that should exist.[00:49:33] So I do think watermarking is, is. A good direction of research and it's, and I'm fairly positive on it. I actually do think people should standardize how that water marketing works across language models so that everyone can detect and understand language models and not just, OpenAI does its own models, but not the other ones and, and so on.[00:49:51] So that's my view on that. And then, and sort of transitioning into the exam data, this is really old one, but it's one of my favorite things to talk about is I. In America, as you know. Usually the way it works is you give your, you, you take your s a t exam, uh, you take a couple of aps, you do your school grades, you apply to colleges, you do a bunch of fluff.[00:50:10] You try to prove how you're good at everything. And then you, you apply to colleges and then it's a, a weird decision based on a hundred other factors. And then they decide whether you get in or not. But if you're rich, you're basically gonna get in anyway. And if you're a legacy, you're probably gonna get in and there's a whole bunch of stuff going on.[00:50:23] And I don't think the system is necessarily bad, but it's just really complicated. And some of the things are weird in India and in a lot of the non developed world, people are like, yeah, okay, we can't scale that. There's no way we can have enough people like. Non rigorously evaluate this cuz there's gonna be too much corruption and it's gonna be terrible at the end cuz people are just gonna pay their way in.[00:50:45] So usually it works in a very simple way where you take an exam that is standardized and sometimes you have many exams, sometimes you have an exam for a different subject. Sometimes it's just one for everything. And you get ranked on that exam and depending on your rank you get to choose the quality and the kind of thing you want to study.[00:51:03] Which this, the kind of thing always surprises people in America where it's not like, oh it's glory land, where you walk in and you're like, I think this is interesting and I wanna study this. Like, no, in the most of the world it's like you're not smart enough to study this, so you're probably not gonna study it.[00:51:18] And there's like a rank order of things that you need to be smart enough to do. So it's, it's different. And therefore these exams. Much more critical for the functioning of the system. So when there's fraud, it's not like a small part of your application going wrong, it's your entire application going wrong.[00:51:36] And that's why, that's just me explaining why this is severe. Now, one such exam is the one that you take in school. There's a, it's called a board exam. You take one in the 10th grade, which doesn't really matter for much, but, and then you take one in the 12th grade when you're about to graduate and that.[00:51:53] How you, where you go to college for a large set of colleges, not all, but a large set of colleges, and based on how much you get on your top five average, you're sort of slotted into a different stream in a d in a, in a different college. And over time, because of the competition between two of the boards that are a duopoly, there's no standardization.[00:52:13] So everyone's trying to like, give more marks than the, the, the other person to attract more students into their board because oh, that means that you can then claim, oh, you're gonna get into a better college if you take our exam and don't go to a school that administers the other exam. What? So it's, and that's, that's the, everyone knew that was happening ish, but there was no data to back it.[00:52:34] But when you actually take this exam as I did, you start realizing that the numbers, the marks make no sense because you're looking at. Kid who's also in your class and you're like, dude, this guy's not smart. How did he get a 90 in English? He's not good at English. Like, you can't speak it. You cannot give him a 90.[00:52:54] You gave me a 90. How did this guy get a 90? So everyone has like their anecdotal, this doesn't make any sense me, uh, moments with, with this exam, but no one has access to the data. So way back when, what I did was I realized they have very little security surrounding the data where the only thing that you need to put in to get access is your role number.[00:53:15] And so as long as you predict the right set of role numbers, you can get everybody's results. So unlike America, also exam results aren't treated with a level of privacy. In India, it's very common to sort of the entire class's results on a bulletin board. And you just see how everyone did and you shamed the people who are stupid.[00:53:32] That's just how it works. It's changed over time, but that's fundamentally a cultural difference. And so when I scraped all these results and I published it, and I, and I did some analysis, what I found was, A couple of very insidious things. One is that in, if you plot the distribution of marks, you generally tend to see some sort of skewed, but pseudo normal distribution where it's a big peak and a, and it falls off on both ends, but you see two interesting patterns.[00:54:01] One that is just the most obvious one, which is Grace Marks, which is the pass grade is 33. You don't see nobody got between 29 and 32 because what they did for every single exam is they just made you pass. They just rounded up to 33, which is okay. I'm not that concerned about whether you give Grace Marks.[00:54:21] It's kind of messed up that you do that, but okay, fine. You want to pass a bunch of people who deserve to fail, do it. Then the other more concerning thing was between 33 and 93, right? That's about 60 numbers, 61 numbers, 30 of those numbers were just missing, as in nobody got 91 on this exam. In any subject in any year.[00:54:44] How, how does that happen? You, you don't get a 91, you don't get a 93, 89, 87, 85, 84. Some numbers were just missing. And at first when I saw this, I'm like, this is definitely some bug in my code. There's no way that, like, there's 91 never happened. And so I started, I remember I asked a bunch of my friends, I'm like, dude, did you ever get a 9 81 in anything?[00:55:06] And they're like, no. And it just unraveled that this is obviously problematic cuz that means that they're screwing with your final marks in some way or the other. Yeah. And, and they're not transparent about how they do it. Then I did, I did the same thing for the other board. We found something similar there, but not, not, not the same.[00:55:24] The problem there was, there was a huge spike at 95 and then I realized what they were doing is they'd offer various exams and to standardize, they would blanket add like a, a, a, a raw number. So if you took the harder math exam, everyone would get plus 10. Arbitrarily, no one. This is not revealed or publicized.[00:55:41] It's randomly, that was the harder exam you guys all get plus 10, but it's capped at 95. That's just this stupid way to standardize. It doesn't make any sense. Ah, um, they're not transparent about it. And it affects your entire life because yeah, this is what gets you into college. And yeah, if you add the two exams up, this is 1.1 million kids taking it every year.[00:56:02] So that's a lot of people's lives that you're screwing with by not understanding numbers and, and not being transparent about how you're manipulating them. So that was the thesis in my view, looking back on it, 10 years later, it's been 10 years at this point. I think the media never did justice to it because to be honest, nobody understands statistics.[00:56:23] So over time it became a big issue then. And then there was a big Supreme court or high court ruling, which said, Hey, you guys can't do this, but there's no transparency. So there's no way of actually ensuring that they're not doing it. They just added a, a level of password protection, so now I can't scrape it anymore.[00:56:40] And, uh, they probably do the same thing and it's probably still as bad, but people aren't. Raising an issue about it. It's really hard to make this people understand the significance of it because people are so compelled to just go lean into the narrative of exams are b******t and we should never trust exams, and this is why it's okay to be dumb.[00:56:59] And it's not, that's not the point, like the point. So, I, I think the, the response was lackluster in retrospect, but that's, that's what I unveiled in 2013. That's fascinating.[00:57:09] You know, in my finance background, uh, the similar case happens with the Madoff funds because if you plot the, the statistical distribution of the, the Madoff funds, you could see that they were just not a normal distribution, and therefore they would, they would probably made up numbers.[00:57:25] And, uh, we also did the same thing in my first job as a, as a regulator in Singapore for, for hedge funds returns. Wow. Which is watermarking. It's this, this is a watermark of a human or, uh, some kind of system. Uh, you know, making it up. And statistically, if you look at the distribution, you can see like this, this violates any reasonable assumption.[00:57:41] Therefore, something's.[00:57:42] Wrong. Well, I see, I see what you mean there. Like in that sense. Yes. That's really cool that you worked on a very similar problem, and I agree that it's messed up. It's a good way to catch liars in[00:57:53] Madoff's case. Like they actually made it a big deal, but I don't know, like I don't see how this was a big, wasn't a bigger deal in India.[00:57:58] But anyway, uh, that's a conversation for another, uh, over drinks perhaps.[00:58:01] Lightning Round[00:58:01] But, so now we're gonna go into the lightning round. Just to cut things off with a, uh, overview. What are your favorite AI people and communities? You mentioned Reddits. Let's be specific about which, uh,[00:58:12] I actually don't really use Reddit that much for, uh, AI stuff.[00:58:16] It was just one, a one-off example. Most of my learnings are Twitter, and I think there are the obvious ones, like everyone follows Riley Goodside now and there's a bunch of like the really famous ones. But I think amongst the lesser known ones, there are, let me say just my favorite one is probably AI Pub because it does a roundup of everybody else's stuff regularly.[00:58:40] I know Brian who runs AI Pub as well, and I just think I find it really useful cuz often it's very hard to catch up on stuff and this gives you the entire roundup of last two weeks, here's what happened in ai.[00:58:51] Good, good, good. Uh, and any other communities like Slack communities, the scores? You don't[00:58:55] do that stuff?[00:58:56] I try to, but I, I don't because it's too time consuming. I prefer reading at my own pace.[00:59:02] Yeah, yeah, yeah. Okay. So, so my, my learning is, uh, start a Twitter like, uh, weekly recap of here's what happened in ai. I mean, it makes sense, right? Like it'll do very well. It was you[00:59:11] very well a year from now. What do you think people will be the most surprised[00:59:15] by in ai?[00:59:17] I think they're gonna be surprised at how much cheaper they're able to bring out, down the cost to, and how much faster that these models get. I'm more optimistic about cost and latency more than I am about just quality improvements at this point. I think modalities will change, but I think quality is near about like a, a maxima that we're gonna achieve.[00:59:42] So this is a request for startups or a request for site projects. What's an AI thing that you would pay for? Is somebody else built[00:59:47] it aside from the Harry Potter image one, which I would definitely, I would pay a lot of money to have like a floating, I don't know, bill Clinton in my room, just saying things back to me whenever I talk to it.[00:59:59] That would be cool. But in terms of other products, uh, if somebody built. A product that would smartly, I know many people have tried to build things like this that would smartly auto respond to things that it can auto respond to. And for the things that are actually important, please don't auto respond and just tell me to do it.[01:00:19] And that distinction, I think is really important. So somewhere in between the automate everything and the just suggest everything hybrid that works well, I think that would be really cool. Yeah. I've thought[01:00:30] about this as well. Even if it doesn't respond for you, it can draft an answer for you to edit.[01:00:35] Right. Uh, so that you, you at least get to review.[01:00:37] I actually built that this morning. If you guys want it. Ooh. You just, oh, with Gmail and then it pre-draft every email in your inbox. Really? But, uh, yeah, you have to change the prompt because my prompt says like, you. Software engineer. I'm a venture capitalist, this is where it works, blah, blah, blah, blah.[01:00:55] But you can modify that and then it, it works. It works. Are you[01:00:58] gonna open source it?[01:01:00] I, I probably will, but it sometimes it's like it cares too much about the prompt. So for example, in the prompt, I was like, if the person is asking about scheduling, suggest the time and public like the calendar, my calendar or give this calendar link in every email.[01:01:15] It will respond. And if you ever wanna chat, here's my calendar. Like no matter what the email was, every email, it would tell them to schedule time. So there's still work to[01:01:24] be done. You're just very helpful. You're just very, very helpful. Well, so actually I have a GitHub version of this, which I actually would pay someone to build, which is read somebody who opened a GitHub issue, like, and, and check if they have missed anything for resolutions.[01:01:38] And then generate response to like request for resolution. And then like me, you know, if, if they haven't answered in like 30 days, close the issue.[01:01:45] Absolutely. And, and one thing I'll add to that is also the idea of the ai, just going in and making PRS for you, I think is super compelling that it just says, Hey, I found all these vulnerabilities, uh, patch man.[01:01:58] Yeah, yeah. We, we got a cell company doing it, so Hello. Yeah, I'll let you know more. Deedee, thank you so much for coming on. I think to wrap it up, um, is there any parting thoughts, kind of like one thing that you want everyone to take away about AI and the impact this kind of have?[01:02:14] Yeah, I think my, my parting thought is I have always been a big fan of people of bridging the gap between research and the end consumer.[01:02:24] And I think this is just a great time to be alive where. If you are interested in AI or if you're even remotely interested, of course you can go build stuff. Of course you can read about it. But I think it's so cool that you sh you can just go read the paper and read the raw things that people did to make this happen.[01:02:42] And I really encourage people to go and read research, follow people on YouTube who are explaining this. Andre Kapai has a great channel where he also explains it. It's just a great time to learn in this space and I would really encourage more people to go and and read the actual stuff. It's really cool.[01:03:01] Thank you[01:03:01] so much, Didi, for coming on. It was a great chat. Um, where can people follow you on Twitter? Any other thing you wanna[01:03:08] plug? I think Twitter is fine. And there's a link to my website from my Twitter too. It's my first name, debark underscore das is my Twitter and dego.com is my website. But you can also just Google DB das and you will find both of those links.[01:03:25] Awesome. All right. Thank you so much.[01:03:27] Thank you. Thanks guys. Get full access to Latent Space at www.latent.space/subscribe
undefined
Apr 13, 2023 • 1h 20min

Segment Anything Model and the Hard Problems of Computer Vision — with Joseph Nelson of Roboflow

2023 is the year of Multimodal AI, and Latent Space is going multimodal too! * This podcast comes with a video demo at the 1hr mark and it’s a good excuse to launch our YouTube - please subscribe! * We are also holding two events in San Francisco — the first AI | UX meetup next week (already full; we’ll send a recap here on the newsletter) and Latent Space Liftoff Day on May 4th (signup here; but get in touch if you have a high profile launch you’d like to make). * We also joined the Chroma/OpenAI ChatGPT Plugins Hackathon last week where we won the Turing and Replit awards and met some of you in person!This post featured on Hacker News.Out of the five senses of the human body, I’d put sight at the very top. But weirdly when it comes to AI, Computer Vision has felt left out of the recent wave compared to image generation, text reasoning, and even audio transcription. We got our first taste of it with the OCR capabilities demo in the GPT-4 Developer Livestream, but to date GPT-4’s vision capability has not yet been released. Meta AI leapfrogged OpenAI and everyone else by fully open sourcing their Segment Anything Model (SAM) last week, complete with paper, model, weights, data (6x more images and 400x more masks than OpenImages), and a very slick demo website. This is a marked change to their previous LLaMA release, which was not commercially licensed. The response has been ecstatic:SAM was the talk of the town at the ChatGPT Plugins Hackathon and I was fortunate enough to book Joseph Nelson who was frantically integrating SAM into Roboflow this past weekend. As a passionate instructor, hacker, and founder, Joseph is possibly the single best person in the world to bring the rest of us up to speed on the state of Computer Vision and the implications of SAM. I was already a fan of him from his previous pod with (hopefully future guest) Beyang Liu of Sourcegraph, so this served as a personal catchup as well. Enjoy! and let us know what other news/models/guests you’d like to have us discuss! - swyxRecorded in-person at the beautiful StudioPod studios in San Francisco.Full transcript is below the fold.Show Notes* Joseph’s links: Twitter, Linkedin, Personal* Sourcegraph Podcast and Game Theory Story* Represently* Roboflow at Pioneer and YCombinator* Udacity Self Driving Car dataset story* Computer Vision Annotation Formats* SAM recap - top things to know for those living in a cave* https://segment-anything.com/* https://segment-anything.com/demo* https://arxiv.org/pdf/2304.02643.pdf * https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/* https://blog.roboflow.com/segment-anything-breakdown/* https://ai.facebook.com/datasets/segment-anything/* Ask Roboflow https://ask.roboflow.ai/* GPT-4 Multimodal https://blog.roboflow.com/gpt-4-impact-speculation/Cut for time:* WSJ mention* Des Moines Register story* All In Pod: timestamped mention* In Forbes: underrepresented investors in Series A* Roboflow greatest hits* https://blog.roboflow.com/mountain-dew-contest-computer-vision/* https://blog.roboflow.com/self-driving-car-dataset-missing-pedestrians/* https://blog.roboflow.com/nerualhash-collision/ and Apple CSAM issue * https://www.rf100.org/Timestamps* [00:00:19] Introducing Joseph* [00:02:28] Why Iowa* [00:05:52] Origin of Roboflow* [00:16:12] Why Computer Vision* [00:17:50] Computer Vision Use Cases* [00:26:15] The Economics of Annotation/Segmentation* [00:32:17] Computer Vision Annotation Formats* [00:36:41] Intro to Computer Vision & Segmentation* [00:39:08] YOLO* [00:44:44] World Knowledge of Foundation Models* [00:46:21] Segment Anything Model* [00:51:29] SAM: Zero Shot Transfer* [00:51:53] SAM: Promptability* [00:53:24] SAM: Model Assisted Labeling* [00:56:03] SAM doesn't have labels* [00:59:23] Labeling on the Browser* [01:00:28] Roboflow + SAM Video Demo * [01:07:27] Future Predictions* [01:08:04] GPT4 Multimodality* [01:09:27] Remaining Hard Problems* [01:13:57] Ask Roboflow (2019)* [01:15:26] How to keep up in AITranscripts[00:00:00] Hello everyone. It is me swyx and I'm here with Joseph Nelson. Hey, welcome to the studio. It's nice. Thanks so much having me. We, uh, have a professional setup in here.[00:00:19] Introducing Joseph[00:00:19] Joseph, you and I have known each other online for a little bit. I first heard about you on the Source Graph podcast with bian and I highly, highly recommend that there's a really good game theory story that is the best YC application story I've ever heard and I won't tease further cuz they should go listen to that.[00:00:36] What do you think? It's a good story. It's a good story. It's a good story. So you got your Bachelor of Economics from George Washington, by the way. Fun fact. I'm also an econ major as well. You are very politically active, I guess you, you did a lot of, um, interning in political offices and you were responding to, um, the, the, the sheer amount of load that the Congress people have in terms of the, the support.[00:01:00] So you built, representing, which is Zendesk for Congress. And, uh, I liked in your source guide podcast how you talked about how being more responsive to, to constituents is always a good thing no matter what side of the aisle you're on. You also had a sideline as a data science instructor at General Assembly.[00:01:18] As a consultant in your own consultancy, and you also did a bunch of hackathon stuff with Magic Sudoku, which is your transition from N L P into computer vision. And apparently at TechCrunch Disrupt, disrupt in 2019, you tried to add chess and that was your whole villain origin story for, Hey, computer vision's too hard.[00:01:36] That's full, the platform to do that. Uh, and now you're co-founder c e o of RoboFlow. So that's your bio. Um, what's not in there that[00:01:43] people should know about you? One key thing that people realize within maybe five minutes of meeting me, uh, I'm from Iowa. Yes. And it's like a funnily novel thing. I mean, you know, growing up in Iowa, it's like everyone you know is from Iowa.[00:01:56] But then when I left to go to school, there was not that many Iowans at gw and people were like, oh, like you're, you're Iowa Joe. Like, you know, how'd you find out about this school out here? I was like, oh, well the Pony Express was running that day, so I was able to send. So I really like to lean into it.[00:02:11] And so you kind of become a default ambassador for places that. People don't meet a lot of other people from, so I've kind of taken that upon myself to just make it be a, a part of my identity. So, you know, my handle everywhere Joseph of Iowa, like I I, you can probably find my social security number just from knowing that that's my handle.[00:02:25] Cuz I put it plastered everywhere. So that's, that's probably like one thing.[00:02:28] Why Iowa[00:02:28] What's your best pitch for Iowa? Like why is[00:02:30] Iowa awesome? The people Iowa's filled with people that genuinely care. You know, if you're waiting a long line, someone's gonna strike up a conversation, kinda ask how you were Devrel and it's just like a really genuine place.[00:02:40] It was a wonderful place to grow up too at the time, you know, I thought it was like, uh, yeah, I was kind of embarrassed and then be from there. And then I actually kinda looking back it's like, wow, you know, there's good schools, smart people friendly. The, uh, high school that I went to actually Ben Silverman, the CEO and, or I guess former CEO and co-founder of Pinterest and I have the same teachers in high school at different.[00:03:01] The co-founder, or excuse me, the creator of crispr, the gene editing technique, Dr. Jennifer. Doudna. Oh, so that's the patent debate. There's Doudna. Oh, and then there's Fang Zang. Uh, okay. Yeah. Yeah. So Dr. Fang Zang, who I think ultimately won the patent war, uh, but is also from the same high school.[00:03:18] Well, she won the patent, but Jennifer won the[00:03:20] prize.[00:03:21] I think that's probably, I think that's probably, I, I mean I looked into it a little closely. I think it was something like she won the patent for CRISPR first existing and then Feng got it for, uh, first use on humans, which I guess for commercial reasons is the, perhaps more, more interesting one. But I dunno, biolife Sciences, is that my area of expertise?[00:03:38] Yep. Knowing people that came from Iowa that do cool things, certainly is. Yes. So I'll claim it. Um, but yeah, I, I, we, um, at Roble actually, we're, we're bringing the full team to Iowa for the very first time this last week of, of April. And, well, folks from like Scotland all over, that's your company[00:03:54] retreat.[00:03:54] The Iowa,[00:03:55] yeah. Nice. Well, so we do two a year. You know, we've done Miami, we've done. Some of the smaller teams have done like Nashville or Austin or these sorts of places, but we said, you know, let's bring it back to kinda the origin and the roots. Uh, and we'll, we'll bring the full team to, to Des Moines, Iowa.[00:04:13] So, yeah, like I was mentioning, folks from California to Scotland and many places in between are all gonna descend upon Des Moines for a week of, uh, learning and working. So maybe you can check in with those folks. If, what do they, what do they decide and interpret about what's cool. Our state. Well, one thing, are you actually headquartered in Des Moines on paper?[00:04:30] Yes. Yeah.[00:04:30] Isn't that amazing? That's like everyone's Delaware and you're like,[00:04:33] so doing research. Well, we're, we're incorporated in Delaware. Okay. We we're Delaware Sea like, uh, most companies, but our headquarters Yeah. Is in Des Moines. And part of that's a few things. One, it's like, you know, there's this nice Iowa pride.[00:04:43] And second is, uh, Brad and I both grew up in Brad Mc, co-founder and I grew up in, in Des Moines. And we met each other in the year 2000. We looked it up for the, the YC app. So, you know, I think, I guess more of my life I've known Brad than not, uh, which is kind of crazy. Wow. And during yc, we did it during 2020, so it was like the height of Covid.[00:05:01] And so we actually got a house in Des Moines and lived, worked outta there. I mean, more credit to. So I moved back. I was living in DC at the time, I moved back to to Des Moines. Brad was living in Des Moines, but he moved out of a house with his. To move into what we called our hacker house. And then we had one, uh, member of the team as well, Jacob Sorowitz, who moved from Minneapolis down to Des Moines for the summer.[00:05:21] And frankly, uh, code was a great time to, to build a YC company cuz there wasn't much else to do. I mean, it's kinda like wash your groceries and code. It's sort of the, that was the routine[00:05:30] and you can use, uh, computer vision to help with your groceries as well.[00:05:33] That's exactly right. Tell me what to make.[00:05:35] What's in my fridge? What should I cook? Oh, we'll, we'll, we'll cover[00:05:37] that for with the G P T four, uh, stuff. Exactly. Okay. So you have been featured with in a lot of press events. Uh, but maybe we'll just cover the origin story a little bit in a little bit more detail. So we'll, we'll cover robo flow and then we'll cover, we'll go into segment anything.[00:05:52] Origin of Roboflow[00:05:52] But, uh, I think it's important for people to understand. Robo just because it gives people context for what you're about to show us at the end of the podcast. So Magic Sudoku tc, uh, techers Disrupt, and then you go, you join Pioneer, which is Dan Gross's, um, YC before yc.[00:06:07] Yeah. That's how I think about it.[00:06:08] Yeah, that's a good way. That's a good description of it. Yeah. So I mean, robo flow kind of starts as you mentioned with this magic Sudoku thing. So you mentioned one of my prior business was a company called Represent, and you nailed it. I mean, US Congress gets 80 million messages a year. We built tools that auto sorted them.[00:06:23] They didn't use any intelligent auto sorting. And this is somewhat a solved problem in natural language processing of doing topic modeling or grouping together similar sentiment and things like this. And as you mentioned, I'd like, I worked in DC for a bit and been exposed to some of these problems and when I was like, oh, you know, with programming you can build solutions.[00:06:40] And I think the US Congress is, you know, the US kind of United States is a support center, if you will, and the United States is sports center runs on pretty old software, so mm-hmm. We, um, we built a product for that. It was actually at the time when I was working on representing. Brad, his prior business, um, is a social games company called Hatchlings.[00:07:00] Uh, he phoned me in, in 2017, apple had released augmented reality kit AR kit. And Brad and I are both kind of serial hackers, like I like to go to hackathons, don't really understand new technology until he build something with them type folks. And when AR Kit came out, Brad decided he wanted to build a game with it that would solve Sudoku puzzles.[00:07:19] And the idea of the game would be you take your phone, you hover hold it over top of a Sudoku puzzle, it recognizes the state of the board where it is, and then it fills it all in just right before your eyes. And he phoned me and I was like, Brad, this sounds awesome and sounds like you kinda got it figured out.[00:07:34] What, what's, uh, what, what do you think I can do here? It's like, well, the machine learning piece of this is the part that I'm most uncertain about. Uh, doing the digit recognition and, um, filling in some of those results. I was like, well, I mean digit recognition's like the hell of world of, of computer vision.[00:07:48] That's Yeah, yeah, MNIST, right. So I was like, that that part should be the, the easy part. I was like, ah, I'm, he's like, I'm not so super sure, but. You know, the other parts, the mobile ar game mechanics, I've got pretty well figured out. I was like, I, I think you're wrong. I think you're thinking about the hard part is the easy part.[00:08:02] And he is like, no, you're wrong. The hard part is the easy part. And so long story short, we built this thing and released Magic Sudoku and it kind of caught the Internet's attention of what you could do with augmented reality and, and with computer vision. It, you know, made it to the front ofer and some subreddits it run Product Hunt Air app of the year.[00:08:20] And it was really a, a flash in the pan type app, right? Like we were both running separate companies at the time and mostly wanted to toy around with, with new technology. And, um, kind of a fun fact about Magic Sudoku winning product Hunt Air app of the year. That was the same year that I think the model three came out.[00:08:34] And so Elon Musk won a Golden Kitty who we joked that we share an award with, with Elon Musk. Um, the thinking there was that this is gonna set off a, a revolution of if two random engineers can put together something that makes something, makes a game programmable and at interactive, then surely lots of other engineers will.[00:08:53] Do similar of adding programmable layers on top of real world objects around us. Earlier we were joking about objects in your fridge, you know, and automatically generating recipes and these sorts of things. And like I said, that was 2017. Roboflow was actually co-found, or I guess like incorporated in, in 2019.[00:09:09] So we put this out there, nothing really happened. We went back to our day jobs of, of running our respective businesses, I sold Represently and then as you mentioned, kind of did like consulting stuff to figure out the next sort of thing to, to work on, to get exposed to various problems. Brad appointed a new CEO at his prior business and we got together that summer of 2019.[00:09:27] We said, Hey, you know, maybe we should return to that idea that caught a lot of people's attention and shows what's possible. And you know what, what kind of gives, like the future is here. And we have no one's done anything since. No one's done anything. So why is, why are there not these, these apps proliferated everywhere.[00:09:42] Yeah. And so we said, you know, what we'll do is, um, to add this software layer to the real world. Will build, um, kinda like a super app where if you pointed it at anything, it will recognize it and then you can interact with it. We'll release a developer platform and allow people to make their own interfaces, interactivity for whatever object they're looking at.[00:10:04] And we decided to start with board games because one, we had a little bit of history there with, with Sudoku two, there's social by default. So if one person, you know finds it, then they'd probably share it among their friend. Group three. There's actually relatively few barriers to entry aside from like, you know, using someone else's brand name in your, your marketing materials.[00:10:19] Yeah. But other than that, there's no real, uh, inhibitors to getting things going and, and four, it's, it's just fun. It would be something that'd be bring us enjoyment to work on. So we spent that summer making, uh, boggle the four by four word game provable, where, you know, unlike Magic Sudoku, which to be clear, totally ruins the game, uh, you, you have to solve Sudoku puzzle.[00:10:40] You don't need to do anything else. But with Boggle, if you and I are playing, we might not find all of the words that adjacent letter tiles. Unveil. So if we have a, an AI tell us, Hey, here's like the best combination of letters that make high scoring words. And so we, we made boggle and released it and that, and that did okay.[00:10:56] I mean maybe the most interesting story was there's a English as a second language program in, in Canada that picked it up and used it as a part of their curriculum to like build vocabulary, which I thought was kind of inspiring. Example, and what happens just when you put things on the internet and then.[00:11:09] We wanted to build one for chess. So this is where you mentioned we went to 2019. TechCrunch Disrupt TechCrunch. Disrupt holds a Hackathon. And this is actually, you know, when Brad and I say we really became co-founders, because we fly out to San Francisco, we rent a hotel room in the Tenderloin. We, uh, we, we, uh, have one room and there's like one, there's room for one bed, and then we're like, oh, you said there was a cot, you know, on the, on the listing.[00:11:32] So they like give us a little, a little cot, the end of the cot, like bled and over into like the bathroom. So like there I am sleeping on the cot with like my head in the bathroom and the Tenderloin, you know, fortunately we're at a hackathon glamorous. Yeah. There wasn't, there wasn't a ton of sleep to be had.[00:11:46] There is, you know, we're, we're just like making and, and shipping these, these sorts of many[00:11:50] people with this hack. So I've never been to one of these things, but[00:11:52] they're huge. Right? Yeah. The Disrupt Hackathon, um, I don't, I don't know numbers, but few hundreds, you know, classically had been a place where it launched a lot of famous Yeah.[00:12:01] Sort of flare. Yeah. And I think it's, you know, kind of slowed down as a place for true company generation. But for us, Brad and I, who likes just doing hackathons, being, making things in compressed time skills, it seemed like a, a fun thing to do. And like I said, we'd been working on things, but it was only there that like, you're, you're stuck in a maybe not so great glamorous situation together and you're just there to make a, a program and you wanna make it be the best and compete against others.[00:12:26] And so we add support to the app that we were called was called Board Boss. We couldn't call it anything with Boggle cause of IP rights were called. So we called it Board Boss and it supported Boggle and then we were gonna support chess, which, you know, has no IP rights around it. Uh, it's an open game.[00:12:39] And we did so in 48 hours, we built an app that, or added fit capability to. Point your phone at a chess board. It understands the state of the chess board and converts it to um, a known notation. Then it passes that to stock fish, the open source chess engine for making move recommendations and it makes move recommendations to, to players.[00:13:00] So you could either play against like an ammunition to AI or improve your own game. We learn that one of the key ways users like to use this was just to record their games. Cuz it's almost like reviewing game film of what you should have done differently. Game. Yeah, yeah, exactly. And I guess the highlight of, uh, of chess Boss was, you know, we get to the first round of judging, we get to the second round of judging.[00:13:16] And during the second round of judging, that's when like, TechCrunch kind of brings around like some like celebs and stuff. They'll come by. Evan Spiegel drops by Ooh. Oh, and he uh, he comes up to our, our, our booth and um, he's like, oh, so what does, what does this all do? And you know, he takes an interest in it cuz the underpinnings of, of AR interacting with the.[00:13:33] And, uh, he is kinda like, you know, I could use this to like cheat on chess with my friends. And we're like, well, you know, that wasn't exactly the, the thesis of why we made it, but glad that, uh, at least you think it's kind of neat. Um, wait, but he already started Snapchat by then? Oh, yeah. Oh yeah. This, this is 2019, I think.[00:13:49] Oh, okay, okay. Yeah, he was kind of just checking out things that were new and, and judging didn't end up winning any, um, awards within Disrupt, but I think what we won was actually. Maybe more important maybe like the, the quote, like the co-founders medal along the way. Yep. The friends we made along the way there we go to, to play to the meme.[00:14:06] I would've preferred to win, to be clear. Yes. You played a win. So you did win, uh,[00:14:11] $15,000 from some Des Moines, uh, con[00:14:14] contest. Yeah. Yeah. The, uh, that was nice. Yeah. Slightly after that we did, we did win. Um, some, some grants and some other things for some of the work that we've been doing. John Papa John supporting the, uh, the local tech scene.[00:14:24] Yeah. Well, so there's not the one you're thinking of. Okay. Uh, there's a guy whose name is Papa John, like that's his, that's his, that's his last name. His first name is John. So it's not the Papa John's you're thinking of that has some problematic undertones. It's like this guy who's totally different. I feel bad for him.[00:14:38] His press must just be like, oh, uh, all over the place. But yeah, he's this figure in the Iowa entrepreneurial scene who, um, he actually was like doing SPACs before they were cool and these sorts of things, but yeah, he funds like grants that encourage entrepreneurship in the state. And since we'd done YC and in the state, we were eligible for some of the awards that they were providing.[00:14:56] But yeah, it was disrupt that we realized, you know, um, the tools that we made, you know, it took us better part of a summer to add Boggle support and it took us 48 hours to add chest support. So adding the ability for programmable interfaces for any object, we built a lot of those internal tools and our apps were kind of doing like the very famous shark fin where like it picks up really fast, then it kind of like slowly peters off.[00:15:20] Mm-hmm. And so we're like, okay, if we're getting these like shark fin graphs, we gotta try something different. Um, there's something different. I remember like the week before Thanksgiving 2019 sitting down and we wrote this Readme for, actually it's still the Readme at the base repo of Robo Flow today has spent relatively unedited of the manifesto.[00:15:36] Like, we're gonna build tools that enable people to make the world programmable. And there's like six phases and, you know, there's still, uh, many, many, many phases to go into what we wrote even at that time to, to present. But it's largely been, um, right in line with what we thought we would, we would do, which is give engineers the tools to add software to real world objects, which is largely predicated on computer vision. So finding the right images, getting the right sorts of video frames, maybe annotating them, uh, finding the right sort of models to use to do this, monitoring the performance, all these sorts of things. And that from, I mean, we released that in early 2020, and it's kind of, that's what's really started to click.[00:16:12] Why Computer Vision[00:16:12] Awesome. I think we should just kind[00:16:13] of[00:16:14] go right into where you are today and like the, the products that you offer, just just to give people an overview and then we can go into the, the SAM stuff. So what is the clear, concise elevator pitch? I think you mentioned a bunch of things like make the world programmable so you don't ha like computer vision is a means to an end.[00:16:30] Like there's, there's something beyond that. Yeah.[00:16:32] I mean, the, the big picture mission for the business and the company and what we're working on is, is making the world programmable, making it read and write and interactive, kind of more entertaining, more e. More fun and computer vision is the technology by which we can achieve that pretty quickly.[00:16:48] So like the one liner for the, the product in, in the company is providing engineers with the tools for data and models to build programmable interfaces. Um, and that can be workflows, that could be the, uh, data processing, it could be the actual model training. But yeah, Rob helps you use production ready computer vision workflows fast.[00:17:10] And I like that.[00:17:11] In part of your other pitch that I've heard, uh, is that you basically scale from the very smallest scales to the very largest scales, right? Like the sort of microbiology use case all the way to[00:17:20] astronomy. Yeah. Yeah. The, the joke that I like to make is like anything, um, underneath a microscope and, and through a telescope and everything in between needs to, needs to be seen.[00:17:27] I mean, we have people that run models in outer space, uh, underwater remote places under supervision and, and known places. The crazy thing is that like, All parts of, of not just the world, but the universe need to be observed and understood and acted upon. So vision is gonna be, I dunno, I feel like we're in the very, very, very beginnings of all the ways we're gonna see it.[00:17:50] Computer Vision Use Cases[00:17:50] Awesome. Let's go into a lo a few like top use cases, cuz I think that really helps to like highlight the big names that you've, big logos that you've already got. I've got Walmart and Cardinal Health, but I don't, I don't know if you wanna pull out any other names, like, just to illustrate, because the reason by the way, the reason I think that a lot of developers don't get into computer vision is because they think they don't need it.[00:18:11] Um, or they think like, oh, like when I do robotics, I'll do it. But I think if, if you see like the breadth of use cases, then you get a little bit more inspiration as to like, oh, I can use[00:18:19] CVS lfa. Yeah. It's kind of like, um, you know, by giving, by making it be so straightforward to use vision, it becomes almost like a given that it's a set of features that you could power on top of it.[00:18:32] And like you mentioned, there's, yeah, there's Fortune One there over half the Fortune 100. I've used the, the tools that Robel provides just as much as 250,000 developers. And so over a quarter million engineers finding and developing and creating various apps, and I mean, those apps are, are, are far and wide.[00:18:49] Just as you mentioned. I mean everything from say, like, one I like to talk about was like sushi detection of like finding the like right sorts of fish and ingredients that are in a given piece of, of sushi that you're looking at to say like roof estimation of like finding. If there's like, uh, hail damage on, on a given roof, of course, self-driving cars and understanding the scenes around us is sort of the, you know, very early computer vision everywhere.[00:19:13] Use case hardhat detection, like finding out if like a given workplace is, is, is safe, uh, disseminate, have the right p p p on or p p e on, are there the right distance from various machines? A huge place that vision has been used is environmental monitoring. Uh, what's the count of species? Can we verify that the environment's not changing in unexpected ways or like river banks are become, uh, becoming recessed in ways that we anticipate from satellite imagery, plant phenotyping.[00:19:37] I mean, people have used these apps for like understanding their plants and identifying them. And that dataset that's actually largely open, which is what's given a proliferation to the iNaturalist, is, is that whole, uh, hub of, of products. Lots of, um, people that do manufacturing. So, like Rivian for example, is a Rubal customer, and you know, they're trying to scale from 1000 cars to 25,000 cars to a hundred thousand cars in very short order.[00:20:00] And that relies on having the. Ability to visually ensure that every part that they're making is produced correctly and right in time. Medical use cases. You know, there's actually, this morning I was emailing with a user who's accelerating early cancer detection through breaking apart various parts of cells and doing counts of those cells.[00:20:23] And actually a lot of wet lab work that folks that are doing their PhDs or have done their PhDs are deeply familiar with that is often required to do very manually of, of counting, uh, micro plasms or, or things like this. There's. All sorts of, um, like traffic counting and smart cities use cases of understanding curb utilization to which sort of vehicles are, are present.[00:20:44] Uh, ooh. That can be[00:20:46] really good for city planning actually.[00:20:47] Yeah. I mean, one of our customers does exactly this. They, they measure and do they call it like smart curb utilization, where uhhuh, they wanna basically make a curb be almost like a dynamic space where like during these amounts of time, it's zoned for this during these amounts of times.[00:20:59] It's zoned for this based on the flows and e ebbs and flows of traffic throughout the day. So yeah, I mean the, the, the truth is that like, you're right, it's like a developer might be like, oh, how would I use vision? And then all of a sudden it's like, oh man, all these things are at my fingertips. Like I can just, everything you can see.[00:21:13] Yeah. Right. I can just, I can just add functionality for my app to understand and ingest the way, like, and usually the way that someone gets like almost nerd sniped into this is like, they have like a home automation project, so it's like send Yeah. Give us a few. Yeah. So send me a text when, um, a package shows up so I can like prevent package theft so I can like go down and grab it right away or.[00:21:29] We had a, uh, this one's pretty, pretty niche, but it's pretty funny. There was this guy who, during the pandemic wa, wanted to make sure his cat had like the proper, uh, workout. And so I've shared the story where he basically decided that. He'd make a cat workout machine with computer vision, you might be alone.[00:21:43] You're like, what does that look like? Well, what he decided was he would take a robotic arm strap, a laser pointer to it, and then train a machine to recognize his cat and his cat only, and point the laser pointer consistently 10 feet away from the cat. There's actually a video of you if you type an YouTube cat laser turret, you'll find Dave's video.[00:22:01] Uh, and hopefully Dave's cat has, has lost the weight that it needs to, cuz that's just the, that's an intense workout I have to say. But yeah, so like, that's like a, um, you know, these, uh, home automation projects are pretty common places for people to get into smart bird feeders. I've seen people that like are, are logging and understanding what sort of birds are, uh, in their background.[00:22:18] There's a member of our team that was working on actually this as, as a whole company and has open sourced a lot of the data for doing bird species identification. And now there's, I think there's even a company that's, uh, founded to create like a smart bird feeder, like captures photos and tells you which ones you've attracted to your yard.[00:22:32] I met that. Do, you know, get around the, uh, car sharing company that heard it? Them never used them. They did a SPAC last year and they had raised at like, They're unicorn. They raised at like 1.2 billion, I think in the, the prior round and inspected a similar price. I met the CTO of, of Getaround because he was, uh, using Rob Flow to hack into his Tesla cameras to identify other vehicles that are like often nearby him.[00:22:56] So he's basically building his own custom license plate recognition, and he just wanted like, keep, like, keep tabs of like, when he drives by his friends or when he sees like regular sorts of folks. And so he was doing like automated license plate recognition by tapping into his, uh, camera feeds. And by the way, Elliot's like one of the like OG hackers, he was, I think one of the very first people to like, um, she break iPhones and, and these sorts of things.[00:23:14] Mm-hmm. So yeah, the project that I want, uh, that I'm gonna work on right now for my new place in San Francisco is. There's two doors. There's like a gate and then the other door. And sometimes we like forget to close, close the gate. So like, basically if it sees that the gate is open, it'll like send us all a text or something like this to make sure that the gate is, is closed at the front of our house.[00:23:32] That's[00:23:32] really cool. And I'll, I'll call out one thing that readers and listeners can, uh, read out on, on your history. One of your most popular initial, um, viral blog post was about, um, autonomous vehicle data sets and how, uh, the one that Udacity was using was missing like one third of humans. And, uh, it's not, it's pretty problematic for cars to miss humans.[00:23:53] Yeah, yeah, actually, so yeah, the Udacity self-driving car data set, which look to their credit, it was just meant to be used for, for academic use. Um, and like as a part of courses on, on Udacity, right? Yeah. But the, the team that released it, kind of hastily labeled and let it go out there to just start to use and train some models.[00:24:11] I think that likely some, some, uh, maybe commercial use cases maybe may have come and, and used, uh, the dataset, who's to say? But Brad and I discovered this dataset. And when we were working on dataset improvement tools at Rob Flow, we ran through our tools and identified some like pretty, as you mentioned, key issues.[00:24:26] Like for example, a lot of strollers weren't labeled and I hope our self-driving cars do those, these sorts of things. And so we relabeled the whole dataset by hand. I have this very fond memory is February, 2020. Brad and I are in Taiwan. So like Covid is actually just, just getting going. And the reason we were there is we were like, Hey, we can work on this from anywhere for a little bit.[00:24:44] And so we spent like a, uh, let's go closer to Covid. Well, you know, I like to say we uh, we got early indicators of, uh, how bad it was gonna be. I bought a bunch of like N 90 fives before going o I remember going to the, the like buying a bunch of N 95 s and getting this craziest look like this like crazy tin hat guy.[00:25:04] Wow. What is he doing? And then here's how you knew. I, I also got got by how bad it was gonna be. I left all of them in Taiwan cuz it's like, oh, you all need these. We'll be fine over in the us. And then come to find out, of course that Taiwan was a lot better in terms of, um, I think, yeah. Safety. But anyway, we were in Taiwan because we had planned this trip and you know, at the time we weren't super sure about the, uh, covid, these sorts of things.[00:25:22] We always canceled it. We didn't, but I have this, this very specific time. Brad and I were riding on the train from Clay back to Taipei. It's like a four hour ride. And you mentioned Pioneer earlier, we were competing in Pioneer, which is almost like a gamified to-do list. Mm-hmm. Every week you say what you're gonna do and then other people evaluate.[00:25:37] Did you actually do the things you said you were going to do? One of the things we said we were gonna do was like this, I think re-release of this data set. And so it's like late, we'd had a whole week, like, you know, weekend behind us and, uh, we're on this train and it was very unpleasant situation, but we relabeled this, this data set, and one sitting got it submitted before like the Sunday, Sunday countdown clock starts voting for, for.[00:25:57] And, um, once that data got out back out there, just as you mentioned, it kind of picked up and Venture beat, um, noticed and wrote some stories about it. And we really rereleased of course, the data set that we did our best job of labeling. And now if anyone's listening, they can probably go out and like find some errors that we surely still have and maybe call us out and, you know, put us, put us on blast.[00:26:15] The Economics of Annotation (Segmentation)[00:26:15] But,[00:26:16] um, well, well the reason I like this story is because it, it draws attention to the idea that annotation is difficult and basically anyone looking to use computer vision in their business who may not have an off-the-shelf data set is going to have to get involved in annotation. And I don't know what it costs.[00:26:34] And that's probably one of the biggest hurdles for me to estimate how big a task this is. Right? So my question at a higher level is tell the customers, how do you tell customers to estimate the economics of annotation? Like how many images do, do we need? How much, how long is it gonna take? That, that kinda stuff.[00:26:50] How much money and then what are the nuances to doing it well, right? Like, cuz obviously Udacity had a poor quality job, you guys had proved it, and there's errors every everywhere. Like where do[00:26:59] these things go wrong? The really good news about annotation in general is that like annotation of course is a means to an end to have a model be able to recognize a thing.[00:27:08] Increasingly there's models that are coming out that can recognize things zero shot without any annotation, which we're gonna talk about. Yeah. Which, we'll, we'll talk more about that in a moment. But in general, the good news is that like the trend is that annotation is gonna become decreasingly a blocker to starting to use computer vision in meaningful ways.[00:27:24] Now that said, just as you mentioned, there's a lot of places where you still need to do. Annotation. I mean, even with these zero shot models, they might have of blind spots, or maybe you're a business, as you mentioned, that you know, it's proprietary data. Like only Rivian knows what a rivian is supposed to look like, right?[00:27:39] Uh, at the time of, at the time of it being produced, like underneath the hood and, and all these sorts of things. And so, yeah, that's gonna necessarily require annotation. So your question of how long is it gonna take, how do you estimate these sorts of things, it really comes down to the complexity of the problem that you're solving and the amount of variance in the scene.[00:27:57] So let's give some contextual examples. If you're trying to recognize, we'll say a scratch on one specific part and you have very strong lighting. You might need fewer images because you control the lighting, you know the exact part and maybe you're lucky in the scratch. Happens more often than not in similar parts or similar, uh, portions of the given part.[00:28:17] So in that context, you, you, the function of variance, the variance is, is, is lower. So the number of images you need is also lower to start getting up to work. Now the orders of magnitude we're talking about is that like you can have an initial like working model from like 30 to 50 images. Yeah. In this context, which is shockingly low.[00:28:32] Like I feel like there's kind of an open secret in computer vision now, the general heuristic that often. Users, is that like, you know, maybe 200 images per class is when you start to have a model that you can rely[00:28:45] on? Rely meaning like 90, 99, 90, 90%, um,[00:28:50] uh, like what's 85 plus 85? Okay. Um, that's good. Again, these are very, very finger in the wind estimates cuz the variance we're talking about.[00:28:59] But the real question is like, at what point, like the framing is not like at what point do it get to 99, right? The framing is at what point can I use this thing to be better than the alternative, which is humans, which maybe humans or maybe like this problem wasn't possible at all. And so usually the question isn't like, how do I get to 99?[00:29:15] A hundred percent? It's how do I ensure that like the value I am able to get from putting this thing in production is greater than the alternative? In fact, even if you have a model that's less accurate than humans, there might be some circumstances where you can tolerate, uh, a greater amount of inaccuracy.[00:29:32] And if you look at the accuracy relative to the cost, Using a model is extremely cheap. Using a human for the same sort of task can be very expensive. Now, in terms of the actual accuracy of of what you get, there's probably some point at which the cost, but relative accuracy exceeds of a model, exceeds the high cost and hopefully high accuracy of, of a human comparable, like for example, there's like cameras that will track soccer balls or track events happening during sporting matches.[00:30:02] And you can go through and you know, we actually have users that work in sports analytics. You can go through and have a human. Hours and hours of footage. Cuz not just watching their team, they're watching every other team, they're watching scouting teams, they're watching junior teams, they're watching competitors.[00:30:15] And you could have them like, you know, track and follow every single time the ball goes within blank region of the field or every time blank player goes into, uh, this portion of the field. And you could have, you know, exact, like a hundred percent accuracy if that person, maybe, maybe not a hundred, a human may be like 95, 90 7% accuracy of every single time the ball is in this region or this player is on the field.[00:30:36] Truthfully, maybe if you're scouting analytics, you actually don't need 97% accuracy of knowing that that player is on the field. And in fact, if you can just have a model run at a 1000th, a 10000th of the cost and goes through and finds all the times that Messi was present on the field mm-hmm. That the ball was in this region of the.[00:30:54] Then even if that model is slightly less accurate, the cost is just so orders of magnitude different. And the stakes like the stakes of this problem, of knowing like the total number of minutes that Messi played will say are such that we have a higher air tolerance, that it's a no-brainer to start to use Yeah, a computer vision model in this context.[00:31:12] So not every problem requires equivalent or greater human performance. Even when it does, you'd be surprised at how fast models get there. And in the times when you, uh, really look at a problem, the question is, how much accuracy do I need to start to get value from this? This thing, like the package example is a great one, right?[00:31:27] Like I could in theory set up a camera that's constantly watching in front of my porch and I could watch the camera whenever I have a package and then go down. But of course, I'm not gonna do that. I value my time to do other sorts of things instead. And so like there, there's this net new capability of, oh, great, I can have an always on thing that tells me when a package shows up, even if you know the, the thing that's gonna text me.[00:31:46] When a package shows up, let's say a flat pack shows up instead of a box and it doesn't know what a flat pack likes, looks like initially. Doesn't matter. It doesn't matter because I didn't have this capability at all before. And I think that's the true case where a lot of computer vision problems exist is like it.[00:32:00] It's like you didn't even have this capability, this superpower before at all, let alone assigning a given human to do the task. And that's where we see like this explosion of, of value.[00:32:10] Awesome. Awesome. That was a really good overview. I want to leave time for the others, but I, I really want to dive into a couple more things with regards to Robo Flow.[00:32:17] Computer Vision Annotation Formats[00:32:17] So one is, apparently your original pitch for Robo Flow was with regards to conversion tools for computer vision data sets. And I'm sure as, as a result of your job, you have a lot of rants. I've been digging for rants basically on like the best or the worst annotation formats. What do we know? Cause most of us, oh my gosh, we only know, like, you know, I like,[00:32:38] okay, so when we talk about computer vision annotation formats, what we're talking about is if you have an image and you, you picture a boing box around my face on that image.[00:32:46] Yeah. How do you describe where that Monty box is? X, Y, Z X Y coordinates. Okay. X, y coordinates. How, what do you mean from the top lefts.[00:32:52] Okay. You, you, you, you take X and Y and then, and then the. The length and, and the width of the, the[00:32:58] box. Okay. So you got like a top left coordinate and like the bottom right coordinate or like the, the center of the bottom.[00:33:02] Yeah. Yeah. Top, left, bottom right. Yeah. That's one type of format. Okay. But then, um, I come along and I'm like, you know what? I want to do a different format where I wanna just put the center of the box, right. And give the length and width. Right. And by the way, we didn't even talk about what X and Y we're talking about.[00:33:14] Is X a pixel count? Is a relative pixel count? Is it an absolute pixel count? So the point is, the number of ways to describe where a box lives in a freaking image is endless, uh, seemingly and. Everyone decided to kind of create their own different ways of describing the coordinates and positions of where in this context of bounding Box is present.[00:33:39] Uh, so there's some formats, for example, that like use re, so for the x and y, like Y is, uh, like the left, most part of the image is zero. And the right most part of the image is one. So the, the coordinate is like anywhere from zero to one. So 0.6 is, you know, 60% of your way right up the image to describe the coordinate.[00:33:53] I guess that was, that was X instead of Y. But the point is there, of the zero to one is the way that we determined where that was in the position, or we're gonna do an absolute pixel position anyway. We got sick, we got sick of all these different annotation formats. So why do you even have to convert between formats?[00:34:07] Is is another part of this, this story. So different training frameworks, like if you're using TensorFlow, you need like TF Records. If you're using PyTorch, it's probably gonna be, well it depends on like what model you're using, but someone might use Coco JSON with PyTorch. Someone else might use like a, just a YAML file and a text file.[00:34:21] And to describe the cor it's point is everyone that creates a model. Or creates a dataset rather, has created different ways of describing where and how a bounding box is present in the image. And we got sick of all these different formats and doing these in writing all these different converter scripts.[00:34:39] And so we made a tool that just converts from one script, one type of format to another. And the, the key thing is that like if you get that converter script wrong, your model doesn't not work. It just fails silently. Yeah. Because the bounding boxes are now all in the wrong places. And so you need a way to visualize and be sure that your converter script, blah, blah blah.[00:34:54] So that was the very first tool we released of robo. It was just a converter script, you know, like these, like these PDF to word converters that you find. It was basically that for computer vision, like dead simple, really annoying thing. And we put it out there and people found some, some value in, in that.[00:35:08] And you know, to this day that's still like a surprisingly painful[00:35:11] problem. Um, yeah, so you and I met at the Dall-E Hackathon at OpenAI, and we were, I was trying to implement this like face masking thing, and I immediately ran into that problem because, um, you know, the, the parameters that Dall-E expected were different from the one that I got from my face, uh, facial detection thing.[00:35:28] One day it'll go away, but that day is not today. Uh, the worst format that we work with is, is. The mart form, it just makes no sense. And it's like, I think, I think it's a one off annotation format that this university in China started to use to describe where annotations exist in a book mart. I, I don't know, I dunno why that So best[00:35:45] would be TF record or some something similar.[00:35:48] Yeah, I think like, here's your chance to like tell everybody to use one one standard and like, let's, let's, can[00:35:53] I just tell them to use, we have a package that does this for you. I'm just gonna tell you to use the row full package that converts them all, uh, for you. So you don't have to think about this. I mean, Coco JSON is pretty good.[00:36:04] It's like one of the larger industry norms and you know, it's in JS O compared to like V xml, which is an XML format and Coco json is pretty descriptive, but you know, it has, has its own sort of drawbacks and flaws and has random like, attribute, I dunno. Um, yeah, I think the best way to handle this problem is to not have to think about it, which is what we did.[00:36:21] We just created a, uh, library that, that converts and uses things. Uh, for us. We've double checked the heck out of it. There's been hundreds of thousands of people that have used the library and battle tested all these different formats to find those silent errors. So I feel pretty good about no longer having to have a favorite format and instead just rely on.[00:36:38] Dot load in the format that I need. Great[00:36:41] Intro to Computer Vision Segmentation[00:36:41] service to the community. Yeah. Let's go into segmentation because is at the top of everyone's minds, but before we get into segment, anything, I feel like we need a little bit of context on the state-of-the-art prior to Sam, which seems to be YOLO and uh, you are the leading expert as far as I know.[00:36:56] Yeah.[00:36:57] Computer vision, there's various task types. There's classification problems where we just like assign tags to images, like, you know, maybe safe work, not safe work, sort of tagging sort of stuff. Or we have object detection, which are the boing boxes that you see and all the formats I was mentioning in ranting about there's instant segmentation, which is the polygon shapes and produces really, really good looking demos.[00:37:19] So a lot of people like instant segmentation.[00:37:21] This would be like counting pills when you point 'em out on the, on the table. Yeah. So, or[00:37:25] soccer players on the field. So interestingly, um, counting you could do with bounding boxes. Okay. Cause you could just say, you know, a box around a person. Well, I could count, you know, 12 players on the field.[00:37:35] Masks are most useful. Polygons are most useful if you need very precise area measurements. So you have an aerial photo of a home and you want to know, and the home's not a perfect box, and you want to know the rough square footage of that home. Well, if you know the distance between like the drone and, and the ground.[00:37:53] And you have the precise polygon shape of the home, then you can calculate how big that home is from aerial photos. And then insurers can, you know, provide say accurate estimates and that's maybe why this is useful. So polygons and, and instant segmentation are, are those types of tasks? There's a key point detection task and key point is, you know, if you've seen those demos of like all the joints on like a hand kind of, kind of outlined, there's visual question answering tasks, visual q and a.[00:38:21] And that's like, you know, some of the stuff that multi-modality is absolutely crushing for, you know, here's an image, tell me what food is in this image. And then you can pass that and you can make a recipe out of it. But like, um, yeah, the visual question in answering task type is where multi-modality is gonna have and is already having an enormous impact.[00:38:40] So that's not a comprehensive survey, very problem type, but it's enough to, to go into why SAM is significant. So these various task types, you know, which model to use for which given circumstance. Most things is highly dependent on what you're ultimately aiming to do. Like if you need to run a model on the edge, you're gonna need a smaller model, cuz it is gonna run on edge, compute and process in, in, in real time.[00:39:01] If you're gonna run a model on the cloud, then of course you, uh, generally have more compute at your disposal Considerations like this now, uh,[00:39:08] YOLO[00:39:08] just to pause. Yeah. Do you have to explain YOLO first before you go to Sam, or[00:39:11] Yeah, yeah, sure. So, yeah. Yeah, we should. So object detection world. So for a while I talked about various different task types and you can kinda think about a slide scale of like classification, then obvious detection.[00:39:20] And on the right, at most point you have like segmentation tasks. Object detection. The bounding boxes is especially useful for a wide, like it's, it's surprisingly versatile. Whereas like classification is kind of brittle. Like you only have a tag for the whole image. Well, that doesn't, you can't count things with tags.[00:39:35] And on the other hand, like the mask side of things, like drawing masks is painstaking. And so like labeling is just a bit more difficult. Plus like the processing to produce masks requires more compute. And so usually a lot of folks kind of landed for a long time on obvious detection being a really happy medium of affording you with rich capabilities because you can do things like count, track, measure.[00:39:56] In some CAGR context with bounding boxes, you can see how many things are present. You can actually get a sense of how fast something's moving by tracking the object or bounding box across multiple frames and comparing the timestamp of where it was across those frames. So obviously detection is a very common task type that solves lots of things that you want do with a given model.[00:40:15] In obviously detection. There's been various model frameworks over time. So kind of really early on there's like R-CNN uh, then there's faster rc n n and these sorts of family models, which are based on like resnet kind of architectures. And then a big thing happens, and that is single shot detectors. So faster, rc n n despite its name is, is very slow cuz it takes two passes on the image.[00:40:37] Uh, the first pass is, it finds par pixels in the image that are most interesting to, uh, create a bounding box candidate out of. And then it passes that to a, a classifier that then does classification of the bounding box of interest. Right. Yeah. You can see, you can see why that would be slow. Yeah. Cause you have to do two passes.[00:40:53] You know, kind of actually led by, uh, like mobile net was I think the first large, uh, single shot detector. And as its name implies, it was meant to be run on edge devices and mobile devices and Google released mobile net. So it's a popular implementation that you find in TensorFlow. And what single shot detectors did is they said, Hey, instead of looking at the image twice, what if we just kind of have a, a backbone that finds candidate bounding boxes?[00:41:19] And then we, we set loss functions for objectness. We set loss function. That's a real thing. We set loss functions for objectness, like how much obj, how object do this part of the images. We send a loss function for classification, and then we run the image through the model on a single pass. And that saves lots of compute time and you know, it's not necessarily as accurate, but if you have lesser compute, it can be extremely useful.[00:41:42] And then the advances in both modeling techniques in compute and data quality, single shot detectors, SSDs has become, uh, really, really popular. One of the biggest SSDs that has become really popular is the YOLO family models, as you described. And so YOLO stands for you only look once. Yeah, right, of course.[00:42:02] Uh, Drake's, uh, other album, um, so Joseph Redman introduces YOLO at the University of Washington. And Joseph Redman is, uh, kind of a, a fun guy. So for listeners, for an Easter egg, I'm gonna tell you to Google Joseph Redman resume, and you'll find, you'll find My Little Pony. That's all I'll say. And so he introduces the very first YOLO architecture, which is a single shot detector, and he also does it in a framework called Darknet, which is like this, this own framework that compiles the Cs, frankly, kind of tough to work with, but allows you to benefit from the speedups that advance when you operate in a low level language like.[00:42:36] And then he releases, well, what colloquially is known as YOLO V two, but a paper's called YOLO 9,000 cuz Joseph Redmond thought it'd be funny to have something over 9,000. So get a sense for, yeah, some fun. And then he releases, uh, YOLO V three and YOLO V three is kind of like where things really start to click because it goes from being an SSD that's very limited to competitive and, and, and superior to actually mobile That and some of these other single shot detectors, which is awesome because you have this sort of solo, I mean, him and and his advisor, Ali, at University of Washington have these, uh, models that are becoming really, really powerful and capable and competitive with these large research organizations.[00:43:09] Joseph Edmond leaves Computer Vision Research, but there had been Alexia ab, one of the maintainers of Darknet released Yola VI four. And another, uh, researcher, Glenn Yer, uh, jocker had been working on YOLO V three, but in a PyTorch implementation, cuz remember YOLO is in a dark implementation. And so then, you know, YOLO V three and then Glenn continues to make additional improvements to YOLO V three and pretty soon his improvements on Yolov theory, he's like, oh, this is kind of its own things.[00:43:36] Then he releases YOLO V five[00:43:38] with some naming[00:43:39] controversy that we don't have Big naming controversy. The, the too long didn't read on the naming controversy is because Glen was not originally involved with Darknet. How is he allowed to use the YOLO moniker? Roe got in a lot of trouble cuz we wrote a bunch of content about YOLO V five and people were like, ah, why are you naming it that we're not?[00:43:55] Um, but you know,[00:43:56] cool. But anyway, so state-of-the-art goes to v8. Is what I gather.[00:44:00] Yeah, yeah. So yeah. Yeah. You're, you're just like, okay, I got V five. I'll skip to the end. Uh, unless, unless there's something, I mean, I don't want, well, so I mean, there's some interesting things. Um, in the yolo, there's like, there's like a bunch of YOLO variants.[00:44:10] So YOLOs become this, like this, this catchall for various single shot, yeah. For various single shot, basically like runs on the edge, it's quick detection framework. And so there's, um, like YOLO R, there's YOLO S, which is a transformer based, uh, yolo, yet look like you only look at one sequence is what s stands were.[00:44:27] Um, the pp yo, which, uh, is PAT Paddle implementation, which is by, which Chinese Google is, is their implementation of, of TensorFlow, if you will. So basically YOLO has like all these variants. And now, um, yo vii, which is Glen has been working on, is now I think kind of like, uh, one of the choice models to use for single shot detection.[00:44:44] World Knowledge of Foundation Models[00:44:44] Well, I think a lot of those models, you know, Asking the first principal's question, like let's say you wanna find like a bus detector. Do you need to like go find a bunch of photos of buses or maybe like a chair detector? Do you need to go find a bunch of photos of chairs? It's like, oh no. You know, actually those images are present not only in the cocoa data set, but those are objects that exist like kind of broadly on the internet.[00:45:02] And so computer visions kind of been like us included, have been like really pushing for and encouraging models that already possess a lot of context about the world. And so, you know, if GB T's idea and i's idea OpenAI was okay, models can only understand things that are in their corpus. What if we just make their corpus the size of everything on the internet?[00:45:20] The same thing that happened in imagery, what's happening now? And that's kinda what Sam represents, which is kind of a new evolution of, earlier on we were talking about the cost of annotation and I said, well, good news. Annotations then become decreasingly necessary to start to get to value. Now you gotta think about it more, kind of like, you'll probably need to do some annotation because you might want to find a custom object, or Sam might not be perfect, but what's about to happen is a big opportunity where you want the benefits of a yolo, right?[00:45:47] Where it can run really fast, it can run on the edge, it's very cheap. But you want the knowledge of a large foundation model that already knows everything about buses and knows everything about shoes, knows everything about real, if the name is true, anything segment, anything model. And so there's gonna be this novel opportunity to take what these large models know, and I guess it's kind of like a form of distilling, like distill them down into smaller architectures that you can use in versatile ways to run in real time to run on the edge.[00:46:13] And that's now happening. And what we're seeing in actually kind of like pulling that, that future forward with, with, with Robo Flow.[00:46:21] Segment Anything Model[00:46:21] So we could talk a bit about, um, about SAM and what it represents maybe into, in relation to like these, these YOLO models. So Sam is Facebook segment Everything Model. It came out last week, um, the first week of April.[00:46:34] It has 24,000 GitHub stars at the time of, of this recording within its first week. And why, what does it do? Segment? Everything is a zero shot segmentation model. And as we're describing, creating masks is a very arduous task. Creating masks of objects that are not already represented means you have to go label a bunch of masks and then train a model and then hope that it finds those masks in new images.[00:47:00] And the promise of Segment anything is that in fact you just pass at any image and it finds all of the masks of relevant things that you might be curious about finding in a given image. And it works remarkably. Segment anything in credit to Facebook and the fair Facebook research team, they not only released the model permissive license to move things forward, they released the full data set, all 11 million images and 1.1 billion segmentation masks and three model sizes.[00:47:29] The largest ones like 2.5 gigabytes, which is not enormous. Medium ones like 1.2 and the smallest one is like 400, 3 75 megabytes. And for context,[00:47:38] for, for people listening, that's six times more than the previous alternative, which, which is apparently open images, uh, in terms of number images, and then 400 times more masks than open[00:47:47] images as well.[00:47:48] Exactly, yeah. So huge, huge order magnitude gain in terms of dataset accessibility plus like the model and how it works. And so the question becomes, okay, so like segment. What, what do I do with this? Like, what does it allow me to do? And it didn't Rob float well. Yeah, you should. Yeah. Um, it's already there.[00:48:04] You um, that part's done. Uh, but the thing that you can do with segment anything is you can almost, like, I almost think about like this, kinda like this model arbitrage where you can basically like distill down a giant model. So let's say like, like let's return to the package example. Okay. The package problem of, I wanna get a text when a package appears on my front porch before segment anything.[00:48:25] The way that I would go solve this problem is I would go collect some images of packages on my porch and I would label them, uh, with bounding boxes or maybe masks in that part. As you mentioned, it can be a long process and I would train a model. And that model it actually probably worked pretty well cause it's purpose-built.[00:48:44] The camera position, my porch, the packages I'm receiving. But that's gonna take some time, like everything that I just mentioned there is gonna take some time. Now with Segment, anything, what you can do is go take some photos of your porch. So we're, we're still, we're still getting that. And then we're asking segment anything, basically.[00:49:00] Do you see, like segment, everything you see here? And, you know, a limitation of segment anything right now is it gives you masks without labels, like text labels for those masks. So we can talk about the way to address that in a, in a moment. But the point is, it will find the package in, in your photo. And again, there might be some positions where it doesn't find the package, or sometimes thing things look a little bit differently and you're gonna have to like, fine tune or whatever.[00:49:22] But, okay, now you've got a, you've got the intelligence of a package finder. Now you wanna deploy that package. Well, you could either call the Segment Everything model api, which hosted on platforms like RoboFlow, and I'm sure other places as well. Or you could probably distill it down to a smaller model.[00:49:38] You can run on the edge, like you wanna run it maybe on like a raspberry pie that just is looking and finding, well, you can't run segment everything on a raspberry pie, but you can run a single shot detector. So you just take all the data that's been basically automatically labeled for you and then you distill it down and train in much, much more efficient, smaller model.[00:49:57] And then you deploy that model to the edge and this is sort of what's gonna be increasingly possible. By the way, this has already happened in in LLMs, right? Like for example, like GPT4 knows. A lot about a lot and people will distill it down in some ways by seeing all the, uh, like code completion will say, let's say you're building a code completion model.[00:50:16] GPT4 can do any type of completion in addition to code completion. If you want to build your own code completion model, cause that's the only task that you're worried about for the future you're building. You could R H L F on all of GPT4 s code completion examples, and then almost kind of use that as distilling down into your own version of a code completion model and almost, uh, have a cheaper, more readily available, simpler model that yes, it only does one task, but that's the only task you need.[00:50:43] And it's a model that you own and it's a model that you can. Deploy more lightly and get more value from. That's sort of what has been represented as possible with, with Segment anything. But that's just on the dataset prep side, right? Like segment anything means you can make your own background removal, you can make your own sort of video editing software.[00:50:59] You can make like any, this promise of trying to make the world be understood and, uh, viewable and programmable just got so much more accessible. Yeah,[00:51:10] that's an incredible overview. I think we should just get your takes on a couple of like, so this is a massive, massive release. There are a lot of sort of small little features that, uh, they, they spent and elaborated in the blog post and the paper.[00:51:24] So I'm gonna pull out a few things to discuss and obviously feel free to suggest anything that you really want to get off your chest.[00:51:29] SAM: Zero Shot Transfer[00:51:29] So, zero shot transfer is.[00:51:31] No. Okay. But, uh, this level of quality, yes, much better. Yeah. So you could rely on large models previously for doing zero shot, uh, detection. But as you mentioned, the scale and size of the data set and resulting model that was trained is, is so much superior.[00:51:48] And that's, uh,[00:51:49] I guess the benefit of having world, world knowledge, um, yes. And being able to rely on that. Okay.[00:51:53] SAM: Promptability[00:51:53] And then prompt model, this is new. I still don't really understand how they did[00:51:58] it. Okay. So, so Sam basically said, why don't we take these 11 million images, 1.1 billion masks, and we'll train a transformer and an image encoder on all of those images.[00:52:14] And that's basically the pre-training that we'll use for passing any candidate image through. We'll pass that through this image encoder. So that's the, um, backbone, if you will, of the model. Then the much lighter parts become, okay, so if I've got that image encoding. I need to interact and understand what's inside the image en coating.[00:52:31] And that's where the prompting comes into play. And that's where the, the mask decoder comes into play in, in the model architecture. So image comes in, it goes through the imaging coder. The image en coder is what took lots of time and resources to train and get the weights for of, of what is Sam. But at inference time, of course, you don't have to re refine those weights.[00:52:49] So image comes in, goes to the image en coder, then you have the image and bedding. And now to interact with that image and embed, that's where you're gonna be doing prompting and the decoding specifically, what comes out of, out of Sam at the image encoding step is a bunch of candidate masks. And those candidate masks are the ones that you say you want to interact with.[00:53:06] What's really cool is there's both prompts for saying like the thing that you're interested in, but then there's also, you can also say the way that you wanna pass a candidate for which mask you're interested in from Sam, is you can just like point and click and say, this is the part of the image I'm interested in.[00:53:24] SAM: Model Assisted Labeling[00:53:24] Which is exactly what, like a, a labeling interface would be, uh, useful for, as an example,[00:53:30] which they actually use to bootstrap their own annotation, it seems.[00:53:33] Exactly. Isn't that pretty cool? Yes, exactly. So this is, this is why I was mentioning earlier that like the way to solve a computer vision problem, you know, like waterfall development versus agile development.[00:53:41] Sure. The same thing, like in machine learning, uh, it took a, it took a little bit, but folks like, oh, we can do this in, in machine learning too. And the way you do it, machine learning is instead of saying, okay, waterfall, I'm gonna take all my images and label them all. Okay, I'm done with the labeling part, now I'm gonna go to the training part.[00:53:55] Okay, I'm done with that part. Now I'm gonna go to the deployment part. A much more agile look would be like, okay, if I have like 10,000 images, let's label the first like hundred and just see what we get and we'll train a model and now we're gonna use that model that we trained to help us label the next thousand images.[00:54:10] And then we're gonna do this on repeat. That's exactly what the SAM team did. Yeah. They first did assisted man, they call it assisted manual. Manual, yeah.[00:54:15] Yep. Yeah. Where, which is uh, 4.3 million mass from 120,000 images.[00:54:19] Exactly. And then semi-automatic, which[00:54:22] is 5.9 million mass and 180,000[00:54:24] images. And in that step, they were basically having the human annotators point out where Sam may have missed a mask and then they did fully auto, which[00:54:32] is the whole thing.[00:54:33] Yes. 11 million images and 1.1[00:54:35] billion mask. And that's where they said, Sam, do your thing and predict all the mask. We won't[00:54:39] even, we won't even judge. Yeah. We just[00:54:41] close our eyes, which is what people are suspecting is happening for training G P T five. Right. Is that we're creating a bunch of candidate task text from G P T four to use in training the, the next g PT five.[00:54:52] So, but by the way, that process, like, you don't have to be a Facebook to take advantage of that. Like That's exactly what, like people building with Rob Flow. That's what you do.[00:54:59] Exactly. That's, this is your tool. That's the onboarding[00:55:01] that I did. That's exactly it. Is that like, okay, like you've got a bunch of images, but just label a few of them first.[00:55:07] Now you've got a, I almost think about it like a, you know, co-pilot is the term now, but I almost, I used to describe it as like a, an army of interns, otherwise known as AI that works alongside you. To have a first guess at labeling images for you, and then you're just kinda like supervising and improving and doing better.[00:55:23] And that relationship is a lot more efficient, a lot more effective. And by the way, by doing it this way, you don't waste a bunch of time labeling images. Like, again, we label images and pursuit of making sure our model learns something. We don't label images to label images, which means if we can label the right images defined by which images most help our model learn things next we should.[00:55:45] So we should look and see where's our model most likely to fail, and then spend our time labeling those images. And that's, that's sort of the tooling that, that we work on, making that exact loop faster and easier. Yeah. Yeah.[00:55:54] I highly recommend everyone try it. It's takes a few minutes. It's, it's great.[00:55:58] It's great. Is there anything else in, in Sam that, Sam specifically that you wanna go over? Or do you wanna go to Robot[00:56:03] SAM doesn't have labels[00:56:03] Full plus Sam? I mentioned one key thing about Sam that it doesn't do, and that is it doesn't outta the box give you labels for your masks. Now the paper. Alludes to the researchers attempting to get that part figured out.[00:56:18] And I think that they will, I think that they were like, we're just gonna publish this first part of just doing all the masks. Cuz that alone is like incredibly transformative for what's possible in, in computer vision. But in the interim, what is happening is people stitching together different models to name those masks, right?[00:56:35] So imagine that you go to Sam and you say, here's an image, and then Sam makes perfect masks of everything in the image. Now you need to know what are these masks, what objects are in these masks? Isn't it[00:56:45] funny that Sam doesn't know because you, you just said it knows[00:56:48] everything. Yeah, it knows it's weird.[00:56:50] It knows all the candidate masks. And that's, that's because that was the function that it was Yeah. Dream for. Yeah. Right, right. Okay. But again, like this is, this is what's going, like this is exactly what multi-modality is going to have happen anyway. You solved it. Yeah. So, yeah, so, so there's a couple different solutions.[00:57:04] I mean, this is where it's. You're begging the question of like, what are you trying to do with Sam? Like if you wanna do Sam, and then you wanna distill it down to deploy a more purpose-built task-specific, faster, cheaper model that you own. Yeah. That's commonly, I think what's gonna happen. So in that context, you're using SAM to accelerate your labeling.[00:57:21] Another way you might wanna use Sam is just in prod outta the box. Like, Sam is gonna produce good candidate labels and I don't need to fine tune anything and I just wanna like, use that as is. Well, in both of these contexts, we need to know the names of the masks that Sam is finding, right? Because like, if we're using Sam to label our stuff, well, telling us the mask isn't so helpful.[00:57:39] Like, in my image of packages, it's like, did you label the door? Did you label the package? I, I need to know what this mask is. There's an[00:57:45] objects nest there. Yeah. That, uh, that we can tell.[00:57:49] Yeah. And so you can use Sam in combination with other models. And pretty soon this is gonna be a single model. Like this podcast is gonna gonna like, I'll make a bold prediction in 30 days.[00:57:59] Like someone will do it, someone will do it in a single model, but with two models. So there's a model, for example, called Grounding DINO. Mm-hmm. Which is zero. Bounding box prediction. Mm-hmm. And with labels, and you interact with Grounding DINO through text prompts. So you could say like, here's an image.[00:58:14] You know, you and I are seated here in the studio. There's cans in front of us. You could say, give me the left can, and it would label bounding box only around the can on the left, like it understands text in that way. So you could use the masks from Sam and then ask Grounding DINO, what are these things?[00:58:29] Or where is X in between the combination of those two things? Boom, you have an automatic working text description of the things that you have in mind. Now again, this isn't perfect, like there will be places that still require human in loop review, and especially like on the novelty of a data set. These things will be be dependent.[00:58:49] But the point is, yes, there's places to improve and yes, you're gonna need to use tooling to do those improvements. The point is like we're starting so far ahead in our process. We're no longer starting at just like, I've got some images, what do I do? We're starting at, I've got some images and candidate descriptions of what's in those images.[00:59:04] How do I now. Mesh these two things together to understand precisely what I want to know from these images. And then deploy this thing because that's where you ultimately capture the value, is deploying this thing and, and envision a lot of that means on the edge because you have things running out in fields where people aren't.[00:59:21] Um, and that usually means constrained compute,[00:59:23] Labeling on the Browser[00:59:23] part of the demo of segment. Anything runs in the browser as well, which is interesting to some people. I I'm not sure how what percent of it was done.[00:59:30] That's what's fascinating. Um, because, and the reason it can do that, right, is because again, the giant image encoder, so remember the steps?[00:59:36] Yeah. It takes an image, the image encoder, and then you prompt from that image encoder. The image en coder is a large model and you need a spun up GPU to run the ongoing encoding that requires meaningful compute. Yeah. But the prompting can run in the browser. It's that lightweight, which means you can provide really fast feedback.[00:59:54] And that's exactly what we did at Robo Flow is we. Sam, and we made it be the world's best labeling tool. Like you can click on anything and Sam immediately says, this is what you wanted. The thing that you wanted to label is in these, this pixel coordinates area. And to be clear, we already had like this like kind of, we call it smart poly, like this thing that, like you could click and it would make regions of, of guesses of interest.[01:00:18] Sam is just such a stepwise improvement that will show, I mean, things that used to take maybe five or six clicks, you can, Sam immediately understands in one click. In one click.[01:00:28] Roboflow +SAM Video Demo[01:00:28] Cool. I, I think we might search over to the, uh, demo, but yeah, I think this is the, the time that we switch to a multimodal podcast and, uh, have a first screen share.[01:00:38] Amazing. So I'll semi nari what's, uh, what's going on, but, uh, we are checking out Joseph's screen and this is the interface of Robo flow. We have, we have Robo Flow before Sam and we have Robo Post Sam, and we're gonna see what, uh, the quality[01:00:53] difference is. Okay, so here is, uh, an image where we have a given weld that we're interested in segmenting this portion of the weld where these two pipes come together.[01:01:06] Yeah. And the weld is highly[01:01:06] irregular. It's kind of like curved in, in both in three dimensions. So it's just not a typical easily segmentable[01:01:13] thing. Yeah. To the human eye. Like pic eye could figure out, you know, probably where this weld starts and stops. But that's gonna take a lot of clicks. Certainly.[01:01:21] Like we could go through and like, we could, you know, this would be like the really old fashioned way of like creating, apparently[01:01:27] this is how they did, uh, lightsabers, that you had to like, mask out lightsabers and then use of the sub in on the, the lights. And you did it for every. So just really super expensive cuz they didn't have any other options.[01:01:39] Wow. And now it's one click in runway.[01:01:41] Wow. Wow. Okay. So open call for someone to make a light saber simulator using Robo Flow. That's awesome. You haven't had one? Not a, I'm aware. Okay. Oh my God, that's a great idea. Yeah. Yeah. Alright. Okay. So we, so that's, that's the very old fashion way now inside Robo Flow, like, uh, before Sam, we did have this thing called Smart Poly.[01:01:58] Uh, and this will still be, still be available for, for users to use. And so if like, I'm, I'm labeling the weld area, I'd go like this. And you know, the first click I'll, I'll narrate a little bit for, for swyx, I clicked on the welded joint. And it got the welded joint, but also includes lots of irrelevant[01:02:12] area, the rest of the, the bottom pipe and then, and the parts on the right.[01:02:15] What is that picking up? Is it picking up on like just the color or is[01:02:17] it like Yeah, this specific model probably wasn't pre-trained on images of welds and pipes and so it just doesn't have a great concept. Yeah. Of what region starts and stop. Now to be clear, I'm not sol here, like part of, part of the thing with robo, I can go say, I can add positive and negative points, so I can say, no, I didn't, I didn't want this part.[01:02:33] Yeah. And so I said I don't want that bottom part of the pipe little better, and I still don't want the bottom part of the pipe. Okay. That's almost, almost there.[01:02:41] There's a lot of space on either side of the weld. Okay. All right.[01:02:43] That's better. So, so four clicks we got, we got our way to, to, you know, the, the weld here.[01:02:48] Yeah. Um, now with Sam. And so we're gonna do the same thing. I'm going to label the weld portion with a single click. It understands the context of, of that, that, that weld. Uh, I was labeling fish, so I thought I was working on fish. So that's like one Okay, that's, that's great. Of like a, a before and after.[01:03:06] But let's talk about maybe some of the other, Examples of things that I might wanna work on. I came with some fun examples. Let's do, um, so I've got this image of two kids playing when I was holding a balloon in the background. There's like a brick wall. The lighting's not great. Yeah, lighting's not fantastic, but um, you know, we can clearly make out what's going on.[01:03:25] So I'm going to click the, uh, the brick wall in the background. Sam immediately labels both sides of the brick wall, even though there is a pole separating view between the left portion of the brick wall and the right portion of the brick wall. So I can just say like, I dunno, I'll just say thing for ease.[01:03:44] Or let's say I wanna do this guy's shoe, and I'm like, actually, you know what, no, I don't want the shoe, I want the whole, uh, person so I can That's two clicks. Two clicks, and Sam immediately got it. Maybe I wanna be even more really precise and get that portion there and miss face a little bit. So we click the face and that's another thing.[01:04:02] Or let's jump to maybe this one's very[01:04:05] fun. Okay, so there's a blue, a chihuahua with a bunch of[01:04:08] balloons. Yeah. So here, let's say like I wanted to do, uh, maybe I just wanted do like the eyes, right? Uhhuh. So I'll click like the left[01:04:15] eye that makes the whole chihuahua light[01:04:17] up so it gets the whole chihuahua.[01:04:19] Now here's where interactivity with models and kind of like a new UX paradigm for interaction with models make some sense. I'm gonna say, okay, I wanted that left eye. I don't want the, like the rest of the dog. Rest of the dog. So I'm gonna say no on this part of the dog. Then I'm gonna go say I go straight to the eye.[01:04:32] Yeah. Yep. I'm gonna say yes on the other eye. Uhhuh boom. Right now you got both eyes. I got both eyes and nothing else. And I could do the same thing with the ear. So I could say like, I want the ear and I click the right ear and it gets the whole again, the whole dog head. But I could say, no, I don't want the dog head.[01:04:46] And it boom recognizes that I want only the right ear. So can[01:04:49] I[01:04:49] ask about, so obviously this is super impressive. Can I ask like, is there a way to generalize this work? Like, I did this work for one image. Can I take a another image of a, the same chihuahua and just say, do that. The, um,[01:05:02] reapply what I did to some degree.[01:05:04] There's a few ways we could do that. The, probably the simplest way is actually going back to what we were talking about where you label a few examples and then you create your own kind of mini model that understands exactly what you're after. Yeah. And then you have that mini model finish the work for you.[01:05:18] And you just do that within robot flow. You just do that within Rob flow? Of course. Yeah. So like, I've got like, so I label, I label a bunch of my images after I've got, you know, we'll say like 10 of them labeled, then I'll kick off, you know, my own custom model. And the nice thing is that like right, I'm building my own ip.[01:05:34] And that's one of the big things that like I'm pretty excited about with, uh, Motomod modality and especially with GBT and some of these things, is that like I can take what these massive models understand. This is a generalist way of saying distill, but I can distill them down into a different architecture that captures that portion of the world.[01:05:54] And use that model for, let's say in this context, I've got an image up of, uh, men kind of in front of a pier and they've got aprons on. I can build my own apron detector. Again, this is sort of like in some context, like if I wanna build a task specific model and, and Sam knows everything that it knows, I can either go the route of trying to use Sam zero shot plus another model to label the, the, the mask images that might be limiting cuz of just the compute intensity that Sam requires to run and, you know, maybe wanna build some of my own IP and make use of some of my own data.[01:06:24] But these are kinda the two routes that I think we'll see continue to evolve. And I can use text prompting with Grounding DINO plus Sam to get a sense of which portions of the image I care about. And then I'm probably gonna need to do a little bit of QA of, of that. But, Like the dataset prep process and the biggest inhibitor to creating your own value in IP just got so much simpler.[01:06:49] And I think that, um, I think we're the first ones to go live with this, so that's, yeah, I'm, I'm very thrilled about that. We're recording[01:06:54] this earlier, but it's, uh, when, when this podcast drops, it'll be live. Uh, hopefully, you know, if everything goes well, I'll coordinate with you. So, so, so it will be live?[01:07:02] No, it will, it will, it will be live, yes. Yes, yes. Uh, and people can go try it out. Exactly. I guess it'll be just be part of the Rofo platform and I, I, I assume I'll, I'll add a, a blog post to it. Anything else on just, uh, so we're, we're about to zoom out from Sam and computer vision to Easter general AI takes, but, uh, anything else in terms of like future projections of, of the, of what happens next in, in computer vision segmentation or anything in that, in that,[01:07:27] Future Predictions[01:07:27] As you were describing earlier, Sam right now only produces masks.[01:07:30] It can't be text steer to give the context of those masks that's gonna happen in a single architecture without chaining together a couple different architectures. That's, that's for sure. The second thing is, um, multimodality generally will allow us to add more context to the things that we're seeing and doing.[01:07:45] And I'm sure we'll probably talk about this in a moment, but like, that's maybe a good segue into like GPT4 Yeah. And GPT4's capabilities, what we expect, how we're excited about it, the ways that we're already using some of GPT4, and really gonna lean into the capabilities that unlocks from, from imagery and, and a visual prep perspective.[01:08:04] GPT4 Multimodality[01:08:04] Let's go into that. Great. I was watching that keynote on GPT4. I was blown away. What were your reactions as a computer vision company?[01:08:13] Similar. Similar, yeah. Apparently. Um, so Greg Brockman did that demo where he said, make a joke generator website. Apparently that was totally ad hoc, like that. Didn't practiced that at all.[01:08:22] Which, what? Yeah, he just gave it a go. Yeah. I, I think that like the. Generation of code from imagery. I think that like screenshot of a website to rack components within six months. I think stuff like that will be imminently possible, doable and just unlock all kinds of potential.[01:08:38] And then did you see the second one with the Discord screenshot that they posted in?[01:08:42] It was a very quick part of the demo, so a lot of people missed it. But essentially what Logan from opening I did was screenshotted, uh, the Discord screen he was on and then pasted it into the discord that had GPT4 read it and it was able to read every word on it. Yes.[01:08:57] I think OCR is a solved problem[01:08:59] in a large language model as opposed to like a dedicated OCR R model.[01:09:03] Yes. Isn't that that that's, we've[01:09:05] never seen that. That's right. Yeah. And I think OCR like is actually a perfect candidate for like multimodality, right, because it's literally photos of text. Yeah. Yeah. And there's already gonna be like ample training data from all the work that's been done on creating prior OCR models.[01:09:20] Right. But yeah, I think that they probably are about to release the world's best. OCR model. Full stop. Yeah. Well,[01:09:27] Remaining Hard Problems[01:09:27] so I think those were like, kind of what they wanted to show on the demo. I, you know, it's, it's news to me that the, the drawing was impromptu. What's a really hard challenge that you wanna try on GT four once you get access to it, what are you going run[01:09:38] it on?[01:09:39] So, the way I think about like, advances in computer vision and what, uh, capabilities get unlocked, where there's still gonna be problems in ensuring that we're building tooling that really unblocks people. I think that, like if you think about the types of use cases that a model already knows without any training, I think about like a bell curve distribution.[01:09:58] Where in the fat center of the curve you have, uh, what historically has been like the cocoa dataset, common objects and context, a 2014 release from Microsoft, 80 classes, things like chair, silverware, food, car. They say sports ball for all. Sports ball. Did they really? Yeah. In the dataset. Yeah.[01:10:16] That's a, that's hilarious.[01:10:18] Oh[01:10:18] my God. So, yeah. And so you've got like all these, I mean, I, I get why they do that. It's like a capture for all sports. Um, but the point is, like in the fat center, you have these things, these, these objects that are as common as possible. And I think that, and then go to the exact, like long tails of this distribution and the very, very like edge of the tails you have.[01:10:38] Data and problems that are not common or regularly seen, the prevalence of that image may be existing on the web is maybe one way to think about this. And that's where you have like maybe a manufacturer that makes their own good that no one else makes, or a logistics company that knows what their stuff were supposed to look like or maybe your specific house looks like a very notable way or a pattern or, or something like this.[01:10:59] And of course, all these problems depend on like what exactly you want to do, but there will be places where there's just proprietary information that doesn't exist on the web basically. And, um, I think of that like what's happening in vision is that fat middle is steadily expanding outward. The models that are trained on cocoa, you know, do better and better and better on like, making that middle sliver really, really confident.[01:11:23] And then models like clip, which, you know, two years ago, the first kind of multimodality approach, which robos already power like we already have clip powered search and robo and have for over a year. Which, you know, links text and images in a way we haven't seen before it. And that basically increases the generalizability of what models can see.[01:11:45] I think G p D four expands that even further, where like, you get like, even further into like, those, those long, long tails. I don't think that like completely, like, I don't think that like, we'll, like never train again, so to speak. That's kinda like my, my mental model of what's happening, what's gonna continue to happen.[01:11:59] Now that still creates emergent problems for developers. That still creates problems like, like we were talking about earlier. Even if, you know, I have a model that knows everything in the world, that model might be a not mine or it might be a model that I can't run where I need to run it. Uh, maybe a place without internet, maybe a place on the edge, maybe a place that's compute constrained.[01:12:16] So I might need to do like some distilling down. I might have data that's truly proprietary that's like not present on the web. So like I can't rely on this model. I might have a task type that these G B D four and multimodal models are extremely good at visual question answering. And I think they'll be able to describe images in kinda like a freeform text way.[01:12:34] But you're still gonna come, maybe need to massage that text into something useful and, and insightful and, and to be, to be understood. And maybe that's a place where you're like, you know, use like lang chain and things to like, uh, figure out what's going on from, from the candidates descriptions of, of text.[01:12:48] And so there's still gonna be a healthy set of problems to making this stuff be, be usable, but ways that we're thinking about at Roble that I'm very excited about. So we already used GPT4 to do like dataset description with, to be clear, just the text only. Just the text only? Yeah, just the text only.[01:13:02] We're, we're fortunate like Greg and, and Sam back us. Um, uh, but personally, personally,[01:13:06] Sam as in Altman, Sam, not the, yeah, not the model Sam, because the mo the model could be smart enough to[01:13:11] back you. I don't know. That's been a funny confusion this last week. You know? Which, which Sam, which Sam are you talking about?[01:13:15] You were talking a lot about Sam does. So, but, but we don't have, um, visual access to be clear. Text only GPT4 to do dataset description, basically passing it what we already know, like we have, Hey, I have a computer vision model with like these sorts of classes or things like this, and gimme a dataset description that enriches, enriches my dataset.[01:13:31] And then we also of course have like GPT4 powered support, like a lot of folks do of like, uh, we ingested, uh, the 480 blogs and the Ripple blog, the 120 YouTube videos, 280, the you guys, the uh, dozens of open source projects and every page in our. Uh, and our help center. And then we ingested that and now we have a GPT4 powered bot that can generate not only like code snippets, just like GPT4 can do really well, but regurgitate and site and point you to the resources across Robo Flow.[01:13:57] Ask Roboflow (2019)[01:13:57] Shout out to the og uh, robo fans. You are the first to have your own bot, which is Ask Robo Flow. I saw this at Hack News. I was like, wait, this is a harbinger of things to come. And uh,[01:14:06] in 2019, this is where the name road flow comes from. Really? We, we, yes. I was[01:14:10] thinking there's nothing imaging in your, in your, uh, description or your[01:14:13] name.[01:14:14] Yeah. Yeah. Cuz I mean, I think that, um, to build, to build a hundred years end durable company, you can't just be one thing. You gotta, you gotta do everything. You gotta, you gotta be Microsoft anyway, so, yeah, yeah, yeah. One of the first things we were doing with, um, AI in 2019 was we realized Stack Overflow is extremely valuable resource, but it's only in English and programmers come from all around the world.[01:14:33] So logically programmers are gonna be speaking various languages to wanna understand and debug their programs. So we said, with these advances in N L P, don't you think that we could translate Stack Overflow? To every single other language and provide a really useful localized stack overflow. And so we started working on that.[01:14:47] We called it Stack Robo Flow. And then, um, Josh, the founder of, uh, delicious, if you remember that, that site. Mm-hmm. Mm-hmm. He Shawn Pardo, he's like, drop, drop the stack. It's cleaner. Just, just make it be robo Flow. It's a great story.[01:14:59] Oh, love the story behind names. And[01:15:00] from from then on, it's just been, uh, Rob Flow.[01:15:02] Yeah, yeah. Um, which is, you know, been a useful name and it's, and it's stuck. But yeah, like we, I mean actually Stack Rob. Dot com is still up and you can like ask it questions. It's not nearly as good, of course. It's like it's before LLMs. Like it's, uh, but uh, yeah, ask Rob Flow was the very first, you know, programmer completion sort of, sort of guide.[01:15:21] So we've been really excited that, um, others have picked up and done a much better job with that than what we were doing.[01:15:26] How to keep up in AI[01:15:26] Yeah. You have a really sort of hacker mentality, which I love. Uh, obviously you at, at the various hack hackathons in San Francisco. Uh, and maybe we can close out with that. I know we've been running long, so, uh, I'm just gonna zoom out a little bit into the broader sort of personal or meta question about how do you keep up with ai, right?[01:15:41] Like you, you're econ grad, you went into data science, very common path. I I had a similar path as well, and I'm going down this AI journey, um, about six, seven years after you. How do you recommend people keep[01:15:51] up? The way that I do is ingest sources from probably similar places that others do of whether it's the research community is quite active on, on Twitter.[01:15:59] Regularly seen papers linked on, on archived people will be in communities, various discords or even inside the robo flow Slack. People will share papers and things that are, um, meaningful and interesting. But that's just like one part is like ingestion. Yes. Getting ingestion from friends, having like engaged in conversations and just kind of being eyes wide open to various things.[01:16:18] The second part is production. Yeah. And we can kinda like read some tweets and see some demos, but for me when Robo Flow, when Brad and I, uh, were just working on stuff very early, one of the pioneer goals that we had was published three blogs and two YouTube videos per week. And we did that for seven months.[01:16:33] So I was just nonstop producing content and that wasn't just like writing a blog. It'd usually be like, Um, you know, you, you do a blog sometimes, or you do like a, a co-lab notebook, training tutorial, or the point is you're basically like naturally re-implementing the papers and things that you're reading and as you mention you out of[01:16:49] ideas.[01:16:50] Anyway. Yeah. Gotta do something.[01:16:53] I mean, and as you mentioned, I spent some time teaching data science work Yeah. Journal assembly and actually taught a bit about gw and I really became a subscriber to the belief that if you can't describe something simply, then you probably don't understand, don't know it yourself.[01:17:05] Yeah. And so being forced to, to produce things and then Yeah. You mentioned like hackathons, like I still, still have a good hackathon, whether that's internal to our team or inside the outside in the community. And I really look up to folks like, I mean, I'm sure you've probably come across like, uh, you, you recently mentioned that you, you'd spent some time with like the notion founders and you know, they're insanely Yeah.[01:17:22] Curious and you would've. Idea of the stature of, of the business. And I think that that's like an incredibly strong ethos to, to[01:17:30] have, they're billionaires and they're having lunch with me to ask what I think[01:17:34] about I, well, yeah, I mean, I think you have an incredibly good view of what's next and what's coming up and uh, a different purview.[01:17:41] But that's exactly what I mean. Right. Like engage in other folks and legitimately asking them and wanting to glean and, and be curious. Like, I dunno, like I think about someone like Jeff Dean who made map produce and also introduced one of the first versions of TensorFlow. Yeah. Like, he just has to be so innately curious to, I don't even know if it's, if it's called reinventing yourselves at that.[01:18:00] By that time, if you've just like been. Uh, so on the, the cutting edge, but it's not like I think about like someone considering themselves, quote unquote an expert in like TensorFlow or a framework or whatever, and it's like everyone is learning. Some people are just like further ahead on their journey and you can actually catch up pretty quickly with some strong, some strong effort.[01:18:18] So I think that that's a lot of it is like being, is there's just as much the mentality as there is, like the, the resources and then like the, the production. And I mean, you kinda mentioned before we started recording like, oh, you're like the expert on these, these sorts of things. And I don't even think that that's, uh, I spend more time thinking about them than a lot of people, but there's still a ton to ingest and work on and change and improve.[01:18:41] And I think that that's actually a pretty big opportunity for, uh, young companies especially that have a, a habit of being able to move quickly and really focus on like unlocking user value rather than most other things.[01:18:53] Well, that's a perfect way to end things. Uh, thank you for being my and many other people's first introduction to computer vision in the state of the art.[01:19:01] Uh, I'm sure we'll have you back for, you know, whatever else comes, uh, along. But you are literally the perfect guest to talk segment anything, and it was by far the hottest this topic of discussion this past week. So thanks for, uh, taking the[01:19:12] time. I had a ton of fun. Thanks for having me. All right. Thank you. Get full access to Latent Space at www.latent.space/subscribe
undefined
Apr 7, 2023 • 51min

AI Fundamentals: Benchmarks 101

We’re trying a new format, inspired by Acquired.fm! No guests, no news, just highly prepared, in-depth conversation on one topic that will level up your understanding. We aren’t experts, we are learning in public. Please let us know what we got wrong and what you think of this new format!When you ask someone to break down the basic ingredients of a Large Language Model, you’ll often hear a few things: You need lots of data. You need lots of compute. You need models with billions of parameters. Trust the Bitter Lesson, more more more, scale is all you need. Right?Nobody ever mentions the subtle influence of great benchmarking.LLM Benchmarks mark our progress in building artificial intelligences, progressing from * knowing what words go with others (1985 WordNet)* recognizing names and entities (2004 Enron Emails) * and image of numbers, letters, and clothes (1998-2017 MNIST)* language translation (2002 BLEU → 2020 XTREME)* more and more images (2009 ImageNet, CIFAR)* reasoning in sentences (2016 LAMBADA) and paragraphs (2019 AI2RC, DROP)* stringing together whole sentences (2018 GLUE and SuperGLUE)* question answering (2019 CoQA)* having common sense (2018 Swag and HellaSwag, 2019 WinoGrande)* knowledge of all human tasks and professional exams (2021 MMLU)* knowing everything (2022 BIG-Bench)People who make benchmarks are the unsung heroes of LLM research, because they dream up ever harder tests that last ever shorter periods of time.In our first AI Fundamentals episode, we take a trek through history to try to explain what we have learned about LLM Benchmarking, and what issues we have discovered with them. There are way, way too many links and references to include in this email. You can follow along the work we did for our show prep in this podcast’s accompanying repo, with all papers and selected tests pulled out.Enjoy and please let us know what other fundamentals topics you’d like us to cover!Timestamps* [00:00:21] Benchmarking Questions* [00:03:08] Why AI Benchmarks matter* [00:06:02] Introducing Benchmark Metrics* [00:08:14] Benchmarking Methodology* [00:09:45] 1985-1989: WordNet and Entailment* [00:12:44] 1998-2004 Enron Emails and MNIST* [00:14:35] 2009-14: ImageNet, CIFAR and the AlexNet Moment for Deep Learning* [00:17:42] 2018-19: GLUE and SuperGLUE - Single Sentence, Similarity and Paraphrase, Inference* [00:23:21] 2018-19: Swag and HellaSwag - Common Sense Inference* [00:26:07] Aside: How to Design Benchmarks* [00:26:51] 2021: MMLU - Human level Professional Knowledge* [00:29:39] 2021: HumanEval - Code Generation* [00:31:51] 2020: XTREME - Multilingual Benchmarks* [00:35:14] 2022: BIG-Bench - The Biggest of the Benches* [00:37:40] EDIT: Why BIG-Bench is missing from GPT4 Results* [00:38:25] Issue: GPT4 vs the mystery of the AMC10/12* [00:40:28] Issue: Data Contamination* [00:42:13] Other Issues: Benchmark Data Quality and the Iris data set* [00:45:44] Tradeoffs of Latency, Inference Cost, Throughput* [00:49:45] ConclusionTranscript[00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO and residence at Decibel Partners, and I'm joined by my co-host, swyx writer and editor of Latent Space.[00:00:21] Benchmarking Questions[00:00:21] Up until today, we never verified that we're actually humans to you guys. So we'd have one good thing to do today would be run ourselves through some AI benchmarks and see if we are humans.[00:00:31] Indeed. So, since I got you here, Sean, I'll start with one of the classic benchmark questions, which is what movie does this emoji describe? The emoji set is little Kid Bluefish yellow, bluefish orange Puffer fish. One movie does that. I think if you added an octopus, it would be slightly easier. But I prepped this question so I know it's finding Nemo.[00:00:57] You are so far a human. Second one of these emoji questions instead, depicts a superhero man, a superwoman, three little kids, one of them, which is a toddler. So you got this one too? Yeah. It's one of my favorite movies ever. It's the Incredibles. Uh, second one was kind of a letdown, but the first is a.[00:01:17] Awesome. Okay, I'm gonna ramp it up a little bit. So let's ask something that involves a little bit of world knowledge. So when you drop a ball from rest, it accelerates downward at 9.8 meters per second if you throw it downward instead, assuming no air resistance, so you're throwing it down instead of dropping it, it's acceleration immediately after leaving your hand is a 9.8 meters per second.[00:01:38] B, more than 9.8 meters per second. C less than 9.8 meters per second. D cannot say unless the speed of the throw is. I would say B, you know, I started as a physics major and then I changed, but I think I, I got enough from my first year. That is B Yeah. Even proven that you're human cuz you got it wrong.[00:01:56] Whereas the AI got it right is 9.8 meters per second. The gravitational constant, uh, because you are no longer accelerating after you leave the hand. The question says if you throw it downward after leaving your hand, what is the. It is, it goes back to the gravitational constant, which is 9.8 meters per, I thought you said you were a physics major.[00:02:17] That's why I changed. So I'm a human. I'm a human. You're human. You're human. But you, you got them all right. So I can't ramp it up. I can't ramp it up. So, Assuming, uh, the AI got all of that right, you would think that AI will get this one wrong. Mm-hmm. Because it's just predicting the next token, right?[00:02:31] Right. In the complex Z plane, the set of points satisfying the equation. Z squared equals modulars. Z squared is A, a pair points B circle, C, a half line D, online D square. The processing is, this is going on in your head. You got minus three. A line. This is hard. Yes, that is. That is a line. Okay. What's funny is that I think if, if an AI was doing this, it would take the same exact amount of time to answer this as it would every single other word.[00:03:05] Cuz it's computationally the same to them. Right.[00:03:08] Why AI Benchmarks matter[00:03:08] Um, so anyway, if you haven't caught on today, we're doing our first, uh, AI fundamentals episode, which just the two of us, no guess because we wanted to go deep on one topic and the topic. AI benchmarks. So why are we focusing on AI benchmarks? So, GPT4 just came out last week and every time a new model comes out, All we hear about is it's so much better than the previous model on benchmark X, on benchmark Y.[00:03:33] It performs better on this, better on that. But most people don't actually know what actually goes on under these benchmarks. So we thought it would be helpful for people to put these things in context. And also benchmarks evolved. Like the more the models improve, the harder the benchmarks get. Like I couldn't even get one of the questions right.[00:03:52] So obviously they're working and you'll see that. From the 1990s where some of the first ones came out to day, the, the difficulty of them is truly skyrocketed. So we wanna give a, a brief history of that and leave you with a mental model on, okay, what does it really mean to do well at X benchmark versus Y benchmark?[00:04:13] Um, so excited to add that in. I would also say when you ask people what are the ingredients going into a large language model, they'll talk to you about the data. They'll talk to you about the neural nets, they'll talk to you about the amount of compute, you know, how many GPUs are getting burned based on this.[00:04:30] They never talk to you about the benchmarks. And it's actually a shame because they're so influential. Like that is the entirety of how we judge whether a language model is better than the other. Cuz a language model can do anything out of. Potentially infinite capabilities. How do you judge one model versus another?[00:04:48] How do you know you're getting better? And so I think it's an area of intense specialization. Also, I think when. Individuals like us, you know, we sort of play with the language models. We are basically doing benchmarks. We're saying, look, it's, it's doing this awesome thing that I found. Guess what? There have been academics studying this for 20 years who have, uh, developed a science to this, and we can actually benefit from studying what they have done.[00:05:10] Yep. And obviously the benchmarks also drive research, you know, in a way whenever you're working on, in a new model. Yeah. The benchmark kind of constraints what you're optimizing for in a way. Because if you've read a paper and it performs worse than all the other models, like you're not gonna publish it.[00:05:27] Yeah. So in a way, there's bias in the benchmark itself. Yeah. Yeah. We'll talk a little bit about that. Right. Are we optimizing for the right things when we over-optimize for a single benchmark over over some others? And also curiously, when GPT4 was released, they emitted some very. Commonplace industry benchmarks.[00:05:44] So the way that you present yourself, it is a form of marketing. It is a form of trying to say you're better than something else. And, and trying to explain where you think you, you do better. But it's very hard to verify as well because there are certain problems with reproducing benchmarks, uh, especially when you come to large language models.[00:06:02] Introducing Benchmark Metrics[00:06:02] So where do we go from here? Should we go over the, the major concept? Yeah. When it comes to benchmark metrics, we get three main measures. Accuracy, precision, recall accuracy is just looking at how many successful prediction the model does. Precision is the ratio of true positives, meaning how many of them are good compared to the overall amount of predictions made Versus recall is what proportion of the positives were identified.[00:06:31] So if you think. Spotify playlist to maybe make it a little more approachable, precision is looking. How many songs in a Spotify playlist did you like versus recall is looking at of all the Spotify songs that you like in the word, how many of them were put in the in the playlist? So it's more looking at how many of the true positives can you actually bring into the model versus like more focusing on just being right.[00:06:57] And the two things are precision and recall are usually in tension.. If you're looking for a higher position, you wanna have a higher percentage of correct results. You're usually bringing recall down because you lead to kind of like lower response sets, you know, so there's always trade offs. And this is a big part of the benchmarking too.[00:07:20] You know, what do you wanna optimize for? And most benchmarks use this, um, F1 score, which is the harmonic mean of precision and recall. Which is, you know, we'll put it in the show notes, but just like two times, like the, you know, precision Times Recall divided by the sum. So that's one. And then you get the Stanford Helm metrics.[00:07:38] Um, yeah, so ultimately I think we have advanced a lot in the, in the past few decades on how we measure language models. And the most interesting one came out January of this year from Percy Lang's research lab at Stanford, and he's got. A few metrics, accuracy, calibration, robustness, fairness, efficiency, general information bias and toxicity, and caring that your language models are not toxic and not biased.[00:08:03] So is is, mm-hmm. Kind of a new thing because we have solved the other stuff, therefore we get to care about the toxic of, uh, the language models yelling at us.[00:08:14] Benchmarking Methodology[00:08:14] But yeah, I mean, maybe we can also talk about the other forms of how their be. Yeah, there's three main modes. You can need a benchmark model in a zero shot fashion, few shot or fine tune models, zero shots.[00:08:27] You do not provide any example and you're just testing how good the model is at generalizing few shots, you have a couple examples that you provide and then. You see from there how good the model is. These are the number of examples usually represented with a K, so you might see few shots, K equal five, it means five examples were passed, and then fine tune is you actually take a bunch of data and fine tune the model for that specific task, and then you test it.[00:08:55] These all go from the least amount of work required to the most amount of work required. If you're doing zero shots benchmarking, you do not need to have any data, so you can just take 'em out and do. If you're fine tuning it, you actually need a lot of data and a lot of compute time. You're expecting to see much better results from there.[00:09:14] Yeah. And sometimes the number of shots can go up to like a hundred, which is pretty surprising for me to see that people are willing to test these language models that far. But why not? You just run the computer a little bit longer. Yeah. Uh, what's next? Should we go into history and then benchmarks? Yeah.[00:09:29] History of Benchmarking since 1985[00:09:29] Okay, so I was up all night yesterday. I was like, this is a fascinating topic. And I was like, all right, I'll just do whatever's in the G PT three paper. And then I read those papers and they all cited previous papers, and I went back and back and back all the way to 1985. The very first benchmark that I can find.[00:09:45] 1985-1989: WordNet and Entailment[00:09:45] Which is WordNet, which is uh, an English benchmark created in at Princeton University by George Miller and Christian Fellbaum. Uh, so fun fact, Chris George Miller also authored the paper, the Magical Number seven plus Minus two, which is the observation that people have a short term memory of about seven for things.[00:10:04] If you have plus or minus two of seven, that's about all you can sort of remember in the short term, and I just wanted. Say like, this was before computers, right? 1985. This was before any of these personal computers were around. I just wanna give people a sense of how much work manual work was being done by these people.[00:10:22] The database, uh, WordNet. Sorry. The WordNet database contains 155,000 words organized in 175,000 sys. These sys are basically just pairings of nouns and verbs and adjectives and adverbs that go together. So in other words, for example, if you have nouns that are hyper names, if every X is a, is a kind of Y.[00:10:44] So a canine is a hyper name of a dog. It's a holo. If X is a part of Y, so a building is a hollow name of a window. The most interesting one for in terms of formal, uh, linguistic logic is entailment, which captures the relationship between two words, where the verb Y is entailed by X. So if by doing X, you must be doing Y.[00:11:02] So in other words, two, sleep is entailed by two snore because you cannot snore without also sleeping and manually mapping 155,000 words like that, the relationships between all of them in a, in a nested tree, which is. Incredible to me. Mm-hmm. And people just did that on faith. They were like, this will be useful somehow.[00:11:21] Right. Uh, and they were interested in cycle linguistics, like understanding how humans thought, but then it turned out that this was a very good dataset for understanding semantic similarity, right? Mm-hmm. Like if you measure the distance between two words by traversing up and down the graph, you can find how similar to two words are, and therefore, Try to figure out like how close they are and trade a model to, to predict that sentiment analysis.[00:11:42] You can, you can see how far something is from something that is considered a good sentiment or a bad sentiment or machine translation from one language to the other. Uh, they're not 200 word languages, which is just amazing. Like people had to do this without computers. Penn Tree Bank, I was in 1989, I went to Penn, so I always give a shout out to my university.[00:12:01] This one expanded to 4.5 million words of text, which is every uh, wall Street Journal. For three years, hand collected, hand labeled by grad students your tuition dollars at work. So I'm gonna skip forward from the eighties to the nineties. Uh, NYS was the most famous data set that came out of this. So this is the, uh, data set of 60,000.[00:12:25] Training images of, uh, of numbers. And this was the first visual dataset where, uh, people were tr tracking like, you know, handwritten numbers and, and mapping them to digital numbers and seeing what the error rate for them was. Uh, these days I think this can be trained in like e every Hello world for machine learning is just train missed in like four lanes of code.[00:12:44] 1998-2004 Enron Emails and MNIST[00:12:44] Then we have the Enron email data set. Enron failed in 2001. Uh, the emails were released in 2004 and they've been upgraded every, uh, every few years since then. That is 600,000 emails by 150 senior employees of Enron, which is really interesting because these are email people emailing each other back and forth in a very natural.[00:13:01] Context not knowing they're being, they're about to be observed, so you can do things like email classification, email summarization, entity recognition and language modeling, which is super cool. Any thoughts about that be before we go into the two thousands? I think like in a way that kind of puts you back to the bias, you know, in some of these benchmarks, in some of these data sets.[00:13:21] You know, like if your main corpus of benchmarking for entity recognition is a public energy company. Mm-hmm. You know, like if you're building something completely different and you're building a model for that, maybe it'll be worse. You know, you start to see how we started. With kind of like, WordNet is just like human linguistics, you know?[00:13:43] Yes. It's not domain related. And then, um, same with, you know, but now we're starting to get into more and more domain-specific benchmarks and you'll see this increase over time. Yeah. NY itself was very biased towards, um, training on handwritten letter. Uh, and handwritten numbers. So, um, in 2017 they actually extended it to Eist, which is an extended to extension to handwritten letters that seems very natural.[00:14:08] And then 2017, they also had fashion ness, which is a very popular data set, which is images of clothing items pulled from Zando. So you can see the capabilities of computer vision growing from single digit, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, to all the letters of the alphabet. To now we can recognize images, uh, of fashion, clothing items.[00:14:28] So it's pretty. So the big one for deep learning, cuz all of that was just, just the appetizers, just getting started.[00:14:35] 2009-2014 : ImageNet, CIFAR and the AlexNet Moment for Deep Learning[00:14:35] The big one for deep learning was ImageNet, which is where Fafa Lee came into the picture and that's why she's super well known. She started working in 2006 and released it in 2009. Fun fact, she actually met with, uh, Christian Feldbaum, who was, uh, one of the co-authors of, uh, war.[00:14:51] To create ImageNet. So there's a direct lineage from Words to Images. Yeah. And uh, they use Amazon Mechanical Turk to help with classification images. No longer grad students. But again, like I think, uh, this goes, kind of goes back to your observation about bias, like when I am a mechanical Turk worker. And I'm being paid by the image to classify an image.[00:15:10] Do you think I'll be very careful at my job? Right? Yeah. Whereas when I'm a, you know, Enron employee, emailing my, my fellow coworker, trying to just communicate something of, of natural language that is a different type of, uh, environment. Mm-hmm. So it's a pretty interesting benchmark. So it was released in 2009 ish and, you know, people were sort of competing to recognize and classify that properly.[00:15:33] The magic moment for ImageNet came in 2012, uh, which is called the AlexNet moment cuz I think that grad student that, um, created this recognition model was, uh, named Alex, I forget his last name, achieved a error rate of 15%, which is, More than 10% lower than the runner up. So it was used just so much better than the second place that everyone else was like, what are you doing?[00:15:54] Uh, and it turned out that he was, he was the first to use, uh, deep learning, uh, c n n 10 percentage points. So like 15 and the other one was 25. Yeah, exactly. So it was just so much, so much better than the others. It was just unbelievable that no one else was, no other approach was even coming close.[00:16:09] Therefore, everyone from there on out for the next, until today we're just learning the lessons of deep learning because, um, it is so much superior to the other approaches. And this was like a big. Images and visual moment because then you had like a sci-fi 10, which is a, another, like a data set that is mostly images.[00:16:27] Mm-hmm. Focused. Mm-hmm. So it took a little bit before we got back to to text. And nowadays it feels like text, you know, text models are kind of eating the word, you know, we're making the text one multi-model. Yeah. So like we're bringing the images to GBT four instead of the opposite. But yeah, in 2009 we had a, another 60,000 images that set.[00:16:46] 32 by 32. Color images with airplanes, automobiles, like, uh, animals, like all kind of stuff. Like I, I think before we had the numbers, then we had the handwritten letters. Then we had clothing, and then we finally made clothing items came after, oh, clothing items. 2009. Yeah, this is 2009. I skipped, I skipped time a little bit.[00:17:08] Yeah, yeah. But yeah, CFR 10 and CFR 100. CFR 10 was for 10 classes. And that that was chosen. And then obviously they optimized that and they were like, all right, we need a new problem now. So in 20 14, 5 years later, they introduced CFAR 100, which was a hundred classes of other items. And I think this is a very general pattern, which is used.[00:17:25] You create a data set for a specific be. You think it's too hard for machines? Mm-hmm. It lasts for five years before it's no longer too hard for machines, and you have to find a new data set and you have to extend it again. So it's Similarly, we are gonna find that in glue, which is another, which is one of more modern data sets.[00:17:42] 2018-19: GLUE and SuperGLUE - Single Sentence, Similarity and Paraphrase, Inference[00:17:42] This one came out in 2018. Glue stands for general Language Understanding Evaluation. This is one of the most influential, I think, early. Earlier, um, language model benchmarks, and it has nine tasks. Um, so it has single sentence tasks, similarity and paraphrase tasks and inference tasks. So a single sentence task, uh, would be something like, uh, the Stanford Sentiment Tree Bank, which is a.[00:18:05] Uh, sentences from movie reviews and human annotations of the sentiment, whether it's positive or negative, in a sort of like a four point scale. And your job is to predict the task of a single sentence. This similarity task would involve corpuses, like the Microsoft research paraphrase corpus. So it's a corpus of sentence pairs automatically extracted from online news sources with human annotations for whether or not the sentence is in the para semantically equivalent.[00:18:28] So you just predict true or false and again, Just to call back to the math that we did earlier in this episode, the classes here are imbalance. This data set, for example, is 68% positive. So we report both accuracy and F1 scores. F1 is a more balanced approach because it, it adjusts for, uh, imbalanced, um, data sets.[00:18:48] Mm-hmm. Yeah. And then finally, inference. Inference is the one where we really start to have some kind of logic. So for example, the M N L I. Um, actually I'm, I'm gonna focus on squad, the Stanford questioning question answering dataset. It's another data set of pairs, uh, questions, uh, uh, p question paragraphs, pairs.[00:19:04] So where one of the sentences of the paragraph drawn from Wikipedia contains the answer to the corresponding question, we convert the task into a sentence, para classification by forming a pair between each question in each sentence into corresponding context and filtering out pairs of low overlap. So basically annotating whether or not.[00:19:20] Is the answer to the question inside of this paragraph that I pulled. Can you identify that? And again, like Entailment is kind of included inside of each of these inference tasks because it starts to force the language model to understand whether or not one thing implies the other thing. Mm-hmm. Yeah.[00:19:37] And the, the models evolving. This came out in 2018, lasted one year exactly. One year later, people were like, that's too easy. That's too easy. So in 2019, they actually came out with super. I love how you'll see later with like swag and hella swag. It's like they come up with very good names for these things.[00:19:55] Basically what's super glue dead is stick glue and try and move outside of the single sentence evaluation. So most of the tasks that. Sean was talking about focus on one sentence. Yeah, one sentence, one question. It's pretty straightforward in that way. Superglue kind of at the, so one, it went from single sentence to having some multi sentence and kind of like a context driven thing.[00:20:21] So you might have questions where, The answer is not in the last paragraph that you've read. So it starts to test the, the context window on this model. Some of them are more, in order to know the answer, you need to know what's not in the question kind of thing. So like you may say, Hey, this drink is owned by the Coca-Cola company.[00:20:43] Is this a Pepsi product? You know, so you need to make the connection false. Exactly, yeah. Then you have also like, um, embedded clauses. So you have things that are not exactly said, have to be inferred, and like a lot of this stack is very conversational. So some of the example contain a lot of the, um, um, you know, or this question's very hard to read out.[00:21:07] Yeah, I know. It's like, it sounds like you are saying, um, but no, you're actually, you're actually. And yet I hope to see employer base, you know, helping out child, um, care centers at the place of employment, things like that, that will help out. It's kind of hard to even read it. And then the hypothesis is like they're setting a trend.[00:21:27] It's going from something very simple like a big p d extract to something that is more similar to how humans communicate. Transcripts, like audio transcripts. Exactly. Of how people talk. Yeah. And some of them are also, Plausibility. You know, like most of these models have started to get good at understanding like a clear cause, kind of like a.[00:21:48] You know, cause effect things. But some of the plausible ones are like, for example, this one is a copa. They're called choice of plausible alternatives. The premises, my body cast a shadow over the grass. What's the cost for this alternative? One, the sun was rising. Alternative to the grass was cut.[00:22:07] Obviously it's the sun was rising, but nowhere. In the question we're actually mentioning the sun, uh, we are mentioning the grass. So some models, some of the older models might see the grass and make the connection that the grass is part of the reason, but the models start to get better and better and go from simply looking at the single sentence context to a more of a, a word new, uh, word knowledge.[00:22:27] It's just really impressive, like the fact that. We can expect that out of a model. It still blows my mind. I think we should not take it for granted that when we're evaluating models, we're asking questions like this that is not obvious from just the given text itself. Mm-hmm. So it, it is just coming with a memorized view of the world, uh, or, or world knowledge. And it understands the premise on, on some form. It is not just random noise. Yeah, I know. It's really impressive. This one, I actually wanted multi rc I actually wanted to spring on you as a, as a test, but it's just too long to read. It's just like a very long logic question.[00:23:03] And then it'll ask you to do, uh, comprehension. But uh, yeah, we'll just, we'll just kinda skip that. We'll put it, we'll put it in the show notes, and then you have to prove us that you're a human. Send us the answer exactly. Exactly and subscribe to the podcast. So superglue was a lot harder, and I think also was superseded eventually, pretty soon.[00:23:21] 2018-2019: Swag and HellaSwag - Common Sense Inference[00:23:21] And, uh, yeah, then we started coming onto the more recent cohort of tests. I don't know how to introduce the rest. Uh, there, there are just so many tests here that I, I struggle a little bit picking from these. Uh, but perhaps we can talk about swag and heli swyx since you mentioned it. Yeah. So SWAG stands for situations with Adversarial Generations.[00:23:39] Uh, also came out in 2018, but this guy, zes Etal, likes to name his data sets and his benchmarks in a very memorable way. And if you look at the PDF of the paper, he also has a little icon, uh, image icon for swag. And he doesn't just go by, uh, regular language. So he definitely has a little bit of branding to this and it's.[00:24:00] Part. So I'll give you an example of the kind of problems that swyx poses. Uh, it it is focused on common sense inference. So what's common sense inference? So, for example, given a partial description, like she opened the hood of the car, humans can reason about the situation and anticipate what might come next.[00:24:16] Then she examined the engine. So you're supposed to pick based on what happened in the first part. What is most likely to happen in the second part based on the, uh, multiple choice question, right? Another example would be on stage, a woman takes a seat at the piano. She a, sits on a bench as her sister plays at the doll.[00:24:33] B. Smiles with someone as the music play. C is in the crowd watching the dancers. D nervously set her fingers on the keys, so A, B, C, or D. It's not all of them are plausible. When you look at the rules of English, we're we've, we're not even checking for whether or not produces or predicts grammatical English.[00:24:54] We're checking for whether the language model can correctly pick what is most likely given the context. The only information that you're given is on stage. A woman takes a seat at the piano, what is she most likely to do next? And D makes sense. It's arguable obviously. Sometimes it could be a. In common sense, it's D.[00:25:11] Mm-hmm. So we're training these models to have common. Yeah, which most humans don't have. So it's a, it's already a step up. Obviously that only lasted a year. Uh, and hello, SWAG was no longer, was no longer challenging in 2019, and they started extending it quite a lot more, a lot more questions. I, I forget what, how many questions?[00:25:33] Um, so Swag was a, swag was a data set. A hundred thousand multiple choice questions. Um, and, and part of the innovation of swag was really that you're generating these questions rather than manually coming up with them. Mm-hmm. And we're starting to get into not just big data, but big questions and big benchmarks of the, of the questions.[00:25:51] That's where the adversarial generations come in, but how that swag. Starts pulling in from real world questions and, and data sets like, uh, wikiHow and activity net. And it's just really, you know, an extension of that. I couldn't even add examples just cuz there's so many. But just to give you an idea of, uh, the progress over time.[00:26:07] Aside: How to Design Benchmarks[00:26:07] Most of these benchmarks are, when they're released, they set. Benchmark at a level where if you just randomly guessed all of the questions, you'll get a 25%. That's sort of the, the baseline. And then you can run each of the language models on them, and then you can run, uh, human evaluations on them. You can have median evaluations, and then you have, um, expert evaluations of humans.[00:26:28] So the randoms level was, uh, for halla. swyx was 20. GT one, uh, which is the, uh, 2019 version that got a 41 on the, on the Hello Sue X score. Bert from Google, got 47. Grover, also from Google, got 57 to 75. Roberta from Facebook, got 85 G P T, 3.5, got 85, and then GPT4 got 95 essentially solving hello swag. So this is useless too.[00:26:51] 2021 - MMLU - Human level Professional Knowledge[00:26:51] We need, we need super Hell now's use this. Super hell swyx. I think the most challenging one came from 2021. 2021 was a very, very good year in benchmarking. So it's, we had two major benchmarks that came out. Human eval and M M L U, uh, we'll talk about mm. M L U first, cuz that, that's probably the more, more relevant one.[00:27:08] So M M L U. Stands for measuring mul massive multitask language understanding, just by far the biggest and most comprehensive and most human-like, uh, benchmark that we've had for until 2021. We had a better one in 2022, but we'll talk about that. So it is a test that covers 57 tasks, including elementary, math, US history, computer science law, and more.[00:27:29] So to attain high accuracy on this task, models must possess extensive world knowledge and prop problem solving. Its. Includes practice questions for the GRE test and the U United States, um, m l e, the medical exam as. It also includes questions from the undergrad courses from Oxford, from all the way from elementary high school to college and professional.[00:27:49] So actually the opening question that I gave you for this podcast came from the math test from M M L U, which is when you drop a ball from rest, uh, what happens? And then also the question about the Complex Z plane, uh, but it equally is also asking professional medicine question. So asking a question about thyroid cancer and, uh, asking you to diagnose.[00:28:10] Which of these four options is most likely? And asking a question about microeconomics, again, giving you a, a situation about regulation and monopolies and asking you to choose from a list of four questions. Mm-hmm. Again, random baseline is 25 out of 100 G P T two scores, 32, which is actually pretty impressive.[00:28:26] GT three scores between 43 to 60, depending on the the size. Go. Scores 60, chinchilla scores 67.5, GT 3.5 scores, 70 GPT4 jumps, one in 16 points to 86.4. The author of M M L U, Dan Hendrix, uh, was commenting on GPT4 saying this is essentially solved. He's basically says like, GT 4.5, the, the next incremental improvement on GPT4 should be able to reach expert level human perform.[00:28:53] At which point it is passing simultaneously, passing all the law exams, all the medical exams, all the graduate student exams, every single test from AP history to computer science to. Math to physics, to economics. It's very impressive. Yeah. And now you're seeing, I mean, it's probably unrelated, but Ivy League universities starting to drop the a t as a requirement for getting in.[00:29:16] So yeah. That might be unrelated as well, because, uh, there's a little bit of a culture war there with regards to, uh, the, the inherent bias of the SATs. Yeah. Yeah. But I mean, that's kinda, I mean exactly. That's kinda like what we were talking about before, right? It's. If a model can solve all of these, then like how good is it really?[00:29:33] How good is it as a Exactly. Telling us if a person should get in. It captures it. Captures with just the beginning. Yeah. Right.[00:29:39] 2021: HumanEval - Code Generation[00:29:39] Well, so I think another significant. Benchmark in 2021 was human eval, which is, uh, the first like very notable benchmark for code code generation. Obviously there's a, there's a bunch of research preceding this, but this was the one that really caught my eye because it was simultaneously introduced with Open Eyes Codex, which is the code generation model, the version of G P T that was fine tuned for generating code.[00:30:02] Uh, and that is, Premise of, well, there is the origin or the the language model powering GitHub co-pilot and yeah, now we can write code with language models, just with that, with that benchmark. And it's good too. That's the other thing, I think like this is one where the jump from GT 3.5 to GPT4 was probably the biggest, like GT 3.4 is like 48% on. On this benchmark, GPT4 is 67%. So it's pretty big. Yeah. I think coders should rest a little bit. You know, it's not 90 something, it's, it's still at 67, but just wait two years. You know, if you're a lawyer, if you're a lawyer, you're done. If you're a software engineer, you got, you got a couple more years, so save your money.[00:30:41] Yeah. But the way they test it is also super creative, right? Like, I think maybe people don't understand that actually all of the tests that are given here are very intuitive. Like you. 90% of a function, and then you ask the language model to complete it. And if it completes it like any software engineer would, then you give it a win.[00:31:00] If not, you give it a loss, run that model 164 times, and that is human eval. Yeah. Yeah. And since a lot of our listeners are engineers too, I think the big thing here is, and there was a, a link that we had that I missed, but some of, for example, some of. Coding test questions like it can answer older ones very, very well.[00:31:21] Like it doesn't not answer recent ones at all. So like you see some of like the data leakage from the training, like since it's been trained on the issues, massive data, some of it leaks. So if you're a software engineer, You don't have to worry too much. And hopefully, especially if you're not like in the JavaScript board, like a lot of these frameworks are brand new every year.[00:31:41] You get a lot of new technologies. So there's Oh, there's, oh yeah. Job security. Yes, exactly. Of course. Yeah. You got a new, you have new framework every year so that you have job security. Yeah, exactly. I'll sample, uh, data sets.[00:31:51] 2020 - XTREME - Multilingual Benchmarks[00:31:51] So before we get to big bench, I'll mention a couple more things, which is basically multilingual benchmarks.[00:31:57] Uh, those are basically simple extensions of monolingual benchmarks. I feel like basical. If you can. Accurately predicts the conversion of one word or one part of the word to another part of the word. Uh, you get a score. And, and I think it's, it's fairly intuitive over there. Uh, but I think the, the main benchmarks to know are, um, extreme, which is the, uh, x the x lingual transfer evaluation, the multilingual encoders, and much prefer extreme.[00:32:26] I know, right? Uh, that's why, that's why they have all these, uh, honestly, I think they just wanted the acronym and then they just kinda worked backwards. And then the other one, I can't find it in my notes for, uh, what the other multilingual ones are, but I, I just think it's interesting to always keep in mind like what the other.[00:32:43] Language capabilities are like, one language is basically completely equivalent to another. And I think a lot of AI ethicists or armchair AI ethicists are very angry that, you know, most of the time we optimize for English because obviously that has, there's the most, uh, training corpuses. I really like extreme the work that's being done here, because they took a, a huge amount of effort to make sure they cover, uh, sparse languages like the, the less popular ones.[00:33:06] So they had a lot of, uh, the, the, obviously the, the popular. Uh, the world's top languages. But then they also selected to maximize language diversity in terms of the complete diversity in, uh, human languages like Tamil Telugu, maam, and Sohi and Yoruba from Africa. Mm-hmm. So I just thought like that kind of effort is really commendable cuz uh, that means that the rest of the world can keep up in, in this air race.[00:33:28] Right. And especially on a lot of the more human based things. So I think we talked about this before, where. A lot of Israel movies are more[00:33:36] focused on culture and history and like are said in the past versus a lot of like the Western, did we talk about this on the podcast? No, not on the podcast. We talked and some of the Western one are more focused on the future and kind of like what's to come.[00:33:48] So I feel like when you're, some of the benchmarks that we mentioned before, you know, they have movie reviews as like, uh, one of the. One of the testing things. Yeah. But there's obviously a big cultural difference that it's not always captured when you're just looking at English data. Yeah. So if you ask the a motto, it's like, you know, are people gonna like this movie that I'm writing about the future?[00:34:10] Maybe it's gonna say, yeah, that's a really good idea. Or if I wanna do a movie about the past, it's gonna be like maybe people want to hear about robots. But that wouldn't be the case in, in every country. Well, since you and I speak different languages, I speak Chinese, you speak Italian, I'm sure you've tested the Italian capabilities.[00:34:29] What do you think? I think like as. Italy, it's so much more, um, dialect driven. So it can be, it can be really hard. So what kind of Italian does g PT three speak? Actually Italian, but the reality is most people have like their own, their own like dialect. So it would be really hard for a model to fool. An Italian that it's like somebody from where they are, you know?[00:34:49] Yeah. Like you can actually tell if you're speaking to AI bot in Chinese because they would not use any of the things that human with humans would use because, uh, Chinese humans would use all sorts of replacements for regular Chinese words. Also, I tried one of those like language tutor things mm-hmm.[00:35:06] That people are making and they're just not good Chinese. Not colloquial Chinese, not anything that anyone would say. They would understand you, but they were from, right, right.[00:35:14] 2022: BIG-Bench - The Biggest of the Benches[00:35:14] So, 2022, big bench. This was the biggest of the biggest, of the biggest benchmarks. I think the, the main pattern is really just, Bigger benchmarks rising in opposition to bigger and bigger models.[00:35:27] In order to evaluate these things, we just need to combine more and more and way more tasks, right? Like swag had nine tasks, hello swag had nine more tasks, and then you're, you're just adding and adding and adding and, and just running a battery of tasks all over. Every single model and, uh, trying to evaluate how good they are at each of them.[00:35:43] Big bench was 204 tasks contributed by 442 authors across 132 institutions. The task topics are diverse, drawing from linguistics, childhood development, math, common sense reasoning, biology, physics, social bias, software development, and beyond. I also like the fact that these authors also selected tasks that are not solved by current language models, but also not solvable by memorizing the internet, which is mm-hmm.[00:36:07] Tracking back to a little bit of the issues that we're, we're gonna cover later. Right. Yeah. I think that's, that's super interesting. Like one of, some of the examples would include in the following chess position, find a checkmate, which is, some humans cannot do that. What is the name of the element within a topic number of six?[00:36:22] Uh, that one you can look up, right? By consulting a periodic table. We just expect language models to memorize that. I really like this one cuz it's, uh, it's inherent. It's, uh, something that you can solve.[00:36:32] Identify whether this sentence has an anachronism. So, option one. During the Allied bombardment of the beaches of Iwojima, Ralph spoke loudly into his radio.[00:36:41] And in option two, during the allied bombardment of the beaches of Iwojima, Ralph spoke loudly into his iPhone. And you have to use context of like when iPhone, when Ally bombarding. Mm-hmm. And then sort of do math to like compare one versus the other and realize that okay, this one is the one that's out of place.[00:36:57] And that's asking more and more and more of the language model to do in implicitly, which is actually modeling what we do when we listen to language, which is such a big. Gap. It's such a big advancement from 1985 when we were comparing synonyms. Mm-hmm. Yeah, I know. And it's not that long in the grand scheme of like humanity, you know, like it's 40 years.[00:37:17] It's crazy. It's crazy. So this is a big missing gap in terms of research. Big benches seems like the most comprehensive, uh, set of benchmarks that we have. But it is curiously missing from Gypsy four. Mm-hmm. I don't know. On paper, for code, I only see Gopher two 80. Yeah. On it. Yeah. Yeah. It could be a curious emission because it maybe looks.[00:37:39] Like it didn't do so well.[00:37:40] EDIT: Why BIG-Bench is missing from GPT4 Results[00:37:40] Hello, this is Swyx from the editing room sometime in the future. I just wanted to interject that. Uh, we now know why the GPT for benchmark results did not include the big bench. Benchmark, even though that was the state-of-the-art benchmark at the time. And that's because the. Uh, GPC four new the Canary G U I D of the big bench.[00:38:02] Benchmark. Uh, so Canary UID is a random string, two, six[00:38:08] eight six B eight, uh, blah, blah, blah. It's a UID. UID, and it should not be knowable by the language model. And in this case it was therefore they had to exclude big bench and that's. And the issue of data contamination, which we're about to go into right now.[00:38:25] Issue: GPT4 vs the mystery of the AMC10/12[00:38:25] And there's some interesting, if you dive into details of GPT4, there's some interesting results in GPT4, which starts to get into the results with benchmarking, right? Like so for example, there was a test that GPT4 published that is very, very bizarre to everyone who is even somewhat knowledgeable.[00:38:41] And this concerns the Ammc 10 and AMC 12. So the mc. Is a measure of the American math 10th grade student and the AMC12 is a, uh, is a measure of the American 12th grade student. So 12 is supposed to be harder than 10. Because the students are supposed to be older, it's, it's covering topics in algebra, geometry number, theory and combinatorics.[00:39:04] GPT4 scored a 30 on AMC10 and scored a 60 on AMC12. So the harder test, it got twice as good, and 30 was really, really bad. So the scoring format of AMC10. It is 25 questions. Each correct answer is worth six points. Each incorrect answer is worth 1.5 points and unanswered questions receive zero points.[00:39:25] So if you answer every single question wrong, you will get more than GPT4 got on AMC10. You just got everything wrong. Yeah, it's definitely better in art medics, you know, but it's clearly still a, a long way from, uh, from being even a high school student. Yeah. There's a little bit of volatility in these results and it, it shows that we, it's not quite like machine intelligence is not the same, or not linearly scaling and not intuitive as human intelligence.[00:39:54] And it's something that I think we should be. Aware of. And when it freaks out in certain ways, we should not be that surprised because Yeah, we're seeing that. Yeah. I feel like part of it is also human learning is so structured, you know, like you learn the new test, you learn the new test, you learn the new test.[00:40:10] But these models, we kind of throw everything at them all at once, you know, when we train them. So when, when the model is strained, are you excusing the model? No, no, no. I'm just saying like, you know, and you see it in everything. It's like some stuff. I wonder what the percentage of. AMC 10 versus AMC 12.[00:40:28] Issue: Data Contamination[00:40:28] Content online is, yes. This comes in a topic of contamination and memorization. Right. Which we can get into if we, if we, if we want. Yeah. Yeah, yeah. So, uh, we're getting into benchmarking issues, right? Like there's all this advancements in benchmarks, uh, language models. Very good. Awesome. Awesome, awesome. Uh, what are the problems?[00:40:44] Uh, the problem is that in order to train these language models, we are scraping the vast majority of the internet. And as time passes, the. Of previous runs of our tests will be pasted on the internet, and they will go into the corpus and the leg model will be memorizing them rather than reasoning them from first principles.[00:41:02] So in, in the machine, classic machine learning parlance, this would be overfitting mm-hmm. Uh, to the test rather than to the generalizing to the, uh, the results that we really want. And so there's an example of, uh, code forces as well also discovered on GPT4. So Code Forces has annual vintages and there was this guy, uh, C H H Halle on Twitter who ran GPT4 on pre 2021 problems, solved all of them and then ran it on 2022 plus problems and solved zero of them.[00:41:31] And we know that the cutoff for GPT4 was 2021. Mm-hmm. So it just memorized the code forces problems as far as we can tell. And it's just really bad at math cuz it also failed the mc 10 stuff. Mm-hmm. It's actually. For some subset of its capabilities. I bet if you tested it with GPT3, it might do better, right?[00:41:50] Yeah. I mean, this is the, you know, when you think about models and benchmarks, you can never take the benchmarks for what the number says, you know, because say, you know, you're focusing on code, like the benchmark might only include the pre 2021 problems and it scores great, but it's actually bad at generalizing and coming up with new solutions.[00:42:10] So, yeah, that, that's a. Big problem.[00:42:13] Other Issues: Benchmark Data Quality and the Iris data set[00:42:13] Yeah. Yeah. So bias, data quality, task specificity, reproducibility, resource requirements, and then calibrating confidence. So bias is, is, is what you might think it is. Basically, there's inherent bias in the data. So for example, when you think about doctor, do you think about a male doctor, a female doctor, in specifically an image net?[00:42:31] Businessmen, white people will be labeled businessmen, whereas Asian businessmen will be labeled Asian businessmen and that can reinforce harmful serotypes. That's the bias issue. Data quality issue. I really love this one. Okay, so there's a famous image data set we haven't talked about called the pedals or iris.[00:42:47] Iris dataset mm-hmm. Contains measurements of, uh, of, uh, length with petal length and petal with, uh, three different species of iris, iris flowers, and they have labeling issues in. So there's a mini, there's a lowest level possible error rate because the error rate exists in the data itself. And if you have a machine learning model that comes out with better error rate than the data, you have a problem cuz your machine learning model is lying to you.[00:43:12] Mm-hmm. Specifically, there's, we know this for a fact because especially for Iris flowers, the length should be longer than the, than the width. Um, but there. Number of instances in the data set where the length was shorter than the, than the width, and that's obviously impossible. So there was, so somebody made an error in the recording process.[00:43:27] Therefore if your machine learning model fits that, then it's doing something wrong cuz it's biologically impossible. Mm-hmm. Task specificity basically if you're overfitting to, to one type of task, for example, answering questions based on a single sentence or you're not, you know, facing something real world reproducibility.[00:43:43] This one is actually, I guess, the fine details of machine learning, which people don't really like to talk about. There's a lot. Pre-processing and post-processing done in I Python notebooks. That is completely un versions untested, ad hoc, sticky, yucky, and everyone does it differently. Therefore, your test results might not be the same as my test results.[00:44:04] Therefore, we don't agree that your scores are. The right scores for your benchmark, whereas you're self reporting it every single time you publish it on a, on a paper. The last two resource requirements, these are, these are more to do with GPTs. The larger and larger these models get, the harder, the more, more expensive it is to run some.[00:44:22] And some of them are not open models. In other words, they're not, uh, readily available, so you cannot tell unless they run it themselves on, on your benchmark. So for example, you can't run your GPT3, you have to kind of run it through the api. If you don't have access to the API like GPT4, then you can't run it at all.[00:44:39] The last one is a new one from GPT4's Paper itself. So you can actually ask the language models to expose their log probabilities and show you how confident they think they are in their answer, which is very important for calibrating whether the language model has the right amount of confidence in itself and in the GPT4 people. It. They were actually very responsible in disclosing that They used to have about linear correspondence between the amount of confidence and the amount of times it was right, but then adding R L H F onto GPT4 actually skewed this prediction such that it was more confident than it should be. It was confidently incorrect as as people say.[00:45:18] In other words, hallucinating. And that is a problem. So yeah, those are the main issues with benchmarking that we have to deal with. Mm-hmm. Yeah, and a lot of our friends, our founders, we work with a lot of founders. If you look at all these benchmarks, all of them just focus on how good of a score they can get.[00:45:38] They don't focus on what's actually feasible to use for my product, you know? So I think.[00:45:44] Tradeoffs of Latency, Inference Cost, Throughput[00:45:44] Production benchmarking is something that doesn't really exist today, but I think we'll see the, the rise off. And I think the main three drivers are one latency. You know, how quickly can I infer the answer cost? You know, if I'm using this model, how much does each call cost me?[00:46:01] Like is that in line with my business model I, and then throughput? I just need to scale these models to a lot of questions on the ones. Again, I just do a benchmark run and you kind of come up. For quadrants. So if on the left side you have model size going from smallest to biggest, and on the X axis you have latency tolerance, which is from, I do not want any delay to, I'll wait as long as I can to get the right answer.[00:46:27] You start to see different type of use cases, for example, I might wanna use a small model that can get me an answer very quickly in a short amount of time, even though the answer is narrow. Because me as a human, maybe I'm in a very iterative flow. And we have Varun before on the podcast, and we were talking about a kind of like a acceleration versus iteration use cases.[00:46:50] Like this is more for acceleration. If I'm using co-pilot, you know, the code doesn't have to be a hundred percent correct, but it needs to happen kind of in my flow of writing. So that's where a model like that would be. But instead, other times I might be willing, like if I'm asking it to create a whole application, I'm willing to wait one hour, you know, for the model to get me a response.[00:47:11] But you don't have, you don't have a way to choose that today with most models. They kind of do just one type of work. So I think we're gonna see more and more of these benchmark. Focus on not only on the research side of it, which is what they really are today when you're developing a new model, like does it meet the usual standard research benchmarks to having more of a performance benchmark for production use cases?[00:47:36] And I wonder who's gonna be the first company that comes up with, with something like this, but I think we're seeing more and more of these models go from a research thing to like a production thing. And especially going from companies like. Google and Facebook that have kinda unlimited budget for a lot of these things to startups, starting to integrate them in the products.[00:48:00] And when you're on a tight budget paying, you know, 1 cent per thousand tokens or 0.10 cent for a thousand tokens, like it's really important. So I think that's, um, that's what's missing to get a lot of these things to productions. But hopefully we, we see them.[00:48:16] Yeah, the software development lifecycle I'm thinking about really is that most people will start with large models and then they will prototype with that because that is the most capable ones.[00:48:25] But then as they put more and more of those things in production, people always want them to run faster and faster and faster and cheaper. So you will distill towards a more domain specific model, and every single company that puts this into production, we'll, we'll want something like that, but I, I think it's, it's a reasonable bet because.[00:48:41] There's another branch of the AI builders that I see out there who are build, who are just banking on large models only. Mm-hmm. And seeing how far they can stretch them. Right. With building on AI agents that can take arbitrarily long amounts of time because they're saving you lots of, lots of time with, uh, searching the web for you and doing research for you.[00:48:59] And I think. I'm happy to wait for Bing for like 10 seconds if it does a bunch of searches for median. Mm-hmm. Just ends with, ends with the right, right result. You know, I was, I was tweeting the other day that I wanted an AI enabled browser because I was seeing this table, uh, there was an image and I just needed to screenshot an image and say, plot this on a chart for me.[00:49:17] And I just wanted to do that, but it would have to take so many steps and I would be willing to wait for a large model to do that for me. Mm-hmm. Yeah. I mean, web development so far has been, Reduce, reduce, reduce the loading times. You know, it's like first we had the, I don't know about that. There, there are people who disagree.[00:49:34] Oh. But I, I think, like if you think about, you know, the CDN and you think about deploying things at the edge, like the focus recently has been on lowering the latency time versus increasing it.[00:49:45] Conclusion[00:49:45] Yeah. So, well that's the, that's Benchmark 1 0 1. Um. Let us know how we, how you think we did. This is something we're trying for the first time.[00:49:52] We're very inspired by other podcasts that we like where we do a bunch of upfront prep, but then it becomes a single topical episode that is hopefully a little bit more timeless. We don't have to keep keeping up with the news. I think there's a lot of history that we can go back on and. Deepen our understanding of the context of all these evolutions in, uh, language models.[00:50:12] Yeah. And if you have ideas for the next, you know, 1 0 1 fundamentals episode, yeah, let us know in the, in the comments and we'll see you all soon. Bye. Get full access to Latent Space at www.latent.space/subscribe
undefined
Mar 29, 2023 • 42min

Grounded Research: From Google Brain to MLOps to LLMOps — with Shreya Shankar of UC Berkeley

We are excited to feature our first academic on the pod! I first came across Shreya when her tweetstorm of MLOps principles went viral:Shreya’s holistic approach to production grade machine learning has taken her from Stanford to Facebook and Google Brain, being the first ML Engineer at Viaduct, and now a PhD in Databases (trust us, its relevant) at UC Berkeley with the new EPIC Data Lab. If you know Berkeley’s history in turning cutting edge research into gamechanging startups, you should be as excited as we are!Recorded in-person at the beautiful StudioPod studios in San Francisco.Full transcript is below the fold.Edit from the future: Shreya obliged us with another round of LLMOps hot takes after the pod!Other Links* Shreya’s About: https://www.shreya-shankar.com/about/* Berkeley Sky Computing Lab - Utility Computing for the Cloud* Berkeley Epic Data Lab - low-code and no-code interfaces for data work, powered by next-generation predictive programming techniques* Shreya’s ML Principles * Grounded Theory* Lightning Round:* Favorite AI Product: Stability Dreamstudio* 1 Year Prediction: Data management platforms* Request for startup: Design system generator* Takeaway: It’s not a fad!Timestamps* [00:00:27] Introducing Shreya (poorly)* [00:03:38] The 3 V's of ML development* [00:05:45] Bridging Development and Production* [00:08:40] Preventing Data Leakage* [00:10:31] Berkeley's Unique Research Lab Culture* [00:11:53] From Static to Dynamically Updated Data* [00:12:55] Models as views on Data* [00:15:03] Principle: Version everything you do* [00:16:30] Principle: Always validate your data* [00:18:33] Heuristics for Model Architecture Selection* [00:20:36] The LLMOps Stack* [00:22:50] Shadow Models* [00:23:53] Keeping Up With Research* [00:26:10] Grounded Theory Research* [00:27:59] Google Brain vs Academia* [00:31:41] Advice for New Grads* [00:32:59] Helping Minorities in CS* [00:35:06] Lightning RoundTranscript[00:00:00] Hey everyone. Welcome to the Latent Space podcast. This is Alessio partner and CTM residence at Decibel Partners. I'm joined by my co-host, swyx writer and editor of Latent Space. Yeah,[00:00:21] it's awesome to have another awesome guest Shankar. Welcome .[00:00:25] Thanks for having me. I'm super excited.[00:00:27] Introducing Shreya (poorly)[00:00:27] So I'll intro your formal background and then you can fill in the blanks.[00:00:31] You are a bsms and then PhD at, in, in Computer Science at Stanford. So[00:00:36] I'm, I'm a PhD at Berkeley. Ah, Berkeley. I'm sorry. Oops. . No, it's okay. Everything's the bay shouldn't say that. Everybody, somebody is gonna get mad, but . Lived here for eight years now. So[00:00:50] and then intern at, Google Machine learning learning engineer at Viaduct, an OEM manufacturer, uh, or via OEM analytics platform.[00:00:59] Yes. And now you're an e I R entrepreneur in residence at Amplify.[00:01:02] I think that's on hold a little bit as I'm doing my PhD. It's a very unofficial title, but it sounds fancy on paper when you say[00:01:09] it out loud. Yeah, it is fancy. Well, so that is what people see on your LinkedIn. What's, what should, what should people know about you that's not on your LinkedIn?[00:01:16] Yeah, I don't think I updated my LinkedIn since I started the PhD, so, I'm doing my PhD in databases. It is not AI machine learning, but I work on data management for building AI and ML powered software. I guess like all of my personal interests, I'm super into going for walks, hiking, love, trying coffee in the Bay area.[00:01:42] I recently, I've been getting into cooking a lot. Mm-hmm. , so what kind of cooking? Ooh. I feel like I really like pastas. But that's because I love carbs. So , I don't know if it's the pasta as much as it's the carb. Do you ever cook for[00:01:56] like large[00:01:57] dinners? Large groups? Yeah. We just hosted about like 25 people a couple weeks ago, and I was super ambitious.[00:02:04] I was like, I'm gonna cook for everyone, like a full dinner. But then kids were coming. and I was like, I know they're not gonna eat tofu. The other thing with hosting in the Bay Area is there's gonna be someone vegan. There's gonna be someone gluten-free. Mm-hmm. . There's gonna be someone who's keto. Yeah.[00:02:20] Good luck, .[00:02:21] Oh, you forgot the seeds. That's the sea disrespects.[00:02:25] I know. . So I was like, oh my God, I don't know how I'm gonna do this. Yeah. The dessert too. I was like, I don't know how I'm gonna make everything like a vegan, keto nut free dessert, just water. It was a fun challenge. We ordered pizza for the children and a lot of people ate the pizza.[00:02:43] So I think , that's what happens when you try to cook, cook for everyone.[00:02:48] Yeah. The reason I dug a bit on the cooking is I always find like if you do cook for large groups, it's a little bit like of an ops situation. Yeah. Like a lot of engineering. A lot of like trying to figure out like what you need to deliver and then like what the pipeline[00:02:59] is and Oh, for sure.[00:03:01] You write that Gantt chart like a day in advance. , did you actually have a ga? Oh, I did. My gosh. Of course I had a Gantt chart. I, I dunno how people, did[00:03:08] you orchestrate it with airflow or ?[00:03:12] I orchestrated it myself. .[00:03:15] That's awesome. But yeah, we're so excited to have you, and you've been a pretty prolific writer, researcher, and thank you.[00:03:20] You have a lot of great content out there. I think your website now says, I'm currently learning how to make machine learning work in the real world, which is a challenge that mm-hmm. , everybody is steaming right now from the Microsoft and Googles of the word that have rogue eyes flirting with people, querying them to people, deploy models to production.[00:03:38] The 3 V's of ML development[00:03:38] Maybe let's run through some of the research you've done, especially on lops. Sure. And how to get these things in production. The first thing I really liked from one of your paper was the, the three VS of ML development. Mm-hmm. , which is velocity validation and versioning. And one point that you were making is that the development workflow of software engineering is kind of very different from ML because ML is very experiment driven.[00:04:00] Correct. There's a lot of changes that you need to make, you need to kill things very quickly if they're not working. So maybe run us through why you decided as kind of those three vs. Being some of the, the core things to think about. and some of the other takeaways from their research. Yeah,[00:04:15] so this paper was conducted as a loosely structured interview study.[00:04:18] So the idea is you interview like three or four people and then you go and annotate all the transcripts, tag them, kind of put the word clouds out there, whatever. There's a bunch of like cool software to do this. Then we keep seeing these, themes of velocity wasn't the word, but it was like experiment quickly or high experimentation rate.[00:04:38] Sometimes it was velocity. And we found that that was like the number one thing for people who were talking about their work in this kind of development phase. We also categorized it into phases of the work. So the life cycle like really just fell into place when we annotated the transcripts. And so did the variables.[00:04:55] And after three or four interviews you iterate on them. You kind of iterate on the questions, and you iterate on the codes or the tags that you give to the transcripts and then you do it again. And we repeated this process like three or four times up to that many people, and the story kind of told itself in a way that[00:05:11] makes sense.[00:05:12] I think, like I was trying to figure out why you picked those, but it's interesting to see that everybody kinda has the same challenges.[00:05:18] It fell out. I think a big thing, like even talking to the people who are at the Microsofts and the Googles, they have models in production. They're frequently training these models in production, yet their Devrel work is so experimental.[00:05:31] Mm-hmm. . And we were like, so it doesn't change. Even when you become a mature organization, you still throw 100 darts at the wall for five of them to stick and. That's super interesting and I think that's a little bit unique to data science and machine learning work.[00:05:45] Bridging Development and Production[00:05:45] Yeah. And one one point you had is kind of how do we bridge the gap between the development environments and the production environments?[00:05:51] Obviously you're still doing work in this space. What are some of the top of mind areas of focus for you in[00:05:57] this area? Yeah, I think it. Right now, people separate these environments because the production environment doesn't allow people to move at the rate that they need to for experimentation. A lot of the times as you're doing like deep learning, you wanna have GPUs and you don't wanna be like launching your job on a Kubernetes cluster and waiting for the results to come.[00:06:17] And so that's just the hardware side of things. And then there is the. Execution stack. Um, you wanna be able to query and create features real time as you're kind of training your model. But in production things are different because these features are kind of scheduled, maybe generated every week.[00:06:33] There's a little bit of lag. These assumptions are not accounted for. In development and training time. Mm-hmm. . So of course we're gonna see that gap. And then finally, like the top level, the interface level. People wanna experiment in notebooks, in environments that like allow them to visualize and inspect their state.[00:06:50] But production jobs don't typically run in notebooks. Yeah, yeah, yeah. I mean there, there are tools like paper mill and et cetera. But it's not the same, right? So when you just look at every single layer of the kind of data technical stack, there's a develop. Side of things and there's a production side of things and they're completely different.[00:07:07] It makes sense why. Way, but I think that's why you get a bunch of bugs that come when you put things in production.[00:07:14] I'm always interested in the elimination of those differences. Mm-hmm. And I don't know if it's realistic, but you know, what would it take for people to, to deploy straight to production and then iterate on production?[00:07:27] Because that's ultimately what you're[00:07:29] aim for. This is exactly what I'm thinking about right now in my PhD for kind of like my PhD. But you said it was database. I think databases is a very, very large field. , pretty much they do everything in databases . But the idea is like, how do we get like a unified development and production experience, Uhhuh, for people who are building these ML models, I think one of the hardest research challenges sits at that execution layer of kind of how do.[00:07:59] Make sure that people are incorporating the same assumptions at development time. Production time. So feature stores have kind of come up in the last, I don't know, couple of years, three years, but there's still that online offline separation. At training time, people assume that their features are generated like just completely, perfectly.[00:08:19] Like there's no lag, nothing is stale. Mm-hmm. , that's the case when trading time, but those assumptions aren't really baked. In production time. Right. Your features are generated, I don't know, like every week or some Every day. Every hour. That's one thing. How do, like, what does that execution model look like to bridge the two and still give developers the interactive latencies with features?[00:08:40] Preventing Data Leakage[00:08:40] Mm-hmm. . I think another thing also, I don't know if this is an interface problem, but how do we give developers the guardrails to not look at data that they're not supposed to? This is a really hard problem. For privacy or for training? Oh, no, just for like training. Yeah. Okay. also for privacy. Okay. But when it comes to developing ML models in production, like you can't see, you don't see future data.[00:09:06] Mm-hmm. . Yeah. You don't see your labels, but at development time it's really easy to. to leak. To leak and even like the seeming most seemingly like innocuous of ways, like I load my data from Snowflake and I run a query on it just to get a sense for, what are the columns in my data set? Mm-hmm. or like do a DF dot summary.[00:09:27] Mm-hmm. and I use that to create my features. Mm-hmm. and I run that query before I do train test. , there's leakage in that process. Right? And there's just at the fun, most fundamental level, like I think at some point at my previous company, I just on a whim looked through like everyone's code. I shouldn't have done that , but I found that like everyone's got some leakage assumptions somewhere.[00:09:49] Oh, mm-hmm. . And it's, it's not like people are bad developers, it's just that. When you have no guard the systems. Yeah, do that. Yeah, you do this. And of course like there's varying consequences that come from this. Like if I use my label as a feature, that's a terrible consequence. , if I just look at DF dot summary, that's bad.[00:10:09] I think there's like a bunch of like unanswered interesting research questions in kind of creating. Unified experience. I was[00:10:15] gonna say, are you about to ban exploratory data analysis ?[00:10:19] Definitely not. But how do we do PDA in like a safe , data safe way? Mm-hmm. , like no leakage whatsoever.[00:10:27] Right. I wanna ask a little small follow up about doing this at Berkeley.[00:10:31] Berkeley's Uniquely Research Lab Culture[00:10:31] Mm-hmm. , it seems that Berkeley does a lot of this stuff. For some reason there's some DNA in Berkeley that just, that just goes, hey, just always tackle this sort of hard data challenges. And Homestate Databricks came out of that. I hear that there's like some kind of system that every five years there's a new lab that comes up,[00:10:46] But what's going on[00:10:47] there? So I think last year, rise Lab which Ray and any scale came out of. Kind of forked into two labs. Yeah. Sky Lab, I have a water bottle from Sky Lab. Ooh. And Epic Lab, which my advisor is a co-PI for founding pi, I don't know what the term is. And Skylabs focus, I think their cider paper was a multi-cloud programming environment and Epic Lab is, Their focus is more like low-code, no-code, better data management tools for this like next generation of Interfa.[00:11:21] I don't even know. These are like all NSF gra uh, grants.[00:11:24] Yeah. And it's five years, so[00:11:26] it could, it could involve, yeah. Who knows what's gonna be, and it's like super vague. Yeah. So I think we're seeing like two different kinds of projects come out of this, like the sky projects of kind of how do I run my job on any cloud?[00:11:39] Whichever one is cheapest and has the most resources for me, my work is kind of more an epic lab, but thinking about these like interfaces, mm-hmm. , better execution models, how do we allow people to reason about the kind of systems they're building much more effectively. Yeah,[00:11:53] From Static Data to Dynamically Updated Data[00:11:53] yeah. How do you think about the impact of the academia mindset when then going into.[00:11:58] Industry, you know, I know one of the points in your papers was a lot of people in academia used with to static data sets. Mm-hmm. , like the data's not updating, the data's not changing. So they work a certain way and then they go to work and like they should think about bringing in dynamic data into Yeah.[00:12:15] Earlier in the, in the workflow, like, , how do you think we can get people to change that mindset? I think[00:12:21] actually people are beginning to change that mindset. We're seeing a lot of kind of dynamic data benchmarks or people looking into kind of streaming datasets, largely image based. Some of them are language based, but I do think it's somewhat changing, which is good.[00:12:35] But what I don't think is changing is the fact that model researchers and Devrel developers want. to create a model that learns the world. Mm-hmm. . And that model is now a static artifact. I don't think that's the way to go. I want people, at least in my research, the system I'm building, models are not a one time thing.[00:12:55] Models as views on Data[00:12:55] Models are views that are frequently recomputed over your data to use database speak, and I don't see people kind of adopting that mindset when it comes to. Kind of research or the data science techniques that people are learning in school. And it's not just like retrain G P T every single day or whatever, but it, it is like, how do I make sure that I don't know, my system is evolving over time.[00:13:19] Mm-hmm. that whatever predictions or re query results that are being generated are. Like that process is changing. Can you give[00:13:27] a, an overview of your research project? I know you mentioned a couple snippets here and there,[00:13:32] but that would be helpful. . I don't have a great pitch yet. I haven't submitted anything, still working on it, but the idea is like I want to create a system for people to develop their ML pipelines, and I want it to be like, Like unifying the development production experience.[00:13:50] And the key differences about this is one, you think of models as like data transformations that are recomputed regularly. So when you write your kind of train or fit functions, like the execution engine understands that this is a process that runs repeatedly. It monitors the data under the hood to refit the computation whenever it's detected.[00:14:12] That kind of like the data distributions have changed. So that way whenever you. Test your pipelines before you deploy them. Retraining is baked in, monitoring is baked in. You see that? And the gold star, the gold standard for me is the number that you get at development time. That should be the number that you get when you deploy[00:14:33] There shouldn't be this expected 10% drop. That's what I know I will have. Made something. But yeah, definitely working on that.[00:14:41] Yeah. Cool. So a year ago you tweeted a list of principles that you thought people should know and you split it very hopefully. I, I thought into beginner, intermediate, advanced, and sometimes the beginner is not so beginner, you know what I mean?[00:14:52] Yeah, definitely. .[00:14:53] The first one I write is like,[00:14:57] so we don't have to go through the whole thing. I, I do recommend people check it out, but also maybe you can pick your favorites and then maybe something you changed your mind.[00:15:03] Principle: Version Everything You Do[00:15:03] I think several of them actually are about versioning , which like maybe that bias the interview studying a little bit.[00:15:12] Yeah. But I, I really think version everything you do, because in experimentation time, because when you do an experiment, you need some version there because if you wanna pr like publish those. , you need something to go back to. And the number of people who like don't version things, it is just a lot. It's also a lot to expect for someone to commit their code every time they like.[00:15:33] Mm-hmm. train their model. But I think like having those practices is definitely worth it. When you say versioning,[00:15:39] you mean versioning code.[00:15:40] versioning code versioning data, like everything around a single like trial run.[00:15:45] So version code get fine. Mm-hmm. versioning data not[00:15:48] as settled. Yeah. I think that part, like you can start with something super hacky, which is every time you run your script, like just save a copy of your training set.[00:16:00] Well, most training sets are not that big. Yeah. Like at least when people are like developing on their computer, it. Whatever. It's not that big. Just save a copy somewhere. Put it ass three, like it's fine. It's worth it. Uhhuh, . I think there's also like tools like dvc like data versioning kind of tools. I think also like weights and biases and these experiment track like ML flow, the experiment tracking tools have these hooks to version your data for you.[00:16:23] I don't know how well they work these days, but . Yeah, just something around like versioning. I think I definitely agree with[00:16:30] Principle: Always validate your Data[00:16:30] I'm. Super, super big into data validation. People call it monitoring. I used to think it was like monitoring. I realize now like how little at my previous company, we just like validated the input data going into these pipelines and even talking to people in the interview study people are not doing.[00:16:48] Data validation, they see that their ML performance is dropping and they're like, I don't know why. What's going on ? And when you dig into it, it's a really fascinating, interesting, like a really interesting research problem. A lot of data validation techniques for machine learning result in too many false positive alerts.[00:17:04] And I have a paper got rejected and we're resubmitting on this. But yeah, like there, it's active research problem. How do you create meaningful alerts, especially when you have tons of features or you have large data sets, that's a really hard problem, but having some basic data validation check, like check that your data is complete.[00:17:23] Check that your schema matches up. Check that your most frequent, like your. Most frequently occurring value is the same. Your vocabulary isn't changing if it's a large language model. These are things that I definitely think I could have. I should have said that I did say data validation, but I didn't like, like spell it out.[00:17:39] Have you, have you looked into any of the current data observability platforms like Montecarlo or Big I I think you, I think you have some experience with that as[00:17:47] well. Yeah. I looked at a Monte car. Couple of years back, I haven't looked into big eye. I think that designing data validation for ML is a different problem because in the machine learning setting, you can allow, there's like a tolerance for how corrupted your data is and you can still get meaningful prediction.[00:18:05] Like that's the whole point of machine learning. Yeah, so like. A lot of the times, like by definition, your data observability platform is gonna give you false positives if you just care about the ML outputs. So the solution really, at least our paper, has this scheme where we learn from performance drops to kind of iterate on the precision of the data validation, but it's a hybrid of like very old databases techniques as well as kind of adapting it to the ML setting.[00:18:33] Heuristics for Model Architecture Selection[00:18:33] So you're an expert in the whole stack. I think I, I talk with a lot of founders, CTOs right now that are saying, how can I get more ML capabilities in, in my application? Especially when it comes to LLMs. Mm-hmm. , which are kind of the, the talk of the town. Yeah. How should people think about which models to use, especially when it comes to size and how much data they need to actually make them useful, for example, PT three is 175 billion parameters co-pilot use as a 12 billion model.[00:19:02] Yeah. So it's much smaller, but it's very good for what it does. Do you have any heuristics or mental models that you use when teams should think about what models to use and how big they need it to be?[00:19:12] Yeah I think that the. Precursor to this is the operational capabilities that these teams have. Do they have the capability to like literally host their own model, serve their own model, or would they rather use an api?[00:19:25] Mm-hmm. , a lot of teams like don't have the capability to maintain the actual model artifact. So even like the process of kind of. Fine tuning A G P T or distilling that, doing something like it's not feasible because they're not gonna have someone to maintain it over time. I see this with like some of the labs, like the people that we work with or like the low-code, no-code.[00:19:47] Or you have to have like really strong ML engineers right over time to like be able to have your own model. So that's one thing. The other thing is these G P T, these, these large language models, they're really good. , like giving you useful outputs. Mm-hmm. compared to like creating your own thing. Mm-hmm.[00:20:02] even if it's smaller, but you have to be okay with the latency. Mm-hmm. and the cost that comes out of it. In the interview study, we talk to people who are keeping their own, like in memory stores to like cash frequently. I, I don't know, like whatever it takes to like avoid calling the Uhhuh API multiple types, but people are creative.[00:20:22] People will do this. I don't think. That it's bad to rely on like a large language model or an api. I think it like in the long term, is honestly better for certain teams than trying to do their own thing on[00:20:36] house.[00:20:36] The LLMOps Stack[00:20:36] How's the L l M ops stack look like then? If people are consuming this APIs, like is there a lot of difference in under They manage the, the data, the.[00:20:46] Well,[00:20:46] I'll tell you the things that I've seen that are unified people need like a state management tool because the experience of working with a L L M provi, like A G P T is, mm-hmm. . I'm gonna try start out with these prompts and as I learn how to do this, I'm gonna iterate on these prompts. These prompts are gonna end up being this like dynamic.[00:21:07] Over time. And also they might be a function of like the most recent queries Tonight database or something. So the prompts are always changing. They need some way to manage that. Mm-hmm. , like I think that's a stateful experience and I don't see the like, like the open AI API or whatever, like really baking that assumption in into their model.[00:21:26] They do keep a history of your[00:21:27] prompts that help history. I'm not so sure. , a lot of times prompts are like, fetch the most recent similar data in my database, Uhhuh, , and then inject that into the pump prompt. Mm-hmm. . So I don't know how, Okay. Like you wanna somehow unify that and like make sure that's the same all the time.[00:21:44] You want prompt compiler. Yeah, . I think there's some startup probably doing that. That's definitely one thing. And then another thing that we found very interesting is that when people put these. LLMs in production, a lot of the bugs that they observe are corrected by a filter. Don't output something like this.[00:22:05] Yes. Or don't do this like, so there's, or please output G on, yeah. . So these pipelines end up becoming a hybrid of like the API uhhuh, they're. Service that like pings their database for the most recent things to put in their prompt. And then a bunch of filters, they add their own filters. So like what is the system that allows people to build, build such a pipeline, this like hybrid kind of filter and ML model and dynamic thing.[00:22:30] So, so I think like, The l l m stack, like is looking like the ML ops thing right in this way of like hacking together different solutions, managing state all across the pipeline monitoring, quick feedback loop.[00:22:44] Yeah. You had one, uh, just to close out the, the tweet thread thing as well, but this is all also relevant.[00:22:50] Shadow Models[00:22:50] You have an opinion about shadowing a less complicated model in production to fall back on. Yeah. Is that a good summary?[00:22:55] The shadowing thing only works in situations where you don. Need direct feedback from. The user because then you can like very reasonably serve it like Yeah, as as long, like you can benchmark that against the one that's currently in production, if that makes sense.[00:23:15] Right. Otherwise it's too path dependent or whatever to.[00:23:18] evaluate. Um, and a lot of services can benefit from shadowing. Like any, like I used to work a lot on predictive analytics, predictive maintenance, like stuff like that, that didn't have, um, immediate outputs. Mm-hmm. or like immediate human feedback. So that was great and okay, and a great way to like test the model.[00:23:36] Got it. But I think as. Increasingly trying to generate predictions that consumers immediately interact with. It might not be I, I'm sure there's an equivalent or a way to adapt it. Mm-hmm. AV testing, stage deployment, that's in the paper.[00:23:53] Keeping Up With Research[00:23:53] Especially with keeping up with all the new thing. That's one thing that I struggle with and I think preparing for this. I read a lot of your papers and I'm always like, how do you keep up with, with all of this stuff?[00:24:02] How should people do it? You know? Like, now, l l M is like the hot thing, right? There's like the, there's like the chinchilla study. There's like a lot of cool stuff coming out. Like what's. U O for like staying on top of this research, reading it. Yeah. How do you figure out which ones are worth reading?[00:24:16] Which ones are kind of like just skim through? I read all of yours really firmly. , but I mean other ones that get skimmed through, how should people figure it out?[00:24:24] Yeah, so I think. I'm not the best person to ask for this because I am in a university and every week get to go to amazing talks. Mm-hmm. and like engage with the author by the authors.[00:24:35] Yeah. Right. Yeah. Yeah. So it's like, I don't know, I feel like all the opportunities are in my lap and still I'm struggling to keep up, if that makes sense. Mm-hmm. . I used to keep like running like a bookmark list of papers or things that I want to read. But I think every new researcher does that and they realize it's not you worth their time.[00:24:52] Right? Like they will eventually get to reading the paper if it's absolutely critical. No, it's, it's true, it's true. So like we've, I've adopted this mindset and like somehow, like I do end up reading things and the things that I miss, like I don't have the fo. Around. So I highly encourage people to take that mentality.[00:25:10] I also, I think this is like my personal taste, but I love looking into the GitHub repos that people are actually using, and that usually gives me a sense for like, what are the actual problems that people have? I find that people on Twitter, like sometimes myself included, will say things, but you, it's not how big of a problem is it?[00:25:29] Mm-hmm. , it's not. Yeah, like , I find that like just looking at the repos, looking at the issues, looking at how it's evolved over time, that really, really helps. So you're,[00:25:40] to be specific, you're not talking about paper repos?[00:25:43] No, no, no, no. I'm talking about tools, but tools also come with papers a lot in, um, databases.[00:25:49] Yeah. Yeah. I think ML specifically, I think there's way too much ML research out there and yeah, like so many papers out there, archive is like, kind of flooded. Yeah.[00:26:00] It's like 16% of old papers produced.[00:26:02] It's, it's crazy. . I don't know if it's a good use of time to try to read all of them, to be completely honest.[00:26:10] Grounded Theory for Problem Discovery[00:26:10] You have a very ethnographic approach, like you do interviews and I, I assume like you just kinda observe and don't Yeah. Uh, prescribe anything. And then you look at those GitHub issues and you try to dig through from like production, like what is this orientation? Is there like a research methodology that you're super influenced by that guides you like this?[00:26:28] I wish that I had. Like awareness and language to be able to talk about this. Uhhuh, , . I[00:26:37] don't know. I, I think it's, I think it's a bit different than others who just have a technology they wanna play with and then they, they just ignore, like they don't do as much, uh, like people research[00:26:47] as[00:26:47] you do. So the HCI I researchers like, Have done this forever and ever and ever.[00:26:53] Yeah. But grounded theory is a very common methodology when it comes to trying to understand more about a topic. Yeah. Which is you go in, you observe a little bit, and then you update your assumptions and you keep doing this process until you have stopped updating your assumptions. . And I really like that approach when it comes to.[00:27:13] Just kind of understanding the state of the world when it comes to like a cer, like LLMs or whatever, until I feel like, like there was like a point in time for like lops on like tabular data prior to these large language models. I feel like I, I'd gotten the space and like now that these like large language models have come out and people are really trying to use them.[00:27:35] They're tabular kind of predictions that they used to in the past. Like they're incorporating language data, they're incorporating stuff like customer feedback from the users or whatever it is to make better predictions. I feel like that's totally changing the game now, and I'm still like, Why, why is this the case?[00:27:52] Was were the models not good enough? Do people feel like they're behind? Mm-hmm. ? I don't know. I try to talk to people and like, yeah, I have no answers.[00:27:59] Google Brain vs Academia[00:27:59] So[00:27:59] how does the industry buzz and focus influence what stuff the research teams work on? Obviously arch language models, everybody wants to build on them.[00:28:08] When you're looking at, you know, other peers in the, in the PhD space, are they saying, oh, I'm gonna move my research towards this area? Or are they just kind of focused on the idea of the[00:28:18] first. . This is a good question. I think that we're at an interesting time where the kind of research a PhD student in an academic institution at CS can do is very different from the research that a large company, because there aren't like, There just aren't the resources.[00:28:39] Mm-hmm. that large companies compute resources. There isn't the data. And so now PhD students I think are like, if they want to do something better than industry could do it, like there's like a different class of problems that we have to work on because we'll never be able to compete. So I think that's, yeah, I think that's really hard.[00:28:56] I think a lot of PhD students, like myself included, are trying to figure out like, what is it that we can do? Like we see the, the state of the field progressing and we see. , why are we here? If we wanna train language model, I don't, but if somebody wants to train language models, they should not be at uc.[00:29:11] Berkeley, , they shouldn't .[00:29:15] I think it's, there's a sort of big, gets bigger mentality when it comes to training because obviously the big companies have all the data, all the money. But I was kind of inspired by Luther ai. Mm-hmm. , um, which like basically did independent reproductions Yeah. Of G P T three.[00:29:30] Don't you think like that is a proof of, of existence that it is possible to do independently?[00:29:34] Totally. I think that kind of reproducing research is interesting because it doesn't lead to a paper. Like PhD students are still like, you can only graduate when you have papers. Yeah. So to have a whole lab set.[00:29:46] I think Stanford is interesting cuz they did do this like reproducing some of the language models. I think it should be a write[00:29:50] a passage for like every year, year one PhD. You[00:29:53] must reproduce everything. I won't say that no one's done it, but I do understand that there's an incentive to do new work because that's what will give you the paper.[00:30:00] Yeah. So will you put 20 of your students to. I feel like only a Stanford or somebody who like really has a plan to make that like a five plus year. Mm-hmm. research agenda. And that's just the first step sort of thing. Like, I can't imagine every PhD student wants to do that. Well, I'm just[00:30:17] saying, I, I, I feel like that there will be clouds, uh, the, the, you know, the big three clouds.[00:30:21] Mm-hmm. Probably the Microsoft will give you credits to do whatever you want. And then it's on you to sort of collect the data but like there of existence that it is possible to[00:30:30] It's definitely possible. Yeah. I think it's significantly harder. Like collecting the data is kind of hard. Like just like because you have the cloud credits doesn't mean like you have a cluster that has SREs backing it.[00:30:42] Mm-hmm. who helped you run your experiments. Right, right. Like if you are at Google Rain. Yeah. I was there what, like five, six years ago. God, like I read an experiment and I didn. Problems. Like it was just there. Problems . It's not like I'm like running on a tiny slur cluster, like watching everything fail every five.[00:31:01] It's like, this is why I don't train models now, because I know that's not a good use of my time. Like I'll be in so many like SRE issues. Yeah. If I do it now, even if I have cloud credits. Right. So, Yeah, I think it's, it can feel disheartening. , your PhD student training models,[00:31:18] well, you're working on better paradigms for everyone else.[00:31:21] You know? That's[00:31:22] the goal. I don't know if that's like forced, because I'm in a PhD program, , like maybe if I were someone else, I'd be training models somewhere else. I don't know. Who knows? Yeah. Yeah.[00:31:30] You've read a whole post on this, right? Choosing between a PhD and going into. Obviously open ai. Mm-hmm. is kinda like the place where if you're a researcher you want to go go work in three models.[00:31:41] Advice for New Grads[00:31:41] Mm-hmm. , how should people think about it? What are like maybe areas of research that are underappreciated in industry that you're really excited about at a PhD level? Hmm.[00:31:52] I think I wrote that post for new grads. . So it might not be as applicable like as a new grad. Like every new grad is governed by, oh, not every, a good number of new grads are governed by, like, I wanna do work on something that's impactful and I want to become very known for this.[00:32:06] Mm-hmm. , like, that's like , like a lot of, but like they don't really, they're walking outta the world for the first time almost. So for that reason, I think that like it's worth working on problems. We'll like work on any data management research or platform in an industry that's like working on Providence or working on making it more efficient to train model or something like.[00:32:29] You know, that will get used in the future. Mm-hmm. . So it might be worth just going and working on that in terms of, I guess like going to work at a place like OpenAI or something. I do think that they're doing very interesting work. I think that it's like not a fad. These models are really interesting.[00:32:44] Mm-hmm. and like, they will only get more interesting if you throw more compute Right. And more data at them. So it, it seems like these industry companies. Doing something interesting. I don't know much more than that. .[00:32:59] Helping Minorities in CS[00:32:59] Cool. What are other groups, organizations, I know you, you're involved with, uh, you were involved with She Plus Plus Helping with the great name.[00:33:07] Yeah, I just[00:33:08] got it.[00:33:10] when you say it[00:33:10] out loud, didn't name Start in 2012. Long time ago. Yeah.[00:33:15] What are some of the organizations you wanna highlight? Anything that that comes to?[00:33:20] Yeah. Well, I mean, shva Plus is great. They work on kind of getting more underrepresented minorities in like high school, interested, kind of encoding, like I remember like organizing this when I was in college, like for high schoolers, inviting them to Stanford and just showing them Silicon Valley.[00:33:38] Mm-hmm. and the number of students who went from like, I don't know what I wanna do to, like, I am going to major or minor in c. Almost all of them, I think. I think like people are just not aware of the opportunities in, like, I didn't really know what a programmer was like. I remember in Texas, , like in a small town, like it's, it's not like one of the students I've mentored, their dad was a vc, so they knew that VC is a career path.[00:34:04] Uhhuh, . And it's like, I didn't even know, like I see like, like stuff like this, right? It's like just raising your a. Yeah. Or just exposure. Mm-hmm. , like people who, kids who grow up in Silicon Valley, I think like they're just in a different world and they see different things than people who are outside of Silicon Valley.[00:34:20] So, yeah, I think Chiles West does a great job of like really trying to like, Expose people who would never have had that opportunity. I think there's like also a couple of interesting programs at Berkeley that I'm somewhat involved in. Mm-hmm. , there's dare, which is like mentoring underrepresented students, like giving research opportunities and whatnot to them and Cs.[00:34:41] That's very interesting. And I'm involved with like a summer program that's like an r u also for underrepresented minorities who are undergrads. , find that that's cool and fun. I don't know. There aren't that many women in databases. So compared to all the people out there. ? Yeah.[00:35:00] My wife, she graduated and applied physics.[00:35:02] Mm-hmm. . And she had a similar, similar feeling when she was in, in school.[00:35:06] Lightning Round[00:35:06] All right. Let's jump into the lining ground. So your favorite AI product.[00:35:12] I really like. Stable diffusion, like managed offerings or whatever. I use them now to generate all of my figures for any talks that I give. I think it's incredible.[00:35:25] I'm able to do this or all of my like pictures, not like graphs or whatever, .[00:35:31] It'd be great if they could do that. Really looking[00:35:34] forward to it. But I, I love, like, I'll put things like bridging the gap between development and production or whatever. I'll do like a bridge between a sandbox and a city. Like, and it'll make it, yeah.[00:35:46] like, I think that's super cool. Yeah. Like you can be a little, I, I enjoy making talks a lot more because of , these like dream studio, I, I don't even know what they're called, what organization they're behind. I think that is from Stability. Stability,[00:35:58] okay. Yeah. But then there's, there's like Lexi there. We interviewed one that's focused on products that's Flare ai, the beauty of stable diffusion being open sources.[00:36:07] Yeah. There's 10[00:36:07] of these. Totally, totally. I'll just use whichever ones. I have credits on .[00:36:13] A lot of people focus on, like have different focuses, like Sure. Mid Journey will have an art style as a focus. Mm-hmm. and then some people have people as the focus for scenes. I, I feel like just raw, stable diffusion two probably is the[00:36:24] best.[00:36:24] Yeah. Yeah. But I don't do, I don't have images of people in my slides . Yeah, yeah. Yeah. That'd be a little bit weird.[00:36:31] So a year from now, what do you think people will be most surprised by in ai? What's on the horizon and about to come, but people don't realize. .[00:36:39] I don't know if this will be, this is related to the AI part of things or like an AI advancement, but I consistently think people underestimate the data management challenges.[00:36:50] Ooh. In putting these things in production. Uhhuh, . And I think people get frustrated that they really try, they see these like amazing prototypes, but they cannot for the life of them, figure out how to leverage them in their organization. And I think. That frustration will be collectively felt by people as it's like it's happened in the past, not for LLMs, but for other machine learning models.[00:37:15] I think people will turn to whatever it, it's just gonna be really hard, but we're gonna feel that collective frustration like next year is what I think.[00:37:22] And we talked a little bit before the show about data management platforms. Yeah. Do you have a spec for what that[00:37:27] is? The broad definition is a system that handles kind of execution.[00:37:33] or orchestration of different like data transformations, data related transformation in your pipeline. It's super broad. So like feature stores, part of it, monitoring is part of it. Like things that are not like your post request to open AI's, p i, , .[00:37:51] What's one AI thing you would pay for if someone built.[00:37:54] So whenever I do like web development or front end projects or like build dashboards, like often I want to manage my styles in a nice way.[00:38:02] Like I wanna generate a color palette, uhhuh, and I wanna manage it, and I wanna inject it throughout the application. And I also wanna be able to change it over time. Yeah. I don't know how to do this. Well, ? Yeah, in like large or E even like, I don't know, just like not even that large of projects. Like recently I was building my own like Jupyter Notebook cuz you can do it now.[00:38:23] I'm super excited by this. I think web assembly is like really changed a lot of stuff. So I was like building my own Jupyter Notebook just for fun. And I used some website to generate a color palette that I liked and then I was like, how do I. Inject this style like consist because I was learning next for the first time.[00:38:39] Yeah. And I was using next ui. Yeah. And then I was like, okay, like I could just use css but then like, is that the way to do it for this? Like co-pilot's not gonna tell me how to do this. There's too many options. Yeah. So just like, let me like just read my code and read and give me a color palette and allow me to change it over time and have this I opera.[00:38:58] With different frameworks, I would pay like $5 a month for this.[00:39:01] Yeah, yeah, yeah. It's, it's a, you know, the classic approach to this is have a design system and then maintain it. Yeah. I'm not designing Exactly. Do this. Yeah, yeah, yeah, yeah. This is where sort of the front end world eats its own tail because there's like, 10 different options.[00:39:15] They're all awesome. Yeah, you would know . I'm like, I have to apologize on behalf of all those people. Cuz like I, I know like all the individual solutions individually, but I also don't know what to recommend to you .[00:39:28] So like that's therein lies is the thing, right? Like, ai, solve this for me please. ,[00:39:35] what's one thing you want everyone to take away about?[00:39:39] I think it's really exciting to me in a time like this where we're getting to see like major technological advances like in front of our eyes. Maybe the last time that we saw something of this scale was probably like, I don't know, like I was young, but still like Google and YouTube and those. It's like they came out and it was like, wow, like the internet is so cool , and I think we're getting to see something like that again.[00:40:05] Yeah. Yeah. I think that's just so exciting. To be a part of it somehow, and maybe I'm like surrounded by a bunch of like people who are like, oh, like it's just a fad or it's just a phase. But I don't think so. Mm-hmm. , I think I'm like fairly grounded. So yeah. That's the one takeaway I have. It's, it's not a fad.[00:40:24] My grandma asked me about chat, g p t, she doesn't know what a database is, but she knows about chat. G p t I think that's really crazy. , what does she, what does she use it for? No, she just like saw a video about it. Ah, yeah. On like Instagram or not, she's not like on like something YouTube. She watches YouTube.[00:40:41] She's sorry. She saw like a video on ChatGPT and she was like, what do you think? Is it a fad? And I was like, oh my god. , she like watched after me with this and I was like, do you wanna try it out? She was like, what ? Yeah,[00:40:55] she should.[00:40:55] Yeah, I did. I did. I don't know if she did. So yeah, I sent it to her though.[00:40:59] Well[00:40:59] thank you so much for your time, Sreya. Where should people find you online? Twitter.[00:41:04] Twitter, I mean, email me if you wanna directly contact me. I close my dms cuz I got too many, like being online, exposing yourself to strangers gives you a lot of dms. . Yeah. Yeah. But yeah, you can contact me via email.[00:41:17] I'll respond if I can. Yeah, if there's something I could actually be helpful with, so, oh,[00:41:22] awesome.[00:41:23] Thank you. Yeah, thanks for, thanks for. Get full access to Latent Space at www.latent.space/subscribe

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode