Speaker 1
And a lot of people, a lot of machine learning engineers working on AI stuff like lm based products they actually spend quite a bit of time on these world simulators if they don't have good user data yet they want to boot bootstrap themselves they try to do that with like lms which is reasonable at least like when you're getting started to get a sense of okay what is happening and
Speaker 2
so okay so then you've got a test data set, you've got some tests you've written. So hopefully you're getting back some numbers that tell you how well it's working every time you make a change. And presumably you then iterate against that till you get something that you consider to be good enough to deploy. There's kind of like a few different types of evaluation we've mentioned. So there's the hard-coded test cases, assertions, things like that. There's obviously human review, like you're manually going and annotating these things or you're getting domain experts to do it. But you mentioned LMS judge as well. And whenever I talk to people about this, I'm often met with skepticism about it, especially because people feel like it sort of turtles all the way down, right? Like I've got a messy stochastic thing and I'm going to evaluate it with another messy stochastic thing. And I need to evaluate that thing as well. So a common problem is like, how do I trust my LLM as judge? How do I make it good? I have opinions about this, but I'd be curious to know how you guys do it.
Speaker 1
Yeah. So this is a really great question. And everyone has this question about how do you trust LLM as a judge? And so there's a systematic way that you should use LLM as a judge. And you need to measure its agreement with a human being. That's the only way that you can know whether you can trust it. So like concretely, the way that works is, and I write about this also in the blog post, the blog post, your AI product needs evals, is you need to go through several iterations where you measure the agreement between the human and the LM as a judge. We have to make sure your human is also writing critiques of the LM as a judge so that you can use those critiques to improve the judge.
Speaker 2
And why not let that human like adjust the prompt themselves? Oh,
Speaker 1
you can, but you need to force a human or domain expert to go through the process of like annotating you know this judge and also making their own assessment so that you can sort of measure the agreement between the human and then you need to do this process several times and every in every iteration you need to try to bring the agreement closer a very interesting thing happens when you do this like one is okay you make LLM as a judge gets better because you get better at prompting it because you figure out like where it's failing. You figure out from the critiques that a human is writing, you're able to like incorporate aspects of that into the prompt, either as examples or otherwise. But also like as much as the LLM becomes like aligned with the human, and I know I'm using this align isn't there's no RLHF that I'm talking about here but it's just correlated with it's like actually a human becomes more aligned with the LLM they're like oh you know what this is a very interesting problem this is actually really hard and I think like you know what the LLM is doing is reasonable it's fine and there's some like sort of empathy that's built the other way that's like counter intuitive you know like you don't think about that but like it's like of makes the human accept.
Speaker 2
And in your experience, how long does this process take? How many rounds of manual labeling am I having to do? How hard is it actually to align an LMS judge? Only
Speaker 1
three or four rounds, iteration of rounds, and I make is a bad term, but there's various ways to encourage this, but label a few hundred examples maybe. So we're not talking about masses of work, right? It's not that bad. And then
Speaker 2
once you've got that LM as a judge, now that's a scalable eval, you can also use it for monitoring, you've kind of got an artifact. Do you also fine-tune them to make them quicker and cheaper? Because I guess the other reason I get people being reticent about wanting to use lm as judge is just as expensive it
Speaker 1
depends like if it's for something that's like simple as like a guard or like a more of a classification thing then i might use fine-tuning but i try not to go there with lm as a judge because it just becomes like a meta problem that i have to worry about and i try really hard not to it just adds to the complexity of the system because i'm also thinking about should i fine-tune the lm you know as well you know should i like so usually i just try like instead of fine-tuning the lm as a judge i sometimes i'm like let's fine-tune the actual model and try to make it better let's you know you can get crazy with the judge like make it because it can become like an academic exercise almost maybe
Speaker 3
there's like a tiktok process where you go back and forth between your judge and your producer i wanted to offer my contribution to this i've lived this in the sense that i produce a daily ai newsletter that is 99 ai generated and i am the human judge annotating every time. I'm nine months in. Today, actually, literally today was the first time I had 100% agreement with my products and the output. So I was surprised when you said three to four times because it takes a lot. I think it also maybe depends on the open-endedness of your domain and maybe the complexity of the task in the sense that I'm summarizing all of AI Twitter, Reddit, Discord.
Speaker 1
I really like AI News. Oh, thank you. Have you talked about how you make it or do it? Yeah, ask me anything
Speaker 3
you want. But it's literally a data pipeline. We filter, we group, summarize, do entity.
Speaker 2
And for anyone listening who doesn't know, AI News is the daily newsletter that basically summarizes all the discords twitter there's
Speaker 3
stuff that you're supposed to keep up on but you can't yeah
Speaker 2
and it tells you what you should actually just pay attention to it's like i enjoy it i get every day and i like read the headlines it's like okay i know i know what i
Speaker 3
should if anything pay attention to correct so today the surprise was oh one preview not many spat out exactly what i would have written as a human it was the first time it's done that i i've always had to do some manual override and me as a layer on top of it going, okay, this is actually the story. This is not. But here, right at the top of the O1 preview was the top three stories I wanted to think. I think that's a
Speaker 2
really good point though, because that task is a lot bigger and more open-ended than what we would, or what I would typically encourage people to do for an eval. So one of the things that we try and advise people to do is to break their evals down into small sort of atomic things that are independent of each other whereas when you're doing that review on you know what a the ai news thing has spat out you're essentially assessing the whole thing it's the little
Speaker 3
bit yeah the whole thing and the micro thing like there's this big and small we split it into like the summarizer prompt the filter prompt and the style prompt style is the is the one that is the smallest and probably fine-tunable but even i have very specific requirements for style that are weird sometimes and sometimes often tuning that style prompt ends up leaking the few shot examples i have into the the data which i then maybe need a like a fourth evaluation on which just gets a whole mess. So I don't have that part worked out yet. I just kind of rely on people to self-correct whenever I hallucinate a link, but it's not great. How much time do you spend as a human on
Speaker 1
each news blast that goes out? I've
Speaker 3
actually been recording it for my team. We're working on handing it off so that I am no longer in the loop. So between 30 to 45 minutes a day. The goal is that we eventually build a newspaper with no journalists that scales every vertical. So it's a product that I think could benefit from the process that you're saying. I think I differ in order of magnitude from what you said, which is kind of interesting. And I have not written unit tests evals because we care about like sort of different metrics like topic selection because i have like maybe 400 candidate stories every day i need like 10 so it's like it's actually a recommendation system rather than and
Speaker 2
it's also a relatively low volume application in terms of the number of like lm outputs you have compared to a lot of what i've seen at least yeah
Speaker 1
that's true like the things that i tend to work on are like contained in scope, I would say. And a lot of times when people come to me with chatbots, like I want to put a chatbot over my SaaS. I try as hard as I can to convince them not to do that. And that's the idea that a lot of people start with, but it tends to be highly correlated with not being thoughtful about the product. Like if you're just trying to put chatbot over your SaaS, like, hey, like instead of clicking menus or doing anything, you just talk to this like chatbot to do anything in my SaaS. Whereas, okay, like let's be more thoughtful. Like can you, you know, integrate AI in your user workflow in a more thoughtful way without having go to chat. You know, it's a little bit of a smell in a lot of cases, like when you just try to put chatbot on it. And it's also also can be very hard to evaluate the surface area is incredible. And it's like a moving target. And then also from a product perspective, sometimes not not
Speaker 3
good. I agree with this from that is what most people should do. But I still aspire to the chat box. No, no, I mean, it's great. Yeah. I mean, it's like, yeah. Let me do the bold case. I go into linear and sometimes I want to do a thing and I don't know how to do it. I do command K and I figure it out, right? But if I don't, if I type the wrong letter, I call it the function now, if I don't know the name for a functionality, I can't do it. So at minimum, I should be doing semantic matching for a function. But actually, what I want is a little mini agent, if we call it, that calls tools and the tools are all the functionality. There's so many dashboards where there's like 5,000 settings, zoom, freaking zoom. I can't figure out how to turn some of my permissions on and off. they of course they don't have search but even if they did have search actually sorry they do have search but like there's like 500 matches for the search item that i want and i don't know what it what it is a chat pot would be nice yeah
Speaker 1
i think it could work like if if your sass has a really good api surface area that's like really cleanly defined then probably do really well or it would do really well um a lot of sass doesn't have
Speaker 2
that. But I agree with what you say about the smell in terms of not necessarily being thoughtful, right? And this has come up with a few of my other podcast guests have spoken about the projects that did or didn't work well. And I think post-ChatGPT, there was like a rush to slap LLMs on everything and chat with PDF or whatever. And then people have taken a step back and gone, well, actually, where are these are bottlenecks? Like a great example of this was speaking to the founder of Gusto. And he was saying how they realized report building was something that people in Gusto spent hours on. And they can actually like automate the entire process of report building in a couple of minutes with an LLM call. So rather than building, you know, just chat over the SaaS, they went and they really spent a lot of time building that feature first, where the amount of leverage a customer gets is enormous, versus just like, I think their story was actually started off with a distributed team. So they were like, everyone should build LLM features. And they got a lot of incremental improvements, because every team sort of did small things. And then they kind of pulled back and said, actually, if we could do like one or two really big what would really move the needle for people? And I think not enough people are doing that. So it definitely resonates. OK, last thing on this area before we move on, which is just obviously like working as an AI consultant, you get brought in when people are struggling or when things aren't working or when people have kind of been disappointed by trying to build with LMS. You've mentioned evals as being like one missing piece that people are often getting wrong. what else are challenges that people are facing or things that are common pitfalls that uh you know what's the advice you find yourself giving again and again yeah
Speaker 1
lot of times people come to me with they're like hey like can you help us here's like a 10 page pdf of our architecture and tools. Can you take a look and tell us what we can be doing better? And there's no looking at data or evals. And it's a very, very common mistake is a focus on like thinking that you can move the needle by having better frameworks and tools when you don't even know what's broken. Like you don't know what's wrong. You just kind of have the sense from vibe checks, like you want it to be better. And so the most common advice I give is like, hey, we have to first instrument your system and then we have to look at your data and see what's wrong. And this is really every time i do it i mean we say you're like we instrument their data stop start looking at it every single time find like okay like these are some there's some really dumb things that are happening here that we can fix and it's just a blind it's just a blind spot for so many people who don't think of that. Once you do it, it's fairly obvious that, hey, I should have been doing this for a long time. It's just for some reason, people don't think of it. And then that's one thing. The second thing is along the same thread of tools and frameworks, people want to, when I talk about evals or looking at your data, the first question people ask is like, what's a tool? What's a tool that you use to do that? Like, okay, the tools can help you, but your mind shouldn't just be, like people have an idea that like, I'm going to buy a tool that's going to solve that problem. But inherent in it is a process. Like you have to follow a process of looking at your data. Like the tool is not going to look at the data for you. It's not going to debug for you. It's not going to do all that stuff.
Speaker 2
I'm obviously biased to suggest tools, right? Because I build one of these tools. But actually, I agree with you in that the customers who get to value with us tend to be problem aware. They don't come to us having nothing in place and are like, we need evals. Typically, what's happened is they've built some jerry-rigged version with spreadsheets and human labeling Jupyter notebooks. And they've hit a scale issue where it no longer works with the collaboration challenges they have. And they're trying to deploy things more frequently. But very rarely does someone show up. We did have this very early on where people would show up and they'd be like, we're building an LLM product, like help us. Like we need evals or whatever, or we need a tool. And we now DQ them in the sales process, right? They're not people who we consider high priority because we know that they're not going to get to value. And if they don't get to value, they're not going to join by the product, right? And so actually, we only have success with people who are somewhat problem aware.
Speaker 3
What's a clarifying question that you can ask for problem aware? This is a new term for me. So
Speaker 2
typically, okay, this is now getting very human loop specific, but very briefly, when someone comes in, like on the first call, usually I just have a big chat with them about the current stage of what they're building. So who's involved, where they got to, are they in production, where are they? And in that process, it becomes very clear whether this is a company that has been told they need to do something with AI and they don't yet fully know what that is yet. And there's a lot of that. And versus you've got a process in place that's not like built on a good framework or good set of tooling. And that's very common. I would say that's the vast majority of the customers who like come through to us have already built something. Oftentimes they built something quite sophisticated, right? Versus as you say, like the people who are much earlier than that tend not to be like, we've tried to help them. And the tool is never going to solve it for you. Like you need to understand the framework and the process for us. That's
Speaker 1
really smart. Yeah. There's this book, The Mom Test, you made. It's like, are you already investing? Can you see that that person is investing in the problem? Yeah. If they're not investing in the problem already, they're probably not serious about solving that problem. I
Speaker 3
often thought you know, basically we have a consultant and a SaaS founder. Have you ever thought about bundling services with software? So you sell tool and then you sell the labor to implement the tool because they don't have the labor. So they're coming to buy the tool. So the way
Speaker 2
that we're typically starting to do this. So firstly, like we don't like to sell consulting services ourselves although we end up doing a lot of that so we have like insanely hands-on customer support which sometimes veers into free consulting but the reason we don't want to do that as a consulting contract is i don't want to be on the hook for delivering consulting like actually i'm getting a lot of value from the customer from working closely with them and we're still selling them a product ultimately we go away but where we have been starting to do it is we have a few customers who are consultants themselves so who are large consultants like not individual like system systems integrators and those systems integrators we have a couple who are using it themselves but increasingly now they are the ones who are going to a customer. It's an early channel for us, but that seems much more likely to me than us doing the consulting ourselves. You