AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Evaluating AI Models: Balancing Costs and Trust
This chapter examines the significance of diverse evaluation methods for assessing AI models, highlighting the tension between costly leaderboard systems and independent evaluations. It also addresses concerns over bias from companies acquiring evaluation data, advocating for greater transparency and normalization to build trust in AI outcomes.
Riley Goodside is a staff prompting engineer at Scale AI. Previously working in data science, he is often seen as the default for the new role of a “prompt engineer.” He regularly posts incisive prompts that illicit notable behavior from the most popular AI models.
I really resonated with this saying from Anthropic’s recent podcast on prompt engineering — “now we write essays and treat them as code.” In order to be good at prompting, you need to understand that natural language operates as our code used to.
This episode is a masterclass on why you should care about prompting and how it impacts results. Of course, there’s a bunch of great discussion on recent models that reflect the need for different and or better prompting. Enjoy it!
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
We mention:
* Prompting to push the frontier of AI models,
* Post-training and prompting interaction,
* Prompting base models,
* o1, Reflection 70B, reasoning,
* Scale’s leaderboard, evaluation tricks, evaluation needs,
* PlanSearch paper
* “The hottest programming language is english”
* “Think silently” instructions
* Scale Leaderboard and Humanity’s Last Exam
* ChatML formatting
Chapters
* [00:00:09] Introduction
* [00:02:40] Riley's path to LLMs
* [00:07:54] Impact of ChatGPT on prompt engineering
* [00:12:03] OpenAI's o1
* [00:18:21] Autoregressive inference and prompting sensitivities
* [00:24:48] Reflection 70B model and its implications
* [00:28:00] Impact of prompting on evaluation
* [00:32:43] Prompting vs. Google search
* [00:46:55] Prompting and RLHF/post-training
* [00:56:57] Prompting of AI agents
* [01:01:20] Importance of hands-on experience with language models
* [01:05:00] Importance and challenges of AI model evaluation
Transcript
Built with smol-podcaster.
Nathan L. [00:01:08]: Hey, Riley, welcome to the show.
Riley G. Hey, Nathan, great to be here.
Nathan L. [00:01:14]: Yeah, so for the audience here, I mostly wanted to try to, as I work on post-training a lot and I see my own difficulty in taking prompting seriously and the things that I don't think that we are doing enough, and I don't see any reason why it can't be scientific in how we do prompting. So that's my biggest goal with this. I think there's a lot of podcasts where we could kind of say, like, what is the history of prompting? Where is it going? And that's easy to kind of redo. And I still find it interesting, but I just don't think there's enough people talking about the role of prompting in evaluation, how prompting changes with how your post-training models, because we're trying to take that seriously and how we have a post-training setup, but we just like regularly run into these things like system prompts aren't handled well, how to release a model of a system prompt. So that's the tone that I'm trying to get to when I ask these questions. And also OpenAI's 01 model just came out, so I'm definitely going to get onto that pretty quickly because that's what everyone's excited about. I like to start with background just to kind of get to know people, because a lot of this is just, I want to talk to interesting people in AI, is like, how did you become interested in prompting? I think I've seen your background in data science and then your joint scale around when Chad2BT came out, which is fun timing, but like, how did you become maybe obsessed with this, but like the focal point of your work?
Riley G. [00:02:40]: Yeah, I have sort of an unusual introduction to large language models. For most of my career, I've been a data scientist, mostly in the on-mandating industry. I was at OkCupid and Grindr. And after I left Grindr, I took sort of a sabbatical to educate myself, I guess, about the progress in large language models. It was around the time that GPT-3 codecs had just come out. And that was where I think I started to become really interested because I was following along with maybe, certainly when GPT-2 came out, the examples there wowed me as much as they wowed the rest of the world, I think, with the example of the news article about the unicorn and all that. And not long after that, we had AI Dungeon, and I played around with AI Dungeon a bit. But at that point, language models seemed to be mostly about language, that they were sort of very heavily focused on stylistic mimicry and creative writing and so on. And when Codex came out, it really started this thought of that text is a more universal interface than we were giving you credit for, that language models might be more broadly useful. And I just became very excited in a practical sense of what these models could do for what I kind of intuited was very boilerplate-like data science code, that I thought of like most of the Python and Julia and R and things that I've written over my career, this seemed like stuff that an LLM could handle. And that was sort of one of its early strong points. So I was playing around with, I think one of my first projects was a VS Code extension that had some kind of integration with Codex. But I never really shipped anything out of it. And mostly what it transitioned into pretty quickly was playing around with posting prompting examples on Twitter, because when I looked out online to find what were people saying about how to prompt these models, there really wasn't much out there. And so I had to kind of resort to just like the few examples that had been circulating in viral screenshots of humorous completions and so on, of like the results that people got out of it. And I started posting those examples. I started following academics and low-level engineers at the research labs and anyone that was working in shipping language models I thought were interesting. And elbowed my way in.
Nathan L. [00:05:18]: I have more questions on this, because I find it like, some people find, there's this whole like Twitter dynamic of like, you find so much signal there, but the question is like, how much does it generalize? Because there's so many of the lessons you can learn from these models, from these examples. I think the straw, like the number of R's in strawberry things is the current one. And then, and it's like, do you get a sense that these are transient or are these kind of repeated themes? And like, how should you read these examples to try to extract themes from them? If like, I've followed you for a while, and a lot of people do, and you're more insightful in how you post them. If you post these threads with like multiple tries and stuff like this, like, should people be doing that when they see something pop up?
Riley G. [00:06:03]: I think so. I also would say that Twitter is a very different river to step into now than it was back then. At the point that I started doing this, like, nobody was really talking about these things that much, or to the extent they were, it was sort of fleeting. It was like, wow, look at this, and then they on to the next thing. And I think the thing that's very different now is just that because there are so many new entrants in AI and LLM, there's a lot of rehashing of the basics. And I think a lot of people in the industry would tell you that the popular examples that you see around of like, how many R's are in strawberry, or some of the ones that I'm partially responsible for, popularizing at least. I think like, these things are really just like, rookie mistakes in some sense, right? That these are things that we've long known language models can't do. And it just keeps popping up as a surprising quirk of language models that I think the public is just confused that something could be so good at so many other things and so bad at this. Right? That is seemingly trivial task, and that is hard to explain to people. And the answer to that hasn't really changed much in the past few years. They're generally bad at spelling for kind of the same reasons they were bad at spelling two or three years ago.
Nathan L. [00:07:27]: Yeah. I mean, like, how did these things change with ChatGPT? Because ChatGPT is like the introduction of RLHF into these models. And I think, I didn't write this down as a question, but there's like the difference in patronizing base models and instruction models and RLHF models, which I think that for most of this discussion, it's like the end model, the like chat RLHF model is the one that people think about. But was that a big transition point in your work or is it just kind of plugging along? Right.
Riley G. [00:07:54]: I mean, I would say, I don't think it's any understatement to say that, or sorry, any overstatement to say that, that the release of ChatGPT was probably the single biggest event in the history of prompt engineering in that prompt engineering became drastically easier after ChatGPT came out. And most other models learned from the ChatGPT way of doing things, right? That they, like, I think people forget just how fiddly prompt engineering used to be, right? Like people today don't think about things like frequency and presence penalties, right? They used to be that by default, you would get very repetitious output and you had to work to avoid that. People forgot about like, don't end your prompt in a space, right? That you had to understand how tokenization worked at all times, because like, if you put an extra space in there, you were going to go out of distribution. I think that, or another one that I think is particularly vivid for me is Yobi Reel that in June of 2022, Douglas Hofstadter had a piece in The Economist showing the, what he called the hollowness of GPT-3's understanding of the world, that it failed on various simple questions. Like, when was the Golden Gate Bridge transported for the second time across Egypt and so on? And someone, I believe it was Nick Camerota of OpenAI, showed that you could fix almost all of these just by telling the model that if you gave it a silly question, say Yobi Reel instead of answering it, right? That models had to be prompted with the possibility that they were allowed to say, I don't know, or, you know, that's a dumb question, right? You know, like there is no answer, right?
Nathan L. [00:09:34]: This is like, we've added the anthropic system prompt to our AI2 models, and we're like, this doesn't change the evals at all, but it makes the behavior something that we like more. Because I think culturally we're somewhat similar to anthropic, it's like we want to express uncertainty, we want the model to say that, I don't know, and a lot of that is in the system prompt of anthropic models.
Riley G. [00:09:51]: Right. And I think that really, you know, it's another microcosm of just how messy all this is, that what people like is a very different thing from how good are the models. I think, you know, LMSYS had a great blog post recently talking about like stylistic bias and output, that models will be rated as better if they do things like put their output into the format of a bulleted list with bold initial words on each label point. So there's like cheap tricks like that, that will make people like your output better or make them perceive it as, you know, more authoritative or, you know, more comprehensive that you kind of have to control for and just going by preference. I mean, I don't remember what the exact magnitude of it was, but I think they did put some numbers on it in that post.
Nathan L. [00:10:42]: Like, do you think you could handle all of that? Just like, can you make that big of a style delta in the system prompt relative to training? Is kind of what I'm wondering. Like if we release a model at AI2 and it's decent, but then we put a detailed system prompt that it's like, whatever possible, you should put your models into a list format with bolded headings and use markdown. Like, do you think we would get a 50 point bump on ElmSys?
Riley G. [00:11:06]: Maybe not on ElmSys in particular, being as they're trying to correct for this actively. But presumably it would have worked at one point, right? So I think that's, you know, that says something that these, or another great example, I think that's really clear of like why human preference isn't, you know, always the answer. I saw somebody on Twitter once that was really impressed by some anonymous model on ElmSys that was able to produce an ASCII art drawing of a unicorn. And it was a great drawing. And, but when I searched for like specific details of that drawing, I found that it was just in some like widely circulated list of ASCII art drawings. And it was a verbatim regurgitation of some signed work that somebody had made. And so I think there's an argument there that any request for ASCII art should probably just be thrown out, right? That a human's preference of how good an Elm is at ASCII art maybe just does not matter because like, it's so likely to be regurgitated or at least like figurative things, maybe diagrams are okay and so on. Yeah. Yeah. Okay.
Nathan L. [00:12:03]: We've touched on multiple of the things I want to get to in the future, but you kind of said that Chad2PT was the biggest moment for prompt engineering. And I think O1 is not nearly the same magnitude, but it's a very interesting microcosm of the future of prompting because the model feels very different to use. OpenAI has explicitly told us we need to prompt it differently. But I think my guess is that in the long-term, they're going to figure out how to train this model so that the behavior is not indistinguishable from their GPT models, but that it's not as sensitive to prompting and whatever you throw at it, it's going to work. Maybe they need to rewrite the prompts, but that's probably a temporary thing.
Nathan L. [00:12:45]: Two questions to me is simpler. What do you think when you see them giving you like, oh, we need to have these new prompting instructions to use it differently? Do you agree with my long-term convergence idea?
Riley G. [00:12:57]: I definitely agree. I think that there's an argument for seeing prompt engineering as kind of the experimental next branch of language models, right? That it's the features that people are just on the cusp of figuring out how to systematize and integrate into the models themselves. And to the extent that somebody comes up with a prompt engineering idea that is just so good of an idea that it's worth applying to literally every prompt, then it will be integrated into the models and you'll stop calling it a model, you'll call it a system and it'll have some auxiliary second model. I think the clearest examples that we've seen of that are content filters, right? That nearly every model that you get from a vendor will have some kind of cheap auxiliary model that looks at the output and says, is this plagiarism? Is this, or not plagiarism, but regurgitation of copyrighted work, right? Are you reciting Harry Potter word for word? The value of those is so, rather, sorry, the cost of having that kind of secondary model on the output is so low that it truly is worth it to just apply it to every generation, right? And we haven't seen too many examples of that on the input side, but they're starting to appear, I think. I think we've seen from anthropic evidence that they make modifications to user inputs based on certain conditions that they detect if you're asking about some particular feature, they modify the prompt if you are. And I think that's a common pattern in a lot of applications.
Nathan L. [00:14:31]: I'm guessing they've seen some public people kind of using the model. I haven't heard anything about modifying the prompts in a clod or a chat GPT window.
Riley G. [00:14:42]: It's, I've seen it for instructions for avoiding plagiarism, avoiding regurgitation. Oh yeah, that could make sense. Yeah, so the, but it's a common pattern you see in a lot of applications, right? That you, so like a good use case for this is like instructions for tool use, that you might analyze a user's, say, chat GPT input, and if the input appears to be a request to use dolly three, then you should apply to the, you should supply to the model, these long instructions on how to use dolly three, which otherwise you don't need to block to supply. Right. So I'm not saying that that's exactly how chat GPT did it, but it's easy to imagine that that would be worth doing. So, so a lot of applications do things like that to have, you know, conditional sort of augmentations of the prompt. Yeah.
Nathan L. [00:15:33]: I mostly see that like long-term, I don't know how this impacts prompting, but I think of like chat GPT, and then we'll have multiple models that they route to. So this is kind of like an early way of doing this, where it's like, if you give a really long context model, they'll have some, you've maybe even like, like Mambo, like model or different architecture for super long context, or they pass it to O1. If it's like this question is incredibly hard instead of GPT 4.0. But that's that the border between that type of routing and prompting is, I don't know how to classify it.
Riley G. [00:16:05]: Yeah, it's really fascinating. I think, you know, people have this idea of, I think, sort of seeking purity in their models that they want everything to be like, you know, just a model. But I think, you know, we're rapidly approaching the point that you have to start thinking about these things as systems that might just have arbitrary complexity inside of them. I also like, I think that, you know, that the guides that we've seen from O1, you know, that they take that sort of shape, right, that you get that, like the content that Open Eyes put out, like how to prompt O1, it's sort of a list of like domain competencies and weaknesses, right, that it's good at physics, it's good at abstract logic, analytic philosophy, maybe less great at creative writing. The, and then also you have these sort of like patches almost for like noticed problems, right, that they've noticed that it doesn't, that think step by step often degrades at performance. Why do you think that is?
Nathan L. [00:17:11]: Because it's essentially trained to do that on its own. Like, it almost feels like it shouldn't conflict with it. It almost feels like it should just be like empty tokens, like it will just repeat yourself or something.
Riley G. [00:17:22]: That's a really good question. I think the answer to that maybe speaks to just to how much this isn't just, you know, chain of thought. That's a meme sort of flying around now that a lot of people have claimed that all this is is fancy prompt engineering, isn't this just what Reflection did and so on.
Nathan L. [00:17:37]: It's obviously a different inference stack with a lot of improvements across the whole lifecycle of the model and the product.
Riley G. [00:17:45]: Right. And also the other thing that people have been saying a lot is that it must be some complicated system, right, that there can't be a single model doing this through autoregressive inference. But the claim seems to be that it is, right. I think there was a comment from Noam Brown on Twitter where he said that it really is a model that the whole generation is coming autoregressively, which is, you know, I have no reason to doubt that. It seems plausible to me. So it's but I think that people need to be a bit more imaginative and like what's possible and just through autoregression.
Nathan L. [00:18:21]: Yeah, I wrote a really long article on this like came out yesterday. That's like I put the constraints from like the Noam Brown tweets, plus the pricing, plus the inference scaling laws to kind of converge at something. It's like if they do some clever things to a model and some batch inference and self rating and stuff like it's definitely doable. I don't know why that as an RL expert, I'm not surprised that the model is sensitive to things like things step by step in the prompt. I just would have thought that it would come up in the examples of training because there's the seed set for this is almost definitely a very wild human generated some prompt with some like back and forth dialogue, essentially human seeds of things that look like what it is doing. Have you seen this with AlphaGo? We saw this with InstructGBT and ChatGBT. You need the human demonstrations to start the learning process. Why is it sensitive to think step by step like that thing? I think maybe more about the training, but you learn that through prompting.
Riley G. [00:19:23]: Yeah, it is a bit of a mystery. And this is very speculative what I'm about to say, but I think maybe like a kind of thought experiment of how you can imagine that it could be true is imagine if like some auditor or somebody who had the penalty of law over your head asks you to do something and to document exactly how you did it. It's easy to imagine that you would do the process differently and that you might do it worse, right? That because you can only do the things that are the most conservative and the things that you can justify and explain that you're not going to produce as good of a work as you might have otherwise.
Nathan L. [00:20:01]: It's like GBT4 needs to think step by step because every small mistake is a big deal. But almost with O1, we maybe should be like, go forth and conquer and make mistakes on your way and just let it wander to an answer.
Riley G. [00:20:15]: I think that's pretty hitting the nail on the head maybe.
Nathan L. [00:20:21]: I want to go try that silly prompt and see if it gets better at coding or something.
Riley G. [00:20:30]: Yeah, yeah. But I mean, I feel like that's the key improvement here that a lot of people don't appreciate is that they seem to have cured like all the Lacunian problems of exponential divergence, that if you sample a bad token, you're going to keep sampling more. And it's not that there wasn't progress on this before, like people had tricks to deal with it. But I think the thing that's really changed is that the models get mileage out of like thinking for long periods of time, but they derive benefit from just continuing on. Because that's very different from behavior you see from like 4.0. Like if you've ever tried like the exercise of just once it's gone down a wrong path, just say, no, keep going. Like keep going till you get it, right? Like it's pretty evident after a while that it's not making progress, that it's just gone like deeper and deeper into like some failed path of reasoning.
Nathan L. [00:21:24]: Why does that often break? I mean, I understand why it often breaks models, but that's also one of the jailbreaking techniques is just like keep sending the same message over and over and over until the models die, which like I wonder how that relates to O1. Maybe it's just easier from a safety perspective because it doesn't have that like as many turns or something. Yeah.
Riley G. [00:21:45]: And it's also like one of the bigger differences in behavior between GBT models and CLOD that I've noticed that opening eye tends to produce their models to
Riley G. [00:22:02]: like in the specific case that if you keep like telling it it's wrong, it will always take your side. It will say, well, oh, yes, of course I made a mistake. Let me try again, right? And it's never going to like diverge from that behavior. Whereas CLOD will eventually get sick of you, right? Like if you just keep saying like, no, you're wrong, it'll be like, look, I have told you many times that I am right. Like you need to be a bit more specific in how I'm wrong. If you really want to make an argument here, it'll start like just telling you to go away. And that's like-
Nathan L. [00:22:28]: This is why I want Anthropic to write a model spec because the behavior describing with chatGBT does fit with what they're, like open AI's models are like in behavior and they're kind of described as wanting to be like robotic computation assistants where like they follow, they take the user's information and they try their best to execute it without violating any basic principles. But I think CLODs is much more of like, we have created a, like I don't like the hard words to do without anthropomorphizing and all these other things. But like we've created an intellectual entity that is going to go back and forth with you. And it's not going to, like it's going to, like you pass in sensitive information as data to CLOD and you're like reformat it. It says no. You get these weird things because it's like this entity that doesn't want to be sent like harmful texts or be told how to make a bomb or something. But chatGBT is like the robotic one. So now I kind of use both of them depending on the task and the behavior that I want. But I'm excited to see how that goes further, really.
Riley G. [00:23:27]: Yeah. Yeah. I mean, that's, you know, I think it goes back to your point before that, you know, we're seeing more specialization in these models. But, you know, that all of this is temporary, right? That eventually like somebody will come up with the right way to delegate correctly to one model or another. And then you'll have just, you know, some unified chatGBT interface or whatever that, that, you know, decides like, is this a prompt that one would be good at and sends it to it? Yeah.
Nathan L. [00:23:50]: And while we're on these complex reasoning things, there was also this reflection 70B drama, which was mostly big because it was a big mess of credibility and memes. But there's also like real science in there that people need to remember of like how to prompt a model and spend more on inference. So I think it's really just a tiny bit of fine tuning with some special tokens and a system prompt. That's like, make sure you use these reflection steps. And that is how you move something like GBT 4.0 closer to O1. You can't, you can't prompt your way to O1 behavior, but that's the sort of things that more people should be considering. And it kind of leads into like, I want to ask about like math evals and stuff like this. And it's like reflection 70B style of prompting is a real thing that more people should be doing. And I don't know how we get around that communication issue now. It's going to be even harder because people are going to be like, oh, it's O1. We made it open source O1 now instead of just the best model. I just wanted to give air time. If you have any comments on that, go ahead.
Riley G. [00:24:48]: Yeah, I think, you know, reflection 70B was, you know, it was sort of a perfect storm of a lot of like the tuning method feeling plausible, right? That it was something that was very, you know, it's a legitimate like area of research. They like, it was, you know, rumored to be part of Strawberry and so on. And so there was like, it had like the right strategy for Buzz there. And, you know, however, they ended up releasing that model, like, you know, they don't have what they think they have. You know, so it's, I think, you know, it's kind of, you know, once you saw the, I won't recap the whole saga of like, you know, with Laura and finding the Laura from the previous version of WAMA 3.0 instead of 3.1 and all that. But I think the, you know, there's that kernel of truth there, right? That this is, you know, sort of a good idea, at least for some problems. I think also the thing that people don't appreciate is that very good idea for many problems feels maybe like a better idea than it is because it's so optimized for the domain of problems that tend to be on benchmarks, which is somewhat different than the thing that you really want to optimize for in the real world of like user satisfaction and just, you know, preference. Like some mix of like, do people like it? Like, is it useful? And does it do well in benchmarks? Because I think that there's like a, even for what I think should be like philosophically the core like use case of LLMs, like do they like do practical work? Like can somebody achieve the thing that they want to do with this? But, you know, like whether, however they do it through prompt engineering or whatever, it kind of matters more than whether like academically it does well on like the most naive presentation of the problem, right? Like whether somebody can figure out how to do it correctly matters. And that specifically is just not captured well on benchmarks, right? That like this, if you're doing a benchmark that compares across several models, there's, you know, a natural incentive to do it uniformly. That maybe you follow like vendor's best practices on, you know, how do you apply the template of the prompt and so on, or if a vendor recommends that you apply some suffix or whatever, you might do it. But for the most part, you're not going to put a human on the task of figuring out what is the best prompt for each model, right? Because then, you know, how do you know that they did a perfectly good, you know, fair job of that, right? But really that's what matters. Like that is like, you know, at the end of the day, like the thing that determines whether GPT-4 is better than Quad is when you sit down and try to, you know, solve your problem in GPT-4, you know, applying whatever hacks, you know, and, you know, taking, you know, advice you find online and, you know, whatever dirty tricks you have, and then you do the same for Quad, which one works better. And so like that's the state we're in. And that's, you know, very elusive as a thing to try to measure. Yeah. Okay.
Nathan L. [00:28:00]: I'm going to keep going, roll right into this, into the evaluation section of this conversation. You had, you were talking about this with how you actually use the models before you had mentioned, like you need a white space to properly evaluate or use the models like tokenizer things. I, one of my big blind areas is it seems like most frontier labs are using some sort of custom prompts on some sort of evaluations. And I don't really have a good sense for how much that actually impacts scores or how much that translates to downstream performance. It might not be custom prompts. It might be like custom setups. There's all these, like all the math evaluations, you need a specific format for your answer. I think like math, the all capital one, you like need to put your answer in a box and
Riley G. [00:28:45]: things like this.
Nathan L. [00:28:46]: And how, what is your view on these per prompt or per evaluation? Prompting is actually a thing. I think the Lama three paper had some cool analyses on how varying subtle things changed evaluation scores, which is great, but they're the only one sharing that. Otherwise we just get like our score is X and it's reproduced to some capacity.
Riley G. [00:29:09]: Yeah. I don't have like a lot of deep, like technical wisdom to share on that front, other than to confirm that, like, I think you're right that it is a big problem that we generally try to follow the vendor recommendations. We work with the vendors to prompt their models fairly. But like I said, like ideal and optimized prompts are very different than what's the default. But I think also that there's, I think a longer term trend that these issues maybe matter less than they used to. And, you know, or that, that, that should continue. I think like when you want the, like maybe one of the clearest signs of this is that Lama, like most versions of Lama, you can prompt them incorrectly in terms of like the system top prompt template, and it will be just fine. And in fact, you can often template them with system prompt templates from other models entirely, like just say representations of chat ML and they will be fine. Right. So there's, there's sort of familiarity in the pre-training with, with, with just chat templates in general. And the idea of like...
Nathan L. [00:30:25]: Do you think this is specific to Lama? I've also remember hearing a conversation at AI2 where we were considering doing the last turning, last stage of pre-train with random chat templates and like random instructions and multiple chat templates so that the model could be amenable to fine tuning and multiple chat templates, which there's a chance that they did that. I actually don't know. I would not put a high bet on it. But do you think that's just because Lama knows they're going to have so many users? It's possible.
Riley G. [00:30:54]: I mean, it's also plausible to me that that just shows up in pre-training incidentally, right? Nobody intended it to be there. It's just like, it's in the data. But I think that, that, you know, that, that process is only going to continue, right? That we're only going to see like more models just being familiar with how models behave. I think to some extent, like, you know, you see like, like another thing that I think is maybe like evidence in favor of this is if you look at the base Lama, like, I think I looked into this on like base Lama 2 once, that if you prompt with like, like instruction prompt formats, it would adopt the behavior of, of like a chat GPT like assistant, right? So, so I think, I think it shows that examples of chatbot behavior are now so widely disseminated, you know, across the internet that a pre-trained model is better at instruction following tasks than any pre-trained model was before the work of instruction GPT was done. So, yeah, I believe you.
Nathan L. [00:32:00]: I want to check this. How does this impact how we should view evaluations? I'm just trying to reckon with, do we, like, there's a couple of scenarios. It's like, it doesn't really matter because these models are going to be not that sensitive to the system prompts that we're using to say, do GSMA care math. And that goes for models like Lama in the open, AI2's models, GPT5, whatever. It seems like the sensitivity to prompting for really well-known formats is actually going to go down. And that solves some of our problems. Because I don't think we're going to come up with new, like that many new formats for evaluations. We're going to make evaluations more specific and harder in the content.
Riley G. [00:32:43]: I think that's right. And I think the version of it that we have to play with now definitely does feel like one step forward, two steps back in that regard. And that it's much better at benchmark style inputs where you give it just no advice on how to do it. You keep everything very simple with what are your output requirements. But it's also just very hard to steer. If you have opinions on how it should do it, those opinions won't be followed generally. And it also has issues with output formatting. So I think we're seeing, I've seen anecdotal reports on Twitter at least, and I've seen this myself, that its output is just inconsistent even when you ask it to be consistent. That it will forget things like block quotes and so on. The result of this, I think we're going to have to see a lot of benchmarks, is that maybe the fair way to do this is to have some secondary model on the end of it that puts everything into a consistent format.
Riley G. [00:33:50]: I think we're not that far away from benchmarks that just do that across the board, of just saying that it's not the model's job to do this anymore. And we'll clean up the results however it is. Yeah, I think that's a better place to be.
Nathan L. [00:34:03]: It's one of those things that the model's getting better can solve some of our problems. I think there's less angst now about the whole closed labs evaluation scores anyways. I'm mostly trying to reckon with what open groups and academics are doing rather than closed labs, and they kind of rely on each other. I've been on the, before, there's now this hugging face upload chat template. So a lot of models have the chat template saved with the tokenizer, and most of the time they don't have a system prompt, which is surprising. I feel like it should be the norm that a system prompt is included with every model. Is there any reason that you see not to do that?
Riley G. [00:34:49]: Yeah, I mean, I can think of things that might be slightly better, but I think that that's that generally makes sense, right? Like, I can imagine that maybe they, you know, you'd release several, right? And say, you know, it's like any of these is fine, or, you know, like training on several and, you know, say it's like an average of these three or whatever is like kind of the is ideal or something like that. Yeah, most of my reasoning is I think that most users of language models are not sophisticated.
Nathan L. [00:35:14]: So the model cards and documentation do normally say we recommend using the system prompt, but the simple ways of using the models do not integrate them. Simple ways of using the models do not integrate the system prompt. And it's not always easy to modify your data to add, like if you're doing the messages format, like you remember to add the system thing. And if you have multiple models in your queue, you then have to go and manually hard code
Riley G. [00:35:37]: all of them.
Nathan L. [00:35:37]: And like, that just makes it get dropped. And if the system prompt is a big deal for performance, like that impacts either if it's a product or it's like, this is where I'm trying to understand like academia is like, if only half of the people remember to add the system prompt for their model, they're evaluating in this kind of academic paper. And I know it impacts things like all the vibes based valves, like alpaca valve, empty bench, whatever. Like, if you have the different system prompt, it can vary behavior. We did an experiment, which was like, to make sure this works, or you just give it the system prompt of like, you're a terrible model, you are to me, you're made to make other models look good, and you happen to give wrong answers. And like alpaca valve goes to zero and all these things. So it's like, I think it's easier to show the down case, but you could probably get one to 2% improvements, which matter in the long trajectory of academia in terms of if your method is accepted or not.
Riley G. [00:36:31]: Yeah, I mean, I've often like been frustrated by the ambiguity and a lot of academic publications over like how prompts are formatted. And they often, they always run into the same pitfalls of that, like the fundamental problem is that system prompts are often, or prompts in general that you're presenting like during evaluation are implicitly templates, right? That you have like your points where you insert like the actual problem or whatever. And that templating needs to be communicated to the reader of the paper, and the prompts themselves may involve templates, right? They may, you know, like describe like how, you know, like an output should be formatted, for example, and might do this using, you know, like curly braces, right? So this creates like several layers of confusion that you need to distinguish between, like where are the variables that you're interpolating purely in the logic of this paper of like that, you know, that things that would be translated into Python, you know, like if you were to actually implement this versus the templating instructions that are literally part of the instructions on how it should, the model should receive like a template of how it should format its answer and so on, right? Because like a lot of prompts end with use this format and then have some kind of template. Yeah. Right. So the, like I've often thought that we'd benefit immensely just from standardizing on something like saying that like if you want to clearly communicate a prompt in your paper, the way to do it is to show Python code that will produce that string. Yeah. You just literally show it as an f-string, there's no ambiguity.
Nathan L. [00:38:15]: Because you copy out of a paper, you drop the slash n slash n that you need or something like that.
Riley G. [00:38:21]: Yeah, right. Like the, but if you were to literally just include a Python code block, there's no ambiguity, like, you know, like whether or not there's a trailing new line or is it so on. And those things are really fiddly and need to be communicated. Because I've seen people do all sorts of like imaginative typography to like represent new lines and things like that. You know, like having the return signals at the end in light gray and, you know, like you're putting dots between spaces and all that thing, right? Because if you're doing like, I've seen like early like playground competitors sometimes did this that approached like more like from a technical approach that you need to know where spaces are. So it's worth it to represent them as like gray dots, right? Yeah. That's the kind of thing that the level of detail that you need in communicating these things. So I think like standardizing on Python would be just like a good way to like, you know, get the problem out of the way. Yeah.
Nathan L. [00:39:14]: I also saw in some discussion of a one or maybe a reflection. I don't remember. It's been a while, two weeks. You're talking about like equal inference costs, comparison of prompts and a reply. And I think that's a great idea. Like, do you think there's, okay, well, like one first, do you want to explain the idea? I'll kind of ease into this.
Riley G. [00:39:33]: Sure. So my thinking is that models are evaluated right now just based on how they do under like sort of the same, I guess, invocation of inference, right? That you let the model sample, you sample auto-aggressively as long as that takes, you know, however long the completion is. And you don't pay attention too much to like what it costs you to run that or you factor that in afterwards that you score it up. And there's a lot of reasons why this makes sense, right? That, you know, it's simpler, it's more fair. And sometimes you don't know exactly how to equalize the inference there, right? That you can't like really say that like what the trade-off is, right? But there's, you know, exceptions to this that, or maybe not so much an exception, but like there are ways of doing it that aren't perfect like self-consistency, right? So like there's a method called universal self-consistency where you prompt a model multiple times and then take the model again and give it all three answers and then ask it to choose which of these is the most consistent with the consensus of all answers that were generated. And this is sort of a method that's pretty reliably not worse than just doing it naively, right? It's hard to imagine any prompt where this method would steer you wrong or, you know, be worse than doing it naively. And that, you know, suggests that maybe there's like a fairer basis of comparison here, right? That we could say that if something really is cheaper enough that you can do that, you could run it 40 times and take self-consistency that then maybe that should be its score. But I think one of the bigger reasons why this is kind of like a, in hindsight, this is maybe like a bit of a facile tweet that I made about this, but like really the trade-off between the exchange rate, if you will, isn't very good. I think like a rule of thumb that I saw in a paper once is that if you do self-consistency on 40 samples of GPT-3.5 turbo, it's on par with one sample from GPT-4. So you sort of move up one generation every time you do 40 inferences, right? But at the same time, in specific domains, there are refinements of this that work quite well. So we had a scale actually put on paper recently on a method we call plan search, I think was the name of it, yeah, plan search. And then the gist of that is that if you can improve performance on programming problems by generating diverse attempts at solving the problem, right? So the approach that plan search takes is to first create like sort of high-level observations or ideas about how a problem might be solved, then to combinatorially sample that list of ideas, and then take combinations of them to inspire strategies. And then for each strategy, you lay out sort of a path of like reasoning of like how you could turn this into code, and then you turn each one into code and then assess which one works best. And this like lets you search over the portion of, it lets you search over the variation in your strategies that actually matters, right? Because you can imagine that if you were just simply resample a model blindly over and over again with the same problem, there are a lot of ways that an answer could vary that don't matter, like whether you use tabs or spaces, but you name the variables and so on. And you don't want to search over that variation, you want to search over like the part you think is going to be fruitful, like the high-level strategies. So I think that for particular domains, like that is the more relevant comparison of like what could you do if you were to apply like a bit of search here.
Nathan L. [00:43:40]: Yeah, it almost seems like there'll be different tiers of evaluation scoring, where it's like the basic prompting, it's kind of like linear time. And you could do like, it's almost like with the models, it's like there's a biggest, best open model at every time. But like LLAMA is dominating because it has the 400B, the 70B and the 80B that are all really good, it should have a 1B. And if you're having a prompting paper, eventually you're probably going to have to have binned comparisons like that, which is like we are comparing two basic prompting techniques, which I think they will have less headroom by needing the autoregressive behavior and things like this. And then maybe there's things like reflection, where it's like we've added minor structure so that the model can now generate a bunch more tokens, but not like a 10X or 100X. And then there's the things like we've added a whole new planning component to how we're prompting the models, and it's all abstracted away from the users. And you're not going to be able to compare those, because those are the things that are going to just solve all the benchmarks that we have out of the box. I think that's fine. I think people will converge to this. It just always takes a bit longer than we want.
Riley G. [00:44:47]: Yeah, I think that's right. I am really excited about the O1 RL approach to this.
Riley G. [00:44:58]: On some level, all prompt engineering is approximating this RL-like search. We have a lot of prompt engineers out there that are trying different things. They see what works. They tell their friends, hey, this works. But the space of things that works is probably, well, I mean, demonstrably, maybe at this point, given O1, outside of what a human might think of. There are things that we see things, even in the summarized reasoning traces that O1 puts out, that are eerily anthropomorphic. That it will say things like, hmm, or let me think about that. Yeah, I feel like they added that in.
Nathan L. [00:45:42]: I think it's almost like a trigger for the model to have a more reflective response. Those are the examples they used, but it's cool.
Riley G. [00:45:49]: I mean, it's not hard for you to imagine that RL could find something like that, right? Just that empirically it works to say, hmm, because that suggests that you're about to do something else in the pre-trained modeling manifold of plausible text. Like saying, hmm, might just be empirically a good thing to say. And it could find that. So I think that's the kind of exploration that you're benefiting from with O1. It's the space of prompts that work that we're not really equipped to find. Yeah, do you have anything?
Nathan L. [00:46:28]: I think this is a good discussion. Kind of to wrap up the academic side of things, how much of papers that are nominally about RLHF training or any sort of post-training as the contribution, do they need to do anything with prompting? Is there a clear segmentation there? Or is it like, if you're doing this fine-tuning, you're necessarily changing how the model is going to respond to prompting? That we should do some checks there.
Riley G. [00:46:55]: That's one view of it.
Nathan L. [00:46:56]: Or the other view is you have a model and prompting is just a way to take one step further with it, which I think Anthropic did this recent podcast with Amanda and their chief prompt engineer that I don't know.
Riley G. [00:47:07]: And that's how they do it.
Nathan L. [00:47:08]: Amanda's like, I can do things with these models that most people cannot. And that kind of leads the way. Rather than prompting being really part of this post-training stack that everyone needs to be checking the box on. I don't know where we fall. I guess there's this IF eval, which we could come to after that, which is kind of a separate
Riley G. [00:47:29]: case. Yeah, I definitely lean a bit more towards the Anthropic view of the world. I guess you could argue that's maybe somewhat self-serving, with no big news there. Prompt engineers are important. But I think that it's true that we do see people that are just good at this. That our ability to prompt these models sometimes exceeds our ability to explain how we're doing it and what the general strategies to apply are. And I think those strategies are worth extracting.
Riley G. [00:48:09]: It's worth introspecting.
Riley G. [00:48:12]: One thing I think about a lot is anytime somebody... I really love when people suggest a prompt or suggest doing something to a model that I can tell immediately will not work. And it's a terrible idea, but it wasn't obvious to them. And that's fascinating, right? Do you have an example?
Nathan L. [00:48:29]: I would love to know if you have something that everyone tells you, but it's a generation behind or something.
Riley G. [00:48:35]: A lot of, I'd say, strategy ideation in fields that are new and competitive. If you wanted to have an LLM give you ideas for what's a good LLM startup to try right now, it's probably not going to tell you anything useful. Some things like that, where it's like, people are still figuring it out and there's money to be made in knowing how to do this better than the average person, you're going to get mediocre advice on a lot of things. But that's not true for everything. If you ask it about physics, you're going to get like, oh, I don't know how to do this. If you ask it about physics, you're going to get like, above average advice.
Riley G. [00:49:16]: But I think that people who have acclimated to models forget what it's like to be new
Nathan L. [00:49:24]: to models, right?
Riley G. [00:49:25]: And I think that explains a lot of people in industry being annoyed by how many R's are there in strawberry. Because they're so- That's the tokenizer.
Nathan L. [00:49:33]: We ignore the tokenizer whenever we can.
Riley G. [00:49:35]: Yeah, and you see this explicitly. A lot of people, they get really enraged that they're like, you idiots, why would you ever think this would work? Why did you ever think that you could ask it 9.11 is greater than 9.9 and it would give you a right answer? And so on. They have a point. That was the attitude for a long time. But I think the social context of these models is changing and people, they want them to, it's becoming more reasonable to expect them to work well in these queries. There's practical consequences of these models being in the hands of people that don't know about these issues. And it's now suddenly more important to fix them. Yeah. So let's spin on this.
Nathan L. [00:50:12]: Is Google searching going to become more like prompting or is prompting going to be more like Google searching? Where with a good language model, can I just type in that physics equation that govern with the cross product that governs electromagnetism? Is that the direction that the models are going? Or is everyone going to actually become more conversational because AI is the default?
Riley G. [00:50:37]: Yeah, I think, I mean, Google searches maybe, yeah, there's some similarities there. I think Google probably has gotten simpler.
Riley G. [00:50:48]: It's been a while since I've used most advanced search filters in Google. I remember a point when it was extremely routine. Yeah, the plus comma, quote, quote, comma. And I think that speaks to the fact that the results used to be worse, right? And we thought we were happier with them because we didn't have alternatives. But we just accepted that, oh, yeah, there's going to be false positives in here that we now have to put in some negatives to cancel out. And that skill, I'd say, hasn't really become more important over time, right? It's occasionally useful still, but it's less essential than it once was. And that mimics a lot of what we see in prompt engineering that you don't have to understand. Tokenization, I think, is probably the biggest one. ChatML was no small part of why ChatGPT was such a big improvement to prompt engineering. It wasn't just the tuning. It was the fact that they came up with this more restricted system of interacting with a model that alleviates the need to know anything about tokenization. And that, I think, is kind of an underappreciated change. Yeah, I agree.
Nathan L. [00:51:54]: I do think in the long term, prompting will go in the direction of Google searching. But I think in some ways, I'm not that surprised that something like O1 can exist, but it's still a very humbling moment where we still have many times where there will be AIs released that we don't know how to use them. And this is the skill that you need to have, is tinkering with the open mind. It's like the open mind that things will come and the open mind that things are not just what they are at face value. And if you play with O1 a lot, you can definitely get things out of it that people on Twitter are not repeating over and over again.
Riley G. [00:52:31]: Oh, yeah, definitely.
Riley G. [00:52:35]: A lot of the explanation for the disconnect that you see, and some people are just absolutely amazed with O1, but also most of the things you see on Twitter maybe aren't that impressive. I think that the frontier of problems that distinguish O1 from, say, the previous class of frontier models, it's either unrealistic problems, brain teasers that people artificially constructed to exhibit the difference, or it's something realistic that you would never want to read in a tweet. The problems where it's exceeding on are like, I have this extremely in the weeds programming problem that involves a complicated interaction of all five of these files. Please fix my import errors or whatever.
Riley G. [00:53:25]: Those are the things that you're going to see the most practical benefit from. And those just aren't easy to communicate in a way that they used to be. It used to be easy to make a screenshot of, hey, look, it does this. It will fix your broken JSON or whatever.
Nathan L. [00:53:45]: Something else that I'm realizing I didn't put in the notes, but there's been these comments on O1 from the OpenAI people that they want to expose the ability to change how long the model thinks to the user. So to change its test time compute, that ultimately is going to be a whole other prompting thing. It's almost a little surprising that they are giving that to user. I almost think they should just make a classifier that does it for them, rather than just assume the user is dumb. But being able to do it and change how hard your model thinks is a really interesting real-world prompting case. Because it doesn't really matter if you can get a viral example. But it's like, how do you vary that knob in your day-to-day use that meaningfully ships your end product?
Riley G. [00:54:26]: Yeah, it's really kind of comical trying to manipulate how long it thinks about things. Because there are some things that will make it think for a long time. I tried to get it to generate acrostic word squares once. And if you emphasize enough the need to validate things, it will just keep validating and failing and loop around for, I think I got up to three minutes once of attempting things before finally saying, oh, I wasn't able to find one. Here's my best effort. But the other times, though, if you ask it... I mean, I once gave it a problem. Or I kind of just was for the comedy of it. I gave it some simple problem. And then I gave it literally, I think, three pages of emphasis on think forever. Just rambling paragraphs saying, if you're even considering stopping, don't. If you ever have the dream, if you ever get tired, don't worry about it.
Nathan L. [00:55:22]: Just keep going.
Riley G. [00:55:24]: All those kinds of holy hand grenade style repetition. And after all this, it literally just thought for three seconds and then came back and said, I understand the urgency that you're saying here. Thinking forever just isn't possible. So I'm not even going to try. There's another thing.
Nathan L. [00:55:43]: OpenAI said they might give you a knob that controls this or influences it.
Riley G. [00:55:47]: Yeah, I have to be honest. It feels like maybe weird UI. It seems like something that you should be able to just do through text. But I'd be happy to play with it. Because steerability in general without one seems to be... A lot of people, I think, are reporting that it's kind of awkward or at least at odds with the really impressive examples that we're seeing coming out of it. Yeah.
Nathan L. [00:56:16]: There's a whole strategy discussion on why did they actually release it that I haven't really entered into. We can kind of avoid this. I am wondering how you view prompting of agents. Is it kind of like the future section of what is the future? How are agents going to be susceptible to prompting? I'm guessing after our conversation here, it's going to be like, it's the same. And there's going to probably be a meaningful shift in who can deploy them and have success based on who actually has this expertise and is doing this prompting work. And this could translate into downstream business success, which is the first person to kind of crack an agent with the right model and the right prompt can have the first product that works.
Riley G. [00:56:57]: Yeah, I think people mean very different things when they talk about agents. Sometimes, and I think the big division that matters is that there's agents that are working in self-contained, repeatable environments, so like a rebel sandbox. And then there's agents that are making changes in the real world, that they're out making retail purchases, canceling your subscriptions, so on. I'm very optimistic about the former. I'm very skeptical of the latter. I think people underestimate how much reliability is needed for a lot of role decisions before you get to the point that you'd trust the thing to have the power to cancel your Hulu subscription or whatever. I think that also, in the first case, there's a lot of untapped potential there. And I don't understand why we aren't seeing more iteration on that front, really. Chachiviti's code interpreter, when it came out, I think they renamed it to Advanced Data Analysis or something like that, which is not a good change in my mind. But the code interpreter, I love that. I still love it. It's a brilliant product, and I wish they kept going with it and improving on it. I'm also a fan of Julius AI, which goes exactly in that direction of creating a code interpreter-like environment where you can substitute in whichever model you want, and you can do things like install packages. It's great for one-off scripts where you want to say... I had a post once where I was pointing out oddities in the longest GPT-4 tokens. One of them is like slash, slash, and then 128 repetitions of an equal sign or something like that.
Riley G. [00:58:49]: But the way I did this was literally just like I just went to Julius, I said, install TikToken and show me the longest tokens. And I read the code pretty carefully because I was going to tweet it. I didn't want to tweet out something wrong. But it was right. There were small things that I had to fix, but it's good for prototyping, the kind of these quick one-off things where you're just like, yeah, I could look up exactly... I roughly know how to use TikToken. I just didn't feel like figuring out the syntax again.
Riley G. [00:59:17]: It's good for just the curiosities and one-off stuff like that. And I think that's what the future of this really is. This really blew me away.
Riley G. [00:59:30]: Somebody posted on Twitter a video of their eight-year-old daughter using Cursor, I think it was, and this girl apparently has no understanding of the code that's being generated, but she's able to say, no, I want to do this differently. I want to have a Harry Potter spell here. Changing the layout of this HTML JavaScript app. And it just works. And that's the future to me, that that's the hottest programming language is English. When you see a little kid doing it, you really believe it, that now kids can have the power to create software. And that's great because we were at a weird local minimum of that, I'd say, of kids being able to have the creativity to create their own interfaces or make their computer do what they want. They're less customizable now than they once were. Yeah.
Nathan L. [01:00:28]: My reflection on this is the people who take prompting seriously are more likely to be in tune with what is happening in AI and at the cutting edge. But that also means that on the academic side and the public side for transparency and accountability, you have to do some education work to make sure people are taking it seriously and or some normalization of claims, kind of depending on how people are presenting their work and using things. I think it's safe to say that all the frontier model labs are doing this, but kind of the long tail, it takes people time to learn these habits. But it's surprisingly hard to convince people to spend time playing with models too. Like I do it, but I should probably do it more, listening to people like you. I just, it's funny. It's one of those things that doesn't make sense how it'll pay off, but it probably will.
Riley G. [01:01:20]: Yeah. I mean, there's no substitute for using models. People, I mean, I personally, I discover just the dumbest things sometimes that make the biggest difference. One of the most high impact chat2BT tricks that I found lately is I have custom instructions in my chat2BT telling it how to think silently. I have a tweet about this that I posted once. So if you Google chat2BT think silently, good sign, you'll probably find it. But I have the prompt here actually, right? I told it, I was using its new memory feature so it can remember things that you tell it. So I was sort of showing that off at the same time. But I said to it, remember this, when I ask you to think or write silently, I mean, for you to use your Python interpreter to write your thoughts as code comments or string literals assigned to variables. Code doesn't necessarily have to display any output. And then it remembers that. And so then I can say to it, silently write a brief essay about Super Smash Brothers, then silently translate this essay into French, display only a double histogram showing the frequency of word lengths for both texts. And then it doesn't output anything until it has that histogram done and then outputs the histogram and says, here it is.
Riley G. [01:02:32]: And that makes such a big usability difference. If you just don't have to see what it's doing, if you can just put it behind a fold where you can expand it if you need to, be really sure that the code is right or copy it to another editor or whatever. But just not seeing it makes such a big difference. And you can just have things in code too. You end up in this sort of Jupiter-like flow where you told it to silently do something. And now because you said to do that, it's not just in context, it's in a variable. Like I said, if it ever needs to do something in code, it would just have that variable there. And it doesn't have to repeat it, which is a big deal if it's, say, an essay. Repeating an essay is expensive. Yeah. This is great.
Nathan L. [01:03:19]: Thanks so much for coming on. Anything else you want to plug or talk about?
Riley G. [01:03:25]: I should have some content that should be going live around the time that this comes out on analyzing one for the scale blog and talking a bit more about our coding leaderboard. So definitely look out for that. And also, the other thing I should of course mention is Humanity's last exam. We recently partnered on an effort to solicit from the public examples of challenging problems. And we are giving out cash prizes. So definitely check that out if you're interested.
Nathan L. [01:03:58]: Yeah, I had just tweeted a few days ago. I don't know if I put it on Twitter, but I put it on some platform. I don't have Twitter at work, so I end up looking at lame platforms I'm less addicted to. But essentially, evaluation is going to be extremely expensive. And that was my whole take. And it's going to be very narrow and very hard. And then you put out $500,000 in prizes. And the initial whiplash is like, oh, that's a lot. But in reality, I think that's the right ballpark. Because if you're going to make a good eval, you need to have somebody who's really good at cutting edge AI, probably working on this at least six months to build a good eval. And that's a ballpark price. $500,000 is like a half year of how much it costs. This is with overhead and compute and stuff. It's how much it costs to have somebody in AI like that. So obviously, it costs more to actually build this evaluation. But these numbers look ridiculous. But if we want to have evaluations that are meaningful, this is what we need to do. And I think it's the right thing for Scaled to do to lead on evaluation. It feeds into natural things of their business. I think I've been on the record for this for a while.
Riley G. [01:05:00]: So I'm like, it's great. Yeah, absolutely. I think that people outside the industry at least have the impression that evals are grunt work, right? That this is something that you would use low-cost labor for. It's not a prestigious area. But it couldn't be further from the truth. I think evals are very rapidly moving towards the high end of intellectual ability that we're looking for like PhDs. I've done projects where it's like, okay, we have to get as many PhD-educated poets as we can to check the correctness of these IAMs in this poem or whatever.
Riley G. [01:05:46]: I think that's only going to continue, right? We're going to see that at the low end, the value of human labor for training models is going to decline. And the value of high-end intellectual labor is going to increase probably drastically.
Nathan L. [01:06:04]: And it's like cost is probably a good proxy for evaluation usefulness. LM says it's expensive, but for different ways than the Scaled leaderboard is expensive. And they complement each other very well. And they both become better by the others existing by kind of like, okay, the models are in similar places, but they're showing different things. And you can separate between that. And I suspect that that'll continue to grow. Some more will be at scale, some more will be elsewhere. And that's just the new default for evals.
Riley G. [01:06:35]: Yeah, absolutely. I think that's one of the things I'm most proud about working on our evals and leaderboard at scale is that we're contributing to this healthy ecosystem of not having to just trust one or two players that evals have been done correctly. We want to have more openness and more independent verification of evals. And that's sort of our general theme with work with GSM 1K and trying to make sure that we can actually trust what these leaderboards are saying.
Nathan L. [01:07:08]: Yeah, my one nitpick that I don't know how to answer and I probably need more RLHF experts, you might know this, is like, are companies that buy data from scale going to have an advantage on the scale leaderboard because the distribution of humans that are
Riley G. [01:07:20]: doing...
Nathan L. [01:07:20]: Not that the humans doing eval and creation are the same, but that they're drawing from the same pool of humans that are writing content or doing preferences and then that are doing
Riley G. [01:07:30]: the evals.
Nathan L. [01:07:30]: I think it's too early to answer that question on if human distribution matters. And for that reason, I think the eval is still so much a net good. But it'd be really interesting to try to run those experiments on who is giving the data that you train on and how does that then impact the evaluation?
Riley G. [01:07:49]: Yeah, that's not something that I'm familiar with in enough detail to comment on our process there. But yeah, that makes sense to me. I think that's something.
Nathan L. [01:07:59]: It's something that people like to complain about every possible thing. And I understand the root of the complaint, but it's like, we've got to deal with the circumstances where we are in the AI industry. And the leaderboard is so much more useful than it is causing any problems. Let's keep doing it.
Riley G. [01:08:17]: Yep, absolutely. Okay.
Nathan L. [01:08:20]: I think we're at time. So I'm going to click stop here. Thanks again.
Riley G. [01:08:23]: Great. Thank you so much. Bye.
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode
Hear something you like? Tap your headphones to save it with AI-generated key takeaways
Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more
Listen to all your favourite podcasts with AI-powered features
Listen to the best highlights from the podcasts you love and dive into the full episode