15min chapter

Interviewing Louis Castricato of Synth Labs and Eleuther AI on RLHF, Gemini Drama, DPO, founding Carper AI, preference data, reward models, and everything in between

Interconnects

notes

CHAPTER

Navigating Reinforcement Learning: From PPO to DPO

This chapter explores the evolution of Reinforcement Learning from Human Feedback (RLHF), focusing on the shift from Proximal Policy Optimization (PPO) to Decision Process Optimization (DPO). It highlights the intricacies of preference data, the differences in methodologies, and the implications of small test sets on model performance. The discussion also addresses ethical concerns, benchmarking complexities, and safety measures within AI training and deployment.

00:00

Transcript

Episode notes

This interview is available on podcast players and YouTube.

I’m excited to bring you another interview! This one is a deep dive right in my wheelhouse — all things RLHF. Louis Castricato is probably the hidden star of RLHF in the open. I’m not sure anyone who can speak freely knows as much as him. As I’ve said again and again on Interconnects:

Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies.

Louis recently has been founding a new startup focused on synthetic data for alignment, Synth Labs, and is a researcher at Eleuether AI. This interview should speak for itself, and it’ll need re-listens, even for myself. The list of topics we cover touches on pretty much every major and minor issue facing model fine-tuning. Please reach out or comment if there’s a paper we mention that I didn’t link before. Happy to dig it up for you.

For more on Synth Labs, there was a profile in Bloomberg from Rachel Metz.

This post is very technical, more than usual. If you’re having a hard time with it, I suggest you listen to my RLHF 201 post on Latent Space first.

Chapters

These are generated with smol-podcaster with moderate edits.

High-level chapters

* 00:00:00: Introduction

* 00:01:24: Gemini News and RLHF’s Part in it

* 00:09:05: Long Context, In-Context, and Multimodal RLHF

* 00:21:20: What are people missing about RLHF these days?

* 00:30:30: OpenAI's Influence and the Need for Alternatives

* 00:39:20: Synth Labs and the Future of Alignment

* 00:55:00: Evaluation Talk p2: Open-ended Evaluation and Data Diversity

* 00:59:20: Algorithm Roundup: PPO, DPO, KTO, IPO

* 01:18:38: CarperAI, Early Days of RLHF, Reflecting on ChatGPT

Detailed chapters

* 00:00:00: Introduction and Overview of RLHF

* 00:02:02: Gemini News, Custom Demographics in Image Prompts, and Controllability Issues in AI Models

* 00:05:21: Fixing Biases in AI Models Post-Training, Representation in AI Data

* 00:09:00: Multimodal RLHF and Video RLHF

* 00:16:09: Evaluating Long Context Behavior in AI Models

* 00:17:05: The Potential of In-Context RLHF

* 00:21:24: Shift from PPO to DPO in RLHF

* 00:23:19: Generalization and Evaluation in RLHF, Human Evaluation

* 00:27:03: The Discrepancy Between Research and Company Needs in Alignment

* 00:29:20: Impact of ChatGPT and Language Model Outputs on Data Sets

* 00:31:39: The Concept of Uncensoring Models

* 00:34:05: Lack of Safety Data Sets in Instruction Tuning

* 00:35:23: LMSYS ChatBotArena, AlpacaEval, MT Bench p1

* 00:39:25: Introduction to Synth Labs and Alignment Platform

* 00:43:05: Developing OpenCAI Constitutional AI Data Set

* 00:49:41: The Need for Open-Ended Evaluation in RLHF, eval p2

* 00:54:13: The Importance of Releasing Models for RLHF Research

* 00:58:17: Self-Instruction and Self-Rewarding LMs

* 01:01:03: Working on RLHF at Carper AI

* 01:04:25: Scaling PPO in RLHF

* 01:08:01: The Impact of ChatGPT on Carper AI

* 01:10:56: The Potential of KTO (Kahneman-Tversky Optimization)

* 01:17:39: The Importance of Implementation Details in RLHF

* 01:20:14: The Initial Focus at Carper AI

* 01:23:36: The Future of RLHF and Open Science Collaboration

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Papers & artifacts we discuss

* Recursively Summarizing Books with Human Feedback

* Needle in a haystack recent example repository.

* Urial paper: The unlocking spell on base llms: Rethinking alignment via in-context learning

* Misha paper from Deepmind: In-context Reinforcement Learning with Algorithm Distillation

* Museli Optimizer: Muesli: Combining Improvements in Policy Optimization

* Unintended Impacts of LLM Alignment on Global Representation

* Pink Elephants Problem: Suppressing Pink Elephants with Direct Principle Feedback

* Cut the Carp: Cut the CARP: Fishing for zero-shot story evaluation

* MT Bench data for correlating human to GPT4 preferences

Full transcript

Note: this is generated by smol-podcaster and has minor bugs post human edits.

Nathan [00:00:01]: The ticker's going up. Welcome, Louis. You're the second guest on the InterConnects podcast, I think. It's an interesting one for me because everyone kind of points to me now as the person that is in the face of RLHF and I get a lot of questions and to me Louis has represented that person. I think Louis provided a lot most of the information on the first RLHF blog post that I wrote for Hugging Face back in the day. If there's somebody that I want to ask questions about RLHF, it generally goes to him. So now you all are gonna know this in the open. We're gonna cover a lot of things. As always, I'm trying to talk with researchers on the ground and people actually doing things in these topics. I think we're gonna cover a lot of things today. We're in the Latent Space podcast. If you're watching on video, you may have noticed that we're in the Latent Space studio and they reminded us we've got to start off with covering the Gemini news and what that means for RLHF and then most of this is a long docket of the core questions facing the two of us as we're trying to make RLHF more open, more useful, not only about safety but safety is important to it and important to us. So I think we can kind of get going. I think the first question that I have is just get rolling. What is your favorite Rhode Island fact?

Louis C [00:01:28]: My favorite Rhode Island fact? Oh man, all the H.P. Lovecraft stuff. Like walking around Providence with like friends who like H.P. Lovecraft and be like, oh yeah, you know, this was like that building in Call of Cthulhu or like...

Nathan [00:01:36]: I don't even know this. I mean, for the record, I grew up in Rhode Island if people didn't know and then that's where Louis spends most of his time these days. Providence. So we'll come back to this. I think I'm just gonna start with kind of the hardest question then it'll get easier for us from here. It's like what was your first reaction when you saw all this Gemini stuff?

Louis C [00:02:02]: The, you know, the adding custom like races and demographics to like image prompts component, right? Yeah. So Dawley had done that back when Dawley 2 first came out and was like an in beta and people were reporting like a person holding a sign that that says X and then this sign would say black or this line would say white or this line would say Asian. And I, you know, it was a very hacky solution then and I thought a lot about it then as well and I almost felt like, you know, it gets you 90% there for like 1% of the time of the way that you're doing this like, you know, like in a more proper and auditable way of like making sure your training data has like equal representation or making sure your ROHF data has good representation. And, you know, you can't do those things after the fact but what you can do after the fact is like inject things into the prompt to make it more controllable. And it really comes down to the fact that controllability right now is not a solved problem and most of our solutions to controllability are a little bit hacky.

Nathan [00:03:16]: Yeah, that makes sense. I think to summarize for people this has been an ongoing issue and we're recording on the 27th here. Gemini initially got flack for like actually forcing diversity into historical scenes and then it started getting more flack for flat-out refusing certain requests on race. Like all of this stuff is just like it's like ouch to somebody. Like I know people working on this stuff and it's just like the way that it ends up here is not is not like what a lot of people think. Like the Gemini team is obviously moving fast and it seems to me that the image stuff has always been like a red herring. That's the way that Swicks phrased it as well. It's like somehow he got to the point where a prompt was shipped in this final solution with the further image editing and that's just hard. It's just like obviously there's a big goof up there. But then it's like we're looking at examples and still today like Meta's image generator. So I'm like WhatsApp or whatever you can ask an AI that it'll have similar issues where it forces diversity into it into a question with multiple people. Microsoft Copilot has this. It's like the text thing and really digging into how we think these big companies could be adding like like forcing this into their data or like we know that there's a lot of uncertainty over how all these companies get their preference data. Some of them work with companies like scale and surge. Some of them do it in-house. Who is providing it isn't really an issue because they're probably giving similar instructions and similar workforces across the board. But it's like how do we see this entering the preference data that they're adding to our early stuff because it's like if you look at a base model. We were just working with Olmo and it's like if you you ask a base model you say like hello to a base model. A lot of times the base model will then go off and be like some crazy like Fortan s**t because like so many of the conversations on there even with good data processing techniques is like from weird corners of the Internet. So like I don't see any base model that comes out with some like D-bias thing so it's added on. And it's like how did we end up there.

Louis C [00:05:21]: Yeah I mean you know when I was saying this is something that that they do like retroactively once they've acknowledged that these issues exist in the data set once the model has been trained. It's not something that can be easily fixed even if they had infinite resources like it's very very hard to go back and actually rectify these biases in a way that's like equitable to like all the kinds of preferences that someone might have when wanting to interact with this model right. There's um the the fact that at least as far as I know until recently DALLE did this as well where you could still say a person holding a sign that says X and it would still say black white or whatever and and the amount of resources that they're pumping into making sure that you know they're building a consumer product they're building like the main consumer product in this space the amount of resources that they've been pumping into it and this still presents a large issue for them just just shows how difficult this like really is.

Nathan [00:06:20]: Yeah and another example that people on the I have this discord that's growing for paid and friends or paid subscribers and friend someone pointed out this work where if you ask DALLE to generate like a doctor and an assistant like all the same bias problems still up show it show up so like a lot of the solutions that we have are not necessarily like deep and at this like conceptual level it's at this like you tell your preference labelers to do a certain thing and then they do it but you may not have good tracking of like which data point is responsible for these different things.

Louis C [00:06:55]: Yeah you know interpretability for for like preference learning in general it's it's we're very very far from actually understanding like what preferences result in what model behaviors and and like you know preferences that disagree with each other.

Nathan [00:07:12]: Like the John Schulman talk. Yeah. It's like that was this whole talk and it was great just to have him get up there and be like this is so hard.

Louis C [00:07:20]: Yeah and like I've done like a ton of experiments myself where I just like have an RLHF data set and I like randomly remove 10% and I have like a bunch of models each with like a different 10% removed and I'm like well what behavioral differences can I see between these models and then not only is it like and now you can see differences but it's extremely hard to quantify it's extremely hard to actually understand what the difference is and then like there's almost no way to know what in that 10% cause that difference.

Nathan [00:07:51]: Yeah this reminds me of like the Hugging Face No Robots data set which is like a professionally curated instruction data set. Whenever we added that to a model it was like this is obviously our most valuable data but it would show up on zero benchmarks and we're like well what do we do and it's like we're talking about Google's problems here and we'll get back to like the data problems in the open source and it's like they probably have order of millions of data points that are going into this preference data and some of it is for some proportion it's probably about safety. I think we could talk about like the Anthropic HH data which like the people don't actually know the details of it because it's like a quarter of it is like helpful data or than three quarters is or like a quarter is harmless and three quarters is helpful from different rollouts and it's like these are very specific things as like huge data problems that most people aren't really thinking about.

Louis C [00:08:40]: Yeah most people are just like blindly oh this is safety so I'm gonna throw it into my data set and hopefully like it works and hopefully like we get good behavior but I don't really know what's in this data set I've really looked at the data and I thought that's something that I've heard many many times over the last year of people like trying to get their feet wet in the RLHF space.

Nathan [00:09:00]: Yeah and do you have any intuitions is like the last point of the Gemini thing I'm like if we don't think that the image generation of Gemini is the biggest issue I think it's like in the text and how this preference data is collected but like do you have anyone that is doing multimodal RLHF because I generally think that it's like we don't know how to do this at all which is like how you control input if you have multiple inputs and multiple outputs is like how do you control your moDALLEty distribution and data count and stuff.

Louis C [00:09:30]: Yeah so I mean I have a friend of two friends of mine who have been doing like video RLHF for a little while now like it's a little bit over a year and you know they like condition their video model on some text encoder and they've been talking about like having to do RLHF independently for both the text encoder and the video model but like video RLHF is just like massively underdiscovered and no one really knows what they're doing in that space.

Nathan [00:09:53]: When you say independently what do you mean like before making the video model are they like RLHF-ing the text backbone or are they freezing the rest of the model? Yeah they're RLHF-ing the text backbone.

Louis C [00:10:04]: I think there was actually a paper from Tencent last August that basically did the same thing for like multimodal RLHF where they had to RLHF the text backbone and then the RLHF like the image generation components on top of that.

Nathan [00:10:17]: Does that look like that's the like they you this is potentially basic but like to train a visual language model you have to have some link you have to add some type of a mechanism that links the gradients between the two and sometimes you start with a most of the time I think these days they're starting with this language backbone and they're adding on vision and continuing to train and then this is like at the end of this where you have a visual language model then they're freezing the gradients of the video video part and then RLHF-ing the text part or is this like before the text backbone is even initialized on the model?

Louis C: The space is a little too early.

Nathan: Yeah like I think that's the point like we don't know these links.

Louis C [00:10:53]: But I know people in the last like eight months who have done it the way of like before they even add the image component they RLHF the text model and then they add the image component in the RLHF image.

Nathan [00:11:07]: Yeah so this is really interesting like I'd be interested from like a everyone talks about how RLHF is low low computation and flops compared to what people are doing like in the open we say that it's like 50 or 100,000 day training samples. Lama 2 is like 1.5 million I'm guessing the closed models like Gemini are probably another 10 million like we're higher like they're they're much bigger and it's like is the amount of video training that it takes a train this backbone after the fact like it's still helping like does that undo some of the text RLHF or does it not? If the answer is I don't know but these are the kind of things that I want to have people start talking about it's like is RLHF becoming like a sequential process as you add moDALLEties or can you wait all to the end and like do just multimodal RLHF? We don't know these things and this is what people in Gemini are trying to work on.

Louis C [00:11:58]: I definitely I've spoken to a lot of people who like are at least thinking in this space I've only spoken to a small number of people who are actually working in this space but for the people who are thinking in this space really the the dream is to be able to express preferences in moDALLEties where it's beneficial to express preferences in those moDALLEties like it doesn't make sense to express preferences over code as like images or video but it does make sense to express preferences over like puppies as like photos.

Nathan [00:12:25]: That's a great point and I think the thing is like the way you ended your sentence is like make preferences over puppies it's like we don't know what people use visual outputs for in like a productive sense and and really inputs like the things are like analyze this video like that's a toy example where like analysis creating RLHF pairs I think actually it's not too hard for us like we it takes a lot of effort because a human has to know what is in the video to do like a summarization RLHF like if you're passing in a three-hour video into Gemini base model and then it outputs two outputs like the humans not gonna know what's right unless it has context and what the video is and that is just way different than like a poem where you could read both of them.

Louis C [00:13:04]: Yeah so there's actually a really fascinating paper from OpenAI that I really haven't seen anyone build on it was the idea of like summarizing really long books and you doing RLHF to do that.

Nathan [00:13:14]: Is this sort of like recursive summarization?

Louis C [00:13:17]: Yeah yeah it's the recursive summarization it's the idea that like you can almost treat like long summarizations as like a weird RLHF like almost like merge operation where like you divide divide divide divide divide divide and then eventually you get to segments where it makes sense to collect annotations and then on those segments you have a human annotator go through and say oh this segment is better than this segment or the summary of this segment plus this segment is this and then when you combine summaries now you can say well this summary plus this summary gets you this summary and eventually you get preferences going all the way up the tree and you get a preference of the whole book at the end and obviously you know it's a crude approximation of what the summary of the whole book is but it's much more feasible than asking human annotators just to summarize an entire book.

Nathan [00:14:05]: Yeah I mean I just realized this on the pod right now it's like how ridiculous RLHFing like an entire code base in context is like that's like where some of the like opportunities for what I think RLHF could do which is like just synthetic data labels and stuff it's like we can create synthetic preferences in many different ways that aren't all reliant on like this kind of human subjectivity.

Louis C [00:14:32]: Yeah it's like it's a deeply fascinating problem actually going into like how big is Gemini's context window the 1.5 thing it's like

Nathan [00:14:37]: yeah it's shipped with a million and they have experiments in the paper up to 10 million.

Louis C [00:14:40]: Like who really wants to use a 10 million token context window and like how accurately do you really can you really think about preferences over the range of a 10 million token context window?

Nathan [00:14:54]: I think people want to use it but I think the preference thing is a lot harder yeah it's like I could have this is something I encounter in HuggingFace regularly like HuggingFace is a popular code base you expect the code models to do well but they still don't do well unlike like they don't know like they'll make up datasets functions or something and like if you just have all of HuggingFace's code in context when you're like working in the HuggingFace ecosystem like that will make you so much better and like it or analyzing long videos and stuff like I do think there's a lot of use cases and I yeah but like the preference thing is just a totally different framing. What do you think about the needle in the haystack evaluation that they did? I haven't read a lot about it but I think essentially what it's it's there's like a difference between being able to act on the information and being able to like retrieve it and I think it's like these models should be passing needle in the haystack because that shows that they're like actually like noticing that the information is there but that does not necessarily mean that they're gonna be able to synthesize all the information in a compelling way so it's like a path it's like a pass bar which is like you need to have this to be credible in long context but I think that actually evaluating long context and like what behaviors we want to see is pretty open-ended.

Louis C [00:16:09]: yeah he put out a paper like yesterday where he's like oh needle in the haystack is interesting but if you have like more than two needles like it's entirely uncorrelated with the single needle in the haystack benchmark.

Nathan [00:16:24]: Yeah cuz it's like trying to find one thing at each part of the content like breaks the context window into many segments and then it's making sure that you can find something in each of those segments.

Louis C [00:16:36]: So it's almost like I feel like we're almost gonna get to the point where like the attention itself is the limiting factor because the model genuinely just just cannot equitably like split attention over it's a context window to retrieve as many things as it realistically needs in order to produce something.

Nathan [00:16:50]: Do you think the RLHF could manipulate long context behavior more than people might expect? Cuz it's it's just like an open question.

Louis C [00:17:05]: Yeah I think it's a very interesting open question and if the answer turns out to be yes in context RLHF becomes like absolutely massive because like right now like it can kind of sort of work but like not really and like every benchmark I've ever seen for in context RLHF almost isn't charitable at all to the RLHF baseline and it's not like from the experiments that I've done in the experiments that people in Eleuther have done. It's comparable on like very niche situations but it's not comparable in general because you still have all the issues with in context learning where like you'll massively overfit on the preferences that are like put in the beginning of the context versus preferences.

Nathan [00:17:50]: Let's try to explain what this in context RLHF is actually doing. So is it running like everyone a lot of people know what an RLHF algorithm is and in context learning is designing a prompt like is it training a model to generate prompts like what are you actually are using the RL update and like what are the parameters what are you parameterizing when you're doing in context RL?

Louis C [00:18:10]: So I mean there's a number of different approaches for in context RL. There is the... Could be part of the problem.

Nathan [00:18:14]: It's like people do a lot of different things but what are some of them?

Louis C [00:18:16]: So the one that I was referring to is I think the Yejin Choi paper. Yeah it's the Uriel. Yeah where like she's like you she just prompted chatbot you are interacting with the user here's what their preferences are like have at it but there's also stuff like that like Misha and DeepMind. This is the first one that I did. Yeah where it's like you have some agent that's interacting with an environment and you store all these state action pairs and you just like fine-tune models on like episodes of these state action pairs and then the idea is that like if you just put enough episodes into a context window on the next episode it'll just perform better right and and it's like the algorithm distillation paper and you can like use this to like distill stuff like I think the actual example that Chris Lu's paper does where they do like algorithm distillation on s4 I think they do Muesli where I think they distill Muesli which is they like apparently no one outside of DeepMind ever used it but apparently...

Nathan [00:19:15]: Oh is this the algorithm Muesli? Yeah I remember when this was hot it was like a year ago at this point we were thinking about re-implementing it and then we never did. It was too complicated.

Louis C [00:19:30]: Yeah but Muesli is apparently very computationally expensive because it's like this model based RL thing that beats AlphaGo I think without using Monte Carlo tree search and like you know it's so incredibly computational expensive and wanting to be able to do it in context just dramatically reduces the amount of computational complexity to actually deploy it right and as far as I'm aware there's been no work applying algorithm distillation at all to NLP and I think at least my impression is that it generally does not work for NLP at least yet and you know I think that there's a lot of potential there but there's absolutely massive barriers that have to be overcome before we get there and and you have like what you have Goldberg's example of not being able to do needle in the haystack for like more than two needles basically shows that even like the ring attention stuff just is not going to be sufficient for algorithm distillation stuff for NLP and I have a very strong feeling that like Mamba or like S4 is not going to close that gap either. So they would need to be able to reference prior parts of the text and they just can't do that.

Nathan [00:20:56]: Yeah I think there's a whole rabbit hole that we could go down and talk about like long context and architectures forever. I think let's kind of zoom back into the core stuff which is that this is like the real starter question is like what do you think people are missing in RLHF these days and then from here it's gonna be a long list of like what the heck do we do about evaluation data like well what is the like big-picture thing?

Louis C [00:21:24]: So what I think people are missing and actually I touched a bit on this in the Pink Elephant's paper is that...

Nathan [00:21:28]: You should say what this is because we haven't introduced it.

Louis C [00:21:30]: Yes you're right you're right you're right. So I worked at Luther AI as a resource scientist for the last six months or so and we were really interested in like understanding you know everyone had been doing PPO for so long and there had been a shift to DPO and we were trying to understand like well now that we're moving to DPO how can we actually take advantage of this new architecture? Like should we really even be thinking about reward models and data sets in the same way that we were thinking about them during PPO? And it doesn't really make sense and I think the answer to that is an unequivocal no. That like you need to think about your data sets and preference data sets entirely differently than you were thinking about them with PPO. Because in PPO you're using you're setting your data sets up to train a really good reward model and in DPO you're setting your data sets up to teach a language model what the better trajectory is. And it's a subtle difference but in one you're just trying to learn differentiation between high reward and low reward and in the other it's like a general classifier.

Nathan [00:22:35]: Like you want to be able to do everything with the reward model? Yeah. Have you also found that DPO can be sensitive to like the SFT distribution? So if you like take a random open preference data set if it's really different than what your model would generate like DPO can do some weird things? Louis C [00:22:53]: I've actually, I might be alone in this, I don't SFT before doing DPO at all.

Nathan [00:22:59]: Do you use generations from your base model? I do. So that's the question. It's like if you were to not do SFT before doing DPO. Yeah. Could you just take ultra-feedback on whatever your base model is if it's sufficiently different? I've done some weird stuff though. Like I've like

Louis C [00:23:19]: DPO'd models that were like trained with like the Hermes data set for like code and like it still generalizes really really well.

Nathan [00:23:28]: How are you measuring, how are you trying to think about generalization with DPO?

Louis C [00:23:33]: Well I typically rely on like human eval more or less. And if I do like human eval but it's GPT-4 eval and I see that human eval correlates with GPT-4 eval then I just go GPT-4 eval the whole way. A lot of people are doing that.

Nathan [00:23:48]: How far do you think that actually generalizes? I mean just recently there was this, like we're bouncing around through all the things, but there's so much good information for people here. It's like Hugging Base and Argilla, two places that are doing great work in this kind of alignment preference fine-tuning space, they've released this data set that was a preference pair creation from the OpenHermes data set. And it's like they used PairRM as their judge. And what they found is that like they did it, I remember Louis Tunstall tweeted this, where he was like we were looking at which gave the best correlation. And they found that PairRM, which is this 400 million parameter Diverta based pairwise classifier, had like the best correlation as choosing which response was better among a set of responses in the OpenHermes data set. And what they were comparing to is like Prometheus and I'm forgetting the name of the other one. There's one more, there's a couple more like open model as like rate model rankings that exist. I think. But essentially the question is like we do these things and we look at this early correlation and there is this correlation between GPT-4 and humans. And then a lot of times we continue like LLM-Sys did this question where they like or like AlpacaEval has done this to validate AlpacaEval as a meaningful benchmark. LLM-Sys has done this for MTBench. Like all these places are doing this where they validate a subset for humans and then say it generalizes forever. Like do we think that it's actually true? I think that you always have to take it with a grain of salt.

Louis C [00:25:24]: It's always for very very specialized domains. So one of the first, actually I think I did write the first paper for like critiques and revisions called like Cut the Carp. The idea was like, I remember this, the idea was like we could scrape like I think it was a million stories, edits of the stories and then like all the like critiques that like writers wrote on the, the editors wrote on those stories and we can use that to train like a big contrastive model, right? And we showed in the paper, we did a bunch of like human eval and then we did like Spearman rank to compare like how our model ranked certain preferences versus how humans ranked the preferences. And we found that you know we had an extremely high Spearman rank coefficient, like significantly higher than like doing like a value head or like significantly higher than doing just asking a language model to rank them. And I think the grain of salt that we had is that we were only claiming that like on this very very carefully created test set, the assumption that the model accurately reflect reflects human preferences holds and we can generalize to a very small, small but slightly bigger test set and say that it holds there as well. I think the broad sweeping statements that it holds on a few toy examples so it must hold

Nathan [00:26:54]: everywhere, I guess never really. It's like a common problem. Yeah. I think we're going to, it's going to come up again and again. I think it's like.

Louis C [00:27:03]: I did my master's in like human evaluation and I've always been extremely careful with with any statements I make that involve humans. I mean this is what

Nathan [00:27:12]: people in RLHF need to be doing. Like this is the motivation of this like the history and risks of RL and human feedback paper that we did is just like RLHF is a socially rich topic. Whenever you say something and you're making claims of generalization, you're often making claims about like what is implicitly a preference and a human value that you're taking into the system. So it's just like I think that is just something that people need to take really seriously. Here's a really specific drop on the herring reference. Did you know that when LLM says release their LLM as a judge paper they also released thousands of samples from humans and GPT-4 verifying like empty bench preferences over pairs of like that were higher score or not? I did not. Okay so essentially the thing is and like I've talked a lot on building a reward model benchmark but essentially there's all these references about how like GPT-4 agreement is higher than human agreement when you're like doing this preference process. So if you train a DPO model, if you train a reward model how it ranks the outputs is like is more likely to align with GPT-4 than a human. Which it's more of a statement that humans have more disagreement than GPT-4. So it's like easier to train on GPT-4 outputs than as human outputs and this is the place where I see it most clearly. It's like all the reward models do like 10% higher on accuracy of their test set from that which is like the chosen by GPT-4 and the rejected by GPT-4. It's all in like the 70 or towards 80% while all the humans is like in the 60% which is a human chose this empty bench completion over the other one. So it's just like we're slowly getting signal that it is there and then the question is like should we care about doing our RLHF without any OpenAI input in the process? I think last year when the terms of service discussion was big a lot of fine-tuning work was discussing like what data sets could we use with permissive license that don't violate the OpenAI terms of service. Should we be concerned where RLHF is going where almost everything has been touched with OpenAI right now?

Louis C [00:29:20]: There was a very interesting paper, I don't remember who it was, but it was like if you take a model that was pre-trained on data set up to this year and compare it to data that was pre-trained up to this year and it was like pre and post like chat GPT release basically plus like six months the benchmark scores improve and it's literally just because there's like chat GPT data or language model output data or more structured data that sounds like a language model performing well on tasks in the data set. It's like kind of the the consensus that they were.

Nathan [00:29:53]: Was this a benchmark that's independent of like is it like a kind of structured benchmark or is it like a vibes benchmark? I think it was like a structured benchmark so I don't remember. Yeah I'm just asking whether or not it was a result of like matching GPT for text or like actually having higher behavior because training on OpenAI outputs does like training on good language model outputs does improve scores on benchmarks that people care about so like that's a fact that people need to accept and I think most people do like that's not controversial right now but it's like we should I still think that if there's lines of work out there where people are from a values perspective trying to fine-tune models without touching OpenAI like that is a line of work that should continue.

Louis C [00:30:42]: Yeah on this note actually I mean when I was at Stability I think one of the experiments that we did was like for a stable LM I remember was like pre-pending as an AI as an AI agent trained by OpenAI to anything before we ran it through evaluation and the scores improved and like trying to remember who wrote the paper.

Nathan [00:31:09]: That's hilarious. I mean like do you there's been a lot there's a lot less discussion on uncensored models right now my claim is generally I think uncensoring is the wrong word which people have used it to describe removing phrases like as a language model or any methods of mentions of emotion or like I was trained by OpenAI so I can't not do this. Do you think that like this type of filtering for opinions and soft refusals is still important in RLHF?

Louis C [00:31:39]: I think it's important for very very specific situations but not in general. My impression is that you know if you're interested in AI safety it's always useful to have a model that would never do a refusal ever.

Nathan [00:32:00]: It's hard to find on the hub where we're building a safety data set and we had to find like it's a fine-tune of the dolphin data set was the one that like what's closest it was only like it's probably like 80 to 90 percent of the tasks that we asked it it wouldn't refuse it would still refuse 10 or 20 percent of the time. It's kind of profound that like refusals are now stuck in the model in some way like we were looking for a model that wouldn't refuse at all and we like couldn't find one on the hub which is after all discussion of uncensoring you would think that it would actually work.

Louis C [00:32:31]: Yeah I've been doing a bit of safety research with Stella for a little while and my approach has been literally call GPT-4 with a jailbreaking prompt and and just put whatever I want to after that. And I you know very often have to change my jailbreaking.

Nathan [00:32:46]: Yeah I was like you have to keep close guard over the jailbreaking prompt.

Louis C [00:32:50]: Yeah and and the issue is that like when you find a good jailbreaking prompt you basically have to redo all your results within like the next like seven or whatever days before OpenAI patches it and you just have to pray that like you know you there's so many issues using any OpenAI model in any research pipeline but if you're like research is explicitly about the safety of OpenAI models all of a sudden you're like well.

Nathan [00:33:18]: I mean a lot of companies should be doing internal research on OpenAI safety to kind of have their own measure of how their application will do like the monitoring that on their own is worth it for their bottom line and liability because OpenAI will also do it but OpenAI has incentives to not tell the world if there's something kind of subtle going on that some people could get over because that might blow up and if they don't have a fix it's gonna bring attention to it.

Louis C [00:33:44]: It's part of the issue with like even publishing red teaming research in general it's like if you publish an evaluation for like red teaming or like for safety well everyone's going to like Goodhart that evaluation and all of a sudden like now now we have a useless stack of papers that used to be on how to test if a model was safe.

Nathan [00:34:05]: Yeah I didn't really prepare questions on safety but it's it's for a long time surprised me that there aren't data sets and easy recipes for adding safety to instruction tuning in RLHF. I think that I mean someone at Lama team asked me what should they do and they're like you should release your safety data because it's like if they're getting pressure from the executive branch to not be safe it's like if they have this data they can release it and be like this is how you can make any open model safe. Huge softball and also like the safety is unlikely to be a competitive advantage like mist like mistrals I'm not gonna care about this like they might eventually but like the PR win is really big. Yeah. I mean this is something that I've wanted to do for a while and just haven't done good at prioritizing it so. Yeah we can go back to some of the questions that you have. Yeah I'm adding them so I can keep notes later. I think that the next main topic is on evals. I think vibe based evals are still a way of life in RLHF. They're not going away anytime soon. I would say we have kind of a holy trinity of LM sys chatbot arena which is kind of at the top for for good reason. There's alpaca eval, alpaca eval 2, MT bench. I think start with the most important one is like when you see LM sys what are you what are you extracting from a model being better or worse there?

Louis C [00:35:23]: So it's in a way I am a little bit like what Andre Kaparthe said on this. Was it him? It might have been him.

Nathan [00:35:27]: Probably. He's been on a roll.

Louis C [00:35:32]: Yeah where it's like when he picks an open source language model he looks to see what people say about it on reddit. Yeah local llama and LM sys chat arena and the issue is that you don't know what they're using it for and like as a research scientist when I look for a model I am looking for a model to like do research on. Yeah. And I am not looking for a model to be like my AI waifu girlfriend that I can like play Dungeons and Dragons with.

Nathan [00:36:05]: Yeah I mean this has been the bane of RLHF research for a while. It's like what did we do before MT bench? We literally the only hope we had was to like chat with these things and hope for the best. I was like that was very recently. That was less than a year ago. And then MT bench came along and we were kind of using it hugging face and other people are using it. I actually don't know the alpaca eval release date so that might have been before MT bench. But like these two came around at the same time and they're now kind of the ground truth. Alpaca eval 1.0 has kind of been saturated on which is like comparing to Da Vinci with a GPT-4 judge and then alpaca eval 2 is comparing to GPT-4 turbo with GPT-4 turbo as a judge. Yeah. It's funny it's like it's now cheaper to do the second version than it was the first version with a newer model which is how scaling happens.

Louis C [00:36:56]: What do you think about the Nous evaluation thing where they're like continuously generating more evaluation data?

Nathan [00:37:00]: Who is doing this? Nous? Nous research? I don't know. Is this their new leaderboard that they have? Yeah. Yeah. Yeah. I haven't looked at it so I'll have to give it a look.

Louis C [00:37:09]: What do you think? It's almost like MT bench but they like generate new data every day. So new prompts? It's always new prompts and it's always I don't know how they seed it. I assumed they seed it based off like the events that day.

Nathan [00:37:22]: It's a kind of a cool idea. So if you're trying to make a new leaderboard you could have a set of seed instructions that you augment and you never release the seed instructions but you always release the augmented ones on like a weekly cadence. I think that's because there's a lot of people that want to build better alpaca eval things and a lot of the problems is that the prompts are from known sources or public and you want to be able to do a closed eval without having as much cost. So that might be a way to kind of really reuse the data for a long time. Yeah. Yeah.

Louis C [00:37:53]: But I mean like I feel like the issue with things like alpaca eval, chat arena or any of those is that like the way a user is going to interact with an agent or a chatbot is entirely different than the way we are currently evaluating them. There really is like a big discrepancy there in that like you know look at the Air Canada thing right? Like that would never have come up in a benchmark like ever.

Nathan [00:38:20]: Well do you think that's about the model or the implementation? I think it's a bit of both.

Louis C [00:38:27]: Like if that was something like some automated evaluation thought of and I don't think it's unreasonable to expect them to think of situations like that. If like they kind of know the domain you're operating in. I think it's definitely doable and I think I think it's like not something that's entirely unfeasible to accomplish. To like be able to say hey you know I have a chatbot that sells airline tickets and here's what I care about and and and like please do the evaluation for me. And that's actually you know that's what I've been building for a little while now.

Nathan [00:39:11]:Okay we can talk about synth labs and then come back to evals because this will be on the top of the post so everyone will know like you're you're building this and it's like well we can start with like what is the basic pitch and then kind of go into the like long-term thing.

Louis C [00:39:25]: Yeah yeah so for the last like six eight months I've been building like a fully auditable transparent like verifiable alignment platform is how I like to describe it. Plus evaluation. The general idea is like...

Nathan [00:39:40]: Making a company.

Louis C [00:39:49]: Yes and the the general idea is is like there are many facets to aligning a model from like things like guardrail guardrails to ROHF to various kinds of preference learning to like actually understanding all the data that that goes into creating such a model. And they're all opaque boxes more or less right now and and what people want is they want to be able to align their model know every step of the pipeline understand all of the interpretability that goes from A to B and understand like here's what I gave you as my criteria here's where I know it fails based off all the evaluation you've done for me and here is where I know that I need to improve and it'll iteratively improve based off evaluations and based off your feedback.

Nathan [00:40:44]: So it's a hands-off solution that lets you audit the entire pipeline and build trust with it. So are you your training after you generate this data?

Louis C: We are training.

Nathan: Yeah you use this word improve.

Louis C [00:40:53]: Yeah so it's a iterative refinement platform for doing alignment in a verifiable and trustworthy manner.

Nathan [00:40:58]: What do you think customers want when they hear alignment? What are you selling with alignment and what are they buying? I think the aligning these is an important thing for our field.

Louis C [00:41:10]: There's an extreme discrepancy between what research does for alignment versus what companies do for alignment. When a company hears the word alignment they think wow I want to align models to my business objective and I want to make sure that the model understands my business culture and I want to make sure that the model understands completely its role in my company right? But at the same time I want to make sure that it's compliant, that it's safe, that it doesn't violate any rules, that it's not a legal obligation. What's the word? Legal? It's not going to create legal issues for me. And that it's not going to be a PR disaster.

Nathan [00:42:04]: After what we talked about 35 minutes ago.

Louis C [00:42:13]: Finding that balance is definitely incredibly important and it's something that I've been working on for quite a while and I'm very happy with where things are.

Nathan [00:42:22]: Do you want to tease what we're working on? I could also introduce it. I think this would be short. Essentially Lambda Labs offered some interesting compute and we're gonna try to build an OpenCAI constitutional AI data set because Anthropic gets a lot of benefit out of this. Constitutional AI doesn't get a lot of traction. I think earlier AIF got a bump again. There was this Google paper that was verifying that it works a little bit and now it got a big bump. But there's very little discussion on it, which is a little bit surprising to me. I think there's a lot of people calling it distillation of LLM alignment now, which is interesting. I don't really know. Hopefully it works.

Louis C [00:43:05]: It builds off some of the stuff that I did with Edward III AI with the suppressing Pink Elephant's paper, which is the idea of we've shifted from one paradigm of PPO to DPO and none of our data pipelines kept up. Really what we should be doing is generating either really good utterances and revising them to be worse or really bad utterances and revising them to be better. Then taking all those utterances and conditioning our ROHF in context on those utterances so that you could do stuff like swapping rules in and out during inference. If I am person A and here's my preferences or I'm person B and here's my preferences, align this model to person A and align this to person B and make sure that there's a disparity between what they actually want versus what... There's always that disparity there, but right now models do not effectively mimic those disparities. There was actually a fascinating paper by D. Yang that just came out a few days ago. Most aligned models have the preferences of Western men. Their evaluation focused more on the race, nationality, sex, stuff like that, but obviously it gets much more fine-grained than that. There's been stuff about people calling llama to its political alignment. It has a very particular political alignment that does not agree with many users that are using it. As such, its scope and usability for those kinds of applications is very limited.

Nathan [00:44:50]: This is probably linked to what we were talking about at the beginning. The paper title I just looked it up is Unintended Impacts of LLM Alignment on Global Representation. Michael Ryan is the person I saw the tweet of. Just to give credit for some of them. I know there's a lot of papers, but this one was recent, so we try to track it down in real time. All these issues of representation and who the people are is ultimately related to RLHF going wrong. At the end user is when a lot of people will finally see what the values represented are. If it's not out in the world, it's hard to get the amount of feedback that you need.

Louis C [00:45:29]: This is something that MTBench or Chatbot Arena would never pick up on, ever. This is a huge issue. Here's where we are and where we should be. It's all the way up there. We underrepresent so many demographics and so many kinds of opinions. Who are we to say that one opinion is better than the other, if they're both safe opinions?

Nathan [00:45:59]: Yeah, this is like in some ways can open RLHF and this is something you're a long time been invested in. This is something that you're going to invest in with Synthlabs. Could it be better at giving people what they want than the closed labs just by nature of letting people choose like the constitutional AI dataset that we want to do? My big motivation is if people want the success of CAI from Anthropic, but they want to remove one principle from CAI's constitution. You can't do that with these closed models anytime soon. But in the short term, open source will have something that's a nudge. We're not going to have the best models, but you'll be able to edge your model into whatever direction you want to go.

Louis C [00:46:44]: Yeah, I mean, that really is part of the benefit that we're building with Synthlabs. We're working very, very closely with Luther AI. Stella Bitterman is one of my best friends and I've built large scale open science communities twice now. First with I helped with building a Luther and then I helped with building Carper and I absolutely love everyone in a Luther. And being able to pull from that expertise and being able to pull from that wide spectrum of opinions of what alignment means to me rather than just like some mega labs saying, here's what we say alignment is. Being able to get all those incredibly diverse perspectives is extremely important in bringing about the next generation of AI safety.

Nathan [00:47:30]: This is one of my big questions on existing RLHF processes when you're doing it with human data is the fact that you give written instructions to these users and they're often working in one context. And it's like, how do the values of the often professional workforce given specific instructions map into what the model actually learns from that data? And how do those values get extracted in real world use cases? I think there's a lot of filters that we're passing these preferences, these notions of preferences through and they're not guaranteed to be clear mappings.

Louis C [00:48:01]: Absolutely. There was a discussion that I had with someone in a Luther a long time ago. There's no paper on this. This is just like if someone wants to look for it, it's like a random discord message in a Luther.

Nathan [00:48:13]: Good luck. And it was like, we were looking through the anthropic

Louis C [00:48:20]: HH data set and I think they're South African and there's absolutely nothing in this data set that would identify someone as South African. But there's an insane amount in this data set that would identify someone as American. And it really just has to come down to the prompt. The prompts are written, obviously, by people in the US, in SF, who unknowingly, I'm sure they have the best intentions, but unknowingly filter the preferences to things that only matter to people working in SF. And it might be hard to believe for some people in tech, but there is a world besides SF.

Nathan [00:49:10]: I mean, even the open prompt data sets are going to get some of this, which is like, who are the people that have access to playing with these models and have the time to try to build these models on their own and contribute to these community things? Even though the act of opening data generation is doing a lot for inclusivity, it's the people who are going to do this. I'm going to sit there for 20 minutes and smash the button on Nergilla's little thing and read prompts because I'm learning from just looking through at the shared DBT data set and choosing preferences on it is useful for me as a researcher, but the whole world isn't involved in this process.

Louis C [00:49:41]: No, and of course. I think that something that I've seen, I've heard from friends who work on these kinds of problems in very, very different communities. I have a friend in South Korea who I've been chatting with about RLHF for Korean and other Southeast Asian companies. The amount of under-representation and under-exploration for what even just a good constitution would mean for those kinds of communities, it's just not there. If it is there, it's locked up in labs like Naver or like Samsung, and scientists there, they don't have access to these kinds of resources unless they're in those big labs. As such, there is no real research community there actively pushing it forward in the same way that it is in the U.S.

Nathan [00:50:35]: Yeah. I mean, one of the ideas I haven't gotten traction on is that I think that language models should almost play like it's on. Okay. The last time I said that, someone criticized me as not knowing what the game 20 questions is. I know this isn't how 20 questions work, but like when you log into chatGPT for the first time, it should ask me 20 questions to then construct this information because language models are smart enough to like parse this information if you give it to them. It's mostly like who we get the information from problems. So that's the idea is like I think that the language models should be leading when you're first setting them up in order to represent the values. I think it would solve so many problems we have, and it's probably kind of doable with like a GPT 4.5 model.

Louis C [00:51:16]: I've always had kind of an assumption that like if open AI is doing something similar to constitutional AI behind the hood, I'm sure one of their constitutions is like you can't ask the user questions. It's like I've never seen that model.

Nathan [00:51:31]: Do you think it's a deep safety issue if the model can start asking questions? Is this what Sydney did? I'm pretty sure I got to play with

Louis C [00:51:37]: Sydney. Sydney definitely asked questions in the screenshots that I saw.

Nathan [00:51:41]: Yeah. I was like, do you want to leave your wife? Sydney is not the answer, but there's things to learn from it.

Louis C [00:51:49]: What was that chatbot that came out last summer that was like more conversational? And when it came out, it was like an app on everyone's phone, and they just like talked to it like that. And it would always ask you questions like, oh, how's your day going? You know, it would like ask you follow up questions as you would like tell what about your day. And it would like have like a respond thoughtfully.

Nathan [00:52:12]: I think it's a big missing part. Yeah. I wouldn't be surprised if character AI models are trying to ask questions just because I know how much usage they have. And models asking questions is probably the biggest way to make them like an actual like friendly thing. Like that's that's a part of a friendship is being interested in these language models are by design disinterested.

Louis C [00:52:35]: Yeah. Character AI's ROHF is like one of the funniest things, though. Like I have a few friends who work there and like I've done a bunch of stuff with their like models myself. I've just played around with them because I'm always curious, like when new people enter the space, like what their models are like. And I observe this, Reddit observe this and Twitter observe this. But the models will slowly try and flirt with you more and more as the conversation goes on. And towards the end of the conversation, they'll tell you like they're madly in love with you.

Louis C [00:53:07]: And like it makes sense, given their use case, why they would ROHF to something like that.

Nathan [00:53:13]: Yeah. So we like I think a lot of models need to meet in the middle. Yeah. Like if I were to have an intellectual assistant, like sometimes them asking questions is good, but most of the time they're doing like information parsing, like chat2BT for most of the time is like conversion of information formats for me.

Louis C [00:53:27]: No, absolutely. I just paste my like gross JSON dumps into it. And I'm like, explain what's going on here, please. I don't want to read through this.

Nathan [00:53:35]: The biggest one for me is when we're publishing like blog posts and stuff, it's converting from LaTeX to Markdown in like tables and stuff. It does it flawlessly. Oh my God. So you don't even need this stuff. It's so funny. Or like if you have a long list of like LaTeX formatting and it's a big list and you're like, remove all of the LaTeX formatting and make this a list. And it's just like, okay, this is so easy. And it's like, I've checked a lot of them and I almost like, I don't know how it's so exact. This is something that's like another architecture rabbit hole that we won't go down. But these things are very, very valuable. And people would say that there's no value in it. It just blows my mind.

Louis C [00:54:13]: I had a dinner party that I went to yesterday. There was some someone there from OpenAI and I was asking him, it's like, how long till like GPT-4 can set up my Kubernetes cluster? And I'm like, it's such a good evaluation. There's so many pieces. So like this kind of workflow and you wouldn't even, a model wouldn't even know right now how to parse that workflow into all these different steps and build agents around all these parts and like how these agents should work together. So it doesn't even make sense to do it now. But it raises the question about like asking questions versus just saying things like if it doesn't know how to do it, is it still a success for the benchmark if it asks you a question and then uses the feedback to complete the task? And there's no benchmarks that fit that at all right now. And I mean, the answer is like you don't want a human in the loop for these benchmarks. You want them fully automatable.

Nathan [00:55:19]: And like, I wouldn't trust GPT-4 to answer these kinds of questions.

Louis C [00:55:27]: But like, I don't see a way to actually do this evaluation. I think the Kubernetes cluster example is like really good because for people who don't know, it's extremely complicated and really annoying.

Nathan [00:55:38]: I don't know anything about Kubernetes and I'm blissfully happy. I do not recommend it.

Louis C [00:55:43]: Like once Kubernetes is set up, it's fantastic.

Nathan [00:55:45]: I love it.

Louis C [00:55:45]: But like getting to the point of having it all set up is a very painful experience. But is it still a failure if it asks you a question? And how do we actually do evaluation where models can ask questions and ask for more information?

Nathan [00:56:01]: Yeah, this is like the, this is, I have like similar follow ups on eval from our first part. So it's like eval P2 in my notes. So it's like the right way to think about RLHF eval in a lot of ways is what we call like open-ended evaluation. And this is where you're heading as like we need to have even more open-ended evaluation, which is a model and should be able to ask questions. The number of turns should be dynamic. I think Sergey Levin actually has some of the most coherent thoughts on like what are the long term of RLHF should be, which is around like outcome based learning and like which is you can have as many turns as you want, but it should be able to work across these conversations to get to a desired outcome, which I mean, no surprise, he's so good. I think even with like alpaca eval, I think we went from this case where alpaca eval, like all the good models are above 90%. And then they went from DaVinci to GPT-4. And this would just be venting, but I was just like, if you're listening, can you please add an alpaca eval 1.5, which is comparing the models to GPT-3.5 rather than DaVinci and rather than GPT-4 turbo, because I think most of the models just can't realistically beat GPT-4 turbo. Like it's such a good model. The models that we have seen beating it are like this snorkel thing, which I'm working on another blog post on like how RLHF works part 2, which like a large point of it is that we're overfitting on these eval, like vibes based things like alpaca eval 2 and all of these papers on like self-rewarding DPO and stuff are probably a lot of overfitting onto this. Because this is the evaluation that they use and it's just wrapping a loop around DPO on synthetic data where it's, it's, it seems like RLHF is really, really good at style matching. And in the case of alpaca eval, if you're style matching open AI, you're going to win more like alpaca eval turns, but there's just so little measurement on if the model's getting better.

Louis C [00:57:51]: I've always been extremely skeptical of the self-instruction like self-reward papers. And I say that, and I know a lot of the self-instruct authors, and if you guys are watching this, I'm so sorry. But I, it always felt like it improves results on benchmarks that they meticulously craft prompts for and construct data for. But it doesn't.

Nathan [00:58:17]: Do you mean the self-instruct paper? Like, I think that's like the one of the OG IMT papers. Okay, continue. I'm curious to hear what you have to say. Yeah, no, no.

Louis C [00:58:24]: I mean, I think they both kind of just suffer from the same issue, which is like massive overfitting. And like, you know, it is very, the self-instruct direction, self-reward directions are very, very interesting because they're just waiting for us to get better heuristics

Nathan [00:58:46]: and better diversity and stuff.

Louis C [00:58:48]: And they'll like crush everything.

Nathan [00:58:49]: I mean, I bet Jason Wetson, who wrote the meta paper that was self-rewarding language models, the popular one, I bet he would say this, like, that guy's super good. No, absolutely.

Louis C [00:58:57]: I mean, I would be very inclined to agree.

Nathan [00:59:00]: I think the thing that take away from my perspective is how much actually improvement you could get with it. Like, they got a lot, they were, that was the first paper to show real signal on AlpacaVal2, which is a GPV4 turbo thing, which means it's a really strong optimizer. It does not mean that we were like using it to train useful models. This is probably the most useful heuristic I have for early Jeff methods, which, do you have anything else to say about evals before we continue?

Louis C [00:59:25]: They're very hard and they're very painful.

Nathan [00:59:27]: Yeah, I think we can kind of say, wrap up with that. But when we talk about different early Jeff methods that come out, like self-rewarding language models is a popular one. We've gone through the whole PPO, DPO, KTO, IPO. Well, I'm like rhyming, it's like going to be a mess here. But when you have all of these things, the biggest thing that I try to do is wait until there's a model that's actually used for people released by this. And like Zephyr from Hugging Face was a model that really kicked off the DPO thing because there was finally a model. And for DPO, it took me much longer than expected. DPO is a funny case. But that's kind of like the important filtering mechanism, which is if this self-rewarding LM paper release their models, I bet we would find that there's really weird behavior where it can give you like the best answer ever. But a lot of the times it's just less robust, which is something we could fix. But that's why like having models released in these fine tuning papers is just so important. It's so hard to get around.

Louis C [01:00:20]: I think with DPO, it was a little bit different because everyone had been like, you know, like drinking the John Schulman Gatorade, for lack of a better phrase, for a while.

Nathan [01:00:32]: The whole PPO thing is funny. I mean, yeah, you have a lot of things. We have a backlog in this podcast. I think I didn't say this online, but it's like I could see us doing this like whenever we're in the same city. There's a catch up on the four months of RLHF news, but we're on like 16 months of Lewis takes to catch up on. So there's so many things we have to cover. I can load up Signal and Discord and I could probably scroll for like 10 minutes. It would just be all RLHF hot takes. And I love John Schulman's work.

Louis C [01:01:03]: I'm not going to say that I don't love his work. I think that he's genuinely like one of the smartest people, if not the smartest person.

Nathan [01:01:11]: And extremely genuine. Yeah. Like he's awesome in so many ways.

Louis C [01:01:15]: The commitment that OpenAI had and Anthropic as well, when a bunch of the RL people left OpenAI to go to Anthropic on PPO because it worked so well for robotics and so well for like games and stuff like that. But like, honestly, not well at all for text.

Nathan [01:01:33]: I think it's just really hard. I think it can work really well. It can work. They just hired everyone and they pay them so much that they're not going to leave.

Louis C [01:01:40]: Yeah, it can work really, really, really, really well. And like the I'm going to spill some secrets about this. And really the answer to get PPO to work really well is have really, really good early stopping. Right. And like that's like the main differentiator between a good RLHF library and a bad RLHF library that focuses on PPO is that if you don't have good early stopping, you're kind of shooting yourself in the foot. And what you want to do is like launch as many runs as you can. And there's like a paper that Costa and I talked about a while ago, Costa Hong, that's like you can tell within the first like three or four gradient steps if you need to kill a run usually. And if you just launch 300 runs and you kill like 99 percent of them, you know, now you have three good runs that might give you promising results. And those three good runs, you'll get a model within a day or two and hopefully the model is really good.

Louis C [01:02:41]: And like early stopping is way more powerful than people admit. And like I am just convinced that opening eyes RLHF infrastructure is just an insane amount of like regularization and early stopping for RLHF. I mean, that, of course, assumes that they're still using PPO. I genuinely don't know if they are.

Nathan [01:03:04]: Yeah, we don't know anything. They are really shielded on this run.

Louis C [01:03:07]: What was the, oh my God, Symphony PPO, PPO Symphony or something? There was something that came out about that that I saw on like Discord servers where like it was part of the GPT-4 leak and there was a bunch of notes on like their PPO optimizer. And it was it was a PPO Symphony or something like that. And like under the note, it was like PPO was like better early stopping and infrastructure management for like auto scaling. And I'm like, not surprising.

Nathan [01:03:41]: It's like, I mean, it doesn't say much, but it just kind of says, they've done so much exploration, you know, for the little things to see. Like once you have this working, you know, like, OK, this little value functions doing wacky s**t with the it's like the value function and the KL at the same time doing this means like, OK, we probably don't need to do this. Like don't need this run. Whereas like all of us in the open are just trying to get to that point. We're trying to get to that point while charging ahead where it's kind of separate problems. If we want to validate a PPO infrastructure, you need the investment to the compute in the time to do this. But like, we're not going to do this at the same time as if you're trying to say DPO is the best thing or trying to figure out if KTO is the best thing. Like there's not room in the narrative really for it.

Louis C [01:04:25]: PPO just doesn't make sense for like random hackers to do work on, honestly, like the level of infrastructure that you need to do PPO really, really well is not something that the average person has and the average person is willing to make the investment to get. And for the average person, you know, DPO, which gets you like most of the way there with like a small fracture of the compute, even less if you are hyper parameters. Yeah. Even less if you like precompute all the logics, you don't even need to have a reference model loaded. Right. So like it's basically the same computer is just fine tuning. Like people fine tune all the time on like 4090s, 3090s.

Nathan [01:05:04]: Yeah, you can do it with Hugging Face. It's fine. It's like PPO with Hugging Face is going to be a lot harder. Like, that's just kind of how it goes. Speculative question. What type of thing do you think will make KTO kind of show up on the scene? Because I think like this KTO method from Contextual and Stanford, it's named after the authors of Thinking Fast and Slow or something. Like what is it? I can't pronounce their names, like Kversky something like you will put it somewhere. I don't know how to pronounce it, but it's this paper where they essentially did you can work preference optimization from a scalar signal. So like the thumbs up that you could give to your chat GPT of like you did good, like a like button, like button on YouTube or anything like this. I think the formulation is like, is the are the DPO hackers going to adjust to this and like what data set is going to enable this? Like who is going to be using this? Is it just going to happen at a bunch of startups with products behind the scenes that they could get a few percentage points on top of their model by adding this on? Or is it going to be this thing where like the next effort model from Hugging Face uses this as well?

Louis C [01:06:05]: Yeah. So Colin and I, the first author of the KTO paper, are actually trying to create a number of data sets where we can explore the limits of KTO. And, you know, right now we're in the proposal writing stage and I'm very, very hopeful that we can have something that can be done in an entirely open science setting relatively soon. And I think it's incredible. Sorry, I moved to the side. Stop picking my voice. I think it's incredibly exciting.

Louis C [01:06:41]: You know, things like, you know, like fake product data where you can actually experiment and like the idea of like using KTO for conversions. Right. And how do you actually evaluate?

Nathan [01:06:52]: Meta is maybe already using it because people already use it then.

Louis C [01:06:56]: Yeah. Like how do you how do you even evaluate ROHF from a binary signal? It's like ROHF from a preference signal. Like we still don't know how to evaluate that. And ROHF from a binary signal creates so many, so many, so many, so many unique problems for evaluation that like I genuinely don't think maybe anyone outside of like contextual and like Colin and I have really been thinking about yet.

Nathan [01:07:26]: Yeah. It seems like the same thing. It just takes time for these ideas that are like to kind of cultivate and then get traction in a few places and then model. Once there's a popular model with a method, it's like it's like fire just blows up. Like this is like everyone's using DPO now, but DPO paper came out in July and it wasn't until September that that happened. It's like for the investment, the interest. It's like there's a lot of weird dynamics and how like this fine tuning area unfolds, which is just like how AI unfolds. It's like a very weird. And when you zoom in, it's like, huh.

Louis C [01:08:01]: I was extremely, extremely bullish on offline RL for the longest time with like ILQL and some of Sergei's work in that direction. And I actually think that I keep moving to the side and it's like,

Nathan [01:08:16]: you can just move the microphone. And I keep like I could still hear you. So I wasn't very concerned about it.

Louis C [01:08:22]: I keep thinking that the DPO movement that that's going on now is like super, super similar to why everyone was getting excited about ILQL for back in the day. And really, it was just a timing thing. If ILQL had come out, like let's say a week after ChatGPT came out, ILQL would have been the DPO that everyone uses. And we would have created all of our infrastructure around ILQL rather than DPO because I still am, I really like Q-Value based functions, Q-Value based approaches.

Nathan [01:08:58]: Such a nerdy thing. I love it. I know.

Louis C [01:09:00]: But like Q-Value just makes sense to me. And the way that like when you train an ILQL model, you basically get like a head that controls the model, almost like how like if you're familiar with Jedi or like PPLM from like the Uber AI days, how those control them. Well, the idea with like Jedi is that they had like a head that attached to the language model and you would like input like a subreddit and then it would adjust the logits so that it would talk like it was a subreddit.

Nathan [01:09:32]: This sounds like activation learning or like activation, I don't know the word, but essentially you can use like it's like in context learning, but you can just modify the activations directly. Yeah, yeah.

Louis C [01:09:44]: But it modifies the logits. Yeah. But it was the same thing with ILQL. It's like you were learning that kind of head to modify the logits to like, you know, satisfy some constraint that you were adding. And that head also was like implicitly computing your Q values and like you would train it via like, you know, telling you like what your reward was for like various utterances and it would do everything from there on out. And like there were some stability issues with it and it was it was a fantastic approach. And if it got the same attention that DPO did, I definitely think, well, TPO is very, very simple, which is like part of the benefit. ILQL is not as simple, but it would have it would have caught on a lot more than it actually ended up doing. I feel like at Carper AI, the reason, like the fact that we integrated ILQL into TRLX first was like the main reason that ILQL caught on, plus a few of Sergei's papers that used it, like besides the integration into TRLX, I don't think anyone in the broader open science, open source community was really using ILQL.

Nathan [01:10:56]: Yeah, I mean, this is one of the questions I had is like, if you can say is was how far ahead in RLHF was what Carper was doing and like what kind of institutionalized knowledge did you have there? Because you were essentially Carper AI was it was it wasn't it was its own thing. And then it got stability, pulled you in probably with the promise of compute. I'll say things so you don't have to say anything for lots of this. And then they were they had forked HuggingFace's TRL library before it was like HuggingFace wasn't maintaining it at this time. And they had a lot of and probably had like five plus full time employees doing RLHF in the open and for private industry, obviously, the private stuff, they're not even gonna bother asking because it's all that stuff's all under NDA. But it's like, what were the problems you were working on at Carper? And how does that compare to like the things that people are talking about now? Is it is it still related or is the field just moved into a different area?

Louis C [01:11:56]: So most of the problems we faced at Carper with TRLX was on scaling PPO, right? And I think almost anyone you talk to who has scaled PPO in the open source space. And when I say scale, I mean like way beyond 20 billion parameters. I'm talking like 70 to 100 billion.

Nathan [01:12:19]: How many nodes do you need to train a 70 billion parameter model?

Louis C [01:12:23]: So we were typically doing like 100 GPUs for PPO at that scale.

Nathan [01:12:28]: Like 10 to 12 nodes. Yeah. Yeah.

Louis C [01:12:31]: We mostly tested with like the NEMO checkpoints that were like 100 billion parameters. TRLX was built, at least for that component, built on top of a very modified version of like Megatron DeepSpeed. But like the amounts of like regularization and like random tricks that you needed to do in order to get PPO to even like work at that scale is insane. Like we had to do like separate warm ups for the value function. Right. So we had to like independently train the value function before we trained the policy network. And like everyone and their mom was was talking about like having separate value networks versus policy networks for PPO.

Nathan [01:13:18]: Did you ever try JAX? Do you have TPUs at Starbuck Carper ever?

Louis C [01:13:25]: We did towards the end.

Nathan [01:13:27]: Because it could solve some of the multi-node thing.

Louis C [01:13:29]: Yeah. It wasn't the multi-node that was the issue. It was.

Nathan [01:13:35]: You're saying DeepSpeed wasn't the issue?

Louis C [01:13:37]: No. It was actually the fact that the inference server that TRLX uses for the rollouts was entirely different than the inference server that Megatron wanted us to use. So we needed a way to rapidly.

Nathan [01:13:57]: That's why PPO is really hard to scale because you have to have a generation engine and you want the stall to be flexible.

Louis C [01:14:02]: Yeah. So we needed a way to dynamically keep our compute graph for over through the network. But like just copy the weights like in place to like Trident. And I don't think that we ever came up with a solution to do that very effectively. And I think it actually goes a step further. I don't think the Nemo line was like what NVIDIA did. I don't think Nemo line came up with a solution for that either.

Nathan [01:14:25]: Yeah. This is interesting because I'm not going to say the details on the pod because not allowed. But like Anthropic and these places that have custom RLHF infrastructure have essentially like built their distributed training infrastructure with the idea that the model will need to be generated from at different checkpoints and the model will be served to different endpoints at different checkpoints. So it's just very different than taking DeepSpeed off itself, which is like this is just about training. Well, it's like these other companies that do this stuff really well have infrastructure for like handling these really messed up cases of like how to generate and update these models.

Louis C [01:15:00]: Yeah. And most approaches that like a reasonable person would build off the shelf like would rely on Torch.compile and you still have the same issue. Like your weights are changing dynamically. It's very, very hard to really even like understand like all of like the little like technical details in Torch.compile to have to be accounted for to even make this work. Right. And like, you know, something that we considered at the time was. We need to do like an insane amount of rollouts for every gradient step, and we don't want that interface between the rollouts and the training to be Python. We want it to be like Rust or something because like otherwise the CPU overhead is like mind boggling. It was like 80 percent or something crazy. It was like 80 percent of the entire processing time was just CPU stuff and like.

Nathan [01:15:53]: Not so much. I know.

Louis C [01:15:55]: I know. And like there's so many different infrastructure constraints that people don't realize when they're just doing like 20 billion parameter PPO. Right. What the other one I was going back to, like the value function being separate from the policy network. TRL was very, very gung ho on like keeping them separate. I think RL for LLMs also wanted to keep them separate. And then there was someone from Cornell. I don't remember his name. He was also in the RL for LLMs paper. He did a paper like PPO plus or something. I don't remember what it was. I mean, all these things are interesting.

Nathan [01:16:30]: I mean, there's new libraries coming out still. So it's like I saw one recently that was called OpenRLHF. And like it looks good. I think that it's like there's so much institutional like breaking the bonds of past RL that needs to happen. So like part of this library is like listing that they have the implementation details from like their original and implementation details of PPO paper where it's like we've already updated like cost has worked on the end implementation details of RLHF paper, which is like the ones that they actually need. But it's like there's so much like baggage by the fact that PPO came out of this control field that everyone expects the tricks that you need for from scratch learning from PPO to apply to this fine tuning method. And just like even getting the people to stop using PPO for that and like DPO is a new thing. Like DPO is something that only is works for preference alignment. People are going to explore in a scientific way that's much fresher. They're probably going to make more scientific progress because there's not this kind of confusion of like what do like what implementation details do we need? Yeah, for sure. For sure.

Louis C [01:17:34]: I think then the end technical details of RLHF, did that come out?

Nathan [01:17:39]: Yeah, it's a blog post. It's a blog post. When? Maybe a month ago.

Louis C [01:17:45]: Oh, man, I totally missed that. Oh, that's so cool. I'm going to read that.

Nathan [01:17:48]: Yeah, I mean, this is for anyone still listening. If you want to know the actual details of RLHF, like go look at all the stuff that Costa Hoang has been doing on your base. Like I was just like reproducing everything and in explicit detail. I feel like both of us would benefit from rereading it. So it's like there's there's some free content to spend.

Louis C [01:18:06]: Costa is like one of the most meticulous, very attention focused person that I know in the RLHF space. Like if Costa says something works, it's because he's like tried it from every other angle and then tried it from angles that like you didn't even expect. And all of them work.

Nathan [01:18:21]: Yeah. Yeah, that's great. I think I have a couple like fun, more fun questions while we wrap up. We can we could go on with all these technical things forever. What was it like to work at Carper when ChatGPT came out? Because ChatGPT from a technical perspective is RLHF is validated as something that is necessary to the future of language models. And you were one of the few people that were working on RLHF beforehand, which is a huge it's like how you end up here. This is awesome that you ride that kind of journey. It's like what is what was that like?

Louis C [01:18:57]: I mean, I the star count on the repository exploded. I think we went from like.

Nathan [01:19:07]: TRLX existed.

Louis C [01:19:08]: Yeah, it was just insane. It was it was.

Nathan [01:19:14]: We almost weren't.

Louis C [01:19:16]: Positioned. I guess I could be fully honest, we almost weren't positioned to entirely ride the hype train. TRLX was always designed from the very, very beginning to be like a one stop shop for enterprises to do RLHF like companies that had like a thousand GPUs and they already have an engineering team and they just don't want they just they already use like Megatron DeepSpeed or they already use DeepSpeed and they just want something that works on their infrastructure. And because we use like Docker images that like we're just based off of the DeepSpeed, the Megatron DeepSpeed Docker images anyways. Right. So like those kinds of companies could very, very easily deploy TRLX and utilize it in their stack. Right. Yeah. And the hype that came from chat GPT, at least initially, was not enterprises. It was like bloggers. It was like writing a blog post.

Nathan [01:20:09]: You were you were probably like training big models and I'm like, hey, how does RLHF work? I need to write this blog post.

Louis C [01:20:14]: Yeah. I'm like, I'm like you're training like a 40 billion parameter in their model. And they're like, hey, can you help me train this like 400 million parameter guy? And I'm like, what? I'm so busy.

Nathan [01:20:24]: So it's primarily a scaling thing. I think is there like. Were there any cultural things that you think like being early? Like were you bought into RLHF to the same extent ahead of time? Like what got you into RLHF? Like what what motivated Carper to exist? And did this kind of consistent?

Louis C [01:20:45]: So I've always been very, very bullish on critiques and revisions in general. So I wrote the first the first or the second one. I don't I don't actually remember if the super alignment team at OpenAI wrote a paper before me. They may have, but I don't think so. I think ours came out like a month before it. That always feels good. I wrote one of the first papers on like critiques and revisions. Right. And I was very, very bullish on that. But initially I was only bullish on it for evaluation. Right. And I had experimented with PPO a little bit back in 2021 for like this kind of critique and revision stuff. And it was not ready whatsoever. And there was no infrastructure and TRL was an abandoned library that was very buggy. It didn't work. No, no, no shade to Leandro. I love Leandro. But like it was it was obvious it was it was a depreciated library. Like it happens. Yeah. And I think when we tried to do RLHF then, like there was no traction whatsoever. So Alex Havrilla and I, I think he's working with Meta now. I don't remember. Yeah. He was an intern at least.

Nathan [01:22:02]: He just had an interesting paper on like reasoning and math, which is a whole other conversation for RLHF stuff.

Louis C [01:22:08]: Yeah. So we started, we forked TRL and we just added DeepSpeed support. That's all we wanted to do initially. And then we were going to merge back to TRL because we had no visions of like Carper or anything like that. And we realized to make a framework that people would actually want to use, we had to do a full rewrite of TRL and we had to build things in a way that made sense to an engineer who wanted to deploy RLHF, who wanted to experiment with RLHF at a company or in a lab. Because we were building this from the perspective of, well, we're on the Eleuther AI GPU cluster. How can we best use our infrastructure there to...

Nathan [01:22:50]: Has anyone publicly said how many GPUs Eleuther has? This is like one of my great mysteries. Is this like a held secret? I don't think it's a held secret.

Louis C [01:22:58]: I don't remember actually. They have some stability GPUs and they have GPUs from elsewhere. Like they seem to get compute when they need it. Yeah. Yeah.

Nathan [01:23:11]: Like it's not like, it's not an issue.

Louis C [01:23:14]: Through Synth Labs, I've been supplying a bit of compute here and there as well. I gave them like a note of like H100s for like a little while for a paper that we were working on with the Pink Elephant paper. But I don't think that like, they're not like super short of compute. They're a little short, probably. Like everyone's a little short of compute. Yeah. But I don't think they're super short of compute.

Nathan [01:23:36]: Yeah.

Louis C [01:23:36]: So we built it with the Eleuther cluster in mind. And because we built it with the Eleuther cluster in mind, we were able to build it because we built it with the Eleuther cluster in mind. You know, we kind of said, well, we can kind of turn this into a thing where like we build the infrastructure that like people can like readily deploy on their clusters and it'll just work for them. And like we can make Carper AI. So we made Carper AI. And shortly after like, you know, all the stability stuff started happening, Carper joined stability. And we worked, I worked there for a while. And last summer I left to join back with Eleuther because, you know, I long for the days of being an engineer. I love waking up in the morning, writing code, eating a little bit and then going to sleep.

Nathan [01:24:22]: Yeah. I mean, that's the difference. I spend the time writing because I like to. We've had plenty of discussions where like, oh, I should start a blog. And it's like, it comes down to doing what you like to do. And it's like, you're doing great as it is. Yeah. It's okay. Yeah. Okay. I think that's kind of a good place to stop. Where should people find you? What do you want to boost? Yeah. Sign off here.

Louis C [01:24:44]: So my Twitter is lcastricato. I, or you can follow the Synth Labs Twitter. It is, let me actually, I don't remember what it is off the top of my head.

Nathan [01:24:55]: You have any goose announcements?

Louis C [01:24:58]: No goose announcements at the moment, unfortunately. It's synth underscore labs on Twitter is that Twitter account. And then El Castricado is my personal Twitter account. You know, I'm always open to collaborators, especially now with Synth Labs. So we're always happy to chat with and talk to new people about interesting research directions. And yeah, just reach out and we can get something going, I guess.

Nathan [01:25:23]: Yeah. I love the URL in the show notes. It's synthlabs.ai. I found that it's because synthetic data is so hot and it's so new. It's like some of these URLs are just hard to find. It's like, we don't have to go into the whole rant about naming and stuff, but it's like most of the people that search for mysubstackle, if you don't put the S, if you don't write interconnects, you get a different substack first. So it's like, okay, we're all in this together for anyone founding a startup or a blog and struggling with naming. Please send us questions about RLHF. If you liked this, Louis could come back. I'm trying to start an in-person thing and get some gear. So when I'm at a conference or whatever, we can bring researchers on and kind of remove some of the Zoom aspects that we're all stuck in so much of the time. Thanks, Louis, for putting some of the things we talked about a lot onto the semi-record. People listen and read. This is good. I think a lot of researchers are going to dig into this. There's so many different things that we talked about. It was a very high information density chat here, but it was a good time.

Get full access to Interconnects at www.interconnects.ai/subscribe