AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
Hey there, welcome to this special edition of ThursdAI. This episode is featuring an interview with Nous Research, a group of folks who fine-tune open source large language models to make them better. If you are interested to hear how finetuning an open source model works, dataset preparation, context scaling and more, tune in!
You will hear from Karan, Teknium, LBJ from Nous Research and Enrico who worked along side them.
To clarify, Enrico is going in depth into the method called Rope Scaling, which is a clever hack, that extends the context length of LLaMa models significantly and his project LLongMa which is an extended version of LLaMa with 8000 token context window.
The first voice you will hear is Alex Volkov the host of ThursdAI who doesn’t usually have a lisp, but for some reason, during the recording, twitter spaces decided to mute all the S sounds.
Links and acknowledgments:
* Nous Research - https://nousresearch.com/ (@nousresearch)
* Redmond Puffin 13b - First LLaMa Finetune
* LLongMa - LLaMa finetune with 8K context (by Encrico, emozilla and KaioKenDev)
* Nous-Hermes-Llama2-13b-GPTQ - Hermes Finetune was released after the recording 🎊
Psst, if you like this, why don’t you subscribe? Or if you are subscribed, consider a paid subscription to support #ThursdAI
Show transcription with timestamps:
Alex Volkov - targum.video (@altryne)[00:00:55] Yeah. That's awesome. So I guess with this, maybe, Karan, if you if you are able to, can you you talk about Nous research and how kind of how it started and what the what are you guys doing, and then we'll dive into the kind of, you know, Hermes and and Puffin and the methods and and all of it.
karan (@karan4d)[00:01:16] Absolutely. Nous research. I mean, I I myself and many other of us are just, like, enthusiasts that we're fine tuning models like, you know, GPTJ or GPT 2. And, you know, we all are on Twitter. We're all on Discord, and kind of just found each other and had this same mentality of we wanna we wanna make these models. We wanna kinda take the power back from people like OpenAI and anthropic. We want stuff to be able to run easy for everyone. And a lot of like minds started to show up.
karan (@karan4d)[00:01:50] I think that Technium's addition initially to Nous research, Jim, kinda showing up. And himself, I and human working on compiling the Hermes dataset was really what came to attract people when Hermes came out. I think we just have a really strong and robust, like, data curation thesis in terms of that. And I think that have just some of the most talented people who have come to join us and just volunteer and work with us on stuff. And I absolutely must say, I can see in the in the listeners is our compute provider, Redmond AI.
karan (@karan4d)[00:02:30] And, you know, none of this none of these models would be possible without Redmond's generous sponsorship for us to be able to deliver these things lightning fast, you know, without making us through a bunch of hoops just a a total total pleasure to work with. So I would I have to shell and say, you know, I highly recommend everyone check out Redmond as because they really make our project possible.
Alex Volkov - targum.video (@altryne)[00:02:52] Absolutely. So shout out to Redmond AI and folks give them a follow. They're the the only square avatar in the audience. Go take them out. And, Karan, thanks for that. I wanna just do a mic check for teknium. Teknium. Can you speak now? Can you? Can I hear you?
Teknium (e/λ) (@Teknium1)[00:03:08] Yeah. My phone died right when you were introducing me earlier.
Alex Volkov - targum.video (@altryne)[00:03:10] Yep. What's up, Eric? -- sometimes on Twitter basis. Welcome, Technium. So briefly, going back to question. I don't know if you heard it. What besides the commercial and kind of the the contact window, what kind of caught your eye in the llama, at least the base until you guys started, or have you also, like, the other guys not had a second to play with the base model and dove into fine tuning directly?
Teknium (e/λ) (@Teknium1)[00:03:35] Yeah. The only thing that really caught my eye was the chat model and how horribly RLHF it was.
Alex Volkov - targum.video (@altryne)[00:03:41] Yeah. I've seen some conversations about and kind of the point of Ira, RLHF as well. And okay. So so now that we've introduced Neus research, sorry, I wanna talk to you guys about what you guys are cooking. Right? The we've seen, the the Hermes model before this was, like, loved it as one of the, you know, the best fine tunes that I've seen at least and the the the most performing ones. Could you guys talk about the process to get to the Hermes model, the previous one? and then give us things about what coming soon?
karan (@karan4d)[00:04:16] Teknium, you got this one. man.
Teknium (e/λ) (@Teknium1)[00:04:22] Yeah. It was basically I saw Alpaca, and I wanted to make it like, remake it with GPT 4, and then from there and just pretty much exclusively included anything that was GPT 4 only, and that was the beginning of the thesis for that. Going forward, though, We still have a lot of low quality data, I think, in Hermes data set that can be cleaned out, and then there's a lot of new data sets that have come out that I wanna start merging into there. also wanna move to something like chat ML or even Vikura format so that we can do some multi turn stuff. It's not very great, long chat.
Alex Volkov - targum.video (@altryne)[00:05:03] Yeah.
karan (@karan4d)[00:05:03] Within within within the Hermes dataset, you know, a lot of it is public available stuff that's particularly GPT 4. Of course, Technium's massive GP teacher dataset. We also have a bunch of GPT 4 data we had generate that we didn't release necessarily just yet, as well as an instruction set that's particularly focused on tasks like Python, transformers, linguistics, very small dataset of that. That's inside Hermes that, you know, we don't really talk about much, but figure that we'll put some exposure to right now on the spaces. And yeah.
Alex Volkov - targum.video (@altryne)[00:05:42] That's awesome. And so the previous Hermes was released on top of LAMA 1, and for many folks, know, obviously, they couldn't use this for different commercial points. And now that this model relates, what the models that you guys release, are you thinking about the license of them? And could you talk about, like, the availability of folks using them in commercial standing now that, you know, the the back of it is commercially available.
LDJ (@Dogesator)[00:06:07] Mhmm. I think we have puffin licensed us MIT I'll have to doublecheck on our own own model. I think that's right, Curran, right, or Tech?
karan (@karan4d)[00:06:18] Yeah. I think so either that or Apache 2 point Like, if if the base model is commercially usable, you know, the stuff we put out is you're good to go. It's -- Yeah.
LDJ (@Dogesator)[00:06:29] So And, like, in our announcements, I put in kind of, you know, one of the main things. It's it's commercially available. the first Nous as far as I think yeah. I'm pretty sure it's the first commercially available Nous model that's released, and a big differential data from Hermes is the fact that, like tech was saying, Hermes is pretty much all single turn data. And it's surprisingly can do pretty decent at multiturn conversations when you actually use it. But then puffin is almost kind of, like, a 180 where it's a vast majority really on context multi turn data.
LDJ (@Dogesator)[00:07:09] And oh, I think can you guys hear me so? I can hear. Okay. It's just something's up with that. Okay. Yeah. So puffin is a vast majority, multi turn data, GPT 4 specifically, and a lot of it is actually real human conversations with GPT for that go on for some of them 4k 6 k context, like, even all the way up to the max 8 k context length of GPT 4. And then we took those few thousand conversations of real humans interacting with GPT 4. And now after that, I'm not sure if you've A lot of people probably heard of Camel AI.
LDJ (@Dogesator)[00:07:46] So they have the physics, biology, chemistry, and mathematics data set. And then within those, there's a bunch of subtopics that you can carry it through. And I just pretty much spent a good few days curating just handpicking the right subtopics, like differential geometry, logic problems, optimization problems, a bunch of different GPT, for examples, and responses from those different subtopics. And then I specifically added those in certain ways to the puffin dataset.
Alex Volkov - targum.video (@altryne)[00:08:17] Awesome. So just just looking for the audience maybe. The puffin model that I think the official name is the red redmon puffin 7B or, sorry, 13B. Yes. This is this is the model that you guys fine tuned, and one of the first is maybe not the first fine tune of llama v Two. that's now publicly available, like you said, maybe with MIT license on Huggingspace, and I think you even added the GGML quantized version. Correct? Mhmm. So and so folks can can go and download that and and already start playing with this. And so first of all, thank you for contributing to the open source. That's great to see. And the speed with which you guys are fine tuned on this is also great to see.
Alex Volkov - targum.video (@altryne)[00:08:55] And maybe now that we've introduced this, maybe this is like repeating a bit. So could you speak about the the difference so the difference is the in the data set, in the task that you fine tune? Like, what is the actual difference between the Hermes or the Hermes that's coming out and the Puffin model? What would people use them for differently? Is that like that? That's a question.
Teknium (e/λ) (@Teknium1)[00:09:21] The profit model definitely be better at multi turn stuff. That's for sure. Yeah.
nisten (@nisten)[00:09:28] So if you want to do anything like OpenAI I'll I'll paste the link above the GGML version of it because I I really I'm I'm gonna test it thoroughly, but I I really think because you guys have use GPT 4, high quality, multi turn conversations, then this can have actual, like, practical use for whoever else was to use it either as, like, something that tells you about the documentation on the site or walks a user through. In other words, this should be better than Hermes then in for, like, customer service stuff, which is just one example.
nisten (@nisten)[00:10:08] Anyway, yeah, I'm gonna try. I'll I'll paste the the link above.
karan (@karan4d)[00:10:14] It's it's likely better for production use alongside, like, stuff that you have with, like, a retrieval pipeline, like, with lang chain, etcetera. Like, I I would believe that without to get it, you know, or just talking, of course. But, you know, there is even though, you know, with this Lima tech unique of of small examples where we can get, like, a a really good really good model that does really well.
karan (@karan4d)[00:10:41] The thing about Hermes dataset and just its size and the various types of data and topics that are in there, I think you get a totally different like, role play or storytelling experience or completion experience with Hermes. Personally, I feel that way.
Alex Volkov - targum.video (@altryne)[00:11:01] Awesome.
Teknium (e/λ) (@Teknium1)[00:11:01] So and that. Another thing about Puffin Dataset is that it does go up to, like, 8K and Enrico here. has been doing a ton of work on extending Llama's context.
Alex Volkov - targum.video (@altryne)[00:11:13] Right. So I wanna I wanna give an introduction then introduce Enrique and and talk about this real quick. Right? LAMA version 1 was released with, again, 2,000 tokens in the contact window. And then many folks, including KaioKendev, and Emozhila. Right? And and some other folks, I think, were involved in bringing some of the quote on quote tricks about what eventually ended up being named rope, scaling, if I'm if I'm not mistaken. And we follow this, and we've talked about the previous news ThursdAI, I. And Llama V2 was released with 4000 tokens in the context window.
Alex Volkov - targum.video (@altryne)[00:11:52] And, you know, we're now still used to kind of Claude and the 16k GPT 3 that four didn't seem like a lot. And then many folks were wondering, and, meanwhile, Enrico was working, whether or not the rope scaling method would apply to the next plumber and look like it did. And so I wanna introduce Enrico uh Enrico Shippole. I hope on staying this right. Welcome to the state. Hopefully, you can unmute and and this place works with you. And The second finetune that I saw rest of the was also back with Nous, the Nouse research, and this was the extended version, what's called Longma.
Alex Volkov - targum.video (@altryne)[00:12:28] So Enrique will go out of the stage and feel free to introduce yourself, your affiliation with news and LlongMa with with the context window.
Enrico Shippole (@EnricoShippole)[00:12:38] Hello. So I'm actually a independent researcher. I'm sponsored by Stability AI, Eleuther AI, and a few other different organizations, including NewsNow. Awesome. I work with different people like Tanishq from Medark, Aaron Komatsusaki, who also is from a Luther and Duck AI. John Ney from Nomosai. So I I have a I have a lot of affiliation with a bunch of different organizations. including together. We're starting a project right now with them.
Alex Volkov - targum.video (@altryne)[00:13:13] That's that's so great to hear, and so welcome to Thursday. Welcome to this day. And can you talk to us a little bit about kind of the ROPE scaling method and and how how were you able to, like, find them like this quickly and how the results looked so far? I wasn't able to run this myself. But hopefully, yeah, talk to us about
Enrico Shippole (@EnricoShippole)[00:13:34] Okay. So initially, The the thing is I actually was hoping that both Emozilla, Bowen, and KaioKenDev would have been able to make it because It was kinda like a equal parts effort on, like, all fronts from each of us. Initially, I had trained some pathways models at 8,000 context length about 4 months ago based on the exposition paper, which did rotary embedding scaling initially. They were one of the first people did it. They based their methodology off of ofer presses alibi.
Enrico Shippole (@EnricoShippole)[00:14:11] I would imagine that most people are pretty familiar with Ofir Press in this work on the alibi positional bias that's been used in a wide range of models now. So Emozilla and I came into contact based off of the work that he had seen me doing with the Palm models scaling those to 8000 context length pretraining, not fine tuning. So what we had initially done is basically take a section of c 4 in different data sets that had examples that were all over 8000 context length that pretrained on them packed together.
Enrico Shippole (@EnricoShippole)[00:14:50] with a beginning of string and end of string token to help with, like, the attention masking portion of that. After he had seen that, Emozilla actually became into contact with kaikode dev I believe Kaiokendev is how you pronounce it. Kaiokendev had also been following Ofir Press's research. He had started working on his own version of scaling the rotary embeddings, I believe based off of both alibi and exposition.
Enrico Shippole (@EnricoShippole)[00:15:22] And what he found is that by scaling the max position all embeddings and the rotary embedding from something like 2048, which you would initially train with. He scaled it up to 8000 or 8192. And he found that by applying, like, in interpolation to the encoding by scaling basically like the the positional index in the rotary embedding, that you were able to essentially turn down the frequency window and rope by like a factor of 0.25.
Enrico Shippole (@EnricoShippole)[00:16:01] The scaling depends on the length that you're trying extrapolate to and the initial context length that the model was trained with. So if you were training with LAMA 2, which had an context window of 4096, and you wanted to do the linear interpolation positional scaling to something like 8192. then you would use a scaling factor of 0.5. If you were trying to do it from 2048, which is the original LAMA was trained with, and you wanted to scale it to 8192, then you would use a scaling factor of 0 point 25.
Enrico Shippole (@EnricoShippole)[00:16:39] So basically, after we had done all of this, Meta had released a paper around the same time that Kaiokendev had released his blog. They both found very similar finding. They had shown in the meta paper that you only had to fine tune for 1000 steps with the linear positional interpolation scaling to be able to get the benefit of doing a full pretrain at a context window of 8192.
Enrico Shippole (@EnricoShippole)[00:17:13] So this is actually like a a big step because it shows that you no longer need to pre train right off the bat at a longer context length. Then you're able to do the fine tuning on essentially a a lower resource like, computational budget and still be able to get the, like, greater results of the longer context window. I know a lot of the major AI companies had been doing just for my work in in personal research with many of them had been doing staged scaling of the context window during training.
Enrico Shippole (@EnricoShippole)[00:17:46] So they would pre train basically, when pre training, they would separate the initial examples from a dataset into multiple stages.
Enrico Shippole (@EnricoShippole)[00:17:54] So anything that is under the window of 2048, you'd separate from the initial dataset then you take things between 2048 4096, then 4096, and 8192, and you would basically chunk the data sets into those different parts you'd first initially train on the 2048 chunk of the data, then you would train on the data between 2048 and 4096, and then you would do the same thing from 4096 to 8192, or if you want to scale that to 16k or 32k context length. But what we have shown now with both the meta paper and this thing, you don't even need to go through that extensive pretraining and staged process, you can just go from a context length of 2048 to 8192.
Enrico Shippole (@EnricoShippole)[00:18:47] scale the rotary embeddings by whatever type of factor that you want to use. So like I was saying, if you're going from 2048 to 8192, you'd be using a scaling factor of 0.25. It only needs 2 lines of code to be able to do that. In the LLongMa post, I had provided an example of scaling the rotary embeddings. The the code was written by Emozilla or Jeff.
Enrico Shippole (@EnricoShippole)[00:19:15] We also came into contact with after all these experiments we then came into contact with Bowen, who had worked a lot about the dynamic NTK scaling with Emozilla, and he had also done NTK by parts which we're we're currently training a lot of models on. So we have the Longma 1 models trained on the open llama series, like the suite of those models that use the linear interpolation scaling.
Enrico Shippole (@EnricoShippole)[00:19:45] We now have the llama 2 models or the longma 2 suite, which is what we're calling it, again, trained on the linear interpolation scaling And then we have another suite of models coming out very soon that uses the the NDK by parts dynamic scaling. That was really specialized by Bowen, so I do not wanna speak on his behalf. It'd it'd probably be good to get him to talk about it in another one of these.
Alex Volkov - targum.video (@altryne)[00:20:14] Absolutely. So let's get in touch after this and and and and set it up. So Thank you for the a very in-depth kind of explanation because we did cover the the the kind of the RoPE killing and how Kaioken in the image boards are ready to wherever he started this in his blog post, and then how it's gonna rotate it. So it's great to to actually hear from the folks who are doing this. I just for the audience, I've attached Enrico's tweet about LLongMA 2, which is now currently trained at AK contact length.
Alex Volkov - targum.video (@altryne)[00:20:47] And and Rico, you told us that we may see even double from the So could you think about the next the next version?
Enrico Shippole (@EnricoShippole)[00:20:56] Okay. So the the initial training process of doing this up to a context, like length of 8192, can be due with be done, basically, with deep speed, 02. and activation checkpointing. And you're able to fit the model on a A100 80 gigabyte node. Now, we are working on the process of scaling it both to 16 k and 32 k. This requires a different methodology during training, you either need to use deep speed 0.3 or fully sharded data parallelism.
Enrico Shippole (@EnricoShippole)[00:21:35] Both of those are are very similar for people who aren't aware. Basically, you're just sharding the optimizer states. The model states across, like, different nodes. You can also use things like tensor parallelism to help with the scaling as well. And then we're going to be basically just adjusting the scaling factor again, collecting a large we've already collected large quantity of data at 16k context length, and we're going to be doing the fine tuning to 16k and be releasing those models Soon, all of this computing is sponsored by stability AI.
Enrico Shippole (@EnricoShippole)[00:22:12] They've been very generous what helping with a lot of the independent research.
Alex Volkov - targum.video (@altryne)[00:22:17] That so I wanna shout out Stability AI for not only given, you know, the world's stability diffusion, also participating in this kind of next wave of AI. Many folks kinda coined the stability AI moment when released the the stable diffusion of the. I wanna say 1.4 back then almost a year ago now, and many folks are saying the about the Llama 2 release now this commercially open source, and and folks can start, like, doing things for you know, for profit companies can join So we definitely wanna shout out stability for for the effort here. And, Enrico, thank you. And, folks, please follow Enrico, and and we'll stay tuned.
Alex Volkov - targum.video (@altryne)[00:22:56] I wanna ask Karan and and Teknium, and other folks from Nous the efforts that that Enrico was talking about. the longer context windows. How would they kinda interplay with the stuff that you're working on with Hermes with with Pufin? Are are kind of the efforts interchangeable? We're gonna see building a top of each other?
karan (@karan4d)[00:23:16] So I I think LDJ can definitely speak to this, but I'd like to happily say that once we did Longbow 1 on the 1st Llama generation of models, we already had puffin 2k, 4k, and 8 for that -- Yeah. -- already prepared and ready. So as the LLongMa models for 13B are released, we will also be doing equivalent, puff in fine tunes, and Potentially Hermes fine tunes. We can talk a little bit more about the future of Hermes at a a little bit later, though.
LDJ (@Dogesator)[00:23:51] Yeah. I mean, I was pretty much going to say the same thing, but kind of elaborate on that about how before when LLongMa V1 and everything. And during the development of LLongMa, there was actually, like you know, of course, me Enrico who are usually just called concepts of mind and and and Emozilla. Like, we've all kinda, like, been butting shoulders a lot together and just kinda working closely, you know, in the same Discord and whatnot. And it's like, hey. Like, you know, working on this, like, experimental LLongMa with thing. Like, hey. You wanna try, like, fine tuning, and then the plan just kind of ended up being like, okay. Just gonna have this Puffin thing.
LDJ (@Dogesator)[00:24:31] that Puffin dataset is already containing a ton of high context conversational data. from GPT 4 and, like, human high quality data. So it's like it's like the perfect fit to have something that's high context capable will be fine tuned on that. And then LLaMa 2 came out, and it's like, oh, Yeah. Let's let's get this out ASAP, and then we'll figure out what we're gonna do later.
Alex Volkov - targum.video (@altryne)[00:24:58] Yeah. Great. And it's just great to see, you know, how many opportunities is like this where with open source can the stuff that we're able to now run and gonna iterate on are building on top of each other. They're just incredible. and this is maybe a watershed moment. And I I wanna thank all of you for being here. I wanna kind of let the other folks who usually hear on Thursday, I need to ask you a question or 2 for Nous visitors. Yam and Nisten, if you if you have a question for news or for Enrico, go ahead. I I will stay young.
Alex Volkov - targum.video (@altryne)[00:25:29] I know you if you have to ask the super deep technical stuff, and the audience will, like it will fly over their I I won't be using the DM with LBJ and and Rico. But yeah. Of course, the stuff that we haven't covered and interesting tough news. Feel free as it pertains to LAMA 2 is gonna be very interesting, I think, for everyone.
nisten (@nisten)[00:25:47] Just to quickly clarify, you guys fine tuned the plain model. Right? Not the chat 1.
Teknium (e/λ) (@Teknium1)[00:25:55] Yep. Okay. Yep. The base model. We wouldn't fine that model. The chat 1 at all.
Alex Volkov - targum.video (@altryne)[00:26:00] Actually, to -- Yeah. The -- -- to maybe continue this stratigram for interrupting. Just one sec. To continue this question, the there are models they were released by Meta, and you have to, like, register and get the email and everything. And then they put some stuff on Hugging Face. And then the the those models were delineated with, like, dash HF. Have you guys use the HuggingFace or the Meta 1, and do you guys know the difference? I felt somebody that, like, maybe doesn't work as well and to inform her Yeah.
Teknium (e/λ) (@Teknium1)[00:26:30] The one on Hugging phase is an FP 16 and the original Llama 2 models in bf16, but we tested the difference between the two models at Carper, and there's such a negligible difference in their quality that it's irrelevant, but we trained on the Hug and Face f P Sixteen ones, but in the f Sixteen ask them.
Alex Volkov - targum.video (@altryne)[00:26:52] Sorry. Yeah. Goran, for interrupting. Go ahead.
karan (@karan4d)[00:26:56] No. All good.
Alex Volkov - targum.video (@altryne)[00:26:58] I I totally forgot what -- That's not it. interrupted today. Yes, Randall. Okay. Nispen, if you have a question for Kiran to follow-up with feel free, and And if not, then, Yum, if you have anything that you wanna ask the the fine folks from Nous, feel feel free as well.
Yam Peleg (@Yampeleg)[00:27:17] Yeah. Sure. First, thank you for what you're doing, guys. You're really making a difference for anyone. There aren't many demos online, so anyone that didn't try Hermes, I highly encourage you to try. I don't know why there aren't them. Okay. I know why there aren't demos that cost money, but just try it. Okay? And now I got a question because from my experience, if you train on the open datasets of Hermes, you get a significantly less quality of a model. No. Now I'm fine I'm fine if you don't release datasets. Don't don't get me wrong.
Yam Peleg (@Yampeleg)[00:27:54] Just I wanted to ask, is there anything else besides the data that is different? What what tips can you give for, I don't know, someone else that want to train high quality model besides having high quality data.
Teknium (e/λ) (@Teknium1)[00:28:08] Everyone understands this. Yeah. The hyperparameters can make key difference. LBJ knows very well because we had to do a ton of different tests. We don't have our freight owners for puffin model. But I'm not sure if those are on the model card for Hermes. If they're not, I can put them And Karen your card can probably talk about the Nous datasets that weren't made public.
karan (@karan4d)[00:28:38] Yeah. We've got, like, maybe around, like, 50 k items of data, like, versus, like, total 300 k instructions there that are not released. And to be frank with you about 45 k of them is just more GPT 4, like, alpaca style instructions. The 5000 or so, the, like, 4500 them compose this dataset we have we've been working on that, you know, at this point, I'm pretty comfortable talking about a we call it the p dactyl dataset.
karan (@karan4d)[00:29:14] I won't speak on everything that's in it, but, essentially, And I don't know if this is the thing that made the big difference, but it's, like, the the one place where I guess you deviate from just using the open datasets more GPT 4 instructions, but it's got some transformers instructions, some linguistics instructions, some calculus 1, instructions, etcetera. It seems to be pretty good.
Teknium (e/λ) (@Teknium1)[00:29:41] Also, Yam, do you have links or anything to the models that tried it with just the makeup of the datasets that we're public from Hermes because I haven't actually seen that before.
Yam Peleg (@Yampeleg)[00:29:57] And again, can you repeat that?
Teknium (e/λ) (@Teknium1)[00:29:58] didn't hear. Do you have any links to the models that trained with just the open datasets from Hermes that you could share with me later?
Yam Peleg (@Yampeleg)[00:30:06] No. No. It's just it's just from my experiments -- Oh, okay. -- on training. Pretty much following the same idea of let's take only GPT 4 from all the open datasets, and the the model that you get is is different. for sure. And and it might be that hyperparameters, you know.
Teknium (e/λ) (@Teknium1)[00:30:25] Another thing that we did too is pretty extensive, like, cleaning. We did do deduplication. We removed things like a URL. Like, any response that had a URL in it, we removed in case it was gonna like, hallucinated URLs. Instead of, like, maybe 8 different filtering processes too that might have made our data quality higher.
LDJ (@Dogesator)[00:30:48] So as an AI language model?
nisten (@nisten)[00:30:51] For anybody -- What do you say? -- for anybody in the audience that hyperparameter meters are are just like the settings in the oven. So it it looks here, like, the ingredients were all okay, but yam mess something up, and before selling as a token -- Yeah. -- came out half baked at the model.
LDJ (@Dogesator)[00:31:08] So we're gonna have to check that out.
LDJ (@Dogesator)[00:31:10] I'm a big proponent personally of hyperparameter optimization being underrated right now, like, in -- Yeah. -- the current space. And that's something I've kind of focused on a lot specifically for things like puffin and just trying to help others around and use some stuff like trying to optimize they're doing, and even just something like like what you just said about the settings for the oven, I mean, double the amount of time you're putting something in the oven, and it's not gonna come out twice as good. It's not even gonna come out 10% as good. It's gonna come worse. You know?
LDJ (@Dogesator)[00:31:45] And although it depends, like, what is your baseline for how how much time you're putting it in the oven and all these different variables that kind of are dependent on each other and affect each other. So it's definitely something you kind of have to build an intuition about to some degree. And then the other end is really I feel like there has to be more investment and more time and energy invested into actual tools that make hyperparameter optimization easier for people that are doing these things.
Yam Peleg (@Yampeleg)[00:32:13] Yeah. Yeah. And the thing is that the models are are really big, so it's really expensive to run them. So you have you have a trade off of how many how much computer you're investing in searching hyperparameters rather than actually using it for training. But but I completely agree So one one last question, actually, too.
Teknium (e/λ) (@Teknium1)[00:32:33] Actually, one thing before we go on. Something great about the puffin dataset is that it's just like, 3000 or so examples, I believe. And so it makes tuning a lot less expensive because you can finish the whole training in just a couple of hours. So, like, with Hermes, if we wanted to try full ablations and dozens of them, it would take weeks weeks to do.
LDJ (@Dogesator)[00:32:55] Yeah. Yeah. Well, to be fair, it's not like it only takes a couple hours on one GPU. We use a a 100 80 gigabytes. So Yeah. Yeah.
Teknium (e/λ) (@Teknium1)[00:33:04] Courtesy of Redman.
Alex Volkov - targum.video (@altryne)[00:33:05] Thank you, Redman.
Enrico Shippole (@EnricoShippole)[00:33:08] Mhmm. I should also probably clarify that when doing the context length, extrapolation, We're doing it on 1,000,000,000 tokens and 64, 80 gigabyte a 100.
Yam Peleg (@Yampeleg)[00:33:20] OOf Mhmm.
Alex Volkov - targum.video (@altryne)[00:33:23] Yeah. Yam is getting over excited. Alright, folks. I wanna -- Yeah. Yeah. -- maybe maybe ask her on this one less and we'll move on to the the the regular ThursdI update camera cadence. But I will say that, like, folks from Nous research and and Rick and and some other here. Thank you so much for coming up and giving us kind of the insights into how this actually happens. Lama2 just released, you know, a few days ago, and you guys are already pumping out, like, open source fine tuned models. And it's great to see. And just so you know, there's always a stage for you here to come in and and announce things.
Alex Volkov - targum.video (@altryne)[00:33:53] And If you do wanna announce, like, a release or something, maybe just, you know, right now, Karan and and Teknium and some folks, I would love to hear like, when the next Hermes is coming?
karan (@karan4d)[00:34:06] Before we say that, I just would like to clarify something about Hermes. So we have the original Hermes dataset on LAMA 2 as something that we will release, but also a sequel to the Hermes dataset, Hermes 2. There will be a distinction between these 2, and you'll see you'll see the the the prior come out first and the latter come out after. But as for release, etcetera, I will absolutely let Technium take the stage with those final words.
Teknium (e/λ) (@Teknium1)[00:34:36] So the training is nearly done. At least it was about 2.8 epochs out of 3 a few hours ago. So it might be done already. Before I release it though, unlike puffin, I didn't we wanted it puffing out, like, same day that llama 2 came out, so we didn't run any benchmarks. And we had to put all the compute we had on Hermes immediately after we were done with that. So we don't have any compute to do any benchmarks or puffing until Hermes is done.
Teknium (e/λ) (@Teknium1)[00:35:06] But before I release Hermes, I do wanna do, like, a full range of benchmarks and stuff like that to make sure everything's good and have a pretty detailed model card, but that should probably only take the rest of tonight at the most. So probably tomorrow morning would be when Hermes comes out.
Alex Volkov - targum.video (@altryne)[00:35:22] That's some folks. And you you heard it here first and definitely follow Teknium, Karan, Enrico, LDJ, and the rest of, like, Nous Research folks, and stay tuned. Enrico, go ahead.
Enrico Shippole (@EnricoShippole)[00:35:34] Yes. I just wanted to to piggyback off of Teknium comment a little bit. So we did do pretty sense of the valuation of the Lauma 2 AK models. We had run different things on perplexity using Gov Report in a couple different other data sets to make sure that the length extrapolation in the context was working properly. We did passkey retrieval. We also did a lot of extensive human evaluation, which took a little bit. I had wanted to get the LAMA 2 AK models out yesterday, but we decided to push it back one day.
Enrico Shippole (@EnricoShippole)[00:36:08] So and what we were doing is we were feeding in research papers and seeing if it could pull out even, like, relevant pieces of information from the context length. And so far, it has been quite successful. So we're we're still running more evals, but the ones so far have shown that there's been, like, no performance degradation, no matter what context length that you're basically using with these extended models.
Alex Volkov - targum.video (@altryne)[00:36:32] That sounds great. and now that this this, you know, LLongMa lies out and the next versions are gonna come out as well. I'm sure that some other folks who also contribute to this research and tell you, like, from their own experiences and vibe. So, yeah, I wanna thank folks. Again, this has been very illuminating, and very glad to have you. And, obviously, the stage is yours whenever you want to come here, and we appreciate you. And you guys are welcome to stay tuned and kinda chime in to the rest of the updates. And with that, I think, for folks in the audience, we're moving to the next thing.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.