AI-powered
podcast player
Listen to all your favourite podcasts with AI-powered features
How to Mix Up Public Data Sets to Make Your Model Better
I was thinking about unable datasets for unsupervised learning or self supervised learning, right? Like that is something that we are trying to grab our heads around like common crawl stack overflow archive the books. And as far as I can tell, nobody has a street answer as to how what the data mix is and everyone's just kind of experiments. Yeah, I get the sense that open AI doesn't want to encourage that anymore. They don't have fine tuning for 3.5 and 4. But each of those had a unique sort of flavor of this data under the hood that might actually work quite well for your use case. So one example that I've used recently in some work is the