Speaker 2
is like only so many more things to sell at HR. Yeah. Oh my
Speaker 1
God. I mean, I think we have enough marketing chatbots. Thank you. Right. No. Um, I mean, there's lots of areas and we touched on some of them. Like, I think the whole world of animal models is super exciting. Um's great um i'd love to have better higher fidelity assays um um that can still be run in a plate so you know the the the sort of the what do they call it the micro like if you have your like full micro tumor environment in a well uh like insanely finicky right now. But, you know, there's only so many targets and tissues. So there's probably someone that can fix that. Right. And I think it would be incredibly high value. I mean, there is a lot of lab automation left to be desired. I think maybe finally the coming from tech, like most of the shit I built was on top of great stuff other people built and made available for free. And in biology, everything's proprietary. I mean, you like buy a large liquid handler. Like I would like to run TXTL on it. And they're like, no, no, that's proprietary. You're like, wait, you make the liquid handler, right? So we should have a new metric in biology, meantime to fun, right? And if meantime to fun is a year everywhere, then we're all going to be super slow, right? And so let's not do that. And anything that isn't actually competitive, let's share, right? And I think we're all going to go a lot faster.
Speaker 2
Yeah, no, I love that. I mean, I think one thing that would also be really interesting for our listeners is maybe just contextualizing, you know, the advances that are happening in AI bio now with like the larger gen AI space. I mean, obviously, you know, these are separate spaces with separate approaches to modeling that you guys use transformers. You know, it's from the outside, like, you know, AI bio is obviously having a big moment now at the same time that gen AI wave is happening. Like, you know, to what extent is, like, to what extent do you look at kind of advances in, like, the broader AI world and find them relevant to, you know, the work in AI bio? And to what extent is it like, look, these are pretty separate domains. And, like, ultimately we look at, like, what other, you know, drug companies are doing or what kind of researchers in the bio space are doing? I
Speaker 1
mean, I think it's both, right? So there's really cool work that's happening in the industry. There's clearly a bunch of leaders that have decided we're going to invest a lot of money that, you know, come up with really, really valuable work that, uh, we can learn from, you know, there's lots of tricks and, and most of us came from, from tech on the research side. And, uh, yeah, I think that's always surprised me. Like he sent from Facebook and Progen from, you know, Salesforce and AlphaFault from DeepMind. And it's like, where is, you know, where's the rest? Obviously, module Bakerland. But no, I mean, there's tricks everywhere. And it's a little bit to, you know, the point I just made, it isn't about a model, right? Typically, any model you would get in Blueprint is either two or three people from a company that spent like six to 12 months on something, or a postdoc or a PhD, right? Uh, that tried to do something. And so it's never the whole thing. It's a, it's a subset. And so you need to be able to extract what are the relevant parts or what are the insights from this paper? And then just adapt really fast, um, to get to something. We once started out with something that looked like ESM. At this point, it's just very far
Speaker 2
away from that. But sort of that was an evolution, not a revolution. What advances are you paying attention to right now, like on the research side?
Speaker 1
Yeah, I mean, there's a couple of things I would like these models to be able to do. I think one is adding more context. And so right now, yeah, we obviously have labeled data from experiments, but I'd love to be able to add experimental context. Like I changed my buffers or, you know, like, so context windows should be wider and should be able to accept more types of information. That's an area that I'm watching. Second is, and we alluded to it a little bit at the start, I'd love to, I'd love to be able to marry structural constraints to language-based or language diffusion-based models. I'm not long structure, and I can comment on why, but I do think structural priors are very helpful because humans interpret molecules through structures, right? It's a little bit like how wind tunnels have these little, you know, simulated wind tunnels have these little thingies that go over the wing. They don't actually exist, but they're incredibly helpful for the person making the wing, right? Or designing the wing, if you will. And so can we have language models and those types of models sort of embed structural priors without relying only on structural information? And we've been very deliberately avoiding any structural information in our models today. So that's an area I'm looking for. I love the how do we marry vocabulary-based language models like the ones that we're using mostly today to full atom in a way that's actually meaningful where you can still learn across different never observed amino acids. And then lastly, and we still continue to invest a lot of time and effort in domain, out of domain. Like I think, honestly, if you're not thinking about is the thing just spitting out bullshit or not, and you have no way to qualify that, that's, you know, you're out of luck. Then maybe lastly, I'd love to see more benchmarks so that most benchmarks are not relevant to the industry. And they're now starting to come, but I think sort of, you know, whatever you read about human language, not as old, but, oh, Claude is like 0.2% better on insert data set, right? And in biology, it's like, cheap, cheap, red cask. Let's do that once a year, right? In
Speaker 2
fairness, I feel like benchmarks are a problem in every model domain, but maybe bio. But they're particularly, there's
Speaker 1
a weird thing in biology where for human language, tasks are pretty voluntold, and we have pretty large corpus of training data on the internet. I think in biology, and I'm noticing when you work with academia, for example, even if they have something interesting, they'll keep it a secret because they might sell it to some pharma company later. And so most of the data that's out there by definition isn't very valuable or somebody actively decided this is not valuable. I guess if it
Speaker 2
was valuable, someone would be able to be keeping it private and trying to sell it. Exactly.
Speaker 1
We're really biased towards irrelevant parts of sequence space. It's so interesting because
Speaker 2
I feel like in the general LLM space, it's been the academics that have actually been incentivized or created a lot of these benchmarks. You can't
Speaker 1
patent math exams, right?
Speaker 2
I mean, I was too intrigued not to dig in a little bit. So you said you're not long structure. I mean, it does seem like a lot of the great advances of the last years have been on the structure side. Maybe just elaborate on that a little bit. Yeah,
Speaker 1
so where I think structure is very helpful is in the very earliest stages. So if you have like a hard to drug target, I could totally see generating based on unknown structure. But I think structures have a couple of very strong limitations as well. So first of all, the training data is very limited. And there's lots of proteins that are very relevant that don't like their picture taken, like memory bound proteins, for example. And so you get biases that are even worse than the biases that you get in sequence based models, and certainly relevant targets that are just out of scope. one. Second is the cost and turnaround time to update to get one additional data point for cryoEN is just absolutely glacial. And I don't think we're going to do 96 well cryoENs anytime soon. So given that we know you run out of domain very fast in this stuff, are you going to update your model as you realize that there are certain liabilities there? So that's problematic. I think they have often very poor translation to function people care about. Like, can you read expression from a structure? I can't, right? And I don't think anyone can. And then finally, as generators, they are typically quite restricted in the sense that your point cloud that constricts sort of what you can generate into and find a valid pose for is an assumption. So that's sort of on structure in particular. If you flip it around, we've tried the opposite, which is are our models capable of understanding structure as a concept, right? And so what we found there, if you just drop a different head on it, I mean, it can make a distance matrix, right? encoded in a one of these things without the restrictions. And so, and again, I do think structural priors are insanely helpful. And so we need to find a way to marry both of these. But again, I don't think, you know, I think one of our customers was joking. We were like, oh, we should try this new sort of structural thing. It's going to be fun. And they were like, sort of, how much is this going to cost to run? And we were like, yeah, well, let's do some math, right? And they were like, you can have two long immunizations through one binder. Which was a fair point, right? Very interesting.