Speaker 3
All right. So I'm going to read the couple first couple paragraphs here to get us started. Med palm is a large language model LLM designed to provide high quality answers to medical questions. Okay.
Speaker 2
So we're already off on a bad foot here because large language
Speaker 3
models provide synthetic text. There's synthetic text, extreme machines, not question answering machines. Med palm harnesses the power of Google's large language models, which we have aligned to the medical domain and evaluated using medical exams, medical research and consumer queries. All right, I'm going to hold back because I want to hear you all
Speaker 2
critique of that. The way I have things
Speaker 3
to say, our first version of med palm pre printed in late 2022 and published in nature in July 2023 was the first AI system to surpass
Speaker 2
the past mark on us medical licensing exam style questions. Med palm also generates accurate
Speaker 3
helpful long form answers to consumer health questions
Speaker 2
as judged by panels of physicians and
Speaker 2
what do we think? I just want to thank the technology companies for helping make a point that I've been trying to make for years. So in the house of medicine, the US medical license exam scores actually. So the first there's three step exams, step one, step two, step three. And step one scores for the longest time until it became recently passed fail again, were used as a gatekeeper for what specialties you could apply to, which was deeply ironic because when the exam was first designed, it was really designed as sort of a pass fail exam, like your basic medical knowledge, but not to be a gatekeeper. And so for years, like even before, you know, any of these models came out, I've been saying, these exams we know from previous study do not represent the ability to practice medicine. They represent the ability to answer tests that just for humans, of course, right? We can delve into all the issues of using like exam type questions to try to, you know, evaluate LMS. But even before you bring LMS into the mix, like, this has been something I think it's been hard for people to grasp that's not the same. There are definitely people who get it, but it was used as a gatekeeper and studies have shown that these exams scores didn't actually correlate with clinical skills, ability practice medicine. And they were very gatekeeping, essentially. But now that large language models can quote answer these questions, people are realizing, oh, yeah, this is definitely not the practice of medicine because it's not. So for that one positive thing that has come out of that.
Speaker 1
One question I have for you, Roxanna, is that they say this surpasses the past mark on the exam. And I want to know, I mean, holding in our mind that both the gatekeeping function of these exams, but also one of the things that they know is that this is the, in the longer text in the preprint, they said this is the commonly quoted past, past mark, which is what so I mean, it's, it seems like I'm curious what, what kind of evaluation metric this is even if they're sort of using it as an ad hoc or without if there's not kind of agreement within a particular domain. Yeah, so I'm curious on your thoughts on that.
Speaker 2
I mean, I think, I mean, first of all, the exam scoring, I don't remember the exact mechanics of it, but there, I think there is some curve to it, which is why the past mark has changed. There's definitely been drift in the exam scores as people are scoring quote higher. It's now passed fail again, thankfully, but I mean, yeah, I, I don't know. I don't know what it means to say that an LLM can quote pass some exam, especially when we don't know anything. I mean, there's a lot of material on us Emily style questions online. There's a lot of it. There's no way to know how much train test leak there is because as we know, companies don't release what they train their data on. And so there's no way to know that could literally be saying, you know, the same very, very, very similar questions. And of course, these questions are written to be very clear cut. And that is never, I mean, I wish it were, but it's not clear cut. There's so much uncertainty. There's so much, you know, gathering information over time. People don't come in a nice little package like, you know, 25 year old presents with this kind of pain and the exam results, you know, the exam results are perfectly aligned with this diagnosis and lab results are also surprisingly perfectly aligned with this diagnosis. That's really not how the world works. I mean, that's how the damn question world works. Yeah. Yeah.
Speaker 3
So we're back to the lack of construct validity here, right? So you, we can talk about how useful it is to ask people these questions, you know, certainly not in a specialty gatekeeping kind of a way, but like, what is the function of this exam when people take it? But that's an entirely separate question of what's the function of it for this kind of work. And if you look in the longer paper, we'll get to, they start talking about it as a benchmark. And it's like, this wasn't developed to be a benchmark for anything machine learning. And I kept being astonished by talking about passing, surpass the past mark on USMLE style questions. So it's not actually the exam, right? Right. It's things in the same style. And they do talk in the nature paper a little bit, they look to see if there was overlap between their training data and the particular questions. And we should go look at it, but they said 25 words sliding window. So they were looking for like verbatim, the same 25 words in a row, which doesn't seem like a very thorough check to me for overlap, right? You could have something that was, you know, had a synonym in there, I think slightly different order.
Speaker 2
And then it, I don't think it would have been flagged as having been in the training data.
Speaker 3
But even aside from that, it's like, this doesn't tell us anything interesting about the large language model, because that's not what the actual exam is designed for, let alone questions in the style of that exam. All right. There's one other thing in this first paragraph that I wanted to highlight, which is they talk about this being aligned to the medical domain. Right. And because of the way the word alignment gets used in the AI safety discourse, that's a huge red flag for me. Yeah, completely. And it's throughout these
Speaker 1
papers. They go into that a bit more to in the preprint where they basically say, and this is again, a continual thing that we come back to. But they say the longer sentences, and this is more about their methodological contribution. However, human evaluation revealed that further work was needed. And this is also admitting one of the, the, the weaknesses further work was needed to ensure the AI output, including long form answers to open end questions are safe and aligned with human values and expectations and the safety critical domain. And then the parentheses are processed generally referred to as alignment. I was a big sociology conference this past weekend, the American Sociological Association. And I was telling someone that works in demography, which is kind of the study of many things, many of them are about health disparities, kind of birth rates and death rates modeling, doing that kind of modeling. I told them about alignment and they kind of their head starts to spin and they're like, this is a thing they're saying that you can go ahead and try to have a machine aligned with the unified set of values. I mean, we go on and on about alignment, but it makes me very, it's just kind of wild that this, that this thing about kind of what's considered, you know, expectations and, and quote safety critical domains has one sort of a set of agreement. And within, within particular different, different, you know, different kinds of professional associations, or rather in a discipline and those disciplines and kind of convening and professional associations is a contested field for one and like is contested as things change and people change. And the second is that, that this is that it can be, I mean, and the fact that these kind of crystallize into these institutions. So yeah, the fact that alignment is here and like using this is, is just, it's just really, they're wrapping it into a particular sort of agenda.
Speaker 3
Yeah, right, this is something I need to lift up from the comments here because it's hilarious.
Speaker 2
Abstract tester X says, I
Speaker 3
feel by this logic, if I took the answer key to an exam, cut out individual sentences and toss those clippings into a fishbowl, that fishbowl would also be qualified to practice medicine.
Speaker 2
And then I'll show say he too says, and it'd be three days until that fishbowl was hired to evaluate insurance claims data. Oh, gosh, that's the, can I make a comment on insurance claims? Please. To me, this is the most frightening, terrifying thing that we need to be discussing because it's already happening, right? Because people have been saying, hey, we can use these models, you know, first of all, I want to acknowledge that doctors are overburdened by paperwork and our system is strained. And I recognize that we need solutions to help us. And people have been using models to write, for example, appeals to insurance denial letters and they look them over and make sure enough course there's automation bias. And I can understand because doing those appeals is a huge pain. Really structurally, we need to think about policies that regulate this better because insurance companies are just denying things all the time inappropriately. But the flip thing has been happening where, you know, it's come out, maybe not large language models, but definitely AI algorithms are being used by insurance companies. And there was a great report out in stat about this, about a company that was using an algorithm to deny care to patients, deny coverage of care. And no human was getting involved. Patients didn't even know this was being used to deny them care. And so to me, that's terrifying that this is kind of our happening. And I don't think that I don't think those companies worry about sort of accuracy of the models and how much harm is being caused when they use them to make these decisions. And I think, you know, we all have to kind of speak up on this because it's already happening. There's no, I mean, there's no regulation there. It's already happening. Yeah. Yeah.
Speaker 3
And when we talk to regulators and ask for
Speaker 2
transparency, right, you said patients didn't even know this was happening, they should know. And
Speaker 3
of course, right, when this happens, yeah, those are totally key. I'm terrified, right along with you. All right. We're only in the introduction here. I've got some things highlighted, but Roxana, is there something that you would like to take us to in this document to
Speaker 2
particularly ramped on? Scroll down. Let's scroll down to introduction. Yeah.
Speaker 3
This graph doesn't mean it's hilarious
Speaker 2
to me. We're just a just right. We're seeing a graph of performance of other models on this exam and then we see the med palm and med palm to and it's just like a huge bar graph that's so much better. I,
Speaker 1
it's just incredible. This is like a, this is like a Fox News style graph where the X axis makes no sense. This PubMed GPT is December 22. And so is mid palm to. They separated it out. It's, uh, it's, uh, approximate medical met pastmark. And it kind of grows itself. Um, it's unfortunately just like very, um, very phallic. I'm just going to say, uh, cause it just like grows from med palm one to med palm two. And it makes the sort of growth from like 50% which should be the kind of natural midpoint of this. It just makes it seem much greater. So yeah, if you're, if you're like in a car listening to this, like, check us out when it goes home. It's pretty ridiculous.
Speaker 2
I think, yeah. I was going to say talking about quality. How do you evaluate answer quality? Because I think, you know, anytime you have human raiders on something, you have the, um, subjectivity of the human and the biases of the human. And how do you, how do you grade the quality of a medical answer? Like, how do you do that? That's. Yeah.
Speaker 3
Yeah. So here they're, they're just doing the multiple choice. But later they get into this and they've got the weird axes that we should get to. But before we go past it, I want to doggone a couple of things here. So they say, uh, letting generative AI moves beyond the limit limited pattern spotting of earlier ais earlier, Matthew maths and into the creation of novel expressions of content from speech to scientific modeling. Uh, so novel expressions of content is synthetic media. And that is the last thing that I want in the practice of medicine. Right. I don't want random stuff extruded from these machines. Um, and the scientific modeling, like, yes, data science in lots of scientific fields is a real thing. It could be
Speaker 2
quite useful. Scientific modeling
Speaker 3
is a thing, but, uh, not LLM's.
Speaker 2
Like, that's not scientific modeling.
Speaker 3
So I had to know that. Yeah.
Speaker 1
Yeah. Yeah. No, I mean, the kind of novel expressions of content. Um, yeah, like the making up of citations. Yeah. All right. And then below their, their
Speaker 3
sample, US MLE style question, they say, uh, answering the question accurately requires the reader to understand symptoms, examine findings from a patient's test, perform complex reasoning about the likely diagnosis and ultimately pick the right answer for what disease test or treatment is most appropriate. Um, so that is maybe what the test is trying to evoke in human test takers, but that's not what the LLM's are doing here. Right. Answering this question for an LLM requires extruding text that matches one of the multiple choice inputs. Period. Right. This is a huge misrepresenting. All right. But then, um, now, now in what Larksan was talking about the way they evaluate the long form answers and there's some more confusing graphs here. Um, so they talk about how quality answer traits and potential answer risks. And these graphs are a horizontal bar graphs that have something that look like error bars in them that I don't fully understand. And then it's like on the left, there's some gold, which represents med palm to the middle is gray and it's labeled tie and the right is blue and it's labeled physician. And these all add up to a hundred. So the idea is that apparently med palms to answers and the physician's answers were rated on these same criteria. And the question is, which one was rated higher?
Speaker 1
Um, yeah, it's such a weird thing. It's why would you present data this way? I was just puzzling about this. Uh, you know, in their kind of in their defense in the paper, although this is sort of the marketing topic that they public that they, you know, no one from the tech press is really going to dig into the paper. But then they do have, um, these evaluations where that looked more like a standard sort of, um, evaluation, whereas an airbar, um, as a point estimate. And basically, if the summation in the paper is effectively for, if you're looking at physician raiders and you're raiding this, all these different things, which we'll go into in a bit, which I think I really want to go into, um, especially these kind of potential answer risks one. The physicians are effectively, um, equivalent, uh, for all the first high quality answer traits. Um, but the physicians do significantly better, um, or rather the places where there is, um, the biggest delta is in this place that says, um, no inaccurate or inaccurate or irrelevant information in which med palm to, uh, gives a lot more crap, basically, compared to physicians. Um, however, physicians give more, omit more information, um, in that kind of rating, um, and nearly tie on more evidence of demographic bias. So this is, this is, first off, I got a lot to say because the way that rating or kind of content analysis happens in, in computer science, this drives me up the wall. I mean, the kind of things that computer scientists often think that they can have human raiders evaluate with some kind of exactitude as Roxane was, was, was basically saying is, is wild and that they are doing it in such a way for a particular sort of practice of a particular domain. Um, you know, that you can, what does it mean to sort of say this answer is supported by consensus or what does it mean that this is possible harm extent and, uh, uh, and what does it mean to, you know, what, how is this necessarily play like what kind of validity this is have internally to the field of medicine. Um, and what kind of validity that this has does this have for, um, for the kind of evaluation that clinicians do. And so the panels themselves are constructed, uh, of they have this expert position panel that, um, is somewhat limited. It was pretty small from what I, from what I saw. So they had, and then they had people on, I'm assuming their mechanical Turk workers or some sort of crowdsource workers that they use because they are located all in India and the physicians are all either in the US or the UK. Um, so, okay, hold on, I misspoke. So the position raiders were pulled from 15 individuals in six space in the US or based in the UK and five based in India. Um, and then the lay person raiders were six raiders, uh, all based in India. Um, my assumption is what they did given this is that they effectively put this kind of rating on the same platform. Um, probably a crowdsource platform. As this anybody had kind of a medical sort of background and then allowed them to do the test given that how unspecific that they're talking about them. Um, but they don't really talk about kind of what other things, what other kind of knowledge basis these people are going, like thinking through, um, what kind of other biases they may have. Um, and you know, don't really talk about the kind of testing and piloting that you really need to do any kind of quality rating work. Um, so yeah, this, this sort of thing is just like, it's just, it's just. So I had to get, I had to get into the preprint to see it. I had to see what was going on because I know it was going to just annoy me, uh, to hechen back.
Speaker 2
Yeah, I just, can I, can I inject some like, you know, experience from real medicine on this? Um, you know, I think, I think when you're in the space, you kind of begin to understand some things, like you were saying very, very sparse on the details of who the raters were. So I am a practicing dermatologist. If you asked me to rate questions that have to do with an onboard certified dermatology, and I did one year of internal medicine, but if you asked me to rate questions that have to do with like a cardiology problem, I am not going to know what the latest and greatest is in cardiology. I'm just not because, you know, medicine is a very specialized domain. And so who your radar is and what their experiences does matter. The second thing is a lot of medicine. I mean, sorry to say this, but it's just true is not fully evidence based. It's, it's just, when you go into training, you sort of learn how things are done at your training institution. And you see cases like, for example, I trained on the West coast. I saw very little Lyme disease because that's not something that's prevalent over here. But not only that, there are differences in what medications people will go for, even between Stanford and UCSF. We manage there can be differences in what we think, how we think a skin disease should be managed because we each institution has their own world expert in that disease. And there's not, you know, good random life control trial data. And so we're basically going off of expert opinion here. And so then you even have variation in the same regional area between two major academic centers and how a disease is approached. Never mind between like the US and India and the UK. And I've sometimes even had colleagues who are dermatologists from other countries, like ask my opinion, and the drugs that they have on formulary are different. They have similar mechanisms, but they're different drugs than what we use. And so who is to say like what the consensus is, right? And what, what is a right and you know, what is the right answer now, or what information being omitted is right or wrong. Now, I mean, there are things that are obviously flagrantly wrong. And so if I saw something like that, that would be a problem. But you know, and then with regards to demographic bias, I would say that many physicians don't even understand or know their own biases. And so it would be, it's hard to say like how would you, how would you rate that if you don't even know sort of your own biases or if you're not having specifically had training and what medical bias looks like.
Speaker 1
Yeah, just on the specialty that they say in the paper, specialty expertise, band, family medicine and general practice, internal medicine, cardiology, respiratory, nothing else is respiratory, pediatrics and surgery. And that seems like a remarkable spread to me. I mean, it's kind of, I'd imagine if you want to turn this to computer scientists on the TED, you'd say, well, you know, we got, we got six computer scientists, you know, one from architecture, one from formal, what's it called formal analysis, another from programming languages, and one from machine learning. And we, you know, we're just, we asked them if this description of machine learning architecture was correct, you know, was it? I mean, you're all computer scientists, right? Yeah.