Date: October 28, 2024
Reference: Woelfle T et al. Benchmarking Human–AI collaboration for common evidence appraisal tools. J Clin Epi Sept 2024.
Guest Skeptic: Dr. Laura Walker is an Associate Professor of Emergency Medicine and the vice chair for digital emergency medicine at the Mayo Clinic. In addition to finding ways to use technology in emergency department (ED) care, she is interested in how health systems work and how patients move through the healthcare system. Previously, she has been an ED medical director, quality chair, and regional hospital director.
Case: The Mayo Clinic Department of Emergency Medicine is planning its next journal club. It has recently been boosted by adding Dr. Chris Carpenter to the faculty who created the amazing JC at Washington University. A resident has been assigned to report on the PRISMA quality checklist for the SRMA they will discuss. She has been playing around with ChatGPT 3.5 from OpenAI and wonders if it could do this task quickly and, more importantly, accurately.
Background: In recent years, large language models (LLMs), such as GPT-4 and Claude, have shown remarkable potential in automating and improving various aspects of medical research. One intriguing area of exploration is their ability to assist in critical appraisal, a cornerstone of evidence-based medicine (EBM). Critical appraisal involves evaluating the quality, validity, and applicability of studies using structured tools like PRISMA, AMSTAR and PRECIS-2.
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses): This is a set of guidelines designed to help authors improve the reporting of systematic reviews and meta-analyses. It includes a checklist of 27 items that focus on the transparency and completeness of the review process, such as the identification, screening, and inclusion of studies, the synthesis of results, and how potential biases are addressed.
AMSTAR (A Measurement Tool to Assess Systematic Reviews): This is a tool used to evaluate the methodological quality of systematic reviews of healthcare interventions. It consists of a checklist of 11 items that assess the robustness of a systematic review's design and execution. The tool covers key areas like the use of comprehensive search strategies, inclusion criteria, methods for assessing the risk of bias, and the appropriateness of data synthesis.
PRECIS-2 (Pragmatic-Explanatory Continuum Indicator Summary): This is an assessment tool used to evaluate how "pragmatic" or "explanatory" a randomized controlled trial (RCT) is. It is designed to help researchers design trials that better align with their research goals, whether they aim to inform real-world clinical practice (pragmatic) or control for as many variables as possible to test an intervention under ideal conditions (explanatory). The tool uses nine domains (e.g., eligibility criteria, recruitment, primary outcome, etc.) to rate how closely the trial conditions resemble real-world clinical settings.
Traditionally, these tasks have been manual, often requiring significant expertise and time to ensure accuracy. However, as LLMs have evolved, their ability to interpret and analyze complex textual data presents a unique opportunity to enhance the efficiency of these appraisals. Research into the accuracy of LLMs, when used for appraising clinical trials and systematic reviews, is still in its early stages but holds promise for the future of automated medical literature assessment.
Given that assessing the validity and quality of clinical research is essential for ensuring that decisions are based on reliable evidence, this development raises important questions. Can these advanced AI tools perform critical appraisals with the same accuracy and reliability as human experts? More importantly, how can they augment human decision-making in medical research?
Clinical Question: Can large language models accurately assess critical appraisal tools wh...