Holistic Evaluation of Large Language Models for Medical Tasks with MedHELM

By Suhana BediHejie CuiMiguel FuentesAlyssa UnellMichael WornowJuan M. BandaNikesh KotechaTimothy KeyesYifan MaiMert OezHao QiuShrey JainLeonardo SchettiniMehr KashyapJason Alan FriesAkshay SwaminathanPhilip ChungFateme Nateghi Haredasht Ivan LopezAsad AaliGabriel TseAshwin NayakShivam VedakSneha S. JainBirju PatelOluseyi FayanjuShreya ShahEthan GohDong-han YaoBrian SoetiknoEduardo ReisSergios GatidisVasu DiviRobson CapassoRachna SaralkarChia-Chun ChiangJenelle JindalTho PhamFaraz GhoddusiSteven LinAlbert S. ChiouHyo Jung HongMohana RoyMichael F. GensheimerHinesh PatelKevin A. SchulmanDev DashDanton CharLance DowningFrancois GrolleauKameron BlackBethel MiesoAydin ZahedivashWen-wai YimHarshita SharmaTony LeeHannah KirschJennifer LeeNerissa AmbersCarlene LugtuAditya SharmaBilal MawjiAlex AlekseyevVicky ZhouVikas KakkarJarrod HelzerAnurang RevriYair BannettRoxana DaneshjouJonathan ChenEmily AlsentzerKeith MorseNirmal RaviNima AghaeepourVanessa KennedyAkshay ChaudhariThomas WangSanmi KoyejoMatthew P. LungrenEric HorvitzPercy LiangMichael A. PfefferNigam H. Shah

Nature Medicine

2026 Vol. 32 Pages 943–951.

Operations, Information & Technology

View Publication

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks—clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs—Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini—using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.