Stanford's New Clinical LLM Benchmark

Because clinical tasks are not multiple choice USMLE questions.

Mar 03, 2025

We were all amazed when LLMs first hit the scene – and, even in the olden days of GPT-3, that they could perform passably (pun intended!) well on various medical licensing examinations.

Of course, anyone who has trained in medicine knows the USMLE questions are built in such a fashion they are primarily associative recall or pathognomonic pattern match. While the distance seemingly traveled over the past few years occasionally seems astounding, it has obvious face validity a statistical prediction engine would be able to statistically rank the correct answer above a handful of others.

One candidate for the “next generation” – the Stanford MedHELM, a varied set of tasks upon which to test evolving LLMs. These range from sets of questions requiring calculations off a clinical case, prediction from clinical notes, to summarization, and beyond. The benchmark summary figure looks like this:

table showing Performance of language models across 31 healthcare benchmarks, 5 categories, and 22 subcategories.

The benchmark authors go on to comment on some of the limitations of the evaluation metrics and room for improvement – the primary concern of the authors being the ideal fashion in which to evaluate natural language output in place of the BertScore-F1.

An interesting step on the path towards better, and more robust, evaluation in such a manner as to replicate actual clinical medicine.

Evidence Triage

Discussion about this post