There’s been a bit of chatter about OpenEvidence’s medically tuned LLM reaching a perfect score on USMLE licensing examination questions, improving on the benchmark of 97% set by GPT-5.
Most of the chatter is: “who cares if an LLM can get a perfect score on a multiple choice test, that’s nothing like treating patients!”
However, they’re missing the other point – as revealed in this paper from early this year – that LLMs can frequently determine the correct answer simply from subtle cues in the construction of correct answers.
These authors created a fictional organ, the Glianorex, whose functions involve emotional regulation. They used LLMs to generate an entire, consistent, fictional embryology, anatomy, and physiology for this organ. Then, they fed this information back into an LLM to draft multiple choice questions in the fashion of medical licensing examinations. Finally, they delivered these questions to a few human clinicians and a test set of various LLMs.
The humans scored ~25% on the questions about the Glianorex. It is, after all, a fictional organ, and the best result that could be expected is the average performance from random guessing. The LLMs, however, scored ~67% on average. Since there’s no actual valid medical knowledge to bring to bear, their performance advantage must be the result of subtle cues in the correct answers, as compared to the incorrect. The cues in answer construction were further demonstrated when LLMs were fed just the answers, without the question stems, and still managed to obtain near-50% correct.
This probably indicates some of the baseline performance of LLMs on medical licensing examinations is derived from similar cues, although these data can only suggest that as a hypothesis, rather than a proven fact.
Most importantly, I love these sample questions provided:
And:
“Emotional Intensity Disease!”