The LLM As QA Judge
A good low-risk administrative application, but still a ways to go.
The field of “missed opportunities for diagnosis” is a commonly applied quality assurance lookback method in the emergency department. Using “eTriggers” such as 72-hour returns, escalations in care after admission, etc., cases are identified for review.
The review process itself, however, is laborious – digesting clinical notes, lab results, and physiology, only to produce low frequencies of true missed opportunities. Sounds like the perfect job for an LLM!
And then:
As compared to a 2-physician review panel, the various LLMs – all a few months old, now – were able to extract varying percentages of “true positives” while including varying percentages of “false positives”. However, “true positives” are so infrequent the burden of “false positives” tips the scales most dramatically – and, as can be seen above, most cases flagged by the LLM are unlikely to represent missed opportunities for diagnosis. Sensitivity can be sacrificed for yield, so the low sensitivities seen here are not terribly concerning – but, even with poor sensitivity, the specificity remains inadequate.
It isn’t so much these LLMs are a “few months old” now, but that they are configured as one-shot prompts rather than a set of organized tool-using agents retrieving and organizing the data to best evaluate in the chosen framework. It might rather be surprising these LLMs performed as well as they did, considering the limited sophistication applied. Better work can be done!

