Are LLMs or Physicians Better for "NOHARM"?
LLMs, unsafe. Physicians, uhm .. also unsafe.
This is a fun project produced by a coalition of mostly Stanford and Massachusetts General Hospital physicians, pooling specialist knowledge to create a new leaderboard for LLMs in medicine.
They call their program the Numerous Options Harm Assessment for Risk in Medicine (NOHARM), and this is a set of 100 primary care-to-specialist consultations paired with over 4,000 expert annotations. The clinical advice provided by LLMs is then benchmarked against these expert annotations, along with the magnitude of any harms using a RAND-UCLA Appropriateness Method scale combined with WHO Classifications for Patient Safety. See:
They have then tested a whole array of LLMs against this framework, ultimately producing this summary rating figure:
There is a lot more taxonomy of harm to digest in the full text of the preprint, as well as the effectiveness of using “guardian” multi-agent configurations as mitigation for safety.
Finally, we add in the humans:
On the left, a figure showing the balance between “completeness” (all the correct actions) and “safety” (weighted average of harmful actions) for LLMs. On the right, the performance of 10 internal medicine humans on the same cases as compared to best, average, and worst LLM. Humans were allowed to use conventional resources to augment their responses, but not AI. I think it’s pretty fair to say – LLMs are far from perfect, but, sadly, humans are pretty much just an average LLM, now.



