Maybe I'm missing something, but if these diagnostic models are just calling general LLMs, isn't NEJM exactly the kind of input the LLMs would be trained on, potentially including the diagnostic cases they are testing against? I know I've seen some of these papers where they intentionally created test scenarios that couldn't have been included in the training material but that doesn't seem to be the case here.
Even if they are protecting from that particular source of confounding, it seems like the more esoteric/rare the disease the more likely both your training and testing materials are to consist of a small number of case reports and derivatives of those case reports. As a result, I'd expect the model to perform well against those testing materials but see no reason to think it would perform nearly as well on an actual patient presentation.
Pretty sure the NEJM CPCs aren't in the training set – I've seen too many other publications using them as test cases for it to be an invalid measure.
I anticipate we'll see a lot more of these benchmarks trying to simulate human-AI interaction by requiring interaction with an LLM role-play as part of a scalable/repeatable back-and-forth approximation of human patients.
Maybe I'm missing something, but if these diagnostic models are just calling general LLMs, isn't NEJM exactly the kind of input the LLMs would be trained on, potentially including the diagnostic cases they are testing against? I know I've seen some of these papers where they intentionally created test scenarios that couldn't have been included in the training material but that doesn't seem to be the case here.
Even if they are protecting from that particular source of confounding, it seems like the more esoteric/rare the disease the more likely both your training and testing materials are to consist of a small number of case reports and derivatives of those case reports. As a result, I'd expect the model to perform well against those testing materials but see no reason to think it would perform nearly as well on an actual patient presentation.
Pretty sure the NEJM CPCs aren't in the training set – I've seen too many other publications using them as test cases for it to be an invalid measure.
I anticipate we'll see a lot more of these benchmarks trying to simulate human-AI interaction by requiring interaction with an LLM role-play as part of a scalable/repeatable back-and-forth approximation of human patients.