DeepMind AIME In Real-World Action
Not quite ready, but not quite far, either.
This is a further evaluation (Google blog here) of the previously-covered Google AIME project – a Gemini-powered chatbot with guardrails meant to serve as an information-gathering healthcare tool with a human in-the-loop.
This is the overall setup:
Unfortunately, it’s not really a head-to-head comparison. Everyone who participated in the study was exposed to AIME – there’s no pre-visit human chat as a parallel group. The mechanism for evaluation, then, is a comparison of the pre-visit AI summary to the post-urgent care clinician human follow-up plan – and those urgent care visits were exposed to the AIME summary in the first place.
The chatbot was “fine”:
Independent clinical review indicated AIME had a generally reasonable set of diagnoses, as well as grossly acceptable management plans – rating higher than humans in some cases, though humans were still preferred more often. The primary domains in which AIME suffered were sort of under the umbrella of “logistics” or “care navigation”, in that their plans underperformed on being practical or cost-effective.
There are limitations, positives, and negatives to take away here – and the overall gist is that this chatbot probably needs a little more refinement to become clinically acceptable with a human-in-the-loop, and definitely not ready to shed any human oversight, at all. Some bits are harder to fix – the LLM attempting to build a differential diagnosis, for example – but the superficial logistics can likely be improved by injecting more local management context into the framework.
Importantly, it’s a nice baby step forward towards “real world” use – not just exam questions, simulations, or marketing promises.


