AI Passes Medical Exams But Still Misses Subtle Diagnoses

The AI That Passed the Medical Exam But Couldn't Read the Patient

In a pilot study that should make every doctor sit up straighter, researchers at Google asked specialists to compare answers from an AI called Med-PaLM 2 with answers from actual physicians. The specialists preferred the AI's answers 65 percent of the time (Singhal et al., 2025). That is a number so striking it demands a pause.

But here is the part that gets buried: when the same specialists compared Med-PaLM 2's answers with answers from other specialists, the human experts still won. The AI was not the best in the room. It was better than the generalist, but not the specialist. And that difference, that gap between passing an exam and practicing medicine, is where this story actually lives.

How an AI Learned to Ace the USMLE

Med-PaLM 2 is not your father's chatbot. It is a large language model built on Google's PaLM 2 architecture, then fine-tuned on medical text and trained to reason aloud. The team, led by K. K. Singhal and colleagues, tested it on the MedQA dataset, which contains questions from the United States Medical Licensing Examination. The AI scored 86.5 percent, a 19 percent improvement over its predecessor Med-PaLM (Singhal et al., 2025).

That score is high enough to pass the real exam. It is also higher than the average human test taker. But the USMLE is a multiple choice test. It asks you to pick the right answer from a list. Real medicine does not work that way.

The researchers knew this. So they built a more challenging evaluation. They created a framework where physicians rated the AI's answers across nine clinical axes, including accuracy, reasoning, and safety. On eight of those nine axes, physicians preferred Med-PaLM 2's answers over answers from other physicians (Singhal et al., 2025). That is not a typo. The AI beat human doctors on their own turf, at least when it came to answering questions.

The Test That Caught the AI Sleeping

Here is where the story gets interesting. The researchers did something smart. They built adversarial datasets designed to probe the AI's limitations. These were not standard exam questions. They were edge cases, rare presentations, and subtle variations that would trip up a model trained on textbook patterns.

On these adversarial questions, Med-PaLM 2 showed significant improvements over its predecessor, with a P value less than 0.001 (Singhal et al., 2025). That is statistically solid. But the fact that the researchers felt the need to build adversarial datasets at all tells you something important. They knew the AI would struggle with the kind of diagnostic nuance that separates a good doctor from a great one.

Consider a patient who comes in with chest pain. The textbook answer is heart attack. The AI will nail that. But what if the patient is a 28 year old woman with anxiety, and the pain is sharp and positional? That might be pericarditis. Or costochondritis. Or a pulmonary embolism. The difference matters. The treatment is different. The stakes are identical.

The AI knows the textbook. But does it know the patient?

The Specialist Problem

The pilot study using real world medical questions is the most revealing part of this paper. The researchers took actual clinical questions, the kind a doctor might ask a colleague during a consult, and had Med-PaLM 2 answer them. Then they asked specialists and generalists to compare the AI's answers with answers from human physicians.

The results were nuanced. Specialists preferred the AI's answers to generalist answers 65 percent of the time (Singhal et al., 2025). That is a win for the AI, but it is also a commentary on the state of generalist medicine. If an AI can outperform a generalist, what does that say about the training and support we give primary care doctors?

But when the comparison shifted to specialist versus AI, the humans won. Specialists preferred their own kind. The AI was not better than the expert. It was better than the non expert. That distinction matters because most medical decisions in the real world are made by generalists. Emergency room physicians, internists, family doctors. They are the ones who see the undifferentiated patient, the one who has not yet been sorted into a specialty bucket.

The AI can help them. But it cannot replace the specialist who has seen a thousand cases of a rare disease and can recognize the pattern in a way no textbook captures.

How They Built the Thing

The methodology behind Med-PaLM 2 is worth understanding because it explains both the success and the failure. The team used a combination of base LLM improvements, medical domain fine tuning, and two new strategies: ensemble refinement and chain of retrieval (Singhal et al., 2025).

Ensemble refinement means the model generates multiple answers, then picks the best one. It is like asking ten doctors and taking the consensus. Chain of retrieval means the model learns to fetch relevant medical knowledge before answering, rather than relying solely on what it memorized during training. This is closer to how a human doctor works. You do not store every fact in your head. You know where to look.

The training data included medical textbooks, clinical guidelines, and research papers. But the model also learned from the structure of medical reasoning. It was trained to explain its thinking, not just spit out an answer. This is critical because medicine is not about knowing the right answer. It is about knowing why that answer is right and why the alternatives are wrong.

The human evaluation framework involved physicians rating the AI's answers on accuracy, reasoning, safety, and other axes. The fact that physicians preferred the AI on eight of nine axes is a genuine achievement. But it is also a reflection of the evaluation itself. The physicians were judging answers to questions, not interactions with patients. In a clinical encounter, the doctor does not just answer a question. They ask follow ups. They observe body language. They notice the patient is holding their breath when they describe the pain.

What the AI Still Misses

The paper is honest about limitations. The adversarial dataset results, while improved, still showed gaps. The AI struggled with questions that required integrating multiple pieces of information that were not explicitly connected in the training data. It also struggled with questions that required understanding a patient's social context, like whether they can afford a medication or whether they have a support system at home.

These are not minor issues. They are the difference between a correct answer and a correct diagnosis. A patient with diabetes who cannot afford insulin does not need a textbook recommendation. They need a doctor who understands their reality and can find a solution that works within it.

The AI also showed limitations in long form medical question answering (Singhal et al., 2025). When the question required a detailed explanation, the AI's answers sometimes sounded confident but contained subtle errors. This is the same problem that plagues all large language models. They are fluent. They are not always accurate.

The Research Does Not Prove AI Is Ready for the Clinic

Let me be precise about what this paper does and does not show. It shows that Med-PaLM 2 can pass a medical exam, generate answers that physicians prefer to other physicians' answers, and improve on its predecessor across multiple metrics. It shows that in a controlled setting, the AI can match the safety of physician answers, at least as rated by other physicians.

What it does not show is that the AI can practice medicine. The study did not test the AI in actual clinical workflows. It did not measure patient outcomes. It did not evaluate whether the AI's answers lead to better or worse care when implemented. The safety ratings came from physicians reading the AI's answers, not from observing real patient interactions.

This is not a flaw in the paper. It is a limitation the authors acknowledge. But it is the most important limitation to understand. Passing a test is not the same as practicing a profession. A pilot who passes the written exam still needs hundreds of hours of flight time before they can carry passengers. A doctor who passes the USMLE still needs residency and fellowship before they can practice independently.

The AI has passed the written exam. It has not done the residency.

What This Actually Means

The Singhal et al. (2025) paper is a genuine step forward. It demonstrates that large language models can reach a level of medical knowledge that is competitive with human physicians. But the details of the results, the preference for specialists over the AI, the gaps in adversarial testing, the limitations in long form reasoning, tell a more specific story.

Here is what the research actually means for medicine, for patients, and for the doctors who will use these tools.

▸The AI is ready to be a second opinion, not the first. The data shows the AI outperforms generalists but not specialists. This suggests the most useful role for the technology is as a triage tool. A generalist can consult the AI for a differential diagnosis, then escalate to a specialist when the AI's answer seems uncertain or when the case is complex. The AI is a force multiplier for the non expert, not a replacement for the expert.

▸Medical education needs to change. If an AI can answer USMLE questions better than most humans, the exam is no longer a valid test of clinical skill. Medical schools and licensing boards need to shift toward evaluating how doctors interact with AI tools, not just what they know. The skill of the future is knowing when to trust the AI and when to override it.

▸The adversarial dataset is the most important part of the paper. The fact that the researchers built specific tests to probe the AI's weaknesses, and that the AI still showed gaps, tells us where to focus. Rare diseases, atypical presentations, and patients who do not fit the textbook profile will remain the domain of human expertise. The AI is excellent at the common. It is not yet good at the uncommon.

▸Safety is not the same as accuracy. The study found that physicians rated the AI's answers as safe as human answers. But safety in a controlled evaluation is not the same as safety in a chaotic emergency room at 3 AM. Real world safety depends on the AI's ability to recognize when it does not know something, to ask for clarification, and to defer to human judgment. These are skills the current generation of AI does not have.

▸The specialist's value just went up. If the AI can handle the routine cases, the specialist's job shifts toward the complex, the rare, and the human. The doctor who can integrate social context, understand a patient's values, and make decisions under uncertainty becomes more valuable, not less. The AI takes over the memorization. The doctor takes over the judgment.

The Singhal et al. (2025) paper is a milestone, but it is a milestone on a long road. The AI passed the exam. It has not yet passed the test of the patient sitting in front of it, scared and confused, looking for someone who understands not just the disease, but the person who has it. That test is still waiting. And it is the only one that matters.

References

[1]K. K. Singhal, Tao Tu, Juraj Gottweis, Rory Sayres (2025). Toward expert-level medical question answering with large language models. Nature MedicineDOI· 704 citations