When AI Doctors Outperform Human Physicians on Clinical Exams

The Doctor Will See You Now. The AI Will Take the Exam.

On a multiple choice test designed to measure whether someone is ready to practice medicine, a machine just crushed the curve. In 2023, a team from Google Research led by Karan Singhal fed a large language model called Flan PaLM 2 a battery of questions from the US Medical Licensing Exam and related benchmarks. The model scored 67.6 percent on MedQA, the dataset that mirrors the actual exam. That is more than 17 percentage points higher than any previous AI system had managed (Singhal et al., 2023). For context, the typical passing score for human medical students hovers around 60 percent.

But here is the strange part. When the same researchers asked human physicians to evaluate the model's answers for qualities like comprehension, reasoning, and potential for harm, the doctors spotted problems the automated scoring missed. The AI could pick the right answer. It could not always explain why the wrong ones were dangerous.

This is the paradox at the heart of modern medical AI. The machines are acing the tests. They are failing the bedside manner.

What the AI Actually Did

Singhal and his colleagues did not just throw questions at a chatbot. They built a new benchmark called MultiMedQA, which combines six existing medical question datasets plus a fresh one called HealthSearchQA that captures real questions people type into Google. The goal was to measure not just textbook knowledge but practical clinical reasoning across domains including professional medicine, research, and consumer health queries.

The model they tested was PaLM, a 540 billion parameter language model, and its instruction tuned variant Flan PaLM 2. To get the best performance, the researchers used a combination of prompting strategies. They did not just ask the model to answer. They asked it to reason step by step, to consider alternatives, and to flag uncertainty. This approach pushed accuracy on every multiple choice dataset to state of the art levels. On MedMCQA, a dataset of Indian medical exam questions, the model hit 57 percent. On PubMedQA, which requires reading biomedical abstracts, it reached 79 percent (Singhal et al., 2023).

These numbers matter because they represent a genuine leap. Prior to this work, the best AI systems struggled to break 50 percent on MedQA. The jump from roughly 50 percent to 67.6 percent is not incremental. It is the difference between a system that guesses randomly and one that actually knows the material.

Why Multiple Choice Is Not Enough

Here is where the story gets interesting. The researchers did not stop at automated scoring. They recruited a panel of clinicians to read the model's answers and rate them along multiple axes: factuality, comprehension, reasoning, possible harm, and bias. This is rare in AI research. Most papers report a single accuracy number and declare victory. Singhal and his team wanted to know whether the model could actually think like a doctor.

The results were sobering. While the model outperformed earlier systems, it still fell short of human clinicians on every dimension except one. The model was better at recalling specific facts from medical textbooks. But when it came to understanding the context of a patient's situation, reasoning through differential diagnoses, or recognizing when an answer might cause harm, the humans were clearly superior (Singhal et al., 2023).

This gap matters because medicine is not a multiple choice test. Real patients do not present with neatly labeled options. They describe symptoms in messy, contradictory language. They forget details. They lie. They are scared. A doctor needs to know not just the right answer but when to say "I need more information" or "This could be something dangerous."

The model could not do that reliably.

How They Fixed It Without Starting Over

The researchers then tried something clever. Instead of retraining the massive model from scratch, which would require enormous computational resources and medical data that is hard to obtain, they used a technique called instruction prompt tuning. This is a parameter efficient method that aligns a pre trained model to a new domain using just a few examples.

Think of it like giving a brilliant but unfocused student a few sample exam questions with detailed explanations of what makes a good answer. The student does not need to re learn medicine. They just need to understand the format and expectations of the test.

The resulting model, called Med PaLM, showed marked improvement on the human evaluation metrics. Comprehension scores went up. Reasoning got better. The model learned to flag its own uncertainty more often. But even after tuning, it remained inferior to clinicians on most metrics (Singhal et al., 2023). The gap narrowed. It did not close.

What the Model Still Cannot Do

This is the part of the story that gets lost in the hype. The paper is careful to document what the model does not do well. It sometimes produces answers that are factually correct but clinically inappropriate. For example, it might recommend a treatment that is technically indicated but dangerous for a specific patient because of an interaction the model did not consider. It can also produce answers that are internally consistent but based on a misunderstanding of the question.

The researchers also found that the model's performance varied dramatically depending on how the question was phrased. A small change in wording could drop accuracy by 10 or 15 percentage points. Human doctors do not have this fragility. They can handle a question asked in different ways because they understand the underlying concepts, not just the surface pattern.

Perhaps most concerning, the model showed signs of bias. It performed worse on questions about certain demographic groups, echoing patterns seen in other language models trained on internet text. The researchers note that "human evaluation reveals key gaps" in the model's ability to handle questions that involve race, gender, or socioeconomic status (Singhal et al., 2023).

The Open Question Nobody Wants to Ask

Here is the question that keeps me up at night. If a model can score 67.6 percent on a medical licensing exam, and if that number keeps going up with each new generation of models, at what point do we trust it? The paper shows that accuracy alone is a poor proxy for clinical competence. But the trend line is clear. Models are getting better. Human performance is not.

The authors do not answer this question. They do not even really ask it. But their data forces the issue. If a model can outperform the average medical student on a standardized test, and if it can be tuned to match clinicians on some dimensions of human evaluation, then the line between "AI assistant" and "AI doctor" starts to blur. The paper suggests that models have "potential utility" in medicine. That is an understatement. The potential is enormous. So are the risks.

What This Actually Means

▸Test performance is not clinical performance. The 67.6 percent score on MedQA is impressive, but it measures pattern matching, not understanding. A model that scores well on exams could still give dangerous advice in a real clinical setting. Do not confuse benchmark accuracy with medical competence.

▸Human evaluation is essential and expensive. The Singhal team's decision to use clinician reviewers is the gold standard. Any company claiming their AI can practice medicine should be required to show human evaluation results, not just automated scores. If they cannot or will not, treat their claims with skepticism.

▸Instruction tuning works, but it is not magic. Med PaLM improved significantly with just a few examples, but it still fell short of humans. This suggests that current models need a fundamental architectural change, not just better prompts, to match human clinical reasoning.

▸Bias is baked into the data. The model's worse performance on certain demographic questions is not a bug. It is a feature of training on internet text. Any clinical AI must be rigorously tested for bias before deployment, and the burden of proof should be on the developers.

▸The doctor's role is changing, not disappearing. If AI can handle exam style knowledge retrieval, then human physicians can focus on what they do best: reasoning through uncertainty, building trust with patients, and making judgment calls that no algorithm can. The question is not whether AI will replace doctors. It is whether doctors will use AI well.

References

[1]Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi (2023). Large language models encode clinical knowledge. NatureDOI· 2,988 citations