ChatGPT Nearly Passed the Medical Licensing Exam

The Exam That Almost Broke

On a Wednesday afternoon in early 2023, a group of researchers at a large academic medical center did something that felt almost illicit. They opened a browser tab, logged into ChatGPT, and began feeding it questions from the United States Medical Licensing Exam. Not the easy ones. The hard ones. The ones that make medical students cry in library stairwells.

What happened next was not supposed to happen.

The AI had never been trained on medical textbooks. It had never sat through a single lecture on pathophysiology. It had never touched a patient. And yet, when the scores came in, ChatGPT had passed all three parts of the USMLE. Not by a landslide, but by enough.

Tiffany H. Kung and her colleagues at the University of Texas Southwestern Medical Center published the results in PLOS Digital Health (Kung et al., 2023). The paper has since accumulated over 3,400 citations. But the number that matters most is this: 60 percent. That was ChatGPT's overall accuracy across the three exams. The passing threshold is roughly 60 percent.

The machine scraped by.

But here is the thing about scraping by on the USMLE. It means you know more than most doctors about the basic science of medicine. It means you can reason through a differential diagnosis. It means you can interpret lab values and recognize patterns in clinical vignettes. It means you are not guessing.

Kung and her team did not train the model. They did not fine tune it. They did not give it a single hint. They simply copied questions from publicly available USMLE practice materials, pasted them into the chat window, and recorded what came back. The AI answered 376 questions from Step 1, 345 from Step 2CK, and 346 from Step 3. It passed all three.

The authors were careful to note what this did not mean. ChatGPT is not ready to see patients. It cannot examine a person. It cannot ask follow up questions. But the fact that it could pass at all, with no specialized training, suggested something that made the medical education establishment nervous.

The exam that has terrorized generations of medical students had just been conquered by a chatbot.

How They Tricked an AI Into Taking the Boards

The methodology was elegant in its simplicity. Kung and her team collected 1,067 multiple choice questions from USMLE practice resources. These were not the official exam questions, which are locked behind strict copyright and security protocols. They were publicly available practice questions that mimic the real thing.

Each question was presented to ChatGPT exactly as written. The researchers did not rephrase anything. They did not provide additional context. They did not tell the AI it was taking a medical exam. They simply asked it to answer.

This matters because ChatGPT is not a medical AI. It is a general purpose language model trained on a vast corpus of internet text. It has read Wikipedia, Reddit, medical journals, fan fiction, and recipe blogs. It has never been optimized for clinical reasoning. It has never been told that its answers could affect whether a future doctor knows how to treat a heart attack.

The researchers evaluated two things: accuracy and concordance. Accuracy was straightforward. Did ChatGPT pick the right answer? Concordance was more interesting. They asked ChatGPT to explain its reasoning for each answer, and then had human evaluators judge whether the explanation made sense and was internally consistent.

The accuracy numbers were striking. On Step 1, which tests basic science knowledge, ChatGPT got 63.6 percent correct. On Step 2CK, which tests clinical knowledge, it got 57.8 percent. On Step 3, which tests clinical management, it got 62.4 percent. All three scores were above the passing threshold.

But here is where it gets weird. ChatGPT was not equally good at everything. It was significantly better at questions that required reasoning than questions that required memorization. The authors found that the model excelled at questions involving clinical reasoning, pathophysiology, and differential diagnosis. It struggled with questions that demanded specific factual recall, like drug dosages or rare disease presentations.

This is the opposite of what you might expect. Medical students spend years memorizing facts. ChatGPT, by contrast, had never memorized anything. It had simply absorbed patterns from text. It could reason its way through a clinical problem without knowing the underlying facts.

That is either deeply impressive or deeply unsettling, depending on your perspective.

What the Machine Got Right

The researchers dug into the specific types of questions where ChatGPT performed best. The pattern was clear: the model was strongest on questions that required synthesis of information.

Consider a typical USMLE question. It might describe a 55 year old man with chest pain, shortness of breath, and a history of smoking. The answer choices include myocardial infarction, pulmonary embolism, pneumonia, and pericarditis. A human student has to integrate the symptoms, risk factors, and physical exam findings to arrive at the most likely diagnosis.

ChatGPT did this well. It could hold multiple pieces of information in its context window and weigh them against each other. It understood that chest pain plus smoking plus shortness of breath points toward a heart attack, not a lung infection. It could rule out pericarditis because the pain was not positional.

This kind of reasoning is what medical schools spend years teaching. And here was a machine that could do it without ever having seen a patient.

The concordance analysis was even more revealing. When ChatGPT explained its reasoning, human evaluators rated those explanations as highly coherent and medically sound. The AI did not just guess the right answer. It could articulate a plausible chain of reasoning that led to that answer. In many cases, its explanations were indistinguishable from what a human medical student might write.

But there was a catch. The researchers noted that ChatGPT sometimes produced explanations that were internally consistent but factually wrong. It would generate a beautiful logical argument for the wrong diagnosis. This is a known problem with large language models. They are extremely good at sounding confident, even when they are completely mistaken.

The Questions That Broke It

ChatGPT did not pass every subject equally. The researchers broke down performance by medical specialty, and the results were revealing.

The model performed best on questions related to pathology, pharmacology, and microbiology. These are subjects with clear patterns and well defined relationships. A bacterium causes a specific disease. A drug binds to a specific receptor. The logic is linear.

It performed worst on questions related to surgery, obstetrics, and pediatrics. These are subjects that require procedural knowledge, developmental context, and an understanding of physical anatomy that cannot be fully captured in text. You cannot learn to tie a suture by reading about it. You cannot understand the mechanics of childbirth from a textbook alone.

This makes intuitive sense. ChatGPT has never held a scalpel. It has never delivered a baby. It has never examined a child. Its knowledge is entirely textual, and some aspects of medicine resist textual representation.

But there was another category of questions that stumped the AI. Questions that required understanding of social context, patient preferences, or ethical considerations. A question about how to counsel a patient who refuses treatment. A question about cultural factors that affect disease prevalence. ChatGPT could not handle these.

The model would give technically correct answers that were socially tone deaf. It would recommend treatments without considering cost. It would suggest interventions without accounting for patient autonomy. These are not failures of medical knowledge. They are failures of something more fundamental.

What the Study Does Not Prove

It is important to be precise about what Kung and her colleagues actually demonstrated. They showed that ChatGPT can pass a multiple choice exam. That is not the same as showing it can practice medicine.

The USMLE is a proxy for clinical competence, not a measure of it. It tests knowledge and reasoning in an artificial environment. There are no real patients. No time pressure. No distractions. No one is crying. No one is dying. The stakes are imaginary.

The authors themselves were careful to hedge. They wrote that their results suggest "large language models may have the potential to assist with medical education, and potentially, clinical decision making." Note the double hedge. "May have the potential." That is academic language for "we are not sure yet."

What the study does not address is whether ChatGPT can apply its knowledge in real time, under real conditions, with real consequences. Medical errors kill hundreds of thousands of people every year. An AI that is right 60 percent of the time is not ready for the clinic. It is not even ready for the classroom, not really.

There is also the question of generalizability. The study used practice questions, not actual exam questions. Practice questions are often easier. They are also more likely to appear in the training data. ChatGPT may have seen some of these questions before, or questions very similar to them. The researchers could not control for this.

And then there is the problem of hallucination. Large language models are known to invent facts with complete confidence. A medical AI that hallucinates could kill someone. The study did not test for this systematically.

The Deeper Implication Nobody Is Talking About

Here is what the study actually reveals, and it is more interesting than whether ChatGPT can pass a test.

The USMLE is designed to test whether a human being has acquired the knowledge and reasoning skills necessary to practice medicine safely. It is a high stakes exam that takes years to prepare for. Medical students spend tens of thousands of dollars on prep courses. They sacrifice sleep, relationships, and mental health.

And a general purpose chatbot passed it without trying.

This does not mean medical education is obsolete. But it does mean that the exam itself may be measuring something that is not uniquely human. The ability to answer multiple choice questions about medical facts and clinical reasoning is now demonstrably achievable by a machine. That should make us ask what else the exam is missing.

The parts of medicine that ChatGPT cannot do are the parts that matter most. The physical exam. The therapeutic relationship. The ability to sit with a patient in silence. The judgment to know when the textbook does not apply. The humility to say "I do not know."

Kung and her colleagues did not set out to critique medical education. But their study does exactly that, inadvertently. If a machine can pass the USMLE, then the USMLE is not testing what we thought it was testing. It is testing pattern recognition and factual recall, which are the parts of medicine that are easiest to automate.

The hard parts remain.

What This Actually Means

▸Medical exams will need to be redesigned. If a general purpose AI can pass the USMLE, the exam is no longer a valid test of human clinical competence. Future exams will need to test skills that machines cannot replicate: physical examination, procedural competence, ethical reasoning, and the ability to navigate uncertainty.

▸AI will become a study tool, not a replacement. The authors suggest that ChatGPT could help medical students by explaining concepts, generating practice questions, and providing feedback. This is probably the most immediate application. Not replacing doctors, but helping them learn faster.

▸Clinical decision support is coming, but slowly. The study shows that AI can reason about medical problems, but it is not reliable enough to trust without human oversight. The most likely near term use is as a second opinion, a tool that suggests possibilities that a human clinician can then verify or reject.

▸The bottleneck is not knowledge, it is judgment. ChatGPT has demonstrated that medical knowledge can be encoded and retrieved by a machine. What it cannot do is apply that knowledge in context, with empathy, and with an understanding of the individual patient. That is where human doctors will remain essential.

▸The bar for AI in medicine just got higher. Now that we know a general chatbot can pass the licensing exam, the question shifts from "can AI do this?" to "should AI do this?" The answer requires more than technical capability. It requires a conversation about safety, ethics, and the nature of healing itself.

Kung and her team did not solve medical education. They did something more interesting. They showed that the test we use to measure medical competence may not be measuring what we think it is. And that is a finding worth discussing over coffee, or over a stiff drink, depending on whether you are a medical student or a medical educator.

References

[1]Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital HealthDOI· 3,485 citations