The Exam That ChatGPT Passed Without Breaking a Sweat

In February 2023, a group of researchers at the University of Toronto did something that would have sounded absurd a decade ago. They sat down at a computer, opened a browser tab, and asked a chatbot to take the United States Medical Licensing Examination. Not a practice quiz. Not a simplified version. The real thing.
The chatbot was ChatGPT, a language model that had been released to the public only three months earlier. It had never studied medicine. It had never seen a patient. It had no body, no hands, no stethoscope. But when the researchers fed it 189 questions from the National Board of Medical Examiners free 120 question set, the model scored 64.4 percent on the Step 1 portion and 57.8 percent on Step 2 (Gilson et al., 2023).
Step 1 is the exam that medical students spend two years preparing for. The passing threshold is roughly 60 percent. ChatGPT cleared it.
This is not a story about a robot becoming a doctor. It is a story about what happens when a machine that has never touched a human body learns to reason like someone who has.
How They Tested the Machine

The researchers, led by Aidan Gilson at the University of Toronto, used two separate question banks. The first came from AMBOSS, a commercial platform that medical students use to prepare for the USMLE. The second was the NBME free 120 questions, which are released by the exam's own creators and closely mirror the actual test.
Each question was a standard multiple choice item. ChatGPT received the question text and the answer options as a prompt. The model generated its answer, along with an explanation. The researchers then compared its performance against two earlier language models: GPT-3 and InstructGPT (Gilson et al., 2023).
The results were uneven but telling. On the AMBOSS questions, ChatGPT scored 44 percent on Step 1 and 42 percent on Step 2. These are failing scores. But on the NBME questions, which are more representative of the actual exam, the model hit 64.4 percent and 57.8 percent. The difference matters: AMBOSS questions are harder, and the model's performance dropped sharply as difficulty increased (Gilson et al., 2023).
Still, the NBME scores are the ones that matter. On Step 1, ChatGPT outperformed the earlier InstructGPT model by an average of 8.15 percent across all data sets. GPT-3, the predecessor that had wowed the world in 2020, scored barely above random chance (Gilson et al., 2023).
What the Model Got Right

The researchers did not just count correct answers. They analyzed the text that ChatGPT produced alongside each response, looking for three things: whether the model gave a logical justification for its answer, whether it used information that was present in the question, and whether it brought in external knowledge.
The results were striking. In 100 percent of the NBME responses, ChatGPT provided a logical justification for its answer selection. In 96.8 percent of all questions, the model used information that was explicitly present in the question stem (Gilson et al., 2023).
This is not trivial. Medical exam questions are often layered with clues. A patient's age, a lab value, a medication side effect. The model was not guessing randomly. It was reading the question, identifying the relevant details, and constructing a chain of reasoning that a human could follow.
The third metric was the most revealing. When ChatGPT answered correctly, it was far more likely to bring in information that was not present in the question. For incorrect answers, the presence of external information dropped by 44.5 percent on Step 1 and 27 percent on Step 2 (Gilson et al., 2023).
This makes intuitive sense. When you know the answer, you can explain why. When you are guessing, you stick to what is in front of you. ChatGPT, it turns out, behaves the same way.
Why This Is Different from What Came Before
Language models have been answering medical questions for years. But they have been bad at it. GPT-3, released in 2020, scored near random chance on the same USMLE questions. InstructGPT, which was fine-tuned to follow instructions, did better but still fell short.
ChatGPT is different because of scale and training. It has 175 billion parameters, which is a technical way of saying it has seen an enormous amount of text. But more importantly, it was trained with reinforcement learning from human feedback. Humans rated its responses, and the model learned to produce answers that people found helpful, accurate, and coherent.
The result is a model that does not just retrieve facts. It reasons through problems. It explains its logic. It sounds like a person who knows what they are talking about, even when it is wrong.
Gilson et al. (2023) put it plainly: "ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering." That is academic understatement for "this thing is different."
The Hard Questions It Still Cannot Answer
The study has limits, and the authors are transparent about them. The sample size is small: 189 NBME questions and 200 AMBOSS questions. The model was tested on multiple choice items only. The USMLE also includes computer based simulations and clinical skills assessments. ChatGPT cannot touch a patient, cannot hear a heart murmur, cannot see a rash. It is a text based system operating on text based problems.
There is also the question of memorization. ChatGPT was trained on a vast corpus of internet text, which almost certainly includes medical exam questions and explanations. It is possible that the model is not reasoning so much as retrieving answers it has seen before. The researchers tried to control for this by using questions that were not publicly available, but the possibility remains.
And then there is the deeper problem. Even when ChatGPT is correct, it does not know that it is correct. It has no internal sense of confidence, no awareness of its own limitations. It can produce a flawless explanation for a wrong answer with the same tone and structure as a correct one. The researchers found that the model gave logical justifications for 100 percent of its responses, even the ones that were wrong (Gilson et al., 2023).
This is the paradox of large language models. They sound like experts. They sound like they know. But they are pattern matching machines, not thinking beings. They can pass a test without understanding what they have passed.
What This Means for Medical Education
The authors of the study are not interested in replacing doctors. They are interested in how a tool like ChatGPT could change the way doctors are trained.
Medical education is built on memorization. Students spend years committing facts to memory: drug interactions, disease presentations, lab value thresholds. The USMLE is designed to test this knowledge. But if a language model can pass the exam without studying, what does that say about the exam?
Gilson et al. (2023) suggest that ChatGPT could serve as an interactive tutor. A student could ask it to explain a concept, generate practice questions, or walk through a differential diagnosis. The model would not replace the student's own learning, but it could accelerate it.
There is precedent for this. Medical students already use question banks, flashcards, and video lectures. ChatGPT is another tool in that ecosystem. The difference is that it talks back. It explains itself. It adapts to the user's questions in real time.
The researchers also note that the model's ability to provide logical justification for its answers makes it uniquely suited for education. A student who gets a question wrong can see not just the correct answer, but the reasoning behind it. That is more valuable than a score.
What This Does Not Prove
This study is one data point. It does not prove that ChatGPT is ready for clinical use. It does not prove that language models understand medicine. It does not prove that the USMLE is obsolete.
What it proves is narrower and more interesting. It proves that a machine can perform at the level of a third year medical student on a written exam, using only the text of the question and the knowledge encoded in its training data. That is a remarkable achievement for a language model. It is also a reminder that passing a test is not the same as practicing medicine.
The USMLE tests factual knowledge and clinical reasoning. It does not test empathy, judgment, or the ability to hold a patient's hand. ChatGPT can pass the exam. It cannot do any of those other things.
What This Actually Means
- ▸Medical students should start using ChatGPT as a study tool. The model can explain concepts, generate practice questions, and walk through reasoning. It is not a replacement for textbooks or lectures, but it is a free, always available tutor that can answer questions in plain language.
- ▸The USMLE may need to evolve. If a language model can pass Step 1, the exam is testing something that machines can do. The next generation of assessments may need to focus on skills that are uniquely human: communication, physical examination, clinical judgment under uncertainty.
- ▸Medical educators should treat this as a wake up call. The study by Gilson et al. (2023) shows that ChatGPT can provide logical justifications for its answers 100 percent of the time. That means students can interact with the model to check their own reasoning. The lecture hall is no longer the only place to learn.
- ▸Patients will eventually encounter these models. A chatbot that can pass the USMLE will be used by people seeking medical information. Doctors need to understand what these tools can and cannot do, so they can guide patients toward reliable use and away from dangerous misunderstandings.
- ▸The bar for what counts as intelligence is shifting. A model that scores 64.4 percent on a medical licensing exam is not a doctor. But it is also not a parlor trick. It is a machine that can reason through complex problems and explain its thinking. That is new. That matters.
References
- [1]Aidan Gilson, Conrad Safranek, Thomas Huang, Vimig Socrates (2023). How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical EducationDOI· 1,988 citations
