The Problem With Medical AI Is Not What You Think

A few years ago, a team of researchers trained an AI to detect pneumonia from chest X-rays. The model performed spectacularly in the lab. Then they deployed it in a hospital, and it promptly fell apart. The reason was not bad code or faulty sensors. The model had learned to recognize the metal tags that radiology technicians place on patients' left sides. In the training data, nearly all patients with pneumonia had those tags. The AI was reading the hospital's workflow, not the disease.
This is the kind of story that makes doctors skeptical of machine learning. And they are right to be skeptical. But the problem is not that AI cannot work. The problem is that nobody can tell whether a given model was built properly, tested honestly, or validated on the right patients. The published papers describing these models have been, to put it bluntly, a mess.
Now a group of 34 researchers led by Gary S. Collins at the University of Oxford have released a new set of guidelines designed to fix that. The TRIPOD+AI statement, published in the BMJ, is an update to a 2015 checklist that aimed to standardize how researchers report prediction models. The original TRIPOD guidelines were useful but incomplete. They were written before machine learning became the dominant method for building clinical prediction tools. The new version, TRIPOD+AI, covers both traditional regression models and modern AI systems (Collins et al., 2024). It is 27 items long, and it is the closest thing medicine has to a universal code of conduct for predictive algorithms.
The stakes are high. Clinical prediction models are used every day to decide who gets a biopsy, who goes to the ICU, who receives an expensive drug. If the models are flawed, patients get hurt. If the models are opaque, nobody can tell whether they are flawed. TRIPOD+AI does not solve the technical challenges of building better AI. It solves a more fundamental problem: making sure that when someone claims a model works, they tell you exactly how they know.
Why Medical Prediction Models Keep Failing in the Real World

The pneumonia X-ray story is not an outlier. It is a pattern. Over the past decade, researchers have documented dozens of cases where AI models that looked perfect in published papers failed when tested on new data. In 2019, a study in Nature Medicine found that most deep learning models for medical imaging had not been validated on data from different hospitals or different patient populations. In 2021, a review of 500 AI studies in pathology found that fewer than 5% reported any kind of external validation.
The problem is not malice. It is sloppy reporting. Researchers often leave out critical details: how they handled missing data, how they split training and test sets, whether they tuned hyperparameters before or after seeing the test results. These omissions make it impossible to reproduce the work or assess its reliability. A model that achieves 95% accuracy in one paper might achieve 55% in another hospital, and nobody knows why because the original paper did not describe the data well enough to spot the difference.
Collins and his colleagues saw this coming. The original TRIPOD checklist, published in 2015, was a response to the same problem in traditional statistical models. But machine learning introduced new complexities. Neural networks can have millions of parameters. They can overfit to spurious correlations like the metal tag on a chest X-ray. They can learn shortcuts that work in one dataset but fail in another. The 2015 checklist did not account for these issues. The new one does.
The 27 Questions Every AI Paper Should Answer

The TRIPOD+AI checklist is not a technical specification. It is a list of reporting requirements, organized into six sections: title and abstract, introduction, methods, results, discussion, and other information. Each item asks a specific question that the paper must answer. Some are obvious. Some are surprising. All of them are designed to prevent the kind of ambiguity that allows flawed models to slip through peer review.
The Title Must Name the Method
Item 1 requires that the title identify the type of prediction model used. If the paper uses machine learning, the title must say so. If it uses regression, the title must say that. This seems trivial, but it is not. A 2022 analysis of 300 prediction model papers found that 40% did not mention the modeling method in the title or abstract. Readers had to dig into the methods section to learn whether the model was a simple logistic regression or a deep neural network. That is unacceptable for a paper that might influence clinical decisions.
The Data Source Must Be Traceable
Items 4 through 7 cover the data. Where did it come from? How was it collected? Were there eligibility criteria? How many patients were excluded, and why? These questions matter because prediction models are exquisitely sensitive to the population they are trained on. A model trained on data from a single academic medical center may not work in a community hospital. A model trained on data from 2015 may not work in 2024, because clinical practice changes. Without detailed descriptions of the data, readers cannot assess whether the model applies to their own patients.
The Outcome Must Be Clearly Defined
Item 8 asks for a clear definition of the outcome being predicted. This sounds obvious, but it is often fudged. What counts as "readmission within 30 days"? Does it include planned readmissions? Does it include patients who die before they can be readmitted? Different definitions produce different models. If the definition is vague, the results are meaningless.
The Model Must Be Described in Enough Detail to Rebuild It
Items 10 through 14 cover the model itself. How were predictors chosen? How was the model trained? How were hyperparameters tuned? Was there any form of feature selection, and if so, was it done inside or outside the cross validation loop? These details matter because machine learning models are not black boxes. They are recipes. If the recipe is incomplete, nobody can replicate the dish.
Performance Must Be Reported With Confidence Intervals
Items 15 through 18 cover performance evaluation. The model must be tested on data that was not used for training. Performance metrics must include confidence intervals, not just point estimates. And the paper must report not just the average performance, but how performance varies across subgroups. A model that works well for white patients may fail for Black patients. A model that works well for young patients may fail for elderly ones. If the paper does not report subgroup analyses, the reader cannot know.
The Model Must Be Made Available
Item 23 requires that the model be made available for independent evaluation. This could mean publishing the code, the weights, or the equations. It could mean providing a web based calculator. The key point is that other researchers must be able to test the model on their own data. Without this, the claims in the paper are essentially unverifiable.
What the Checklist Does Not Do
The TRIPOD+AI statement is not a certification of quality. It does not say that a model is good or bad. It only says that the paper describing the model is complete enough for others to evaluate it. A paper can follow every item on the checklist and still describe a terrible model. The checklist is about transparency, not performance.
This distinction matters because some critics have worried that TRIPOD+AI could be used as a rubber stamp. "We followed the guidelines, so our model must be valid." That is not what the authors intend. The checklist is a floor, not a ceiling. It ensures that the paper contains the information needed for critical appraisal. It does not perform that appraisal for you.
There is also a practical tension. The checklist is long. Twenty seven items, each with sub items, adds up to a lot of text. Some journals have limited word counts for methods sections. Researchers may need to use supplementary materials or online repositories to comply fully. The authors acknowledge this and encourage the use of supplementary files, but the burden is real. A checklist that is too burdensome may be ignored.
Why This Matters for Patients and Doctors
The ultimate goal of TRIPOD+AI is not to help researchers publish better papers. It is to help clinicians make better decisions. When a doctor uses a prediction model to decide whether to start a patient on blood thinners or send them for a CT scan, that doctor needs to trust the model. Trust requires transparency. If the model's paper does not say where the data came from, how the outcome was defined, or how performance was measured, the doctor has no basis for trust.
The same logic applies to regulators. The FDA has approved hundreds of AI based medical devices, but the evidence supporting many of them is thin. A 2023 analysis found that only 12% of FDA approved AI devices had been validated in a prospective study. The rest relied on retrospective analyses of existing data, often from the same institution that developed the model. TRIPOD+AI does not change FDA policy, but it gives regulators a standard to hold companies to. If a company claims its model works, it must provide the evidence in a format that allows independent verification.
What This Actually Means
The TRIPOD+AI statement is not a revolution. It is a refinement. It takes a process that was already in place and updates it for a new technological era. But refinements matter. The difference between a model that works and a model that fails is often a matter of details. The checklist forces those details into the open.
Here is what this means in practice:
- ▸If you are a researcher, you should use the TRIPOD+AI checklist before you submit your next paper. Not after. Before. Build the checklist into your study design, not just your writing. If an item asks for a description of how missing data were handled, decide that in advance, not after you see the results.
- ▸If you are a reviewer or editor, you should require that submitted papers include a completed TRIPOD+AI checklist. The authors provide one. Use it. If a paper claims to follow the guidelines but does not provide the checklist, flag it.
- ▸If you are a clinician, you should ask whether the prediction tools you use were developed according to TRIPOD+AI standards. If the answer is no, or if the answer is unclear, treat the tool with caution. Demand evidence that it works in patients like yours.
- ▸If you are a patient, you should know that the models influencing your care are becoming more transparent. That does not mean they are perfect. It means you can ask harder questions. And you should.
- ▸If you are a journalist, stop writing about AI breakthroughs without checking whether the underlying paper follows reporting guidelines. If a paper does not report confidence intervals or external validation, that is the story. Not the accuracy number. The missing information.
The metal tag on the chest X-ray was a subtle signal. It took a sharp researcher to notice it. But the reason it got into the model in the first place was not subtle. It was sloppy reporting. The training data was not described well enough for anyone to spot the confound. TRIPOD+AI does not prevent every failure. But it makes them harder to hide. That is a start.
References
- [1]Gary S. Collins, Karel G.M. Moons, Paula Dhiman, Richard D Riley (2024). TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJDOI· 1,899 citations
