Why AI in Healthcare Needs Regulation Before It's Too Late

The Algorithm That Wants to Diagnose You

In March 2023, a new kind of doctor started seeing patients. No white coat, no medical degree, no Hippocratic oath. Just a text box and a probability engine trained on the entire internet. GPT-4 had arrived, and within weeks, people were asking it to read their lab results, interpret their skin rashes, and explain why their chest hurt.

Some of its answers were brilliant. Some were confidently wrong. And nobody was watching.

That is the central tension Bertalan Meskó and Eric Topol identify in their 2023 paper for npj Digital Medicine: large language models are entering healthcare faster than any regulator can keep up, and the consequences of letting them run unmonitored are not hypothetical (Meskó & Topol, 2023). The authors, both physicians and digital health researchers, argue that we are facing a regulatory gap that could harm patients in ways that traditional medical devices never could.

Here is the thing that surprised me: GPT-4 is not just a better chatbot. It can now read text embedded in images and analyze the context of those images. That means it can look at a patient's X-ray, read the radiologist's notes, and generate a summary that sounds authoritative. It can do insurance preauthorization. It can write clinical documentation. It can answer patient questions about their own data. All of this without being designed for any of those tasks.

The paper makes a clear case: these models are fundamentally different from the AI tools that regulators already oversee. They are not trained on specific medical datasets with labeled outcomes. They are trained on language, on patterns, on the statistical shape of human text. And that means their failures are not failures of accuracy alone. They are failures of context, of safety, of privacy.

What Makes LLMs Different from Every Other Medical AI

Most people assume that if a technology is called "AI" in healthcare, it must have been tested, validated, and approved. That is true for some AI. But not for large language models.

Meskó and Topol draw a sharp distinction. The AI tools that currently have FDA clearance are narrow. They do one thing, like detecting diabetic retinopathy from retinal scans or flagging suspicious nodules on CT images. They are trained on carefully curated medical data. Their performance is measured against gold standards. They fail in predictable ways.

LLMs are the opposite. They are broad. They do everything. They are trained on the open web, which includes Reddit threads, Wikipedia articles, medical textbooks, and YouTube comments. They have no built-in understanding of what is true versus what is statistically plausible. They generate text that sounds like a person who has read everything but understands nothing.

The authors point to a specific risk: GPT-4 can read texts on images and analyze their context. This is not just an upgrade. It is a fundamental change in capability. A model that can read a patient's handwritten medication list, interpret it, and then generate advice is operating at a level of complexity that existing regulatory frameworks were never designed to handle.

The Three Things That Could Go Wrong

Meskó and Topol organize their concerns around three categories: safety, ethics, and privacy. Each one deserves its own look.

Safety: The Hallucination Problem

LLMs hallucinate. That is a technical term for when the model generates something that sounds true but is completely fabricated. In a chatbot that writes poetry, this is a feature. In a system that tells a patient whether their symptoms warrant an ER visit, it is a disaster.

The paper notes that GPT-4's ability to support multiple medical tasks brings "risks from mishandling results it provides to varying reliability to a new level" (Meskó & Topol, 2023). The problem is not that the model is always wrong. It is that it is sometimes wrong, and you cannot tell when. It does not express uncertainty like a human doctor would. It does not say "I am not sure about this." It just produces text with the same tone of confidence whether it is reciting a well established fact or inventing a plausible fiction.

Ethics: Whose Values Are Embedded?

Unlike a blood test or an MRI machine, LLMs carry the values of their training data. That data is mostly English language internet content, which means it reflects the biases, assumptions, and blind spots of the people who produce that content. The authors argue that without regulatory oversight, there is no way to ensure these models align with medical ethics rather than just statistical patterns.

Consider a patient asking about pain management. The model might generate advice that reflects cultural biases about pain tolerance, or it might default to the most statistically common treatment rather than the one best suited to the individual. In a clinical setting, these choices have consequences.

Privacy: The Data That Never Leaves

This is the one that keeps me up at night. When a patient types their symptoms into a chatbot, that data does not just disappear. It gets sent to servers, processed, and potentially stored. The authors warn that current LLMs do not have built in mechanisms for protecting patient data in the way that HIPAA compliant systems do. And because these models are trained on user interactions, every question a patient asks becomes part of the training data for the next version of the model.

There is no consent process for this. No opt out. No way for a patient to know that their private health information is being used to make the model better at answering questions for someone else.

What the Authors Actually Recommend

Meskó and Topol do not just identify problems. They offer practical recommendations for what regulators should do. Here is what they propose:

▸Require transparency about what data the model was trained on and how it was filtered
▸Mandate that LLMs clearly indicate when they are providing medical information versus general information
▸Establish a framework for continuous monitoring, not just premarket approval
▸Create standards for how models handle patient data, including deletion protocols
▸Ensure that models cannot be used for clinical decision making without human oversight
▸Develop a system for reporting and tracking adverse events caused by LLM generated advice

The authors are careful to say they do not want to stifle innovation. They want to channel it. The goal is to let LLMs fulfill their "exciting and transformative potential" without causing harm (Meskó & Topol, 2023).

Why This Paper Matters Now

The paper was published in 2023, but the situation has only become more urgent. Since then, more LLMs have been released. More hospitals have started pilot programs. More patients have started using these tools on their own, without any oversight at all.

The authors make a point that is easy to miss: the regulation of GPT-4 and generative AI in medicine is a "timely and critical challenge" precisely because the technology is so powerful (Meskó & Topol, 2023). If we wait until something goes wrong, the damage will already be done. If we regulate too aggressively, we might block innovations that could save lives.

The paper is not a warning against AI. It is a warning against the absence of guardrails.

What the Research Does Not Prove

It is important to be honest about the limits of this paper. Meskó and Topol do not present new experimental data. They do not run a trial where they compare GPT-4's diagnostic accuracy against human doctors. They do not measure how often patients follow bad advice from an LLM.

What they do is synthesize existing evidence and make a normative argument. They say that the current trajectory is dangerous and that regulatory action is needed. That is a policy argument, not a scientific finding. It is persuasive because it is grounded in what we already know about how LLMs work and how medical regulation works.

The open question is whether regulators will act fast enough. The paper cannot answer that. It can only make the case.

The Hardest Problem Nobody Is Talking About

There is one issue that the paper raises implicitly but does not fully explore. LLMs are not just tools that doctors use. They are tools that patients use directly. A patient can go home, open a browser, and ask an LLM to interpret their test results before they ever see their physician. That patient has no way of knowing whether the model is hallucinating. They have no way of knowing that the model was not designed for this task.

This is the regulatory blind spot. Medical devices are regulated at the point of sale. But LLMs are not sold as medical devices. They are sold as general purpose tools that happen to be good at answering questions. The patient who uses one for medical advice is not the FDA's responsibility under current law.

Meskó and Topol argue that this needs to change. But they do not pretend it will be easy.

What This Actually Means

The paper leaves you with a clear set of implications, not just for regulators but for anyone who might use these tools.

▸If you are a patient, do not trust an LLM for medical advice unless you can verify every claim it makes against a reliable source. The model does not know when it is wrong.
▸If you are a clinician, do not use an LLM for clinical documentation or decision support without understanding what data it was trained on and what safeguards are in place. Your liability does not go away because a machine made the error.
▸If you are a hospital administrator, do not deploy LLM tools without a clear policy on data privacy, adverse event reporting, and human oversight. The regulatory framework is coming. It is better to be ahead of it.
▸If you are a regulator, the clock is ticking. The paper makes a specific, actionable case for what oversight should look like. The technology is already in use. The question is whether you will catch up before someone gets hurt.
▸If you are a developer, build transparency into your product from the start. The authors argue that regulation should not damage the potential of LLMs. The best way to ensure that is to make safety features part of the design, not an afterthought.

The paper ends with a vision: regulatory oversight that lets medical professionals and patients use LLMs "without causing harm or compromising their data or privacy" (Meskó & Topol, 2023). That is the goal. Whether we reach it depends on whether we treat this as the urgent problem it is, not as a future concern to be dealt with later.

The algorithms are already here. They are already being used. And nobody is watching.

References

[1]Bertalan Meskó, Eric J. Topol (2023). The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital MedicineDOI· 892 citations