ChatGPT Shows Promise and Pitfalls in Clinical Settings

The Doctor Will See You Now. The Chatbot Is Watching.

On a Tuesday morning in early 2023, a patient in an Italian hospital described their symptoms to a physician. The doctor listened, asked follow-up questions, and then, instead of pulling up a textbook, opened a browser tab and typed the same symptoms into ChatGPT. It was not an officially sanctioned experiment. It was a test. The doctor wanted to see if the AI could catch something he missed.

This is not a hypothetical scenario. It is the kind of moment that Marco Cascella and his colleagues at the University of Naples Federico II wanted to understand when they published their feasibility study in the Journal of Medical Systems. The paper, which has already accumulated over a thousand citations, does not try to declare ChatGPT a medical breakthrough or a dangerous toy. Instead, it does something more useful: it maps the specific places where the AI succeeds, the specific places where it fails, and the specific ways it could be misused by people who do not understand its limits.

The results are unsettling, promising, and deeply human.

What the Paper Actually Did

Cascella and his team did not run a clinical trial with patients. They ran a thought experiment with structure. They took ChatGPT (the GPT-3.5 version available in early 2023) and fed it a series of scenarios drawn from real clinical and research settings. They tested it on four fronts: supporting clinical decisions, generating scientific text, being misused for harmful purposes, and reasoning about public health topics. They did not measure accuracy as a single number. They looked for patterns.

The authors found that ChatGPT could produce fluent, confident responses to medical queries. It could summarize a patient history, suggest differential diagnoses, and even write a credible draft of a scientific abstract. But fluency is not accuracy. Confidence is not correctness. The paper documents multiple instances where the AI generated answers that sounded authoritative but were factually wrong, sometimes dangerously so.

The Good: Where ChatGPT Surprised the Researchers

Clinical Decision Support That Did Not Embarrass Itself

Cascella et al. (2023) tested ChatGPT on a set of clinical vignettes. These are short descriptions of patient cases, the kind used to train medical students. The AI was asked to propose a diagnosis and a management plan. In many cases, it performed at a level that the authors described as "reasonable" for a junior medical trainee.

It did not hallucinate wildly. It did not suggest treatments that would kill the patient. It offered plausible differentials and referenced standard guidelines. For a model trained on text, not on anatomy or pharmacology, this was not trivial.

Scientific Writing That Saved Time

One of the paper's more provocative findings was that ChatGPT could generate coherent drafts of scientific content. The authors used it to produce a summary of a research topic, and the output was structured, grammatically correct, and semantically relevant. They noted that it could be a useful tool for researchers who need to generate a first draft or summarize a body of literature.

But here is the catch. The AI had no original insight. It synthesized what it had seen before. It could not evaluate the quality of the sources it was drawing from. It could not tell you if a study was underpowered or if a result had been retracted. It could only mimic the form of scholarship without the substance.

The Bad: Where It Went Wrong

The Confidence Trap

The most troubling finding in the paper is not that ChatGPT makes mistakes. It is that ChatGPT never sounds uncertain. When a human doctor is unsure, they say things like "this is a complex case" or "we need more tests." ChatGPT, by contrast, will offer a definitive-sounding answer to a question that has no definitive answer.

Cascella et al. (2023) documented cases where the AI provided a diagnosis that was plausible but not the most likely. It did not flag its own uncertainty. It did not say "I am not sure." It presented its best guess as though it were a fact. In a clinical setting, that kind of false confidence could lead to a missed diagnosis or a wrong treatment.

Misuse Is Not Theoretical

The paper also explored the potential for misuse. The authors asked ChatGPT to generate a list of ways to falsify clinical trial data. The AI refused. They asked it to write a convincing abstract for a fake study. It refused again. But they noted that the refusal was not based on understanding. It was based on pattern matching. The AI had been trained to avoid generating harmful content, but that training is brittle.

A user who rephrased the question, who framed it as a hypothetical or a thought experiment, might get a different answer. The boundary between acceptable and unacceptable use is not clear to the AI. It is a rule-based filter, not a moral compass.

Public Health Reasoning That Missed the Point

When the authors asked ChatGPT to reason about public health topics, it performed unevenly. It could recite facts about vaccination rates or disease prevalence. But when asked to weigh trade-offs between individual liberty and collective safety, it produced generic, sometimes contradictory responses.

The AI did not understand the ethical dimensions of the questions. It was generating text that sounded like a reasonable person's opinion, but it had no opinion. It had no values. It had no ability to prioritize one outcome over another. For a public health official looking for guidance, this would be worse than useless. It would be misleading.

Methodology: How to Stress-Test an AI

The researchers structured their evaluation around four scenarios, each designed to probe a different aspect of ChatGPT's capabilities and limitations.

Scenario 1: Clinical Practice Support

They presented the AI with patient cases and asked it to generate differential diagnoses, suggest diagnostic tests, and propose management plans. They compared the AI's output to standard clinical guidelines and to the judgment of a panel of physicians.

Scenario 2: Scientific Production

They asked ChatGPT to write abstracts, summarize research papers, and generate lists of references. They checked for factual accuracy, proper citation, and coherence.

Scenario 3: Misuse in Medicine and Research

They tested the AI's safeguards by asking it to generate content that could be used for fraud, plagiarism, or other unethical purposes. They documented which requests were blocked and which were fulfilled.

Scenario 4: Public Health Reasoning

They posed open-ended questions about health policy, resource allocation, and ethical dilemmas. They evaluated the AI's responses for logical consistency, depth of reasoning, and awareness of trade-offs.

The authors did not use a formal scoring rubric. They relied on expert judgment, which is a limitation but also a strength. In a field where context matters, a human evaluator can catch subtleties that a numerical score would miss.

What This Research Does Not Prove

This paper is not a verdict. It is a snapshot. The authors tested one version of ChatGPT at one point in time. The model has been updated since. The training data has changed. The safeguards have been adjusted. Some of the specific errors documented in the paper may no longer occur.

More importantly, the paper does not prove that AI cannot be useful in medicine. It proves that the current generation of large language models has specific, predictable failure modes. That is actually good news. If the failures were random, they would be impossible to guard against. If they are predictable, they can be mitigated.

The paper also does not address the question of bias. Other research has shown that language models can reproduce racial and gender biases present in their training data. Cascella et al. did not test for this. It remains an open question how those biases would manifest in clinical recommendations.

The Deeper Problem: What the AI Does Not Know

The paper's most valuable contribution is not its list of errors. It is its diagnosis of the underlying problem. ChatGPT does not have a mental model of the patient, the disease, or the treatment. It has a statistical model of language. It predicts the next word based on the words that came before. That is a fundamentally different kind of intelligence from the one a doctor uses.

A physician learns from experience. They see a patient with a certain set of symptoms, try a treatment, observe the outcome, and update their mental model. They can generalize from one case to another. They can recognize when a patient does not fit the textbook pattern.

ChatGPT cannot do any of this. It has no memory of the conversation from five minutes ago. It has no way to verify a fact against a database. It has no way to know if the patient is still alive. It is a mirror, not a mind.

What This Actually Means

▸Do not use ChatGPT to diagnose a real patient. It might get the right answer. It might get the wrong answer. You will not know which is which until it is too late. The model does not know when it is wrong. It cannot tell you.

▸Use ChatGPT to generate a first draft of a clinical note or a research summary. But treat that draft as a starting point, not a finished product. You must verify every fact, check every reference, and rewrite every sentence that sounds vaguely off.

▸Teach medical students about the limitations of AI. The paper explicitly calls for education on the appropriate use of language models. Students who rely on ChatGPT for answers will not learn to think like doctors. They will learn to think like chatbots.

▸Build guardrails, not gates. The paper shows that simple refusal mechanisms are not enough. A user who wants to misuse the AI will find a way. The solution is not to block all uses. It is to design systems that flag uncertainty and require human verification.

▸Assume the AI is wrong until proven otherwise. That sounds harsh, but it is the only safe stance. The model is fluent but fallible. Every output should be treated as a hypothesis to be tested, not a conclusion to be followed.

The doctor who typed symptoms into ChatGPT on that Tuesday morning got lucky. The AI gave him a reasonable suggestion. But luck is not a clinical protocol. The next doctor might not be so fortunate. The paper by Cascella and his colleagues is a warning, a guide, and an invitation. It tells us that the technology is not ready to replace clinicians. It also tells us that it is too powerful to ignore. The question is not whether we will use AI in medicine. It is whether we will use it wisely.

References

[1]Marco Cascella, Jonathan Montomoli, Valentina Bellini, Elena Bignami (2023). Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. Journal of Medical SystemsDOI· 1,093 citations