Dr. Ahmed Zayed, MD General practitioner (12+ years) Clinical AI builder (SAFE-Triage, Hathor) Founder, ZayedMD May 12, 2026 · 14 min

AI in Healthcare

AI Hallucinations in Clinical Practice: Why “Confident Wrong” Is the Real Risk

Reading Time: 10 minutesWhy AI 'confident wrong' fabrications are the real clinical risk, and what the human-in-the-loop workflow actually needs to look like.

Dr. Ahmed Zayed, MDGP · Clinical AI Research

13 min readMay 12, 2026Updated May 31, 2026

10 minutes

Medically reviewed by Dr. Ahmed Zayed, MD · Last updated May 31, 2026 · Editorial standards

A discharge summary arrives in a physician’s inbox at 11pm. Sixty-two-year-old man, post-op day three, COPD on home oxygen, recovering from an elective procedure. The document looks immaculate. Clean headings, a tidy “Hospital Course” paragraph, plausible dosing for his inhaled corticosteroid combination, the right disposition section. Two paragraphs in I notice the line: *Patient instructed to follow up with pulmonology at the Mayo Clinic*. We are in Cairo. He has never been to Minnesota. The AI scribe pulled “Mayo Clinic” out of a probability cloud because it was a high-frequency string in its training data for the phrase *follow up with pulmonology*. Had the physician signed the document without reading carefully, the discharge instructions sent home with the patient would have included a referral to a hospital in another country.

That is the real risk profile of AI in clinical care today. Not glaring failures. Plausible-sounding fictions, dressed in the formatting of a real document, embedded in workflows where physicians are trained to skim once and sign.

This is also the part of the AI safety conversation that gets the least attention. The headlines are dominated by autonomous diagnostic agents and image-classification accuracy. However, the actual risk landing in clinic inboxes today is more boring and more dangerous, such as AI-generated text that reads like every other clinical document, with a handful of details that are quietly invented. In this blog post, we will discuss what AI hallucinations actually look like in clinical work, why they trigger automation bias in trained physicians, what the FDA’s lifecycle framework does and does not cover, and what an essential human-in-the-loop workflow looks like at the bedside.

The Statistical Nature of AI Confabulation in Clinical Contexts

In the technical literature on large language models, a “hallucination” is any output that is not grounded in the model’s input or in verifiable fact.
The model produces text that is statistically plausible. It follows the patterns the model learned during training. However, the specific claims are wrong, and the model has no way of knowing it is wrong. There is no internal flag. No warning bubble. No asterisk in the output. The same probability machinery that produces accurate sentences produces hallucinated sentences. From the model’s point of view they are the same kind of output.

In a clinical context, hallucinations show up in three flavors that matter to a working physician.

The invented clinical detail

A scribe summarizing a consultation visit fabricates a heart murmur the clinician never auscultated. Or assigns a heart rate that was never measured. Or attributes a symptom to the wrong system. The patient may have presented with shortness of breath, and the AI summary then describes “exertional dyspnea worsening over three weeks” because that phrase pattern lives in millions of training-set notes. The clinician said something simpler. The model upgraded the description to match the surrounding documentation pattern.

The invented citation

AI-assisted research tools confidently produce reference lists where some of the citations exist, some are real papers attributed to the wrong authors, and at least one is a journal article that was never written. Published evaluations of large-language-model output in medical and biomedical settings have found that a substantial fraction of generated citations are fabricated outright, and the models produce DOIs that do not resolve to anything . The same problem persists in current-generation tools when used naively.

The invented workflow

A discharge summary recommends follow-up at a hospital across the world. A medication schedule references a brand name that does not exist in your formulary. A return-precaution paragraph mentions a symptom that does not apply to the patient’s procedure. The fabrication is not in the medicine itself but in the operational instructions that surround the medicine.

All three share an essential feature. The fabricated content is plausible. It uses the right register, the right vocabulary, the right structure. A physician scanning quickly cannot tell the fabricated details from the real ones.

Confabulation vs. Refusal: The Safety Difference

There is an essential distinction in how modern AI systems can fail, and it is worth getting clear on the terminology because the safety implications run in opposite directions.

A **refusal** is when the model declines to answer. *I cannot determine the appropriate dose without more information about the patient’s renal function*. This is a safety feature. The model is uncertain and it tells you it is uncertain. You, as the clinician, get a clear signal that this is the kind of question the tool cannot answer reliably and that you need to fall back on your own training or another resource.

A **confabulation** is when the model produces a plausible-sounding answer despite uncertainty. The same dose question gets back a confident answer with a specific number, presented in the same calm declarative voice the model uses for things it actually knows. The model is not hedging. It is not flagging uncertainty. From the user’s point of view, a confabulated answer is indistinguishable from a correct one.

The reason this distinction matters is that we, as clinicians, have learned to distrust hedging. When a colleague says *I think it might be*, we know to look harder. When the same colleague says *It is X*, we update toward believing them. AI models that have been tuned for fluency and helpfulness lean toward the second mode. The reinforcement-learning-from-human-feedback training that makes them feel useful in conversation also pushes them toward confident-sounding answers, because users reward confidence.

Refusal is a safety feature that user-experience optimization has been quietly trained out of these models. Confabulation is the failure mode that remains, and it is the part of the AI safety story that should keep clinicians up at night, not the science-fiction headlines.

Why “Confident Wrong” Triggers Automation Bias

Automation bias is the well-described tendency of humans to over-trust computer-generated output, especially when the output looks professional. The phenomenon has been studied since the 1990s in aviation, in process control, in radiology. The effect is well-replicated in non-clinical settings and the underlying mechanism (a learned mapping from “looks professional” to “is competent”) is plausible to translate into clinical reading habits. In some studies of decision-support workflows, experienced operators are no less vulnerable than less experienced ones, and there is reason to suspect the same pattern holds for physicians reading AI-generated documentation. Healthcare-specific empirical work on the question is still emerging .

Three features of AI-generated clinical text combine to make automation bias particularly dangerous.

Formatting fluency

Modern language models produce documents that look like the documents physicians have been trained to read. The headings are right. The paragraph structure is right. The clinical vocabulary is right. The pacing is right. A clinician scanning a discharge summary at the end of a 12-hour shift is not pattern-matching against the medical content. They are pattern-matching against *does this look like a normal discharge summary*. AI-generated text passes that pattern match easily.

Uniformity

Human writers vary. One physician’s notes are terse, another’s are verbose, a third uses a particular phrase nobody else uses. AI output is internally consistent. The summary you got yesterday and the summary you got today read like the same writer, because they were the same writer. That uniformity is comforting. However, it also removes the small surprises that would otherwise prompt closer reading.

Absence of provenance

A traditional electronic-health-record entry tells you who wrote what and when. An AI-assisted document blurs that line. The clinician signs it, but who actually wrote which sentence? When something is wrong, the human-in-the-loop workflow assigns blame to the signer, but the signer cannot easily tell which sentences they should have caught.

The result is a workflow in which a physician is asked to function as a quality-control reviewer for a writer they cannot interrogate, on documents that are designed to look correct, at the end of a shift when their attention is most depleted. This is a system that sets up failure.

The FDA’s 2025 Lifecycle Guidance: A Regulatory Backstop

The U.S. Food and Drug Administration has been working on a regulatory framework for AI-enabled medical devices since 2019. The current centerpiece is the Lifecycle Management approach, formalized in the agency’s recent guidance on Predetermined Change Control Plans for AI/ML-Based Software as a Medical Device. The framework addresses one specific problem, such as the fact that AI models change after deployment, and traditional medical-device review assumes a static product.

A few features of the framework matter for working clinicians.

Predetermined Change Control Plans

For AI-enabled medical devices that fall under FDA jurisdiction and whose manufacturers intend to modify them over time, the agency’s PCCP guidance describes how a Predetermined Change Control Plan should be structured. The guidance is nonbinding and the framework applies to device-scoped tools, not to every AI product used in a clinical setting (an AI scribe used as an administrative documentation tool, for example, may sit outside this scope). When a PCCP is in place for a covered device, it describes the kinds of model updates that can ship under the existing authorization and the kinds that require a new submission. The plan must describe the methods, the validation approach, and the kinds of changes covered.

Post-market monitoring

Post-market monitoring is now a structural expectation, not a courtesy. Manufacturers are expected to track real-world performance and report meaningful deterioration. For clinicians, this means there should be a feedback channel. Such as a way to report a hallucination back to the vendor so it gets counted in the post-market record. If your vendor does not offer that channel, you are using a tool whose makers cannot learn from your experience.

Transparency about training data

The guidance also pushes for transparency about training data and known limitations. A device cleared for use in adult emergency-department triage was not necessarily validated for use in pediatric primary care. A scribe trained on a North American documentation corpus may produce North American defaults when used in a Middle Eastern clinic. The Mayo Clinic line in my opening anecdote is exactly that kind of training-data leak.

The framework is a regulatory backstop. However, it is not a replacement for clinician vigilance. The cleared status of a device tells you that a manufacturer cleared a regulatory bar at one point in time. It does not tell you that the device produced accurate output for the patient sitting in front of you on this visit, on this day, at this hour.

Practical Tactics for the Review-Edit-Sign Workflow

The literature and my own clinical experience converge on a small set of practical tactics.
None of them are exotic. Let’s look at the ones that work in a real clinic, in real time, on a real shift.

Read the document. Do not skim it.

When you sign an AI-generated note, you are saying you wrote it. Read it the way you would read a note written by a trainee. If your review pace is faster than your read pace for a self-authored note of comparable length, you are not reviewing. You are rubber-stamping.

Triangulate against the source data

A scribe summary should agree with the medication list, the problem list, and the most recent vital signs. A research summary should be checkable against the abstracts of the cited papers. Build small habits of cross-checking, such as glancing at the medication list before signing and confirming that the dosing in the AI summary matches what is actually on the chart. The cost is seconds per check. The saved cost on a hallucinated detail can be enormous.

Use structured prompts that force grounding

When you ask an AI tool a clinical question, structure the prompt so the tool has to point at sources. *Summarize the indications for SGLT2 inhibitors in heart failure with preserved ejection fraction. Cite the trial data* is a different prompt from *tell me about SGLT2 inhibitors*. The first is harder to confabulate against, because confabulation requires the model to invent both an answer and a citation, and inventing a citation that resolves is harder than inventing a fact.

Adopt a Review-Edit-Sign workflow

When the only motor action available to the clinician is approval or rejection, the cognitive default is approval. When the workflow includes an editing step in the middle, the clinician’s hand is forced to engage with the content. Even small edits change the cognitive register from reviewer to author. That shift is essential. It is also the single most reliable behavioral guardrail I have seen in my own practice.

Use a secondary check for high-stakes outputs

Some health systems are deploying a second AI model whose only job is to flag inconsistencies in the first model’s output. A critic model, in other words. The technique does not catch every error. However, it catches the easy ones, and it shifts the human attention budget toward the hard ones. The architecture is similar to the second-radiologist read in mammography, such as a redundancy in service of safety.

Report what you catch

A hallucination that you intercept and correct disappears from the record. The vendor never sees it. The next clinician using the same tool gets the same risk. Build a habit of reporting, even informally, to the vendor or the institutional team responsible for AI quality. The post-market surveillance system the FDA framework wants to build is only as good as the signal that gets fed into it. Rest assured, every clinician who reports a hallucination is protecting the next clinician who would have signed it.

The combination of these tactics is the closest thing to an all-rounded clinical AI safety routine I have seen work in practice. None of them on their own is sufficient. Together, they form an essential operational discipline that compensates for the part of the system that cannot be trusted to flag its own errors.

Conclusion

Undoubtedly, AI scribes and AI-assisted clinical tools are going to be part of medicine for a long time. The productivity gain is real. The documentation quality on the human-supervised cases is real. The time these tools give back to direct patient care is real. The same scribe that almost sent the COPD patient to Minnesota has, on a different day, pulled together a careful and accurate summary of a 90-minute consultation that would have taken me 25 minutes to write by hand.

The point of this article is not to argue against the tools. It is to argue against the pattern of trust they invite. The output of these systems looks like the output of a competent colleague. It is not produced by a competent colleague. It is produced by a probability machine that has no idea what is true. The clinician, you and me and the resident on call tonight, is the only point in the workflow where truth gets enforced.

That is the real safety conversation. Not whether to use AI in clinical practice. How to remain the backstop while using it, right?

—

Key Clinical Takeaways

AI hallucinations in clinical text are plausible-sounding fabrications, not glaring errors. The risk profile is “confident wrong,” not “obviously broken.” That makes them harder to catch under shift conditions, and harder to flag back to the vendor afterward.

Confabulation differs from refusal in an essential way. A refusal is the safety feature. A confabulation is the failure mode that remains after fluency-tuning, and it is the one that should keep clinicians up at night.

Automation bias makes well-formatted documents harder to scrutinize. The better the formatting, the less the reviewing clinician engages with the content. Experienced physicians are not protected from this effect. In some cases they are more vulnerable.

The FDA Lifecycle Management framework is a regulatory backstop, not a replacement for clinician vigilance. Cleared status is a snapshot, not a guarantee. Use the post-market reporting channel when your vendor offers one, and ask for one when it does not.

Human-in-the-loop is a workflow, not a slogan. Read the document. Triangulate against source data. Use Review-Edit-Sign instead of Review-Sign. Report the hallucinations you catch.

—

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? *FAccT ’21*.

U.S. Food and Drug Administration. *Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence/Machine Learning (AI/ML)-Enabled Device Software Functions* (Guidance for Industry, 2024–2025).

Cummings, M. L. (2017). Automation Bias in Intelligent Time Critical Decision Support Systems. *AIAA 1st Intelligent Systems Technical Conference*.

JAMA Network Open. Multiple recent publications on hallucination prevalence and clinical AI safety. (Specific citations to be added by editorial review.)

—

*[Internal links to be added after publication: ZayedMD pillar on Clinical AI Safety, ZayedMD cluster on Post-deployment monitoring]*

*[Newsletter CTA: “Get the next issue of Clinical AI Intelligence. Physician-written, evidence-grounded analysis of AI in medicine. Delivered weekly. No spam.”]*

Dr. Ahmed Zayed, MD

Licensed physician and clinical AI specialist. Founder and Editor-in-Chief of ZayedMD, a physician-led medical publication covering clinical AI, neurology, metabolic health, and evidence-based patient guidance.