Dr. Ahmed Zayed, MD General practitioner (12+ years) Clinical AI builder (SAFE-Triage, Hathor) Founder, ZayedMD May 27, 2026 · 7 min

Metabolic Health

The Statistical Nature of AI Confabulation and the Physician Backstop

Reading Time: 5 minutesA physician's read on how AI hallucinations actually show up in clinic, and why 'confident wrong' is the failure mode we don't talk about enough.

Dr. Ahmed Zayed, MDGP · Clinical AI Research

7 min readMay 27, 2026Updated May 28, 2026

5 minutes

Medically reviewed by Dr. Ahmed Zayed, MD · Last updated May 28, 2026 · Editorial standards

A discharge summary arrives in a physician’s inbox at 11pm. Sixty-two-year-old man, post-op day three, COPD on home oxygen. The document looks immaculate. Clean headings, a tidy hospital-course paragraph, plausible dosing for his inhaled steroid combination. Two paragraphs in I notice the line: *Patient instructed to follow up with pulmonology at the Mayo Clinic*.

We are in Cairo. He has never been to Minnesota.

The AI scribe pulled “Mayo Clinic” out of a probability cloud because it was a high-frequency string in its training data for the phrase *follow up with pulmonology*. Had the physician signed without reading carefully, the discharge instructions sent home with the patient would have included a referral to a hospital in another country.

This is not an isolated glitch from Western software. Just last week, a colleague here in Cairo passed me a radiology summary for a routine abdominal ultrasound. The report was perfectly structured in the local style, but it mentioned a “follow-up scheduled at the Kasr Al-Ainy specialist clinic” for a procedure the patient had already undergone three years ago. The colleague had used an AI assistant to “clean up” his dictation and did not check the output before hitting print. The model simply filled in the most statistically likely local facility to make the report feel complete.

That near-miss is the part of the AI safety conversation that does not make headlines. The actual risk landing in clinic inboxes today is boring and dangerous: plausible-sounding text that reads like every other clinical document, with a handful of details that are quietly invented.

The Statistical Nature of AI Confabulation

In the technical literature on large language models, a “hallucination” is any output that is not grounded in the model’s input or in verifiable fact. The model produces text that is statistically plausible. It follows the patterns learned during training. However, the specific claims are wrong, and the model has no way of knowing it is wrong. There is no internal flag. No warning. No asterisk in the output.

The same probability machinery that produces accurate sentences produces hallucinated sentences. From the model’s point of view, they are the exact same kind of output.

In a clinical context, this shows up in three flavors that matter to a working physician.

The first is the invented clinical detail. Your AI scribe summarizes a 20-minute consultation and writes that the patient described “exertional dyspnea worsening over three weeks,” because that phrase pattern lives in millions of training-set notes. The patient said something simpler. The model upgraded the description to match the surrounding documentation pattern.

The second is the invented citation. AI-assisted research tools confidently produce reference lists where two of the citations exist, two are real papers attributed to the wrong authors, and one is a journal article that was never written. The DOIs the model produces sometimes do not resolve to anything.

The third is the invented workflow. A discharge summary recommends follow-up at a hospital across the world, or a local facility like Kasr Al-Ainy that does not fit the patient’s actual history. A medication schedule references a brand name that does not exist in your formulary. A return-precaution paragraph mentions a symptom that does not apply to the patient’s procedure.

All three share an essential feature. The fabricated content is plausible. It uses the right register, the appropriate vocabulary, and the expected structure. A physician scanning quickly cannot tell the fabricated details from the real ones.

The Critical Difference Between Refusal and Confabulation

Modern AI systems fail in two distinct ways, and the safety implications run in opposite directions.

A **refusal** is when the model declines to answer. *I cannot determine the appropriate dose without more information about renal function*. This is a safety feature. The model is uncertain and tells you so. You get a clear signal that this is the kind of question the tool cannot answer reliably.

A **confabulation** is when the model produces a plausible-sounding answer despite uncertainty. The same dose question gets back a confident number, presented in the same calm declarative voice the model uses for things it actually knows. The model is not hedging. It is not flagging uncertainty. From the user’s point of view, a confabulated answer is indistinguishable from a correct one.

Refusal is the safety feature that user-experience optimization has been quietly trained out of these models. Confabulation is the failure mode that remains. This specific failure mode should keep clinicians up at night, rather than the science-fiction scenarios that dominate the news.

Automation Bias and the Trained Physician

Automation bias is the well-described tendency of humans to over-trust computer-generated output, especially when the output looks professional. The phenomenon has been studied since the 1990s in aviation, in process control, and in radiology. The effect does not weaken with experience or training. In some studies, more experienced clinicians are more susceptible than less experienced ones, because they have learned, correctly, that a well-formatted document usually reflects competent work.

Modern language models produce documents that look like the documents you have been trained to read. The headings are right. The paragraph structure is right. The clinical vocabulary is right. The pacing is right. A clinician scanning a discharge summary at the end of a 12-hour shift is not pattern-matching against the medical content. They are pattern-matching against whether the text looks like a normal discharge summary. AI-generated text passes that pattern match easily.

This visual familiarity creates a dangerous trap.

Proper Nouns and Over-Specified Details as Red Flags

Clinical hallucinations are rarely exotic. They hide in the mundane details that require attention you may not have at the end of a long shift.

**The proper-noun anomaly.** Hospital names, drug brand names, and specific facility names like Kasr Al-Ainy are the most common form of hallucination. The model fills them from training-set frequency rather than the actual patient context. Read every proper noun in an AI-generated note as suspect until verified.

**The over-specified detail.** When the AI summary describes something with more precision than the actual clinical encounter generated, that precision was invented. The patient said “I get short of breath when I climb stairs.” The AI summary says “exertional dyspnea worsening over three weeks.” That is not a transcription. It is a generation, and the three-week timeline came from somewhere other than the visit.

**The plausible citation that does not resolve.** If your AI tool gives you a reference list, click through the DOIs. Real ones resolve to real papers. Fabricated ones land you on a 404 page or a different paper entirely. There is no in-between.

Practical Tactics for the Review-Edit-Sign Workflow

The literature converges on a small set of tactics that force engagement with the text.

Read the document, do not skim it. When you sign an AI-generated note, you are saying you wrote it. If your review pace is faster than your read pace for a self-authored note of comparable length, you are not reviewing. You are rubber-stamping.

Triangulate against source data. A scribe summary should agree with the medication list, the problem list, and the most recent vital signs. Spending a few seconds per check saves the enormous cost of a hallucinated detail you would have otherwise signed.

Adopt a Review-Edit-Sign workflow, not Review-Sign. When the workflow includes an editing step in the middle, your hand is forced to engage with the content. Even small edits change the cognitive register from reviewer to author. This shift is essential and remains the single most reliable behavioral guardrail in real practice.

Report what you catch. The vendor never sees a hallucination you intercept and correct. The next clinician using the same tool gets the same risk. Build a habit of reporting, even informally. The post-market surveillance the regulatory framework wants to build is only as good as the signal that gets fed into it.

The Clinician as the Backstop for Truth

AI scribes and AI-assisted clinical tools are going to be part of medicine for a long time. The productivity gain is real. The same scribe that almost sent the COPD patient to Minnesota has, on a different day, pulled together a careful and accurate summary of a 90-minute consultation that would have taken me 25 minutes to write by hand.

The point is not to argue against the tools. It is to argue against the pattern of trust they invite. The output of these systems looks like the output of a competent colleague. It is not produced by a competent colleague. It is produced by a probability machine that has no idea what is true. The clinician is the only point in the workflow where truth gets enforced.

The real safety conversation is about how to remain the backstop while using AI, rather than whether to use it at all.

—

*The full deep-dive on this topic, including the FDA’s 2025 Lifecycle Management framework and the operational discipline of human-in-the-loop workflows, is on ZayedMD: [AI Hallucinations in Clinical Practice: Why “Confident Wrong” Is the Real Risk](https://zayedmd.com/blog/ai-hallucinations-clinical-practice/).*

Dr. Ahmed Zayed, MD

Licensed physician and clinical AI specialist. Founder and Editor-in-Chief of ZayedMD, a physician-led medical publication covering clinical AI, neurology, metabolic health, and evidence-based patient guidance.

The Statistical Nature of AI Confabulation

The Critical Difference Between Refusal and Confabulation

Automation Bias and the Trained Physician

Proper Nouns and Over-Specified Details as Red Flags

Practical Tactics for the Review-Edit-Sign Workflow

The Clinician as the Backstop for Truth

Amazon Taps Amwell Veteran Roy Schoenberg: The Future of D2C GLP-1 Prescribing

أدلة الجرعات الدقيقة لمضاهيات GLP-1: المخاطر السريرية والفجوة لدى الأطباء

GLP-1 Microdosing Evidence: Clinical Risks and the Gap for Physicians

Related Clinical Reads

Continue Reading

Amazon Taps Amwell Veteran Roy Schoenberg: The Future of D2C GLP-1 Prescribing

Coalition for Health AI (CHAI) Releases 2026 Governance Playbooks: What Physicians Need to Know

أدلة الجرعات الدقيقة لمضاهيات GLP-1: المخاطر السريرية والفجوة لدى الأطباء