AI in Healthcare

AI Hallucinations in Clinical Prescribing: A Risk-Stratified Field Guide for Practising Clinicians

Reading Time: 11 minutesA clinician-focused, evidence-anchored field guide to the specific failure modes when LLMs are used for prescription drafting, drug-interaction checks, and dose calculation. Covers fabricated references, ambient-scribe hallucinations, antimicrobial errors, and what the peer-reviewed evidence says about safe deployment.

AI Hallucinations in Clinical Prescribing: A Risk-Stratified Field Guide for Practising Clinicians — editorial illustration
13 min readMay 27, 2026Updated May 28, 2026
11 minutes
Medically reviewed by Dr. Ahmed Zayed, MD · Last updated May 28, 2026 · Editorial standards

There is a particular kind of quiet moment that happens after you have asked an LLM for a drug-interaction check or a dose recommendation. The answer comes back fast, it sounds authoritative, and it usually cites a reference or two. The temptation is to accept it and move on. If you have been practising long enough to remember when UpToDate first changed prescribing workflows, you probably already know the right question to ask next. How do I know this is true? The 2023 to 2026 peer-reviewed evidence on large-language-model prescribing has now matured to the point where we can answer that question with specifics rather than intuition. We know which drug classes are highest-risk for AI hallucinations. We know which prescribing contexts produce the worst error rates. We know that the headline numbers in vendor demos do not match the open-ended free-text performance clinicians actually encounter. Yes, the literature is uneven. However, the gaps are themselves informative, and the peer-reviewed findings give you a working risk-stratification framework you can use tomorrow morning. In this blog post, we will discuss what AI hallucinations in clinical prescribing actually look like, where the strongest evidence sits, and how to deploy LLMs in your prescribing workflow without putting patients at preventable risk.

What is an AI hallucination in the prescribing context?

A hallucination in clinical prescribing is not the same as a hallucination in casual chatbot use. It is a confidently stated, plausibly written, factually wrong drug-related output from a large language model. The mechanism is well understood. LLMs are trained to predict the next token, not to verify pharmacological fact. When they do not know an answer, they generate something that looks like an answer. In a casual context this is annoying. In a prescribing context it is a patient-safety event waiting to be documented.

The specific failure modes break down into four clinically meaningful categories. The first is fabricated facts, such as a wrong dose, a wrong indication, or a wrong half-life. The second is fabricated references, where the LLM cites a journal article that does not exist or attributes a real article to the wrong authors. The third is missed contraindications, where a clinically important warning is silently omitted. The fourth is fabricated interactions, where an LLM invents a drug-drug interaction that has no evidence base, or misses a real one. Each of these has now been measured in the peer-reviewed literature.

Why “exam pass” does not mean “safe prescriber”

It is essential to separate two things that vendor demos often conflate. LLMs can score impressively on multiple-choice clinical pharmacology questions. van Nuland and colleagues, writing in the Journal of Clinical Pharmacology in 2024 (van Nuland M et al., 2024), reported ChatGPT correctly answering 79% of 264 factual clinical-pharmacy MCQs, beating the pharmacist baseline of 66%. Stroop and colleagues, in the British Journal of Clinical Pharmacology in 2025 (Stroop A et al., 2025), then tested GPT-4 on open-ended clinician-submitted pain-therapy questions covering drug interactions, dosages, and contraindications. Performance on free-text prescribing was meaningfully worse than on MCQ benchmarks. The exam-pass headline does not survive contact with the prescribing pad.

How often do LLMs fabricate the references behind a prescribing recommendation?

This has the cleanest number in the literature. Bhattacharyya and colleagues, writing in Cureus in 2023 (Bhattacharyya M et al., 2023), asked ChatGPT-3.5 to generate 115 references across 30 short medical papers. 47% of the references were fabricated. Another 46% were authentic but inaccurate. Only 7% were both authentic and accurately cited. Incorrect PMIDs were the most common error component, present in 93% of the papers. This single statistic should change how you treat any LLM-generated evidence in a prescribing context. If your AI assistant tells you a recommendation is supported by Smith et al. 2022, the probability that Smith et al. 2022 exists as cited is roughly one in fourteen.

What this means in practice

Treat any LLM-generated citation as a hypothesis to verify, not as evidence to act on. If you are using an LLM to draft an answer for a patient or a colleague and the answer includes references, you are responsible for resolving those references against PubMed or the publisher record before you stand behind the underlying claim.

How well do general-purpose LLMs perform as drug-interaction checkers?

Poorly, and the evidence is direct. Al-Ashwal and colleagues, in Drug, Healthcare and Patient Safety in 2023 (Al-Ashwal FY et al., 2023), compared ChatGPT-3.5, ChatGPT-4, Bing AI, and Google Bard against conventional drug-interaction databases across 255 drug pairs. Sensitivity for clinically relevant DDIs was poor across all four chatbots. Roosan and colleagues, in the Journal of the American Pharmacists Association in 2023 (Roosan D et al., 2023), evaluated ChatGPT-4 in medication therapy management and found the model did surface drug interactions but did not recommend specific dosages. That distinction matters. An LLM that flags interactions can be a useful prompt to investigate further. An LLM that recommends doses without surfacing the underlying interaction logic is a safety hazard.

The clinical implication

Do not use a general-purpose LLM as your primary DDI check. Use Lexicomp, Micromedex, Stockley’s, or your hospital’s formulary-integrated tool. An LLM can be a useful adjunct for surfacing possible interactions you might not have thought to look for, but the verification step has to happen in a reference database.

What does the evidence say about antimicrobial prescribing?

Antimicrobial prescribing is one of the highest-risk prescribing categories for LLM deployment, and the peer-reviewed evidence is consistent. De Vito and colleagues, in Infection in 2024 (De Vito A et al., 2024), tested ChatGPT against infectious-disease residents and specialists on 96 antibiogram-based cases covering endocarditis, pneumonia, intra-abdominal infections, and bloodstream infections. ChatGPT underperformed specialists on prescriptive accuracy, particularly on antibiogram interpretation. Tao and colleagues, in Annals of Biomedical Engineering in 2024 (Tao Z et al., 2024), evaluated ChatGPT-4 antibiotic recommendations covering choice, dose, and duration in vulnerable populations. Only 38.1% of responses were comprehensive and correct, and 11.9% contained outright errors.

Why antimicrobial prescribing is particularly hard for LLMs

Antibiotic choice depends on the local antibiogram, patient-specific factors such as allergies and renal function, the suspected source of infection, and the institutional stewardship policy. None of these are reliably available to a general-purpose LLM, which is reasoning from its training data rather than your hospital’s susceptibility patterns. The result is that LLMs default to textbook-canonical answers that may be wrong for your patient or your local resistance landscape. Treat any LLM antibiotic recommendation as a starting point for the conversation with your ID consultant, not as a prescription you can write.

What about dose calculation in narrow-therapeutic-index drugs?

This is where the evidence is thinnest and the risk is highest. Tezcan and colleagues, in Digital Health in 2026 (Tezcan H et al., 2026), retrospectively compared GPT-4 weekly warfarin dose adjustments against cardiologist prescriptions in 180 patients with out-of-range INRs. Roughly 74% of GPT-4 recommendations were within ±1 mg per week of the cardiologist dose, and 84% within ±2 mg per week. The authors explicitly framed their results as hypothesis-generating, not safety-validating, and called for prospective physician-supervised trials before any clinical use. That framing is essential to read carefully. A 74% “close enough” rate sounds good until you remember that the remaining 26% includes the patients for whom the dose was meaningfully wrong.

Where the evidence is essentially absent

There is no peer-reviewed accuracy study of ChatGPT, GPT-4, Gemini, or Claude on pediatric weight-based dosing as of the search date. There is no peer-reviewed study on LLM accuracy for renal- or hepatic-function dose adjustment. There is no peer-reviewed accuracy study for vancomycin, methotrexate, chemotherapy regimens, or pregnancy-teratogen contraindication checking. These gaps are not just academic. They are categories of prescribing where the harm of a wrong answer is high and where you have essentially no published evidence about how well an LLM performs. The honest default is to assume LLMs are not yet safe for these decisions without human verification.

What is happening with ambient AI scribes and prescribing artifacts?

This is the deployment context most clinicians will encounter first, because ambient scribes are spreading through health systems faster than dedicated prescribing AI. Taylor and colleagues, in JMIR Medical Informatics in 2026 (Taylor SL et al., 2026), reported a UC Davis pragmatic prospective pilot involving 31 physicians and 7,545 ambient-AI-scribe notes. Physician review of 356 notes identified the following error profile.

  • Accidental omissions in 18% of notes.
  • Hallucinations in 11.5% of notes.
  • Accidental inclusions in 9.3% of notes.
  • Bias in 1.1% of notes.
  • 5.3% of notes had errors rated severity 4 or 5, meaning serious or imminent harm potential.

The most striking finding may be the behaviour data. The median proportion of AI words edited by physicians was only 9%. 14.9% of notes were left entirely unedited. Leung and colleagues, in JMIR Medical Informatics in 2025 (Leung TI et al., 2025), provide the editorial-level context, cataloguing hallucination and omission concerns and the limited evidence on patient-safety outcomes across the ambient-scribe category.

The specific prescribing failure mode

The category of ambient-scribe error that matters most for prescribing is the auto-generated or auto-inserted medication entry. When an LLM transcribes a clinical encounter and infers a medication change that the physician did not actually intend, the artifact lands in the note and, sometimes, in the medication-administration record. The Taylor data tell you that in a real deployment this is happening at clinically meaningful rates and that physicians are not catching it as reliably as the vendor marketing suggests.

What does the evidence say about EHR-integrated prescribing copilots?

Here the gap between vendor claims and peer-reviewed evidence is the widest. PubMed yields no controlled prescribing-error trial of Epic-integrated GPT-4, Microsoft DAX Copilot prescribing features, Suki, or Augmedix as of the search date. The closest peer-reviewed analog is Ong and colleagues, in Cell Reports Medicine in 2025 (Ong JCL et al., 2025), who ran a prospective cross-over study of a retrieval-augmented LLM clinical decision support system on 91 prescribing-error scenarios across 16 specialties. They tested three configurations. The pharmacist-plus-LLM configuration outperformed both the LLM-alone configuration and, in some scenarios, the pharmacist-alone configuration. The LLM-alone configuration was inferior to humans.

The practical reading

Claim safety for AI prescribing only where there is prospective, controlled, human-in-the-loop evidence. Everything else is preliminary, regardless of the brand on the user interface. If your hospital is being pitched an EHR-integrated prescribing copilot, ask for the prospective error-rate data with the actual deployment configuration in your environment. If the vendor cannot produce it, treat the deployment as a pilot and document accordingly.

What about geriatric prescribing and deprescribing?

This is a use case where structured tools plus an LLM appear to work better than an LLM alone. Kulenovic and Lagumdzija-Kulenovic, in Studies in Health Technology and Informatics in 2025 (Kulenovic A, Lagumdzija-Kulenovic A, 2025), paired ChatGPT 4.0 with a structured potentially-inappropriate-medication detection tool. The LLM alone was insufficient against STOPP and Beers criteria without the rule-based scaffold. Socrates and colleagues, in JMIR Aging in 2025 (Socrates V et al., 2025), described an LLM pipeline for deprescribing opportunities in older emergency-department patients. The work is promising but retrospective. The pattern is consistent. Use structured tools as the scaffold. Use the LLM to surface candidate opportunities. Use the clinician to make the decision.

What is the bias signal in opioid prescribing?

Omar and colleagues, in a medRxiv preprint in 2025 (Omar M et al., 2025), evaluated 10 large language models across 1,000 acute-pain vignettes with 34 sociodemographic variations. The work demonstrated socio-demographic bias in LLM opioid recommendations. The caveat is that this is a preprint rather than a peer-reviewed article. The clinical implication holds nonetheless. LLMs reflect the biases in their training data. In a prescribing category where racial and socioeconomic disparities are already extensively documented, deploying an unmonitored LLM into the opioid-prescribing pathway risks amplifying those disparities, not reducing them.

How should you actually stratify the risk in your own practice?

Let’s take a look at a practical framework based on the evidence above.

Highest-risk drug classes

Anticoagulants, antimicrobials when antibiogram interpretation is involved, opioids, chemotherapy, and any narrow-therapeutic-index agent where the evidence is essentially absent. Treat LLM output for these classes as a hypothesis only.

Highest-risk patient populations

Older adults on polypharmacy, pediatrics where weight-based dosing is required, pregnancy where teratogenicity matters, and any patient with renal or hepatic impairment requiring dose adjustment. The peer-reviewed accuracy evidence in these populations is essentially absent, which is itself a risk-stratification finding.

Highest-risk prescribing contexts

Open-ended free-text queries (Stroop 2025, much worse than MCQ benchmarks would suggest). Requests for evidence or references (Bhattacharyya 2023, with 47% fabricated). Ambient-scribe-generated medication entries (Taylor 2026, with 15% of notes never edited). Chatbot-based DDI screening as a primary check (Al-Ashwal 2023, with poor sensitivity).

Lowest-risk prescribing contexts

Structured prompts with explicit reference databases attached, such as a retrieval-augmented system against Lexicomp. Pharmacist-in-the-loop verification (Ong 2025). Rule-based deprescribing scaffolds with the LLM surfacing candidates (Kulenovic 2025). Documentation review of LLM-generated content with explicit clinician edit, not pass-through acceptance.

What about the postmarket surveillance question?

This is the gap I want every clinician reading this article to register. There is no published peer-reviewed MedWatch or FDA Adverse Event Reporting System analysis of AI-prescribing-attributable adverse events as of the search date. The postmarket-surveillance infrastructure for AI-mediated prescribing harm has not yet been operationalized in the indexed literature. This means that when an AI-mediated prescribing error happens in your clinic and contributes to patient harm, the reporting pathway and the aggregated signal are essentially absent. ZayedMD’s position is that this gap needs to close. Until it does, individual clinicians and pharmacy-and-therapeutics committees are the surveillance system, and incident reports against AI tools should be filed with the same seriousness as any other prescribing-error report.

Conclusion

Undoubtedly, large language models are now embedded in prescribing workflows whether we deliberately deployed them or not. The peer-reviewed evidence does not support a blanket prohibition, and it does not support uncritical adoption. It supports a more clinically useful position. There are prescribing contexts where the evidence justifies cautious, human-in-the-loop use of LLMs as adjuncts to existing reference tools. There are prescribing contexts, including pediatric dosing, renal-adjusted dosing, narrow-therapeutic-index drugs, and antibiogram-driven antimicrobial choice, where the peer-reviewed accuracy evidence is essentially absent and the responsible default is to assume LLMs are not yet safe. There is a documented fabricated-citation rate of 47% from the most-cited reference study, which alone is enough to change how you treat any AI-generated evidence in a prescribing context. If you build a risk-stratified habit of asking “what does the evidence say about LLM accuracy in this prescribing context?” before you accept an AI suggestion, you can rest assured that you are practising at the edge of the evidence rather than chasing the technology after the harm has been done.


References

  1. Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High rate of fabricated and inaccurate references in ChatGPT-generated medical content. Cureus 2023; 15(5): e39238. doi:10.7759/cureus.39238 (PMID: 37337480)
  2. Taylor SL et al. Hallucinations, omissions, and editing behaviour in an ambient AI scribe deployment: a UC Davis prospective pilot. JMIR Medical Informatics 2026; 14: e86474. doi:10.2196/86474 (PMID: 41996389)
  3. Al-Ashwal FY, Zawiah M, Gharaibeh L, Abu-Farha R, Bitar AN. Evaluation of chatbots for drug interaction screening: ChatGPT-3.5, ChatGPT-4, Bing AI, and Google Bard versus conventional databases. Drug, Healthcare and Patient Safety 2023; 15: 137-147. doi:10.2147/DHPS.S425858 (PMID: 37750052)
  4. Huang X, Estau D, Liu X, Yu Y, Qin J, Li Z. ChatGPT vs licensed clinical pharmacists across prescription review, ADR recognition, and drug counselling tasks. British Journal of Clinical Pharmacology 2023; 89(1): 244-250. doi:10.1111/bcp.15896 (PMID: 37626010)
  5. Roosan D et al. ChatGPT-4 in medication therapy management: surfaces drug interactions but does not recommend specific dosages. Journal of the American Pharmacists Association 2023; 64(2): 422-428. doi:10.1016/j.japh.2023.11.023 (PMID: 38049066)
  6. Ong JCL et al. Pharmacist-plus-LLM versus LLM-alone in a prospective crossover study of 91 prescribing-error scenarios across 16 specialties. Cell Reports Medicine 2025; 6(7): 102323. doi:10.1016/j.xcrm.2025.102323 (PMID: 40997804)
  7. De Vito A et al. ChatGPT versus infectious-disease residents and specialists on antibiogram-based cases. Infection 2024; 52(5): 1957-1964. doi:10.1007/s15010-024-02350-6 (PMID: 38995551)
  8. Tao Z et al. ChatGPT-4 antibiotic recommendations in vulnerable populations: only 38.1% comprehensive and correct. Annals of Biomedical Engineering 2024; 53(2): 410-422. doi:10.1007/s10439-024-03600-2 (PMID: 39133388)
  9. Stroop A et al. GPT-4 on open-ended clinician-submitted pain-therapy questions: dosage, interaction, and contraindication performance. British Journal of Clinical Pharmacology 2025; 91(7): 1822-1830. doi:10.1002/bcp.70036 (PMID: 40066678)
  10. van Nuland M et al. ChatGPT on 264 factual clinical-pharmacy multiple-choice questions versus pharmacist baseline. Journal of Clinical Pharmacology 2024; 64(9): 1095-1100. doi:10.1002/jcph.2443 (PMID: 38623909)
  11. Tezcan H et al. GPT-4 weekly warfarin dose adjustments versus cardiologist prescriptions in 180 patients with out-of-range INRs. Digital Health 2026; 12: 20552076251412985. doi:10.1177/20552076251412985 (PMID: 41509866)
  12. Gao Y et al. ChatGPT fabrication of plausible-but-wrong drug-indication links across 2,694 true and 5,662 false associations. Annals of Biomedical Engineering 2024; 52(7): 1947-1957. doi:10.1007/s10439-023-03385-w (PMID: 37855948)
  13. Leung TI et al. The ambient AI scribe evidence base: hallucinations, omissions, and patient-safety outcome gaps. JMIR Medical Informatics 2025; 13: e80898. doi:10.2196/80898 (PMID: 40749188)
  14. Kulenovic A, Lagumdzija-Kulenovic A. ChatGPT 4.0 with structured potentially-inappropriate-medication detection: STOPP and Beers performance. Studies in Health Technology and Informatics 2025; 322: 248-252. doi:10.3233/SHTI250067 (PMID: 40200464)
  15. Socrates V et al. LLM pipeline for deprescribing opportunities in older emergency-department patients. JMIR Aging 2025; 8: e69504. doi:10.2196/69504 (PMID: 39679140)
  16. Wang L et al. Network meta-analysis of LLM accuracy across 168 studies and 35,896 questions: humans still outperform on top-1 and top-3 diagnosis. Journal of Medical Internet Research 2025; 27: e64486. doi:10.2196/64486 (PMID: 40305085)
  17. Gérard A et al. GPT-4o, Gemini Advanced, Le Chat, and DeepSeek R1 on the European Prescribing Exam. British Journal of Clinical Pharmacology 2025; 91(9): 2384-2392. doi:10.1002/bcp.70137 (PMID: 40495266)
  18. Omar M et al. Sociodemographic bias in LLM opioid recommendations across 10 models and 1,000 acute-pain vignettes. medRxiv 2025 (preprint, not peer-reviewed). doi:10.1101/2025.03.04.25323396 (PMID: 40093243)
  19. Reis ZSN et al. Structured calibrated prompts reduce ambiguity in e-prescription instructions versus generic prompts. Mayo Clinic Proceedings: Digital Health 2024; 2(4): 642-654. doi:10.1016/j.mcpdig.2024.09.006
  20. STAT News Opinion. Using AI in addiction medicine could be particularly risky. Companion ZayedMD canonical: AI in Addiction Medicine (2026-05-19). External cross-reference only.

PubMed search and metadata were retrieved via parallel research sub-agent using the NCBI E-utilities API on 2026-05-19. All DOI links resolve to publisher-hosted full text or abstract. Author bylines should be verified against the publisher record at the resolve-citations step. Note explicitly to the reader: five clinically important evidence gaps were identified during research — no peer-reviewed LLM accuracy data exists for (1) pediatric weight-based dosing, (2) renal- or hepatic-function dose adjustment, (3) vancomycin / methotrexate / chemotherapy regimens, (4) pregnancy-teratogen contraindication checking, (5) prospective prescribing-error RCTs of any branded EHR-integrated copilot.

Dr. Ahmed Zayed, MD

Licensed physician and clinical AI specialist. Founder and Editor-in-Chief of ZayedMD, a physician-led medical publication covering clinical AI, neurology, metabolic health, and evidence-based patient guidance.