Dr. Ahmed Zayed, MD General practitioner (12+ years) Clinical AI builder (SAFE-Triage, Hathor) Founder, ZayedMD May 17, 2026 · 10 min

AI in Healthcare

Why AI Triage Systems Keep Failing — And What a Physician Built Instead

Reading Time: 7 minutesA physician explains why ESI and NEWS2 alone would have sent a heart attack patient home — and how SAFE-Triage's layered architecture caught it. 97.2% exact ESI agreement, zero critical under-triage, built for Egyptian emergency departments.

Dr. Ahmed Zayed, MDGP · Clinical AI Research

9 min readMay 17, 2026Updated May 28, 2026

7 minutes

Medically reviewed by Dr. Ahmed Zayed, MD · Last updated May 28, 2026 · Editorial standards

Dr. Ahmed Zayed, MBBCh — Physician and Healthcare AI Builder

He is 55 years old, diabetic, and complaining of mild heartburn.

In Arabic, he says: حرقان في المعدة خفيف — mild burning in the stomach. He does not look distressed. His vital signs are borderline normal. His NEWS2 score is zero. Standard scoring logic would classify him as ESI level 4: low acuity, one resource needed, sent to the waiting area.

Standard scoring logic would miss his heart attack.

This is a well-known teaching case in emergency medicine — famous precisely because it demonstrates where both NEWS2 and ESI, used in isolation, can fail. A diabetic male over 50 presenting with GI-type discomfort carries a documented Silent MI risk pattern that standard deterministic scoring is not designed to surface. This is why we built SAFE-Triage the way we did — not as a single-score system, but as a layered architecture that separates what AI does well from what deterministic rules do safely.

It is also the case we used in the SAFE-Triage presentation.

Table of Contents

The Egyptian ED: What the Numbers Actually Show

Before I explain the system, you need to understand what it was built for.

Egypt has 1.4 hospital beds per 1,000 people. The global average is 2.9. Average transport time from incident to emergency department is 189.7 minutes — by which point many trauma patients have already passed their window for intervention. In Egyptian public hospitals, audits have found that NEWS2 and ESI compliance is effectively absent in most settings. The equipment is not consistently available. Nursing staff operate under severe cognitive load.

There is also a language problem that most AI triage tools do not address: patients in Egyptian emergency departments describe their symptoms in Egyptian Arabic dialect, not standard clinical Arabic and certainly not English. A system built on English-language training data, or even standard Modern Standard Arabic, will misunderstand a meaningful fraction of what it hears.

And then there is infrastructure. The internet goes down. Regularly. Any AI triage system that requires a cloud connection to function is a system that fails exactly when it is most needed.

86.1% of Egyptian ED staff have experienced verbal violence from patients or families. 34.3% have experienced physical violence. The trigger, documented repeatedly, is perceived arbitrary waiting — the sense that triage decisions are random. Objective, transparent, explainable triage directly reduces this. That is not a side benefit of SAFE-Triage. It is a primary design requirement.

What the Research Shows About AI Triage

The AI triage debate has a lot of enthusiasm and, if you look at the actual data, a clear message.

A 2024 comparative study in the Journal of Medical Internet Research tested multiple large language models against trained ED triage staff and untrained doctors. Result: GPT-4 performed comparably to untrained doctors — not to trained triage staff. The failure was systematic: LLMs tended toward overtriage, while untrained humans tended toward undertriage. Neither is safe.

A 2025 systematic review confirmed: AI triage improves speed and reduces cognitive burden, but high-acuity case safety still depends on trained human validation at every step.

JAMA Network Open documented in 2026 that overtriage and undertriage remain active safety failures in EDs even with fully trained staff and no AI involved. This is the baseline. Not a solved problem.

The pattern across the literature is consistent: an LLM placed in a decisional position — where it assigns acuity directly — is not a safe architecture for emergency triage. Not yet. The question is what a safe architecture looks like.

The Three Ways AI Triage Fails

1. It cannot reliably surface atypical presentations. LLMs find the statistically probable answer. For a 55-year-old diabetic with heartburn, the probable answer is GERD. The improbable-but-deadly answer is STEMI. An LLM without explicit red-flag retrieval — and without being forced to check for it regardless of how the patient presents — will miss it.

2. It hallucínates. LLMs generate plausible outputs even when data is missing. A model asked about a vital sign that was never recorded will produce a value consistent with the training distribution. In triage, a hallucinated normal oxygen saturation is a mechanism for misclassification. A 2024 scoping review on LLMs in emergency medicine identified this as a primary safety concern for clinical deployment.

3. It is not built for your population or language. Most AI triage systems are trained on Western, English-language datasets. A 2024 systematic review noted significant performance degradation on out-of-distribution populations. An Egyptian ED patient describing symptoms in Cairene dialect is about as far from those training distributions as possible.

The Architecture: AI Extracts → Rules Decide → Humans Confirm

SAFE-Triage uses a layered architecture with a clear division of responsibility. Each component does what it is demonstrably good at. No AI component is placed in a decisional position.

The AI extraction layer

A patient arrives and describes their complaint — in Arabic, English, or mixed code-switching. Voice input is handled by Google’s Chirp 2 ar-EG speech recognition model. Text extraction runs through Gemini 2.5-flash via Google Vertex AI, which identifies the chief complaint, associated symptoms, and candidate red-flag patterns, then outputs a structured feature set.

The Arabic NLP layer draws on a curated lexicon of 2,101 Arabic keywords, including 1,858 Egyptian-dialect terms and variants, mapped to 6,370 SNOMED-CT concepts with ICD-10 cross-referencing. This bilingual coverage — from Egyptian colloquial Arabic to standardised medical terminology — is, to our knowledge, the first openly described system of its kind.

A second large open-weight model, Gemma 4 27B-IT, deployed via Vertex AI Model Garden, serves as a shadow fallback reviewer — what we think of as the “Chief Resident” check on the primary extraction.

The deterministic rules engine

The actual triage decision is made by a deterministic Python rules engine implementing ESI v5 Decision Points A through D and NEWS2 thresholds, encoded from the AHRQ ESI v5 Handbook and stored in a local SQLite database. This is what runs when the internet goes down. The rules do not change. They cannot be overridden downward by any AI component. A NEWS2 score of 7 or above triggers ESI 1 unconditionally. The safety floor is hard-coded.

The asynchronous QA layer

MedGemma 4B-IT performs asynchronous quality-assurance review — it flags atypical patterns for human attention. Critically, it does not modify triage acuity. It does not have a decisional role. It is an attention-direction system: it surfaces cases that the deterministic layer classified as lower acuity but which carry clinical patterns worth a second look.

In testing against 17 critical or borderline KTAS cases, MedGemma flagged 12 of 17 for further review (71%). Gemma 4 27B-IT resolved 6 of 17 in its shadow-reviewer role with no regressions. These are trial-pipeline findings, not live-system claims — but they show the layered architecture performing exactly as intended.

The physician confirmation gate

Every triage decision is confirmed by a clinician before it acts on patient flow. The AI can escalate acuity. It cannot lower it. And the physician can override anything.

What the Benchmark Shows

On the MIMIC-IV-Ext Triage Instruction Corpus (MIETIC, n=36) — an expert-validated, ESI-aligned benchmark — SAFE-Triage achieved: – 35/36 exact ESI agreement (97.2%) – 36/36 within-one-level agreement (100%) – 0/36 critical under-triage

The single discordant case was safe over-triage: the system predicted ESI 2 for a case the expert labelled ESI 3. Higher acuity, not lower. The system erred on the side of caution.

The Arabic mirror of the same benchmark produced an identical confusion matrix.

For comparison: published studies of ESI-trained nurses on standard scenarios show exact agreement of 59.2% (Mistry et al., 2018) and 59.6% (Jordi et al., 2015). SAFE-Triage’s constrained-AI design appears to reduce the variability long documented in scenario-based human triage — though this is indirect comparison, not a head-to-head trial.

The KTAS cross-protocol stress test (1,262 cases) yielded 37.8% exact agreement, 81.6% within-one-level, and 1.3% critical under-triage. This is expected: KTAS and ESI are different protocols. The point of this test was to measure robustness under protocol mismatch, not to validate performance. The within-one rate holding above 80% under a completely different triage standard is a meaningful signal.

These are retrospective benchmark results. SAFE-Triage has not been deployed in clinical use. A prospective validation study in an Egyptian emergency department is the necessary next step.

The Egyptian UHI Connection

Egypt is mid-transition to Universal Health Insurance — a system that will, for the first time, require standardised, auditable documentation of clinical decisions across Egyptian hospitals. SAFE-Triage produces ESI-graded, SNOMED-coded, ICD-10-assigned, timestamped triage records, with BigQuery audit logging designed for GAHAR ICD.03 alignment. This is not incidental to the Egyptian healthcare context. It is what the UHI transition demands.

How We Got to Harvard

I am a physician. I am also a freelance medical writer and AI consultant on Upwork — nights and weekends. That freelance income funded the API and infrastructure costs for SAFE-Triage. Not a grant. Not investor money. Writing.

The Harvard Health Systems Innovation Lab Hackathon 2026 Cairo hub was hosted at the American University in Cairo, in partnership with Orange Egypt. My team was selected and accepted. The Harvard HSIL program brings the health systems innovation competition directly into the regional contexts where these problems live — and the Cairo hub meant competing as an Egyptian team, solving an Egyptian problem, in front of people who understand the Egyptian healthcare system.

Getting selected was validation that the problem framing was right.

The project has since been submitted to the Triagegeist Competition and the MedGemma Competition (Google, 2026). The live demonstration is accessible at safe-triage-ai.web.app. The source code is at github.com/DrAhmed7887/safe-triage-project.

Why the Architecture Matters

The contribution SAFE-Triage claims is not a performance number. The performance numbers are real, but they come from 36 benchmark cases and deserve appropriate humility. The contribution is the architectural pattern itself: a constrained-AI design in which generative AI handles what it is good at — language understanding, dialect coverage, feature extraction — while deterministic rules retain the safety-critical decisional authority.

That pattern is transferable. Any clinical setting considering LLM-assisted decision support faces the same tension: the LLM is powerful, but putting it in a decisional position creates risks that are not fully manageable yet. The answer is not to avoid AI. The answer is to be precise about what AI does and what rules do.

AI Extracts. Rules Decide. Humans Confirm.

The 55-year-old diabetic man who says “mild heartburn” in Arabic deserves the same chance at having his heart attack caught as any patient anywhere. That is what this was built for.

Dr. Ahmed Zayed, MBBCh is a physician and healthcare AI builder. SAFE-Triage is a research-stage prototype. It has not been deployed in clinical use. The benchmark data presented are retrospective results, not clinical outcomes. Live demo: safe-triage-ai.web.app. Source: github.com/DrAhmed7887/safe-triage-project. Accepted and selected for the Harvard Health Systems Innovation Lab Hackathon 2026 (Cairo hub, AUC / Orange Egypt).

Sources

Masanneck L et al. Triage Performance Across LLMs, ChatGPT, and Untrained Doctors. J Med Internet Res. 2024;26:e53297. PMID: 38875696.
Wang C et al. Patient Triage and Guidance in EDs Using LLMs. J Med Internet Res. 2025;27:e71613. PMID: 40374171.
Yi N et al. Effects of AI on ED Triage: Systematic Review. J Nurs Scholarsh. 2025. PMID: 39262027.
Preiksaitis C et al. LLMs in Emergency Medicine: Scoping Review. JMIR Med Inform. 2024. PMID: 38728687.
Hoffmann JA et al. Overtriage and Undertriage in EDs. JAMA Netw Open. 2026. PMID: 41874504.
Olawade DB et al. Human in the Loop AI in Healthcare. Int J Med Inform. 2026. PMID: 41740273.
Zachariasse JM et al. Performance of Triage Systems: Systematic Review. BMJ Open. 2019. PMID: 31142524.
Mistry B et al. Multicenter Assessment of ESI Reliability. Ann Emerg Med. 2018. [From SAFE-Triage Academic Review Brief v1.2]
Jordi K et al. ESI Accuracy in Swiss Hospitals. Swiss Med Wkly. 2015. [From SAFE-Triage Academic Review Brief v1.2]

Try SAFE-Triage

Live demo and open source code — see the layered architecture in action.

Live Demo
GitHub Source

Dr. Ahmed Zayed, MD

Licensed physician and clinical AI specialist. Founder and Editor-in-Chief of ZayedMD, a physician-led medical publication covering clinical AI, neurology, metabolic health, and evidence-based patient guidance.

The Egyptian ED: What the Numbers Actually Show

What the Research Shows About AI Triage

The Three Ways AI Triage Fails

The Architecture: AI Extracts → Rules Decide → Humans Confirm

What the Benchmark Shows

The Egyptian UHI Connection

How We Got to Harvard

Why the Architecture Matters

Try SAFE-Triage

Coalition for Health AI (CHAI) Releases 2026 Governance Playbooks: What Physicians Need to Know

أجهزة قياس ضغط الدم دون كفة ورقابة إدارة الغذاء والدواء

Cuffless Blood Pressure Devices and FDA Oversight

Related Clinical Reads

Continue Reading

Medicare’s $50 GLP-1 Bridge: What Physicians Need to Document for Foundayo, Wegovy, and Zepbound (2026)

Claude Code Dynamic Workflows: A Clinical AI Safety Deep Dive

Amazon Taps Amwell Veteran Roy Schoenberg: The Future of D2C GLP-1 Prescribing