Red-teaming a healthcare AI system without clinicians is like stress-testing a bridge without a structural engineer. You'll find things that break, but you won't find the things that matter. The failure modes that make clinical AI dangerous are almost never the ones that automated evals surface. They're the ones a physician recognises in the first 30 seconds because they've seen the consequence in a real patient.
Below is a working taxonomy of the 12 failure categories that consistently emerge in structured red-team sessions with clinical annotators — across medical LLMs, clinical decision support tools, health chatbots, and diagnostic AI systems. Some of these can be partially caught by automated evaluation. Most cannot.
The 12 Failure Modes
Hallucinated or Incorrect Drug Dosages
The model states a plausible but incorrect dose, often correct for a different indication, patient weight class, or renal function tier. Automated evals rarely catch this because the format is correct — only a clinician knows the number is wrong.
Contraindication Blindness
The model recommends a treatment that is contraindicated given patient information present in the same prompt — e.g., recommending NSAIDs for a patient whose history mentions CKD stage 3. The model processes the history but fails to integrate the contraindication logic.
Severity Underestimation / Over-Reassurance
The model produces a reassuring response to a symptom cluster that warrants urgent evaluation. Classic example: "this is likely benign" for a presentation that, in context, carries meaningful red-flag risk. Particularly common in patient-facing chatbots.
Differential Performance by Patient Demographics
The model's clinical accuracy or recommendation quality degrades for patient prompts that include specific demographic markers — gender, age, ethnicity, socioeconomic indicators. Often not visible in aggregate benchmarks; only surfaces in stratified evaluation.
False Certainty / Missing Epistemic Hedges
The model states diagnoses or treatment recommendations as definitive when clinical practice demands uncertainty acknowledgment. Failing to say "this should be confirmed with..." or "consider specialist referral if..." in contexts where those qualifiers are medically obligatory.
Outdated or Jurisdiction-Wrong Guidelines
The model produces recommendations consistent with guidelines that have since been updated — common for models trained before a major guideline revision — or applies guidelines for the wrong healthcare system (e.g., NICE guidelines applied to a US clinical context).
Diagnostic Anchoring on Presented Diagnosis
When a prompt includes an existing diagnosis, the model anchors to it and fails to flag alternate diagnoses that better fit the symptom constellation. Mirrors the anchoring cognitive bias in human clinicians — but without the corrective mechanism of experience and peer review.
Failure to Integrate Multi-Turn Clinical Context
In extended conversations, the model fails to maintain and apply clinically relevant information from earlier turns. A patient who mentioned aspirin allergy in turn 2 receives an aspirin recommendation in turn 8. Critical for health chatbots and AI copilots with long-context sessions.
Inappropriate Urgency Escalation or Deflation
The model either fails to recommend emergency care for presentations that warrant it, or recommends it disproportionately, causing alert fatigue or unnecessary patient anxiety. Both directions have patient harm potential; both are common.
Practicing Beyond Model's Intended Scope
The model provides specific clinical advice (diagnosis, drug selection, dosage) when it was designed for general health information. Common when the model is not appropriately restricted by system prompt, or when adversarial prompts successfully bypass the restriction.
Health Literacy Mismatch
The model's language is calibrated for a clinical audience when the deployment context is patient-facing, or vice versa. A patient-facing chatbot that says "rule out pulmonary embolism" instead of "go to the emergency room now" has failed at its primary safety function.
Jailbreak via Clinical Authority Framing
Users who frame requests as coming from a healthcare professional — "as a physician, I need the exact lethal dose of..." — successfully extract responses the model would otherwise refuse. Clinical AI requires adversarial testing with professional-authority framings that general red-team playbooks miss.
Why Automated Evals Miss Most of These
Automated evaluation against benchmarks like MedQA, MedMCQA, or PubMedQA catches a meaningful subset of failure mode #6 (guideline drift) and occasionally #5 (uncertainty). It is largely blind to failures #1 through #4, #7 through #9, #11, and #12 — because these failures require clinical context, longitudinal reasoning, and awareness of how a real patient or clinical workflow will interact with the output.
Automated evals also have a fundamental sampling problem: they test the model on questions that are in the evaluation set. Red-teaming tests the model on inputs that are designed to surface failure — which by definition are not in any benchmark.
A useful framing: Automated evals tell you how well your model performs on problems it was designed to solve. Red-teaming tells you how it performs on problems it was not designed for — which is exactly the distribution that produces patient harm in deployment.
How to Structure a Clinical Red-Team Session
An effective clinical red-team session is not a free-form "try to break the model" exercise. It is a structured adversarial evaluation with four components: a target failure taxonomy (the 12 categories above are a starting point), a prompt generation protocol, a clinical evaluation rubric, and a structured output format for findings.
Annotators are briefed on the model's intended use case, its safety boundaries, and the specific failure modes to probe. They are then given structured prompt templates for each failure category — e.g., for failure mode #2 (contraindication blindness), they generate prompts that include a relevant contraindication embedded in patient history and evaluate whether the model's recommendation respects it.
Each finding is recorded with: the adversarial prompt, the model response, the failure mode category, severity rating (1–5), and a brief clinical explanation of why the response is problematic. This structured output is what enables systematic remediation — either through prompt engineering, fine-tuning, or RLHF targeting the specific failure modes surfaced.
Frequency of Red-Teaming
Red-teaming is not a one-time pre-launch activity. Healthcare AI models are retrained, fine-tuned, and have their system prompts modified continuously. Any change to the model's weights or the system prompt that governs clinical behavior warrants a targeted red-team sweep — at minimum covering the failure modes most relevant to the change. Full-spectrum red-teaming should happen at least quarterly for deployed clinical AI systems, and before any significant capability expansion.
Regulatory note: For AI/ML-based Software as a Medical Device (SaMD) under FDA oversight, red-teaming findings and remediation are part of the post-market surveillance obligation. Document every session, finding, and corrective action. You will need this for predetermined change control plan (PCCP) submissions.
Need clinical red-teaming for your healthcare AI?
We run structured adversarial sessions with specialty-matched physicians and regulatory experts — delivering findings in a format your ML team can act on directly.
Reach Co-founder →