Human Evidence for FDA SaMD Submissions

Most AI teams building towards FDA clearance for a Software as a Medical Device (SaMD) understand the technical requirements reasonably well: analytical validation, algorithm testing, software documentation. What consistently surprises them is how much of the submission package consists of human-generated evidence — evidence that cannot be produced by automated testing and that takes time to generate correctly.

This article covers the four categories of human evidence the FDA looks for in AI/ML SaMD submissions, what each category requires in practice, and how to generate them systematically rather than retrofitting them at submission time.

Disclaimer: This article reflects publicly available FDA guidance as of July 2025. It is not legal or regulatory advice. Engage qualified regulatory affairs counsel before any FDA submission. Regulatory frameworks evolve, and your specific device classification determines your specific requirements.

Why Human Evidence Is Central to AI SaMD Submissions

The FDA's framework for AI/ML-based SaMD — articulated in the 2021 Action Plan, the 2023 Marketing Submission Recommendations, and ongoing guidance updates — places significant weight on demonstrating that a device performs safely and effectively in the hands of its intended users, with real clinical data, in realistic conditions. Algorithmic performance on held-out test sets is necessary but not sufficient.

The FDA's concern is the gap between laboratory performance and real-world clinical performance — a gap that is particularly wide for AI systems because they can perform very differently when deployed in the complexity of actual clinical workflows versus the controlled conditions of a test dataset.

The Four Categories of Human Evidence

1. Clinical Validation Study

The core of most AI SaMD submissions is a clinical validation study that demonstrates the device achieves its intended clinical outcome — not just that the algorithm achieves a certain AUC on a validation set. For diagnostic AI, this typically means a reader study: a cohort of target-specialty clinicians evaluates patient cases with and without the AI's assistance, and the study measures whether the AI improves clinical decision-making.

Key design requirements the FDA will scrutinise:

Intended use population: The patient cohort must reflect the demographic and clinical characteristics of the intended use population — not a convenience sample skewed toward the training data distribution.
Intended use clinicians: Reader study participants must be representative of the clinicians who will actually use the device — not only academic specialists if the intended use is community hospital settings.
Reference standard: The ground truth used to evaluate the AI's outputs must be established by an appropriate clinical gold standard — independent expert panel adjudication, not the AI's own outputs used recursively.
Statistical powering: The study must be powered to detect the minimum clinically meaningful difference, not just any detectable difference.

2. Usability Engineering Evidence (Human Factors)

FDA requires human factors (HF) evidence for AI SaMD that demonstrates the device's user interface does not introduce use errors that could lead to patient harm. For AI tools this is particularly important because users may over-rely on (automation bias) or under-utilise (alert fatigue) AI outputs in ways that create harm even when the algorithm is technically accurate.

A summative human factors study involves representative end users interacting with the device in a simulated use environment, completing a set of critical and essential tasks. Evaluators observe and record use errors, near-misses, and successful task completions. The FDA expects this study to be conducted before final design freeze.

AI-specific HF considerations: For AI decision support tools, the critical tasks include how clinicians interpret AI outputs in the context of discordant cases — where the AI recommendation conflicts with the clinician's initial judgment. How the AI communicates uncertainty, confidence scores, and the basis for its recommendation directly affects the human factors outcome.

3. Real-World Performance Monitoring Data (Post-Market)

For AI/ML SaMD with a Predetermined Change Control Plan (PCCP) — which is the pathway for AI models that will be retrained or updated post-market — the FDA expects a defined real-world performance monitoring (RWPM) protocol as part of the submission. This protocol specifies: what performance metrics will be monitored, how frequently, what thresholds trigger a required response, and how adverse events related to AI performance will be captured and reported.

The human evidence component of RWPM includes: clinician feedback mechanisms built into the clinical workflow, structured adverse event capture linked to AI outputs, and periodic human expert review of flagged cases. These are not passive processes — they require active participation from clinical staff, which means clinical workflow integration must be planned from the design stage.

4. Clinical Expert Panel Review of Algorithm Outputs

Many submissions include a structured expert panel review of algorithm outputs on a representative sample of cases — particularly for use cases where the algorithm's outputs cannot be directly compared to a clear ground truth. This panel review serves multiple purposes: it establishes the reference standard for the validation study, provides qualitative evidence of the algorithm's clinical reasoning patterns, and surfaces failure modes that quantitative metrics miss.

Panel composition matters to the FDA. The panel should be independent of the developing organisation, should represent the relevant specialties and practice settings of the intended use population, and should operate under a structured protocol that includes inter-rater reliability measurement.

Common Evidence Gaps at Submission Time

Based on FDA feedback letters and publicly available review documents, the most common human evidence deficiencies in AI SaMD submissions are:

Demographically limited clinical validation cohort — validation data skewed toward specific age groups, geographic regions, or disease severities that don't reflect the intended use population. The FDA increasingly requires stratified performance data across demographic subgroups.
Reader study clinicians unrepresentative of intended use — using academic specialists as readers when the device will be used in community settings, or vice versa.
Missing human factors evidence for AI-specific use errors — HF studies that test general device usability but don't specifically probe automation bias, over-reliance, and the handling of AI-clinician discordance.
Insufficient post-market surveillance plan detail — RWPM protocols that describe monitoring in principle but don't specify the human review mechanisms, frequency, or thresholds with enough specificity to be enforceable.
No prospective evidence of real-world clinical impact — submissions that demonstrate algorithmic accuracy but lack evidence that the AI actually changes clinical decisions in ways that improve patient outcomes.

How to Generate This Evidence Systematically

The teams that navigate FDA submission most smoothly treat human evidence generation as a continuous process that begins during development — not a documentation exercise that begins at pre-submission. Practical sequence:

Design the clinical validation study during algorithm development — before your test set is finalised. The validation cohort composition determines what performance data you'll have at submission.
Run formative human factors studies early — not just the summative study required for submission. Formative HF studies inform UI design choices that are expensive to change later.
Build the expert panel review into your development cadence — quarterly expert panel reviews of algorithm outputs generate the longitudinal evidence of algorithm behaviour that demonstrates stability and controlled performance across time.
Instrument for post-market monitoring from deployment day one — clinician feedback loops, structured adverse event capture, and performance dashboards are much easier to implement at initial deployment than to retrofit into an established clinical workflow.

Timeline reality check: A properly designed clinical validation study with appropriate statistical power takes 6–18 months to execute for most AI SaMD use cases. Teams that start this process at pre-submission find themselves 12+ months from clearance. Teams that start during development find themselves 3–6 months from clearance at pre-submission.

Building your FDA submission evidence package?

We run structured expert panel reviews, clinical reader studies, and human factors usability sessions — with documentation designed to meet FDA expectations for AI/ML SaMD submissions.

Reach Co-founder →

Human Evidence for FDA SaMD Submissions: What AI Teams Need to Collect Before They Apply

Why Human Evidence Is Central to AI SaMD Submissions

The Four Categories of Human Evidence

1. Clinical Validation Study

2. Usability Engineering Evidence (Human Factors)

3. Real-World Performance Monitoring Data (Post-Market)

4. Clinical Expert Panel Review of Algorithm Outputs

Common Evidence Gaps at Submission Time

How to Generate This Evidence Systematically

Building your FDA submission evidence package?