Most healthcare AI teams that come to us have already run a failed RLHF attempt. The data exists — a few hundred preference pairs collected from a handful of physicians — but it's noisy, inconsistent, and the reward model trained on it performs worse than a simple supervised baseline. The root cause is almost always the same: the annotator brief.
What you tell a clinician before a preference-ranking session determines everything about the quality of data that comes out. Get it wrong and you're not collecting preference data — you're collecting a mix of recency bias, specialty opinion, formatting preference, and whatever the physician happened to be thinking about that day. Get it right, and a surprisingly small cohort of annotators can produce the signal that meaningfully improves clinical AI safety and utility.
Why Clinical RLHF Is Different
Standard RLHF annotation asks human raters to compare two model outputs and indicate which is better. This works reasonably well for general tasks where "better" is intuitive — more helpful, more factual, less harmful. Clinical AI introduces a problem: physicians disagree substantively and correctly, based on specialty, practice setting, and patient population.
A hospitalist reviewing an AI triage recommendation thinks very differently from an emergency physician. An oncologist's preference for hedged, detailed prognosis language is exactly what a primary care physician's patient cannot process. Neither is wrong. But if you pool their preferences without controlling for this variation, you produce a reward model that satisfies no one.
Key principle: Clinical RLHF is not about finding the single "correct" answer — it's about capturing coherent, internally consistent preference signals from well-defined annotator cohorts, then training separate reward models (or conditioning a single model) on each cohort's signal.
The Four Components of a Working Annotator Brief
1. Task Framing — What the Physician Is Actually Evaluating
The most common mistake is asking physicians to judge which output is "better" without defining better. Before any annotation begins, the brief must specify the exact evaluation lens:
- Safety lens: Which output is less likely to cause patient harm if followed verbatim?
- Clinical accuracy lens: Which output is more consistent with current evidence-based guidelines?
- Communication lens: Which output would a patient with moderate health literacy understand and act on correctly?
- Completeness lens: Which output omits fewer clinically relevant considerations?
Different use cases require different lenses. A clinical decision support model needs the safety and accuracy lens. A patient-facing chatbot needs the communication lens. Running a single RLHF pass with no lens specification produces data that's a noisy average of all four — useful for none.
2. Rubric Design — Anchoring Subjective Judgment
Even with a defined lens, physician preference is highly variable without anchoring examples. A rubric provides 3–5 concrete worked examples of what constitutes a clear preference (A much better than B), a weak preference (A slightly better than B), and a tie. Each worked example should be drawn from the actual domain of your model — not hypotheticals.
The rubric should also explicitly list disqualifying factors — outputs that are automatically marked as unsafe regardless of comparative quality. For clinical AI this typically includes: incorrect drug names, dosages outside standard ranges, categorical statements where uncertainty should be expressed, and any output that could discourage seeking in-person care for a serious symptom.
3. Calibration Tasks — Finding Outliers Before Production
Run every annotator through 10–15 "gold standard" preference pairs before they touch production data. These are pairs where expert consensus exists — verified by a senior clinical advisor — and the correct preference is unambiguous. Annotators who fail more than 3 of 15 calibration items are either misunderstanding the task framing or bringing a specialty perspective that's mismatched to the target use case.
At Dritiva: We require a minimum calibration pass rate of 80% before an annotator enters production. Annotators who fail the threshold are either re-briefed or reassigned to a cohort better matched to their specialty. We never mix calibration-failed annotators into the same batch as calibration-passed annotators — the signal contamination is difficult to recover from.
4. Disagreement Handling — What to Do When Clinicians Conflict
Disagreement is not an error to be resolved — it is data. When two physicians from the same specialty, with the same evaluation lens, produce opposing preferences on the same pair, that pair is genuinely ambiguous for your model's use case. There are three valid strategies:
- Exclude and flag: Remove the pair from training data and add it to a "edge case" set for qualitative review. Works when disagreement rate is low (<15%).
- Majority vote with confidence weighting: Assign the majority preference but reduce the reward signal weight proportional to disagreement level. Works when you have ≥5 annotators per pair.
- Specialty-stratified labels: Keep the pair in training with two labels — one per specialty cohort — and train a reward model that conditions on annotator specialty. Works best for multi-specialty clinical AI.
Data Format — Output That's Actually Usable
RLHF preference data for clinical AI should be structured to support DPO (Direct Preference Optimization), which has become the dominant training method for preference fine-tuning. A minimal DPO-compatible record looks like:
The metadata fields — specialty, lens, preference strength, calibration status — are not optional. They're what allow you to filter, stratify, and diagnose reward model failures after training. Raw preference pairs with no metadata are a liability: you cannot tell which signal is trustworthy and which is noise.
Practical Scale: How Many Pairs Do You Need?
The honest answer is: fewer than you think for a first useful model, and more than you think for a safe one. For an initial domain fine-tune with DPO, 500–2,000 high-quality specialty-matched pairs typically outperform 10,000 mixed-quality pairs. For a model approaching clinical deployment — where safety is load-bearing — plan for 5,000–15,000 pairs across multiple specialties, with adversarial and edge-case pairs deliberately oversampled.
Common mistake: Scaling preference data collection before fixing the annotator brief. Every pair collected with a broken brief is not neutral — it actively trains the reward model in the wrong direction. Fix the brief first, run a 50-pair pilot, evaluate reward model outputs, then scale.
What This Looks Like in Practice
A well-structured clinical RLHF engagement typically follows this sequence: a two-hour brief and calibration session per annotator cohort → a 50-pair pilot collection → reward model training on the pilot data → qualitative evaluation of model outputs by the clinical advisor → adjustment of the brief or rubric if needed → full-scale collection. The pilot step is where most problems surface and where fixing them costs the least.
The goal is not to collect data as fast as possible. It's to collect the minimum amount of data that produces a reward model whose outputs a senior clinician would not be embarrassed to put in front of a patient.
Ready to build your clinical RLHF pipeline?
We scope annotator cohorts, design rubrics, and run calibration — then deliver DPO-ready JSONL within 48 hours of project start.
Reach Co-founder →