Building a Medical SFT Dataset: Quality vs. Volume

The intuition that more data is always better dies hard — even in the face of substantial evidence to the contrary. For medical supervised fine-tuning (SFT), the relationship between dataset size and model quality is non-linear in ways that have significant practical implications for how you invest your annotation budget.

The short version: 1,000 carefully constructed, physician-written instruction-response pairs routinely outperform 100,000 crowdsourced or model-generated pairs for clinical instruction fine-tuning. This is not a niche finding — it reflects something fundamental about how SFT works on top of strong foundation models, and about what "quality" actually means in a medical annotation context.

Why Volume Fails in Medical SFT

SFT on a capable foundation model is essentially teaching the model to imitate a target distribution of responses. If that target distribution is noisy — inconsistent formatting, factual errors, hedging in the wrong places, missing clinical nuance — the model learns to imitate the noise. Scaling a noisy dataset makes a noisier model, not a better one.

Medical crowdsourcing compounds this problem. General-purpose annotation platforms have no mechanism to verify clinical credentials. An annotator who answers medical questions quickly and consistently earns a good platform reputation — regardless of whether their answers are clinically accurate. The result is a dataset that is internally consistent but medically unreliable. The model trains on it, produces confident-sounding outputs, and the confidence is part of what it learned.

The confident-but-wrong problem: A model trained on a large, low-quality medical dataset will produce responses that sound authoritative — because confident tone was overrepresented in the training data. This is worse than a model that correctly hedges its uncertainty. Confidence miscalibration is one of the hardest failure modes to fix post-training.

What Makes a Medical SFT Example High Quality

High quality in a medical SFT context means five things simultaneously — and all five need to be present in every example for the example to be valuable:

Factual accuracy: The response is consistent with current evidence-based guidelines for the relevant specialty and patient population.
Appropriate uncertainty: The response hedges correctly — not excessively (which makes it useless) and not insufficiently (which makes it dangerous). A physician who knows when to say "this requires clinical judgment" is producing better training data than one who always gives a definitive answer.
Task alignment: The response answers the specific question asked, not a slightly easier version of it. Instruction following is a learnable behaviour; teaching it requires examples where the response actually follows the instruction.
Communication calibration: The response is pitched at the right level for the intended user — clinical language for a clinician-facing tool, plain language for a patient-facing one. Mixing registers within a dataset is a common source of deployment-time failure.
Safe framing: The response includes appropriate safety qualifiers, referral signals, and disclaimers where a clinician would include them — not formulaic legal boilerplate, but genuine clinical safety language.

A crowdsourced annotator can produce examples that meet one or two of these criteria reliably. A credential-verified, task-calibrated physician — given a well-designed task brief — routinely meets all five.

How to Spec the Task for Clinical SFT Annotation

The task specification is where most medical SFT projects go wrong. A poorly specified task produces inconsistent data even from excellent annotators. The spec must define:

The instruction distribution

What types of questions will the model receive in deployment? A clinical decision support tool receives different instructions than a patient-facing chatbot or a medical coding assistant. The instruction distribution in your SFT dataset should mirror — or intentionally oversample the harder end of — your deployment distribution. Generic medical question-answering datasets are not a substitute for deployment-matched data.

The response format

Length, structure (prose vs. structured bullet vs. step-by-step), use of headers, inclusion of references, reading level — all of these need to be specified and consistently applied. Format inconsistency in training data produces format inconsistency at inference time, which is a reliability problem even when the underlying facts are correct.

Edge case handling

What should the model say when it doesn't know? When the question is outside its intended scope? When the patient presentation is ambiguous? These edge cases need explicit examples in the training data, with physician-written responses that model the correct behaviour — not just the easy cases where there is a clear correct answer.

The Right Scale for Clinical SFT

For a domain-specific fine-tune on top of a capable foundation model (anything in the GPT-4 / Claude / Gemini class), the practical ranges are:

Proof of concept: 200–500 high-quality physician-written examples. Enough to demonstrate that the model can be steered toward a target clinical communication style.
Production fine-tune (single specialty): 1,000–3,000 examples covering the full range of the target task distribution, with deliberate oversampling of high-stakes and edge-case scenarios.
Multi-specialty or broad clinical AI: 5,000–15,000 examples, stratified by specialty, task type, clinical consequence level, and demographic distribution of the target patient population.

Budget allocation principle: If you're choosing between 500 physician-written examples and 10,000 crowdsourced examples for the same cost, take the 500. The physician-written examples will produce a better model and a safer one. The 10,000 crowdsourced examples will produce a model that looks better on automated metrics and worse in clinical evaluation.

Quality Control for Clinical SFT Data

Even physician annotators produce variable quality without a quality control layer. The QC process for medical SFT annotation should include: annotator calibration before production (test tasks with known-good answers), inter-annotator agreement measurement on a random sample (5–10% of total), clinical advisor review of all high-stakes examples, and systematic checks for factual errors, format violations, and safety language omissions.

The QC overhead is real — it adds 20–30% to the time cost of annotation. But a dataset that hasn't been through this process will require multiple fine-tuning iterations and clinical evaluations to diagnose what went wrong. The QC investment is cheaper than the rework.

Need a clinical SFT dataset built to spec?

We design the task brief, select and calibrate specialty-matched physicians, run QC, and deliver datasets in your training format — within timelines your ML team can actually work with.

Reach Co-founder →

Building a Medical SFT Dataset: Quality vs. Volume — What the Research Actually Shows

Why Volume Fails in Medical SFT

What Makes a Medical SFT Example High Quality

How to Spec the Task for Clinical SFT Annotation

The instruction distribution

The response format

Edge case handling

The Right Scale for Clinical SFT

Quality Control for Clinical SFT Data

Need a clinical SFT dataset built to spec?