Medical Text Classification: 12 Experiments from TF-IDF to LLMs
I recently spent a few weeks building a text classification system for medical correspondence. The task: take OCR’d clinic letters and classify them into ~49 categories — specialties like Cardiology and Ophthalmology, administrative types like Discharge Summaries, diagnostic categories like Echocardiogram results.
I ran 12 experiments. Here’s what actually mattered.
The dataset
| Property | Value |
|---|---|
| Source | OCR'd PDF medical clinic letters |
| Total samples | ~14,700 raw → 13,672 after filtering |
| Classes | 49 letter types (after merging) |
| Min samples/class | 35 |
| Split | 70 / 10 / 20 (stratified) |
Experiment 1: TF-IDF baseline
TfidfVectorizer (unigrams + bigrams, 50k features) with a LinearSVC. Result: ~91% accuracy.
A good sanity check. Most of the classification signal lives in simple lexical features — certain words and phrases are strong indicators of letter type. But bag-of-words can’t capture word order or context, so there’s a ceiling.
The biggest win: fixing the labels
Before trying fancier models, I looked at the label set. Some categories were synonymous (“Nephrology” and “Renal”), some were ambiguous, and a few were too vague to be useful. After merging and cleaning:
+5pp from touching zero model code
This turned out to be the single largest improvement in the entire project.
Experiment 2: DistilBERT
Fine-tuned distilbert-base-uncased — 4 epochs, batch size 16, learning rate 2e-5, max 512 tokens. Matched the ~96% mark and became the baseline for everything else.
Experiments 3–6: Bigger and fancier models
I tried several model variations, hoping to push past 96%:
- ClinicalBERT & BioClinicalBERT — models pre-trained on clinical notes. No improvement. The task doesn’t rely on deep clinical terminology — it relies on structural cues like headers, greetings, and clinic names.
- Longformer at 1024 tokens — maybe more context would help? No. Most letters are identifiable from their first page. The letter type shows up early.
- Hierarchical classification — broad category first, then fine-grained. Added complexity without benefit. The 49-class space is already well-separated.
Model comparison (Top-1 accuracy)
Scale: 93% – 97%
★ = baseline. Scale starts at 93% to show differences. All models cluster within ~0.5pp of each other.
Experiments 7–9: LLM relabeling and distillation
This is where things got interesting — and humbling.
LLM relabeling
I used a large language model (via batch API) to independently classify all 13,700 samples. The LLM agreed with the original labels only 85.7% of the time.
Training DistilBERT on LLM-assigned labels:
Effect of LLM relabeling on training
Format: "trained on → evaluated against". LLM labels are systematically different, not better.
The LLM’s labels weren’t wrong in a random way. They were systematically different. The LLM doesn’t know how a healthcare organization internally categorises its own correspondence. It applies its own logic, which doesn’t match the operational reality.
Consensus relabeling
Only change a label when both the trained BERT and the LLM agree the original is wrong. Out of 9,500+ training samples, only 4 met this criterion. BERT memorises its training labels almost perfectly, so it virtually never disagrees with them on in-sample data.
Soft knowledge distillation
Used the LLM’s top-5 predictions with confidence scores as soft targets — blended loss: α × CE(hard labels) + (1-α) × KL(soft labels ‖ student logits).
The soft KL loss stayed flat at ~3.5 across all epochs. LLM confidence scores are too noisy for effective distillation.
Experiments 10–11: Cleanlab (the second big win)
Cleanlab uses confident learning to find likely mislabeled samples. I ran 3-fold cross-validation to get out-of-sample predictions, then flagged samples where the model confidently disagreed with the label.
142 out of 9,500 training samples flagged (1.5%). Manual inspection confirmed ~99% were genuinely mislabeled — the model was right, the ground truth was wrong.
Noisiest classes by mislabel rate
The key insight: instead of removing flagged samples, I relabeled them with the model’s prediction. I also ran cleanlab on the test set, finding 62 mislabeled test samples.
Cleanlab relabeling vs baseline
Top-1
Top-3
Top-5
Most of the model’s “errors” were actually correct predictions being penalised by wrong ground truth.
The production model
Final approach: run cleanlab on the entire dataset (train + val + test), relabel all 212 flagged samples (1.6% of the data), train a fresh DistilBERT on the full corrected dataset for 6 epochs.
Training loss curve
Epoch
Estimated production performance: ~98% top-1, ~99% top-3, ~99.4% top-5.
The full picture
| # | Experiment | Top-1 | Top-3 | Notes |
|---|---|---|---|---|
| 1 | TF-IDF + LinearSVC | ~91% | — | Before label merging |
| 2 | Label merging | ~96% | — | +5pp — biggest single gain |
| 3 | DistilBERT | 95.8% | 98.1% | Baseline for all further work |
| 4–5 | ClinicalBERT variants | ~96% | — | No improvement |
| 6 | Longformer 1024 | ~96% | — | Longer context didn't help |
| 7 | Hierarchical | <96% | — | More complexity, no gain |
| 8 | LLM relabeled | 86.2% | 95.4% | LLM labels diverge too much |
| 9 | Soft distillation | 95.3% | 97.5% | Soft loss didn't converge |
| 10 | Cleanlab remove | 95.9% | 97.7% | Small top-1 gain |
| 11 | Cleanlab relabel | 98.1% | 99.1% | True performance (corrected eval) |
Key takeaways
Label quality > model architecture. The two biggest accuracy gains came from label merging (+5pp) and cleanlab corrections (+2pp). Model changes contributed +0pp.
Simple models can be enough. DistilBERT — a general-purpose, distilled transformer — matched or beat every domain-specific and larger model I tried.
LLM labels are not automatically better. An LLM doesn't know your domain's labeling conventions. Its "corrections" may be internally consistent but operationally wrong.
Your model might be better than your metrics say. If ~1.6% of your labels are wrong, your accuracy metric has a ~1.6% noise floor. Cleanlab helped me see past that.
Invest in data quality tooling early. I wish I'd run cleanlab before any model experiments. It would have saved me from chasing model improvements that were actually label noise.