← Back to blog

Medical Text Classification: 12 Experiments from TF-IDF to LLMs

· 14 min read

I recently spent a few weeks building a text classification system for medical correspondence. The task: take OCR’d clinic letters and classify them into ~49 categories — specialties like Cardiology and Ophthalmology, administrative types like Discharge Summaries, diagnostic categories like Echocardiogram results.

I ran 12 experiments. Here’s what actually mattered.

The dataset

Property Value
Source OCR'd PDF medical clinic letters
Total samples ~14,700 raw → 13,672 after filtering
Classes 49 letter types (after merging)
Min samples/class 35
Split 70 / 10 / 20 (stratified)

Experiment 1: TF-IDF baseline

TfidfVectorizer (unigrams + bigrams, 50k features) with a LinearSVC. Result: ~91% accuracy.

A good sanity check. Most of the classification signal lives in simple lexical features — certain words and phrases are strong indicators of letter type. But bag-of-words can’t capture word order or context, so there’s a ceiling.

The biggest win: fixing the labels

Before trying fancier models, I looked at the label set. Some categories were synonymous (“Nephrology” and “Renal”), some were ambiguous, and a few were too vague to be useful. After merging and cleaning:

Before merging
91%
After merging
96%

+5pp from touching zero model code

This turned out to be the single largest improvement in the entire project.

Experiment 2: DistilBERT

Fine-tuned distilbert-base-uncased — 4 epochs, batch size 16, learning rate 2e-5, max 512 tokens. Matched the ~96% mark and became the baseline for everything else.

Experiments 3–6: Bigger and fancier models

I tried several model variations, hoping to push past 96%:

  • ClinicalBERT & BioClinicalBERT — models pre-trained on clinical notes. No improvement. The task doesn’t rely on deep clinical terminology — it relies on structural cues like headers, greetings, and clinic names.
  • Longformer at 1024 tokens — maybe more context would help? No. Most letters are identifiable from their first page. The letter type shows up early.
  • Hierarchical classification — broad category first, then fine-grained. Added complexity without benefit. The 49-class space is already well-separated.

Model comparison (Top-1 accuracy)

Scale: 93% – 97%

TF-IDF (merged)
~96%
DistilBERT ★
95.8%
ClinicalBERT
~95.6%
BioClinicalBERT
~95.5%
Longformer 1024
~95.5%
Hierarchical
<95.5%

★ = baseline. Scale starts at 93% to show differences. All models cluster within ~0.5pp of each other.

Experiments 7–9: LLM relabeling and distillation

This is where things got interesting — and humbling.

LLM relabeling

I used a large language model (via batch API) to independently classify all 13,700 samples. The LLM agreed with the original labels only 85.7% of the time.

Training DistilBERT on LLM-assigned labels:

Effect of LLM relabeling on training

Original → Original
96.1%
LLM → Original
86.2%
LLM → LLM
93.2%

Format: "trained on → evaluated against". LLM labels are systematically different, not better.

The LLM’s labels weren’t wrong in a random way. They were systematically different. The LLM doesn’t know how a healthcare organization internally categorises its own correspondence. It applies its own logic, which doesn’t match the operational reality.

Consensus relabeling

Only change a label when both the trained BERT and the LLM agree the original is wrong. Out of 9,500+ training samples, only 4 met this criterion. BERT memorises its training labels almost perfectly, so it virtually never disagrees with them on in-sample data.

Soft knowledge distillation

Used the LLM’s top-5 predictions with confidence scores as soft targets — blended loss: α × CE(hard labels) + (1-α) × KL(soft labels ‖ student logits).

Baseline
95.76%
Soft distilled
95.32%

The soft KL loss stayed flat at ~3.5 across all epochs. LLM confidence scores are too noisy for effective distillation.

Experiments 10–11: Cleanlab (the second big win)

Cleanlab uses confident learning to find likely mislabeled samples. I ran 3-fold cross-validation to get out-of-sample predictions, then flagged samples where the model confidently disagreed with the label.

142 out of 9,500 training samples flagged (1.5%). Manual inspection confirmed ~99% were genuinely mislabeled — the model was right, the ground truth was wrong.

Noisiest classes by mislabel rate

Discharge
9.1%
Paediatrics
7.2%
Physio
6.8%
Sexual Health
4.6%
OB/GYN
4.3%

The key insight: instead of removing flagged samples, I relabeled them with the model’s prediction. I also ran cleanlab on the test set, finding 62 mislabeled test samples.

Cleanlab relabeling vs baseline

Top-1

95.8
98.1

Top-3

97.9
99.1

Top-5

98.5
99.4
vs original labels vs corrected labels

Most of the model’s “errors” were actually correct predictions being penalised by wrong ground truth.

The production model

Final approach: run cleanlab on the entire dataset (train + val + test), relabel all 212 flagged samples (1.6% of the data), train a fresh DistilBERT on the full corrected dataset for 6 epochs.

Training loss curve

2.36
1
0.21
2
0.07
3
0.04
4
0.03
5
0.02
6

Epoch

Estimated production performance: ~98% top-1, ~99% top-3, ~99.4% top-5.

The full picture

# Experiment Top-1 Top-3 Notes
1 TF-IDF + LinearSVC ~91% Before label merging
2 Label merging ~96% +5pp — biggest single gain
3 DistilBERT 95.8% 98.1% Baseline for all further work
4–5 ClinicalBERT variants ~96% No improvement
6 Longformer 1024 ~96% Longer context didn't help
7 Hierarchical <96% More complexity, no gain
8 LLM relabeled 86.2% 95.4% LLM labels diverge too much
9 Soft distillation 95.3% 97.5% Soft loss didn't converge
10 Cleanlab remove 95.9% 97.7% Small top-1 gain
11 Cleanlab relabel 98.1% 99.1% True performance (corrected eval)

Key takeaways

Label quality > model architecture. The two biggest accuracy gains came from label merging (+5pp) and cleanlab corrections (+2pp). Model changes contributed +0pp.

Simple models can be enough. DistilBERT — a general-purpose, distilled transformer — matched or beat every domain-specific and larger model I tried.

LLM labels are not automatically better. An LLM doesn't know your domain's labeling conventions. Its "corrections" may be internally consistent but operationally wrong.

Your model might be better than your metrics say. If ~1.6% of your labels are wrong, your accuracy metric has a ~1.6% noise floor. Cleanlab helped me see past that.

Invest in data quality tooling early. I wish I'd run cleanlab before any model experiments. It would have saved me from chasing model improvements that were actually label noise.