Performance with standard supervised learning, 95% CIs shown in parentheses
Validation | Train | Test | |||||
Model | Precision | Recall | F1 | F1 | Precision | Recall | F1 |
Random baseline | 0.499 | 0.104 | 0.161 | 0.161 | 0.501 | 0.102 | 0.158 |
Conventional supervision | |||||||
Naïve Bayes (multilabel) | 0.284 (0.237 to 0.325) | 0.140 (0.135 to 0.192) | 0.175 (0.161 to 0.222) | 0.999 | 0.234 (0.169 to 0.276) | 0.113 (0.087 to 0.158) | 0.139 (0.106 to 0.185) |
Naïve Bayes (multiclass) | 0.372 (0.298 to 0.399) | 0.327 (0.314 to 0.398) | 0.300 (0.266 to 0.342) | 0.696 | 0.178 (0.146 to 0.232) | 0.238 (0.213 to 0.294) | 0.181 (0.154 to 0.226) |
SVM (multilabel) | 0.107 (0.112 to 0.132) | 1.000 (1.000 to 1.000) | 0.184 (0.192 to 0.223) | 0.181 | 0.102 (0.095 to 0.124) | 1.000 (1.000 to 1.000) | 0.177 (0.166 to 0.211) |
SVM (multiclass) | 0.200 (0.171 to 0.244) | 0.159 (0.157 to 0.211) | 0.154 (0.142 to 0.196) | 0.696 | 0.217 (0.145 to 0.263) | 0.169 (0.14 to 0.227) | 0.164 (0.129 to 0.213) |
Nearest centroid (multiclass) | 0.349 (0.297 to 0.395) | 0.270 (0.254 to 0.327) | 0.278 (0.247 to 0.325) | 0.694 | 0.307 (0.18 to 0.355) | 0.205 (0.15 to 0.276) | 0.219 (0.151 to 0.278) |
BERT conventional (multiclass) | 0.467 (0.434 to 0.549) | 0.577 (0.546 to 0.654) | 0.480 (0.447 to 0.550) | 0.696 | 0.484 (0.414 to 0.575) | 0.509 (0.434 to 0.610) | 0.452 (0.390 to 0.525) |
Distant supervision | |||||||
Naïve Bayes (multilabel), ICPC-2 | 0.626 (0.515 to 0.687) | 0.234 (0.196 to 0.278) | 0.323 (0.268 to 0.362) | 0.979 | 0.590 (0.427 to 0.656) | 0.285 (0.206 to 0.384) | 0.378 (0.274 to 0.456) |
Naïve Bayes (multiclass), ICPC-2 | 0.516 (0.466 to 0.569) | 0.590 (0.541 to 0.639) | 0.512 (0.462 to 0.549) | 1.00 | 0.511 (0.412 to 0.611) | 0.524 (0.449 to 0.628) | 0.481 (0.404 to 0.567) |
Nearest centroid, ICPC-2 | 0.718 (0.565 to 0.765) | 0.416 (0.373 to 0.463) | 0.444 (0.384 to 0.489) | 1.00 | 0.520 (0.400 to 0.615) | 0.362 (0.298 to 0.448) | 0.386 (0.303 to 0.467) |
Conventional BERT, CKS | 0.603 (0.553 to 0.653) | 0.584 (0.53 to 0.64) | 0.550 (0.494 to 0.593) | 0.927 | 0.551 (0.477 to 0.649) | 0.562 (0.483 to 0.691) | 0.508 (0.429 to 0.594) |
BERT NSP, CKS | 0.364 (0.333 to 0.394) | 0.816 (0.767 to 0.865) | 0.462 (0.424 to 0.488) | 0.291 | 0.257 (0.215 to 0.331) | 0.598 (0.525 to 0.711) | 0.306 (0.257 to 0.371) |
BERT MLM, CKS | 0.600 (0.547 to 0.64) | 0.615 (0.566 to 0.673) | 0.567 (0.512 to 0.604) | 0.792 | 0.481 (0.409 to 0.574) | 0.536 (0.469 to 0.639) | 0.467 (0.397 to 0.548) |
For conventional supervision, ‘train’ and ‘test’ results are for classifiers trained on the whole 80% training split, and validation was performed using 5-fold cross-validation over the training set. For distant supervision, the OIAM training set was repurposed as a validation set, as it was not used to train the models with this setup.
CKS, Clinical Knowledge Summaries; ICPC-2, International Classification of Primary Care-2; MLM, masked language modelling; NSP, next sentence prediction; OIAM, One in a Million; SVM, support vector machine.