Table 2

Performance with standard supervised learning, 95% CIs shown in parentheses

	Validation			Train	Test
Model	Precision	Recall	F1	F1	Precision	Recall	F1
Random baseline	0.499	0.104	0.161	0.161	0.501	0.102	0.158
Conventional supervision
Naïve Bayes (multilabel)	0.284 (0.237 to 0.325)	0.140 (0.135 to 0.192)	0.175 (0.161 to 0.222)	0.999	0.234 (0.169 to 0.276)	0.113 (0.087 to 0.158)	0.139 (0.106 to 0.185)
Naïve Bayes (multiclass)	0.372 (0.298 to 0.399)	0.327 (0.314 to 0.398)	0.300 (0.266 to 0.342)	0.696	0.178 (0.146 to 0.232)	0.238 (0.213 to 0.294)	0.181 (0.154 to 0.226)
SVM (multilabel)	0.107 (0.112 to 0.132)	1.000 (1.000 to 1.000)	0.184 (0.192 to 0.223)	0.181	0.102 (0.095 to 0.124)	1.000 (1.000 to 1.000)	0.177 (0.166 to 0.211)
SVM (multiclass)	0.200 (0.171 to 0.244)	0.159 (0.157 to 0.211)	0.154 (0.142 to 0.196)	0.696	0.217 (0.145 to 0.263)	0.169 (0.14 to 0.227)	0.164 (0.129 to 0.213)
Nearest centroid (multiclass)	0.349 (0.297 to 0.395)	0.270 (0.254 to 0.327)	0.278 (0.247 to 0.325)	0.694	0.307 (0.18 to 0.355)	0.205 (0.15 to 0.276)	0.219 (0.151 to 0.278)
BERT conventional (multiclass)	0.467 (0.434 to 0.549)	0.577 (0.546 to 0.654)	0.480 (0.447 to 0.550)	0.696	0.484 (0.414 to 0.575)	0.509 (0.434 to 0.610)	0.452 (0.390 to 0.525)
Distant supervision
Naïve Bayes (multilabel), ICPC-2	0.626 (0.515 to 0.687)	0.234 (0.196 to 0.278)	0.323 (0.268 to 0.362)	0.979	0.590 (0.427 to 0.656)	0.285 (0.206 to 0.384)	0.378 (0.274 to 0.456)
Naïve Bayes (multiclass), ICPC-2	0.516 (0.466 to 0.569)	0.590 (0.541 to 0.639)	0.512 (0.462 to 0.549)	1.00	0.511 (0.412 to 0.611)	0.524 (0.449 to 0.628)	0.481 (0.404 to 0.567)
Nearest centroid, ICPC-2	0.718 (0.565 to 0.765)	0.416 (0.373 to 0.463)	0.444 (0.384 to 0.489)	1.00	0.520 (0.400 to 0.615)	0.362 (0.298 to 0.448)	0.386 (0.303 to 0.467)
Conventional BERT, CKS	0.603 (0.553 to 0.653)	0.584 (0.53 to 0.64)	0.550 (0.494 to 0.593)	0.927	0.551 (0.477 to 0.649)	0.562 (0.483 to 0.691)	0.508 (0.429 to 0.594)
BERT NSP, CKS	0.364 (0.333 to 0.394)	0.816 (0.767 to 0.865)	0.462 (0.424 to 0.488)	0.291	0.257 (0.215 to 0.331)	0.598 (0.525 to 0.711)	0.306 (0.257 to 0.371)
BERT MLM, CKS	0.600 (0.547 to 0.64)	0.615 (0.566 to 0.673)	0.567 (0.512 to 0.604)	0.792	0.481 (0.409 to 0.574)	0.536 (0.469 to 0.639)	0.467 (0.397 to 0.548)

For conventional supervision, ‘train’ and ‘test’ results are for classifiers trained on the whole 80% training split, and validation was performed using 5-fold cross-validation over the training set. For distant supervision, the OIAM training set was repurposed as a validation set, as it was not used to train the models with this setup.
CKS, Clinical Knowledge Summaries; ICPC-2, International Classification of Primary Care-2; MLM, masked language modelling; NSP, next sentence prediction; OIAM, One in a Million; SVM, support vector machine.