Table 2

Performance comparison of text vectorisation and classification methods on both training and test datasets (each metric is shown with its 95% CI)

Training set
SVCKNNRF
PRF1Acc.AUCPRF1Acc.AUCPRF1Acc.AUC
TF-IDF0.96
(0.88 to 1.00)
1.00
(1.00 to 1.00)
0.98
(0.94 to 1.00)
0.99
(0.96 to 1.00)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
0.37
(0.18 to 0.56)
0.54
(0.32 to 0.71)
0.76
(0.66 to 0.86)
0.95
(0.91 to 0.98)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
Word2Vec0.00
(0.00 to 0.00)
0.00
(0.00 to 0.00)
0.00
(0.00 to 0.00)
0.61
(0.50 to 0.71)
0.65
(0.49 to 0.80)
0.83
(0.60 to 1.00)
0.37
(0.19 to 0.55)
0.51
(0.29 to 0.70)
0.73
(0.63 to 0.83)
0.77
(0.66 to 0.87)
0.95
(0.82 to 1.00)
0.67
(0.48 to 0.83)
0.78
(0.63 to 0.90)
0.86
(0.77 to 0.93)
0.96
(0.91 to 0.99)
Doc2Vec1.00
(1.00 to 1.00)
0.85
(0.71 to 0.96)
0.92
(0.83 to 0.98)
0.94
(0.89 to 0.99)
1.00
(0.99 to 1.00)
1.00
(1.00 to 1.00)
0.41
(0.23 to 0.60)
0.58
(0.36 to 0.75)
0.77
(0.67 to 0.87)
0.97
(0.93 to 0.99)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
1.00
(1.00 to 1.00)
Test set
SVCKNNRF
PRF1Acc.AUCPRF1Acc.AUCPRF1Acc.AUC
TF-IDF0.62
(0.38 to 0.87)
0.71
(0.46 to 0.93)
0.67
(0.44 to 0.84)
0.67
(0.47 to 0.80)
0.82
(0.62 to 0.97)
1.00
(1.00 to 1.00)
0.29
(0.07 to 0.55)
0.44
(0.13 to 0.70)
0.67
(0.50 to 0.83)
0.73
(0.56 to 0.89)
0.80
(0.50 to 1.00)
0.57
(0.30 to 0.83)
0.67
(0.42 to 0.86)
0.73
(0.57 to 0.87)
0.80
(0.64 to 0.94)
Word2Vec0.00
(0.00 to 0.00)
0.00
(0.00 to 0.00)
0.00
(0.00 to 0.00)
0.53
(0.37 to 0.70)
0.77
(0.56 to 0.93)
0.57
(0.14 to 1.00)
0.29
(0.07 to 0.54)
0.38
(0.10 to 0.62)
0.57
(0.40 to 0.730
0.58
(0.37 to 0.77)
0.62
(0.25 to 1.00)
0.36
(0.13 to 0.62)
0.45
(0.13 to 0.69)
0.60
(0.43 to 0.77)
0.57
(0.32 to 0.79)
Doc2Vec1.00
(1.00 to 1.00)
0.50
(0.25 to 0.80)
0.67
(0.40 to 0.87)
0.77
(0.60 to 0.90)
0.86
(0.72 to 0.96)
0.33
(0.00 to 1.00)
0.07
(0.00 to 0.25)
0.12
(0.00 to 0.35)
0.50
(0.33 to 0.67)
0.57
(0.40 to 0.77)
0.89
(0.62 to 1.00)
0.57
(0.31 to 0.82)
0.70
(0.40 to 0.88)
0.77
(0.60 to 0.90)
0.90
(0.76 to 0.99)
  • Acc, accuracy; AUC, area under the curve; F1, F1-score; KNN, K-nearest neighbours; P, precision; R, recall; RF, random forest; SVC, support vector classification; TF-IDF, term frequency-inverse document frequency.