Just as with a diagnostic test or a prediction rule, clinicians should be told the accuracy and reproducibility of algorithm outputs. A process of internal (or in-sample) validation should have tested and refined the algorithm on datasets resampled from the original training datasets,29 either by bootstrapping (multiple sampling in random order) or cross-validation (datasets segmented into different testing sets multiple times [or ‘folds’], hence the term k-fold cross-validation where k=number of folds, usually 5 or 10).
This is followed by a process of external (out-of-sample) validation on previously unseen data, preferably taken from a temporally or geographically different population. This step, which is often omitted, is crucial as it often reveals overfitting, where the algorithm has learnt features of the training dataset too perfectly, including minor random fluctuations, and consequently, may not perform well on new datasets. For classification tasks which are most common, metrics of discrimination should be reported (box 2), and chosen sensitivity/specificity thresholds justified in maximising clinical utility.30 For regression-based prediction tasks, clinicians should ask if an algorithm performs better than existing regression models, in case it may not,31 and ask if replication studies of the same algorithm by independent investigators have yielded the same performance results.32
Box 2Performance measures for machine learning algorithms
Area under receiver operating characteristic curve (AUROC)
For binary outcomes involving numerical samples (such as disease or event present or absent), the receiver operating characteristic (ROC) curve plots the true positive (TP) rate (sensitivity) against the false positive rate (1 minus specificity). An AUROC of 1.0 represents perfect prediction; an AUROC equal to or above 0.8 is preferred.
For binary outcomes involving imaging data, a modification of the ROC is the free-response ROC, or FROC* where a FROC curve comprising a 45⁰ diagonal line indicates the algorithm is useless, while the steeper and more convex the slope of the curve, the greater the accuracy.
In situations where outcomes are not binary and multidimensional, or where data are highly skewed with disproportionately large numbers of true negatives, other methods such as the volume under the surface of the ROC curve and false discovery rate-controlled area under the ROC curve have been suggested; values equal to or above 0.8 are again preferred.**
Confusion matrix
A confusion matrix is a contingency table which yields several metrics, with optimal performance represented by values approaching 100% or 1.0.
Positive predictive value (PPV) or precision: the proportion of positive cases that are TP rather than false positives (FP): PPV=TP/TP +FP.
Negative predictive value (NPV): the proportion of negative cases that are true negatives (TN) rather than false negatives (FN): NPV=TN/TN +FN.
Sensitivity (Sn) or recall: the proportion of TP cases that are correctly identified: Sn=TP/TP+FN.
Specificity (Sp): the proportion of true negative (TN) cases which are correctly identified: Sp=TN/TN+FP.
Accuracy: the proportion of the total number of predictions that are correct: TP+TN/TP+FP+TN+FN.
F1 score: this measure represents the harmonic mean of precision (or PPV) and recall (sensitivity) in which both are maximised to the largest extent possible, given that one comes at the expense of the other. It is reported as a single score from 0 to 1 using the formula: 2 x TP/(2 x TP+FP+FN). The higher the score, the better the performance.
Matthew’s correlation coefficient: This coefficient takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes: TP x TN – FP x FN/√ (TP +FP) (TP+FN) (TN+FP) (TN+FN). A coefficient of +1 represents a perfect prediction, 0 no better than random, and −1 total disagreement between prediction and actual outcome.
Precision-recall (PR) curve
The PR curve is a graphical plot of PPV (or precision) against sensitivity (or recall) to show the trade-off between the two measures for different feature (or parameter) settings. The area under the PR curve is a better measure of accuracy for classification tasks involving highly imbalanced datasets (ie, very few positive cases and large numbers of negative cases). An area under the PR curve (AUPRC) of 0.5 is preferred. Ideally, algorithm developers should report both AUROC and AUPRC, along with figures of the actual curves.
Regression metrics
Various metrics can be used to measure performance of algorithms performing regression functions (ie, predicting a continuous outcome). They include mean absolute error (mean of the absolute differences between actual and predicted values), mean squared error (calculated by summing the differences between actual and predicted values, squaring the results, and dividing by the total number of instances) and root mean squared error (standard deviation of all errors). In all cases, values closer to 0 indicate better performance.
Another commonly used metric is the coefficient of determination (R2), which represents how much of the variation in the output variable (or Y—dependent variable) of the algorithm is explained by variation in its input variables (X—independent variables). An R2 of 0 means prediction is impossible based on input variables and R2 of 1 means completely accurate prediction with no variability. Generally R2 should be above 0.6 for the algorithm to be useful.
*See Moskowitz CS. Using free-response receiver operating characteristic curves to assess the accuracy of machine diagnosis of cancer. JAMA 2017;318:2250–2251.
**See Yu T. ROCS: Receiver operating characteristic surface for class-skewed high-throughput data. PLoS One 2012;7:e40598.