The search strategy identified a total of 921 records; after duplicate removal and title and abstract screening, 114 full-text studies were retrieved, of which 3916–54 met the selection criteria for inclusion in the final analysis (figure 2).
Study characteristics
Study characteristics are summarised in online supplemental table 1. Studies originated from the USA (n=12),17 19–23 25 41 43 50 51 54 Austria (n=9),24 28–31 33 39 47 48 China (n=6),26 32 35 49 52 53 Germany (n=3),37 45 46 South Korea (n=3),27 40 44 Canada (n=3),30 36 38 Brazil (n=1),16 Japan (n=1),34 Spain (n=1)18 and one study was labelled as international.42 Over the 6-year distribution of publications to June 2022, most studies were published in 2021 (n=10) and the first half of 2022 (n=12), indicating considerable growth in research in this area since the publication of previous reviews of studies published up to 2019.11 12 Study design comprised retrospective cohort study (n=25), prospective cohort studies (n=9); secondary analyses of trial data (n=2), prospective pilot study (n=2) and a retrospective case-control study (n=1). Studies mostly used data from EHRs alone to develop their models (n=21), with the remainder including specified clinical assessments (eg, nursing assessment, n=8), compiled clinical databases (eg, data repository or open-access database, n=6), data from a clinical quality improvement registry (n=1), data from both EHRs and clinical assessments (n=1), data from EHRs and a clinical database (n=1) and data solely from electrocardiographs (n=1).
The median (IQR) sample size of training datasets was 2389 (IQR: 371–27,377) participants, of whom, when reported as a percentage, a median of 20% (IQR: 20%–25%) was used as a ‘hold-out’ sample for internal validation. External validation and deployment studies had a median of 4765 (IQR: 2429–11 355) and 5887 (IQR: 3456–10 975) participants, respectively. The age of participants ranged from a mean of 54.4–84.4 years. Hospital inpatients were treated in surgical wards (n=14), medical wards (n=10), intensive care units (ICU) (n=7) or a combination of all three settings (n=8). The reported reference standards for verifying delirium cases in the training dataset comprised the confusion assessment method for the Intensive Care Unit (CAM-ICU) (n=10), International Classification of Diseases codes (n=14), the CAM (n=7) and the Diagnostic Statistical Manual (n=3). Several alternative screening methods, such as the 4 A’s Test (n=2), were used infrequently, and three studies reported no information as to what reference standard was used. The prevalence of delirium in training and internal validation datasets ranged from 2.0% to 53.6%, and from 10% to 39% in external validation studies. Delirium prevalence was 1.5%28 and 31.2%31 for the two deployment studies that reported data on this outcome. Length of stay ranged from an average of 1.9–13.6 days, but was not reported in 27 (69%) of studies.
Model characteristics
Thirty of thirty-nine publications described the training and internal validation of a delirium model,17 18 21–26 30 32–41 43 44 46–54 with investigators of 6 of these studies (20%) externally validating their model in a subsequent paper.16 19 20 27 29 42 Investigators of three studies (10%) implemented and evaluated their model in real-time clinical workflows,2 8 31 45 but no publications described monitoring or optimising a deployed model.
Figure 3 depicts the numbers of publications that used each type of model across each stage of application maturity. In total, random forest models were the most common (n=11), followed by logistic regression (n=6), gradient boosting (n=5) and artificial neural networks (n=4). Two other papers each described using a decision tree, L1-penalised regression, or natural language processing models, with another seven papers describing different models unique to the study.
Figure 3Number of publications by machine learning method. If a study describes multiple models, only the best-performing (area under receiver-operator curve) model is shown. LEM, learning from examples module 2; LR, logistic regression; RBF, radial basis function; RF, random forest; SAINTENS, self-attention and intersample attention transformer; SVM, support vector machine.
Performance metrics of each model at their different stages of validation, when reported, are listed in online supplemental table 2. In the absence of any universal task-agnostic standard, we regarded values of AUROC>0.7, of sensitivity and specificity ≥80%, of PPV ≥30% and NPV ≥90%, of Brier scores <0.20 and calibration plots showing high concordance as being acceptable accuracy thresholds for clinical application. For internal validation, omitting two studies for which the AUROC statistic was not reported,40 44 the median AUROC for the remaining models was 0.85 (IQR: 0.78–0.90). For external validation and deployment studies, the reported median AUROC scores were 0.75 (IQR: 0.74–0.81) and 0.92 (IQR: 0.89–0.93), respectively.
Stratified by algorithm type, the median AUROC (models with >1 publication) for training and internal validation studies was highest for random forest models (0.91, IQR: 0.88–0.91). In order of decreasing performance were natural language processing (AUROC: 0.85, IQR: 0.83–0.91); decision trees (AUROC: 0.83, IQR: 0.78–0.89); artificial neural networks (AUROC: 0.81, IQR: 0.76–0.86); gradient boosting (AUROC: 0.81, IQR: 0.77–0.85); artificial neural networks (AUROC: 0.81, IQR: 0.75–0.87) and logistic regression models (AUROC: 0.80, IQR: 0.78–0.82).
In regards to external validation, a gradient boosting algorithm performed best (AUROC: 0.86), followed by random forest models (AUROC: 0.78, IQR: 0.75–0.80) and L1-penalised regression (AUROC: 0.75, IQR: 0.75–0.75). For prospective studies of deployed models, the best performance was observed in one study using natural language processing, with an AUROC score of 0.94,45 with random forest models achieving a median AUROC score of 0.89 (IQR: 0.87–0.90). The AUROC performance metrics for all models, stratified by stage of maturity, is presented graphically in figure 4.
Figure 4Graphical representation of AUROC performance metrics stratified by stage of development. Son et al44 andOh et al40 did not report AUROC but are included in the analysis as they reported other performance metrics. AUROC, area under receiver-operator curve.
The median sensitivity and specificity for training studies were 75% (IQR: 64.1%–82.3%) and 82.2% (73.3%–90.4%), respectively. For external validation studies, median sensitivity and specificity dropped to 73% (IQR: 67.5%–81.7%) and 69% (IQR: 48%–72%), respectively. However, in deployment-level studies, median sensitivity and specificity were 87.1% (IQR: 80.6%–93.5%) and 86.4% (IQR: 84.3%–88.5%), respectively. The PPV and NPVs of included ML models were only reported for 10 of 39 studies (26%), which ranged respectively from 5.8% to 91.6% and from 90.6% to 99.5%.
Of the total, only 14 studies (35.9%) reported calibration metrics which showed considerable variation. Using calibration plots, four studies reported poor calibration, an equal number reported reasonable calibration, while the remainder employed alternative calibration methods with variable results (see online supplemental table 1). The Brier score was reported for only five studies (13%) and ranged from 0.14 to 0.22.
Clinical application
Three articles from two investigators subjected their prototype model to prospective validation using live data in a form reflecting its future application to clinical workflows.28 31 45 Sun et al45 trained three separate models to predict delirium, acute kidney injury and sepsis. They found their delirium model performed slightly worse using live data from three hospitals at admission (AUROC decreased by 3.6%) and when deployed in another participating hospital with data separate to that of the training set, performance dropped by another 0.8% at discharge. Sun et al reported user feedback only for the acute kidney injury model.
Jauk et al28 implemented their delirium prediction model in an Austrian hospital system for 7 months and thereafter for an additional month in the trauma surgery department of another affiliated hospital.31 The prediction model performed somewhat worse on prospective data (AUROC: 0.86) as it did on retrospective data used in the training and internal validation study33 (AUROC: 0.91). In addition, predictions of the random forest model used in this study correlated strongly with nurses' ratings of delirium risk in a sample of internal medicine patients (correlation coefficient (r)=0.81 for blinded and r=0.62 for non-blinded comparison). In the external validation study, the model achieved an AUROC value above 0.85 across three prediction times (on admission: 0.863; first evening: 0.851; second evening: 0.857). However, when the model was re-trained using local data, the AUROC value exceeded 0.92 for all three prediction times, and correctly predicted all 29 patients who were deemed high risk for delirium by a senior physician (sensitivity=100%, specificity=90.6%). In a qualitative survey, the 13 health professionals involved in the project perceived the ML application as useful and easy to use.