Review

Navigating the machine learning pipeline: a scoping review of inpatient delirium prediction models

Abstract

Objectives Early identification of inpatients at risk of developing delirium and implementing preventive measures could avoid up to 40% of delirium cases. Machine learning (ML)-based prediction models may enable risk stratification and targeted intervention, but establishing their current evolutionary status requires a scoping review of recent literature.

Methods We searched ten databases up to June 2022 for studies of ML-based delirium prediction models. Eligible criteria comprised: use of at least one ML prediction method in an adult hospital inpatient population; published in English; reporting at least one performance measure (area under receiver-operator curve (AUROC), sensitivity, specificity, positive or negative predictive value). Included models were categorised by their stage of maturation and assessed for performance, utility and user acceptance in clinical practice.

Results Among 921 screened studies, 39 met eligibility criteria. In-silico performance was consistently high (median AUROC: 0.85); however, only six articles (15.4%) reported external validation, revealing degraded performance (median AUROC: 0.75). Three studies (7.7%) of models deployed within clinical workflows reported high accuracy (median AUROC: 0.92) and high user acceptance.

Discussion ML models have potential to identify inpatients at risk of developing delirium before symptom onset. However, few models were externally validated and even fewer underwent prospective evaluation in clinical settings.

Conclusion This review confirms a rapidly growing body of research into using ML for predicting delirium risk in hospital settings. Our findings offer insights for both developers and clinicians into strengths and limitations of current ML delirium prediction applications aiming to support but not usurp clinician decision-making.

Introduction

Delirium is a common but underdiagnosed state of disturbed attention and cognition that afflicts one in four older hospital inpatients.1 It is independently associated with a longer length of hospital stay, mortality, accelerated cognitive decline2 and new-onset dementia.1 Since older people are particularly vulnerable to severe illness from COVID-19 infection, delirium emerged as a frequent acute geriatric syndrome during the pandemic.3 Predicting who is likely to develop delirium before symptom onset may facilitate the targeted implementation of preventive strategies that can avoid up to 40% of cases.4

Risk stratification models enable clinicians to identify patients at high risk of an adverse event and intervene where appropriate.5 The advent of wearables, genomics, and dynamic datasets within electronic health records (EHRs) provides big data to which machine learning (ML) can be applied to individualise clinical risk prediction.6 ML is a subset of artificial intelligence that uses advanced computer programmes to learn patterns and associations within large datasets and develop models (or algorithms), which can then be applied to new data in rapidly producing predictions or classifications, including diagnoses.7 Across developed nations, more than 150 ML applications are approved for use in routine clinical practice, and this number is projected to rise exponentially over the coming years.6 8

The key stages of the ML pipeline that models must traverse, from initial in-silico (computer-based) development to real-world deployment, comprise the following6 (figure 1): (1) data collection; (2) data preparation; (3) feature selection and engineering; (4) model training; (5) model validation, both internal and external; (6) deployment of the model within a working application; and (7) post-deployment monitoring and optimisation of the application. During the development phase (stages 1–3), researchers collect, clean and transform data into computable formats and select relevant features as model inputs. The model is then iteratively improved through several training cycles against static, retrospective datasets (stage 4). In stage 5, the model undergoes two processes of validation: internal validation for accuracy and reproducibility against a random sample from the original training dataset (‘hold out’ sample); and external validation, whereby researchers validate the model on a new external dataset set derived from previously unencountered patients using the same performance metrics. In stage 6, the model is subject to prospective validation using live (or near-live) dynamic data in a form reflecting its future real-world deployment, integrated into a prototype application, and evaluated for its feasibility in clinical workflows. Then, it is assessed for its clinical utility within clinical trials, which compares application-guided patient care and outcomes with the current standard of care. Finally, stage 7 entails monitoring the effectiveness and safety of the model over its life cycle using surveillance data.

Figure 1
Figure 1

Machine learning pipeline.

ML models have enormous potential in facilitating more accurate risk stratification, preventive intervention and avoidance of incident delirium, but external validation, prospective evaluation and clinical adoption remain limited,6 and analysis of the clinical impact of deployed models on patient care is rarely performed.9 10 Previous systematic reviews of delirium prediction models have been limited to in-silico models focusing on performance metrics using static retrospective data,11 12 and the studies within these reviews are limited to those published before 2019. The objectives of this review were to: (1) provide a more contemporary overview of research on all ML delirium prediction models designed for use in the inpatient setting; (2) characterise them according to their stage of development, validation and deployment; and (3) assess the extent to which their performance and utility in clinical practice have been evaluated.

Methods

This review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews guidelines13 and is registered within the Open Science Framework (OSF) database (osf.io/8r5cd). A scoping review methodology was selected as it allows us to map the broad and emerging ML evidence base in a flexible but systematic manner.14

Literature search

The search strategy was developed by two authors (TS, LSH) and reviewed by a third author (IAS) and a librarian. We searched PubMed, EMBASE, IEEE Xplore, Scopus, Web of Science, CINAHL, PsycInfo, Cochrane, OSF pre-prints and the aiforhealth.app machine learning research dashboard between inception and 14 June 2022, using a mixture of medical subject headings (MeSH) and keywords related to delirium and ML (for the exact search terms, see online supplemental appendix 1). Additional studies were identified by perusing the reference lists of retrieved articles.

Study selection

Retrieved studies were imported into EndNote 20 and screened for relevance and duplicates in Covidence. Two reviewers (TS, IT) independently screened the titles and abstracts, and two authors (TS, LSH) reviewed the full-text articles. Disagreements between screening authors were resolved by discussion or settled by a third reviewer (IAS). We considered full-length original studies published in peer-reviewed journals, pre-prints and conference proceedings. Eligible studies had to fulfil all the following criteria: use of at least one ML method that predicts delirium; applied to an adult hospital inpatient population; published in English; and reporting at least one of the following performance measures (area under the receiver-operator curve (AUROC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV). Studies were excluded if they were: editorials, position statements, letters to the editor, conference abstracts or press releases; conducted in non-hospital settings; or did not report any model performance metrics.

Data extraction and synthesis

One reviewer (TS) independently performed data extraction using a preagreed form designed in Covidence. The following data items were extracted: title; author; publication year; country (where data were collected); study aim and design; clinical setting; population characteristics; ML modelling method(s); reference standard used to diagnose delirium; frequency of delirium; data source and type; evolutionary stage and respective sample size; model performance measures (comprising, where reported, AUROC, sensitivity, specificity, PPV and NPV, Brier score, calibration plot concordance), primary outcome measures; comparison to standard care; principal discharge diagnosis; and length of stay. Qualitative information on user acceptance of deployed models was also recorded where reported.

We defined a model as being in the ‘development and internal validation’ stage if the dataset used for validating the model came from the same patient population as the training dataset. An ‘external validation’ study was where the model was validated using a dataset from a population temporally or geographically separate from that used to provide the original training data. Finally, we labelled a study as having a ‘deployment’-level study was where the was evaluated in a routine clinical setting.

Corresponding authors were contacted for studies that did not report the reference standard used to define delirium in their dataset. Two authors (LSH, IT) cross-checked the data extracted for a random sample of 25% (n=10) of studies, and disagreement was managed through discussion.

A narrative approach was taken to synthesise the data extracted from the selected studies, including tabular and graphical representations, summarising the number of studies in each stage, year and country published, performance metrics, algorithm type, data type and stage of development. Descriptive statistics for continuous variables comprised mean and SD and median and IQR for normally and non-normally distributed data, respectively. All analyses and visualisations were done within R.15 As this was a scoping review, no attempt was made to assess the quality of individual study design or methods.

Results

The search strategy identified a total of 921 records; after duplicate removal and title and abstract screening, 114 full-text studies were retrieved, of which 3916–54 met the selection criteria for inclusion in the final analysis (figure 2).

Figure 2
Figure 2

Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow chart.

Study characteristics

Study characteristics are summarised in online supplemental table 1. Studies originated from the USA (n=12),17 19–23 25 41 43 50 51 54 Austria (n=9),24 28–31 33 39 47 48 China (n=6),26 32 35 49 52 53 Germany (n=3),37 45 46 South Korea (n=3),27 40 44 Canada (n=3),30 36 38 Brazil (n=1),16 Japan (n=1),34 Spain (n=1)18 and one study was labelled as international.42 Over the 6-year distribution of publications to June 2022, most studies were published in 2021 (n=10) and the first half of 2022 (n=12), indicating considerable growth in research in this area since the publication of previous reviews of studies published up to 2019.11 12 Study design comprised retrospective cohort study (n=25), prospective cohort studies (n=9); secondary analyses of trial data (n=2), prospective pilot study (n=2) and a retrospective case-control study (n=1). Studies mostly used data from EHRs alone to develop their models (n=21), with the remainder including specified clinical assessments (eg, nursing assessment, n=8), compiled clinical databases (eg, data repository or open-access database, n=6), data from a clinical quality improvement registry (n=1), data from both EHRs and clinical assessments (n=1), data from EHRs and a clinical database (n=1) and data solely from electrocardiographs (n=1).

The median (IQR) sample size of training datasets was 2389 (IQR: 371–27,377) participants, of whom, when reported as a percentage, a median of 20% (IQR: 20%–25%) was used as a ‘hold-out’ sample for internal validation. External validation and deployment studies had a median of 4765 (IQR: 2429–11 355) and 5887 (IQR: 3456–10 975) participants, respectively. The age of participants ranged from a mean of 54.4–84.4 years. Hospital inpatients were treated in surgical wards (n=14), medical wards (n=10), intensive care units (ICU) (n=7) or a combination of all three settings (n=8). The reported reference standards for verifying delirium cases in the training dataset comprised the confusion assessment method for the Intensive Care Unit (CAM-ICU) (n=10), International Classification of Diseases codes (n=14), the CAM (n=7) and the Diagnostic Statistical Manual (n=3). Several alternative screening methods, such as the 4 A’s Test (n=2), were used infrequently, and three studies reported no information as to what reference standard was used. The prevalence of delirium in training and internal validation datasets ranged from 2.0% to 53.6%, and from 10% to 39% in external validation studies. Delirium prevalence was 1.5%28 and 31.2%31 for the two deployment studies that reported data on this outcome. Length of stay ranged from an average of 1.9–13.6 days, but was not reported in 27 (69%) of studies.

Model characteristics

Thirty of thirty-nine publications described the training and internal validation of a delirium model,17 18 21–26 30 32–41 43 44 46–54 with investigators of 6 of these studies (20%) externally validating their model in a subsequent paper.16 19 20 27 29 42 Investigators of three studies (10%) implemented and evaluated their model in real-time clinical workflows,2 8 31 45 but no publications described monitoring or optimising a deployed model.

Figure 3 depicts the numbers of publications that used each type of model across each stage of application maturity. In total, random forest models were the most common (n=11), followed by logistic regression (n=6), gradient boosting (n=5) and artificial neural networks (n=4). Two other papers each described using a decision tree, L1-penalised regression, or natural language processing models, with another seven papers describing different models unique to the study.

Figure 3
Figure 3

Number of publications by machine learning method. If a study describes multiple models, only the best-performing (area under receiver-operator curve) model is shown. LEM, learning from examples module 2; LR, logistic regression; RBF, radial basis function; RF, random forest; SAINTENS, self-attention and intersample attention transformer; SVM, support vector machine.

Performance metrics of each model at their different stages of validation, when reported, are listed in online supplemental table 2. In the absence of any universal task-agnostic standard, we regarded values of AUROC>0.7, of sensitivity and specificity ≥80%, of PPV ≥30% and NPV ≥90%, of Brier scores <0.20 and calibration plots showing high concordance as being acceptable accuracy thresholds for clinical application. For internal validation, omitting two studies for which the AUROC statistic was not reported,40 44 the median AUROC for the remaining models was 0.85 (IQR: 0.78–0.90). For external validation and deployment studies, the reported median AUROC scores were 0.75 (IQR: 0.74–0.81) and 0.92 (IQR: 0.89–0.93), respectively.

Stratified by algorithm type, the median AUROC (models with >1 publication) for training and internal validation studies was highest for random forest models (0.91, IQR: 0.88–0.91). In order of decreasing performance were natural language processing (AUROC: 0.85, IQR: 0.83–0.91); decision trees (AUROC: 0.83, IQR: 0.78–0.89); artificial neural networks (AUROC: 0.81, IQR: 0.76–0.86); gradient boosting (AUROC: 0.81, IQR: 0.77–0.85); artificial neural networks (AUROC: 0.81, IQR: 0.75–0.87) and logistic regression models (AUROC: 0.80, IQR: 0.78–0.82).

In regards to external validation, a gradient boosting algorithm performed best (AUROC: 0.86), followed by random forest models (AUROC: 0.78, IQR: 0.75–0.80) and L1-penalised regression (AUROC: 0.75, IQR: 0.75–0.75). For prospective studies of deployed models, the best performance was observed in one study using natural language processing, with an AUROC score of 0.94,45 with random forest models achieving a median AUROC score of 0.89 (IQR: 0.87–0.90). The AUROC performance metrics for all models, stratified by stage of maturity, is presented graphically in figure 4.

Figure 4
Figure 4

Graphical representation of AUROC performance metrics stratified by stage of development. Son et al44 andOh et al40 did not report AUROC but are included in the analysis as they reported other performance metrics. AUROC, area under receiver-operator curve.

The median sensitivity and specificity for training studies were 75% (IQR: 64.1%–82.3%) and 82.2% (73.3%–90.4%), respectively. For external validation studies, median sensitivity and specificity dropped to 73% (IQR: 67.5%–81.7%) and 69% (IQR: 48%–72%), respectively. However, in deployment-level studies, median sensitivity and specificity were 87.1% (IQR: 80.6%–93.5%) and 86.4% (IQR: 84.3%–88.5%), respectively. The PPV and NPVs of included ML models were only reported for 10 of 39 studies (26%), which ranged respectively from 5.8% to 91.6% and from 90.6% to 99.5%.

Of the total, only 14 studies (35.9%) reported calibration metrics which showed considerable variation. Using calibration plots, four studies reported poor calibration, an equal number reported reasonable calibration, while the remainder employed alternative calibration methods with variable results (see online supplemental table 1). The Brier score was reported for only five studies (13%) and ranged from 0.14 to 0.22.

Clinical application

Three articles from two investigators subjected their prototype model to prospective validation using live data in a form reflecting its future application to clinical workflows.28 31 45 Sun et al45 trained three separate models to predict delirium, acute kidney injury and sepsis. They found their delirium model performed slightly worse using live data from three hospitals at admission (AUROC decreased by 3.6%) and when deployed in another participating hospital with data separate to that of the training set, performance dropped by another 0.8% at discharge. Sun et al reported user feedback only for the acute kidney injury model.

Jauk et al28 implemented their delirium prediction model in an Austrian hospital system for 7 months and thereafter for an additional month in the trauma surgery department of another affiliated hospital.31 The prediction model performed somewhat worse on prospective data (AUROC: 0.86) as it did on retrospective data used in the training and internal validation study33 (AUROC: 0.91). In addition, predictions of the random forest model used in this study correlated strongly with nurses' ratings of delirium risk in a sample of internal medicine patients (correlation coefficient (r)=0.81 for blinded and r=0.62 for non-blinded comparison). In the external validation study, the model achieved an AUROC value above 0.85 across three prediction times (on admission: 0.863; first evening: 0.851; second evening: 0.857). However, when the model was re-trained using local data, the AUROC value exceeded 0.92 for all three prediction times, and correctly predicted all 29 patients who were deemed high risk for delirium by a senior physician (sensitivity=100%, specificity=90.6%). In a qualitative survey, the 13 health professionals involved in the project perceived the ML application as useful and easy to use.

Discussion

This scoping review examined contemporary research around ML models for predicting delirium in adult inpatient settings and identified an additional 22 studies published since late 2019 which was the finish date for previous reviews.11 12 We have mapped the development and implementation stage and associated performance metrics of these new models according to a six-stage evolutionary ML pipeline. Importantly, we included three novel implementation studies which demonstrated good predictive accuracy and user acceptance, underscoring the potential clinical utility of ML models for delirium prediction.

However, our review reveals several limitations in the existing research that future studies need to address. First, training data in most studies comprised routinely collected data obtained retrospectively from EHRs which, while providing vast quantities of data for training complex models, suffer from inaccuracies and omissions relating to key predictor variables. Only a quarter of studies18 26 28 29 31 32 35 37 40 44 45 in this review sourced prespecified and prospectively collected data, such that missing or incomplete data relevant to model optimisation, and which could not be remedied using imputation methods, emerged as a critical limitation for many studies. For instance, the EHR-derived models of Zhao et al53 lacked microbiological, radiological and biomarker data relevant to delirium, limiting their predictive accuracy. Similarly, missing information about medication use and frailty indices posed a limiting factor in several other studies.17 22 49–51 Many studies also did not have access to demographic data of their study population, such as socioeconomic status, gender and race.26 53 Reliance on data sources with missing data and unrepresentative of target populations weakens model performance and introduces biases, generating models that may exacerbate healthcare inequities.7

Second, similar to the findings of previous reviews,11 12 most models described in our scoping review did not mature past the stage of internal validation. Only six studies validated their model on an external dataset16 19 20 27 29 42 despite evidence that models that perform well on ‘hold out’ training data usually have lower performance when applied to more noisy datasets from different institutions due to model overfitting.5

Third, of all 39 included studies, only those of Jauk et al28 31 and Sun et al45 subjected their models to a prospective evaluation using live data in clinical practice. The extent to which clinicians will adopt a model depends on their trust in its predictive accuracy and utility and the ease with which it can be integrated into clinical workflows.7 Sun and colleagues45 demonstrated their deep learning model performed equally well in training and prospective validation studies.29 In a subsequent case study, the authors demonstrated an instance where their application correctly predicted postoperative delirium in a patient with a negative preoperative CAM-ICU, demonstrating its clinical utility in a surgical ward.55 In addition, they found ML applications could be particularly useful for the early detection of delirium in wards where delirium screening is often not performed and delirium is underdiagnosed.1

Similarly, Jauk and colleagues28 analysed 5530 predictions over 7 months of deployment, finding their model performance was reliable and attracted high satisfaction ratings by a senior physician. In a later qualitative study, the 47 nurses and physicians associated with the project rated the delirium prediction model as useful, easy to use and interpretable without increasing workload.56 These favourable findings were replicated in a follow-up study where the random forest model was implemented in a separate hospital network.31 However, cross-hospital evaluations underscored the need to re-train the model with local data to mitigate declines in performance when applied to new clinical settings.31 45 However, neither of these models has been subjected to clinical trials to establish impacts on patient care or outcomes.

Our review has some limitations. As our study was a scoping exercise, and in the absence of an agreed risk of bias assessment tool for ML prediction studies, we chose not to critically appraise the quality of individual studies. For similar reasons, and given the heterogeneity of the data source, model type and performance metrics reported in included studies, quantitative meta-analysis was not performed.

Conclusion

Prediction models derived using ML methods can potentially identify individuals at risk of developing delirium before symptom onset to whom preventive strategies can be targeted, which may, in turn, reduce incident delirium and improve patient outcomes. This scoping review identified all publications describing ML-based delirium prediction models over the last 5 years, evaluated their stage in the ML evolution pipeline, and assessed their performance and utility. Relatively few were subject to external validation, which, when performed, showed degraded model performance. In addition, while few studies underwent prospective evaluation in real-world clinical settings, performance and user acceptance seemed promising in those that did. However, given the limitations of current delirium prediction models, they should not be seen as substitutes for expert clinician judgement.