Original Research

Machine learning for outcome predictions of patients with trauma during emergency department care

Abstract

Objectives To develop and evaluate a machine learning model for predicting patient with trauma mortality within the US emergency departments.

Methods This was a retrospective prognostic study using deidentified patient visit data from years 2007 to 2014 of the National Trauma Data Bank. The predictive model intelligence building process is designed based on patient demographics, vital signs, comorbid conditions, arrival mode and hospital transfer status. The mortality prediction model was evaluated on its sensitivity, specificity, area under receiver operating curve (AUC), positive and negative predictive value, and Matthews correlation coefficient.

Results Our final dataset consisted of 2 007 485 patient visits (36.45% female, mean age of 45), 8198 (0.4%) of which resulted in mortality. Our model achieved AUC and sensitivity-specificity gap of 0.86 (95% CI 0.85 to 0.87), 0.44 for children and 0.85 (95% CI 0.85 to 0.85), 0.44 for adults. The all ages model characteristics indicate it generalised, with an AUC and gap of 0.85 (95% CI 0.85 to 0.85), 0.45. Excluding fall injuries weakened the child model (AUC 0.85, 95% CI 0.84 to 0.86) but strengthened adult (AUC 0.87, 95% CI 0.87 to 0.87) and all ages (AUC 0.86, 95% CI 0.86 to 0.86) models.

Conclusions Our machine learning model demonstrates similar performance to contemporary machine learning models without requiring restrictive criteria or extensive medical expertise. These results suggest that machine learning models for trauma outcome prediction can generalise to patients with trauma across the USA and may be able to provide decision support to medical providers in any healthcare setting.

Summary

What is already known?

  • Machine learning methods such as XGBoost and Deep Neural Networks are capable of accurately predicting patient outcomes in complex clinical settings.

  • Previous works have demonstrated good performance for predicting hospitalisation or critical outcomes (which includes either intensive care unit admission or patient death).

What does this paper add?

  • This study presents a new predictive Deep Neural Network which can generate effective and high-fidelity outcome prediction models for patients with trauma across a broader population than previously demonstrated.

  • With the size of the dataset used, we were able to limit the predicted outcome to patient mortality, which is a relatively rare but highly relevant event in the emergency department.

Introduction

Trauma is a leading cause of death in the USA, and each year, thousands of trauma physicians and other front-line healthcare personnel face a critical triage decision: which patients should be prioritised to prevent major complications or death?1 In 2018 alone, traumatic injuries caused over 240 000 mortalities in the USA.2 3 Evidence-based tools such as Injury Severity Score (ISS) can mislead medical professionals into undertriaging patients or incorrectly classifying a patient’s condition as unsurvivable, and regression models are often limited by restrictive model criteria.4–7 A regression line cannot capture the highly non-linear decision boundary required for accurate patient triage, and with the annual increase of emergency department (ED) visits outpacing the growth of the US population most years,8 a more useful prognostic tool will be necessary to achieve better patient outcomes and resource utilisation.9

Many researchers over the past 30 years have sought to improve the clinical decision-making process for patient care. McGonigal et al demonstrated the groundbreaking capabilities of neural networks using only Revised Trauma Score, ISS and patient age to provide more accurate predictions than contemporary logistic regression models.10 Marble and Healy produced a more sophisticated model which could identify sepsis with almost 100% accuracy.11 These studies were only valid for a small subset of patients, though—they narrowed their focus to specific patient conditions. Significant advancements in machine learning (ML) techniques have been made since these papers’ publication, and more effort than ever is pushing towards modelling techniques that generalise across all patients, regardless of age or injury mechanism.

Several recent papers have demonstrated the power of ML in predicting patient outcomes in the hospital and ED, but these were formulated without an abundance of nationally representative data sets, with models restricted to certain age groups, or without the verification of model performance across different injury mechanisms.12–14 These issues created a gap in clinical understanding about the models’ generalisability across patient demographics and conditions. There is, therefore, a need to study the capabilities of ML on a sufficiently large and diverse national dataset with a focus on generalisability across clinical scenarios. To the best of our knowledge, no study we searched has used ML solely to predict ED death, despite the clinical relevance of such a risk assessment tool in prioritising and triaging critical patients.

With a large dataset that captures patient visit information from across the United States, we hypothesised that an all ages, injury-invariant, generalisable ML model could predict patient mortality in the ED better than current practices. The model’s generalisability across different age groups was validated by examining contemporary mortality prediction models, comparing key performance metrics and analysing performance characteristics across injury types to ensure model invariance.

Methods

Study setting

This retrospective study used 2007–2014 National Trauma Data Bank (NTDB) data. The American College of Surgeons (ACS) collects trauma registry data from hospitals across the USA every year and compiles it into the NTDB. The ACS has created the National Trauma Data Standard (NTDS) Data Dictionary, which ensures the quality and validity of data used by researchers.12

Study samples

From 5.8 million patient visits captured in the data, we selected patient with trauma visits with complete ED vitals, a known mode of arrival and transfer status, and a valid outcome (ie, excluding dispositions that were ‘not applicable’, ‘not known/recorded’ or ‘left against medical advice’). Patients not meeting these criteria were removed from the dataset.

Predictor variables

We considered the 77 predictors for mortality shown in table 1, all of which are typically available at the time of patient admission and triage in the ED. Predictors came from the following categories: demographics, ED vitals, comorbidities, injury intent, injury type, injury mechanism, arrival mode and transfer status.4 13 14 Over the years, the NTDS has added and removed certain comorbidities. Chronic conditions not represented across all years were removed from the dataset. NTDS External Injury Codes were transformed into injury type, mechanism and intent based on the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) code recorded for the patient.

Table 1
|
Predictor and trauma outcome variables

Outcome variables

The outcome variable being predicted was patient mortality. Patients with a disposition of ‘deceased/expired,’ ‘expired’ or ‘discharged/transferred to hospice care’ were treated as positive cases for patient mortality, as each is explicitly mortality or an expectation of such shortly after discharge.15 16 All other valid outcomes were treated as negative for patient mortality. These included general admission to the hospital, admission to a specialised unit within the hospital (intensive care unit (ICU), step-down, etc), transfer to another hospital or discharge from the ED.

Model generation

Data were preprocessed before being passed to the model for training or prediction. Two separate, non-overlapping datasets were constructed; one contained hospital outcomes and the other contained ED outcomes. For each, a training set was created using 70% of the data available, and the remaining 30% was retained as a test set. As there are relatively few mortalities, we used stratification to sample from the pool of mortality and non-mortality cases individually, ensuring each class is represented proportionally. Categorical data were given binary encodings for each variable in a category (one-hot encoding) and numerical data were standardised.17

We trained an XGBoost model to predict ISS and add it as a feature in the data, as ISS can be very useful in determining the immediacy of a patient’s condition but is not typically available on patient check-in. Our custom PyTorch model architecture shown in figure 1 was composed of four distinct layers: a single input layer, two hidden layers with 300 and 100 neurons, respectively, and a final output layer which predicted patient mortality.18 Batch normalisation and drop-out layers were used to prevent overfitting. The final architecture was made specifically for this study and applied to three different age groupings: children, adults and all ages.

Figure 1
Figure 1

Model architecture and sample training loss. The model consisted of three layers and used both batch normalisation and dropout to smooth training loss and prevent overfitting of the model to the training set. Figure by JDC. ISS, Injury Severity Score.

Because of the scarcity of ED mortality data points, we tried pretraining each model with a coarse learning rate on hospital outcomes to boost its discriminatory capabilities.19 Then, using a finer learning rate, the model was trained again on the dataset containing ED outcomes. The pretrained model’s performance was compared with one with no pretraining, and the better of the two was selected.

Models were evaluated on sensitivity, specificity, sensitivity-specificity gap, area under receiver operating characteristic curve (AUC), positive predictive value (PPV), negative predictive value (NPV) and Matthews correlation coefficient (MCC). The sensitivity-specificity gap is the linear distance between these two values and explains how far the model is from having perfect predictive capabilities. It is calculated as shown in equation 1.

Display Formula

Equation 1: sensitivity-specificity gap

MCC is a balanced measure between true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) whereby the only means of improving the metric is reducing the total number of misclassifications. It is mathematically identical to Pearson correlation coefficient and is calculated as shown in equation 2.

Display Formula

Equation 2: MCC

Model verification

To validate model performance, we collected results from other modern ML-based outcome prediction tools and compared our performance metrics. Goto et al and Raita et al’s models predict either patient mortality or admission to the ICU with ED check-in data,20 21 while Hong et al used triage data to predict patient hospitalisation.22 These papers did not report a value for MCC. Because no contemporary ML study has tried to generalise to all ages before, we segmented the models into different age groupings. To evaluate model competence in predicting outcomes irrespective of the nature of a patient’s injury, injury mechanisms determined by the reported external injury code were systematically filtered out of the data before training and testing the model.

To verify the model architecture’s effectiveness in learning to predict mortality, we created a second set of models which predicted the overall outcome of a patient, whether in the hospital or in the ED. This was an important step in verifying the model due to the scarcity of ED mortality data points and relative abundance of hospital deaths.

Statistical analysis of excluded patients

Because of the reduction of the dataset from 5.8 million patient visits to two million, we examined whether patients meeting our inclusion criteria had the same distribution as those we excluded, with the goal of comparing their baseline characteristics. We applied Student’s t-test to the patient age, Glasgow Coma Score (GCS) total and ISS and the χ2 test to patient gender and presence of comorbidities. For each variable, we calculated a p value with an alpha level of 0.05 to determine whether included and excluded patients were statistically similar. All tests returned a p value of zero, indicating included and excluded patients occupy different distributions. This might imply that excluded patients had a reason some data were missing.

Results

From 2007 to 2014, 5.8 million unique patient with trauma visits were recorded in the NTDB with two million unique visits meeting our inclusion criteria. The data which met these criteria were composed of 300 847 children and 1 706 638 adults. Table 1 shows characteristics of the child and adult populations with respect to the selected predictors and outcomes. From these data, the hospital outcome dataset contained 1 765 545 unique visits, and the ED outcome dataset retained the remaining 245 940.

Model benchmarking

For children, our model achieved similar performance to Goto’s Deep Neural Network (DNN),20 with an improvement in PPV (0.09; 95% CI 0.08 to 0.10), as shown in table 2. Across all other metrics, our model’s performance characteristics fell within the CIs given by the Goto DNN. Additionally, the size of our dataset allows for our 95% CI to be much narrower than the comparison models for children.

Table 2
|
Predictor and trauma outcome variables

The adults-only model showed similar performance to the comparison models. The sensitivity (0.76; 95% CI 0.76 to 0.76) was higher than the Hong Triage DNN22 (0.70) and fell just below the Raita DNN21 (0.80; 95% CI 0.77 to 0.83) while still achieving high specificity (0.80; 95% CI 0.80 to 0.80). The sensitivity-specificity gap (0.44) demonstrated that the model was balanced similarly to the comparison models. The all ages model’s performance metrics were generally in line with those from our child and adult only models.

Performance across injury mechanisms

The models for all ages and adults-only both saw an increase in predictive performance across all metrics when excluding fall injuries from the test set. Table 3 shows the adult model without fell exhibited better AUC (0.87; 95% CI 0.87 to 0.87), specificity (0.84; 95% CI 0.83 to 0.85), sensitivity-specificity gap (0.39), PPV (0.16; 95% CI 0.15 to 0.17) and MCC (0.659; 95% CI 0.652 to 0.666) while maintaining similar sensitivity (0.77; 95% CI 0.76 to 0.78) and NPV (0.989; 95% CI 0.988 to 0.990). The model for children was weaker when falling injuries were excluded, with a lower sensitivity (0.71; 95% CI 0.70 to 0.72) and MCC (0.569; 95% CI 0.553 to 0.585). These results revealed that the model was invariant to all injury mechanisms in the NTDS except for falling injuries, which might require additional predictors.

Table 3
|
Comparison of performance with and without fall injuries

Architecture verification

The second set of models, which predicted patients’ overall outcome, outperformed the ED only models in most respects. For children, it achieved superior AUC (0.91; 95% CI 0.91 to 0.91), sensitivity (0.80 95% CI 0.80 to 0.80), specificity (0.92; 95% CI 0.92 to 0.92), sensitivity-specificity gap (0.28), NPV (0.998; 95% CI 0.998 to 0.998), and MCC (0.746; 95% CI 0.742 to 0.750), as could be seen in table 4.

Table 4
|
Model performance for varying outcome predictions

Similarly, for adults, the hospital and ED model achieved stronger AUC (0.89; 95% CI 0.89 to 0.89), sensitivity (0.79; 95% CI 0.79 to 0.79), specificity (0.84; 95% CI 0.84 to 0.84), sensitivity-specificity gap (0.37), PPV (0.12; 95% CI 0.12 to 0.12), NPV (0.993; 95% CI 0.993 to 0.993) and MCC (0.689; 95% CI 0.687 to 0.691).

The hospital and ED all-ages model achieved AUC (0.90; 95% CI 0.90 to 0.90), sensitivity-specificity gap (0.36), and MCC (0.711; 95% CI 0.709 to 0.713). These general performance characteristics were between the corresponding metrics for child and adult models, indicating it had generalised for both children and adults.

Discussion

Implementation of our ML architecture on the NTDB provided innovative predictive capabilities that generalise to all trauma age groups and most types of injuries. With our dataset of approximately two million unique visits, we created a single neural network architecture and trained unique models for children, adults and all ages. Our models for children and adults achieved similar performance to the comparison models across most metrics, reinforcing the notion that such performance is possible across a more diverse set of patients than previously tested. These results suggest that our models could generalise well across all ages. However, fall injuries have the potential to confound the model, suggesting that the outcome of fall injuries might require more information than the included predictors provide.

It is important to note that one study not referenced in table 2, the Trauma Quality Improvement Programme (TQIP),23 24 has built a logistic regression model for child patient mortality that achieved an AUC of 0.996—almost perfect predictive power—but featured much narrower inclusion criteria than this study. Whereas the TQIP report limited their observations to victims of blunt, penetrating, or abuse-related injuries with at least one Abbreviated Injury Score (AIS) of two or greater, we imposed none of these criteria.25

Our study has advantages over prior publications we’ve found in ML trauma outcome prediction, featuring over two million unique patient encounters from across the USA. While previous studies showed the capabilities of ML as a prognostic tool, none captured the diverse healthcare settings across the USA, demonstrated invariance across injury mechanisms, or focused solely on patient mortality, and only one confirmed that additional data would not improve its model further.20–22 ML is a data-driven technique, requiring a multitude of unique data points to maximise the model’s predictive power. With our large, diverse set of trauma data, we are confident that our model is optimised for its current architecture, and the narrow CIs indicate it might generalise to patients with trauma across the USA.

While testing model invariance across injury mechanisms, we discovered that excluding fall injuries noticeably affected the model’s predictive capabilities. It is well known that adult fall injuries, especially in the elderly population, can result in hip fractures, leading to complications and death. Current triage guidelines acknowledge the complex nature of ground-level falls on the elderly,26 and at least one study has demonstrated that AIS and GCS are unreliable measures for assessing these patients’ mortality risk levels.27 Although removing these injuries improved the performance of the adult model, the child model achieved slightly worse performance, indicating the model could discern the seriousness of a child’s fall-related injury well. Further investigation will be necessary to find the predictors and ML architecture to overcome this confounding factor.

Our model architecture verification process indicates that the architecture of figure 1 can make predictions highly correlated with a patient’s true outcome. The challenge in achieving reliable results for ED only cases lies in the scarcity of ED mortality data points, not the modelling approach. Widening the inclusion criteria may allow for more training examples to be retained, but it will come at the cost of data richness.

The main limitation of this study is the need for a complete set of patient vitals. Our dataset had approximately 5.8 million unique patient visits, but only two million met our inclusion criteria. While this is sufficient for training, it signifies that there are many clinical scenarios our model cannot handle. However, our study is broader than similar works, as medical research often limits its inclusion criteria to a small subset of patient characteristics. This specialisation improves model performance but makes it irrelevant to many patients in actual clinical settings. Our methods only filter out patients missing important information, such as vitals or demographics; we do not filter by age, injury mechanism or any other categorical value, and we demonstrate the viability of this approach in predicting patient outcomes. The NTDB provides a variety of pertinent facility-related information, but our study excluded it to ensure a fair comparison to contemporary works, as they did not have access to facility variables. Some facilities, like level 1 trauma centres, will be better equipped than others to handle certain types of patients, and that reality is not captured in this study.28–30 Instead, we based our ML on patient demographics and injury characteristics so prehospital emergency medical services could use the prediction to guide patient with trauma field triage. Finally, the deidentified nature of the data used means our model can only analyse the outcomes of individual visits rather than the patients themselves. A longitudinal study would likely benefit the model, as it could learn the patterns which contribute to patient deterioration over the long-term rather than during a single visit.

Future work

Further research into the defining patient characteristics, model architecture or preprocessing pipeline, which allows the model to differentiate between fatal and survivable fall injuries, is a necessary next step. This will address the performance loss observed when patients who have suffered a fall injury are included in the test set. Additionally, data related to healthcare facilities should be integrated into the predictive model, as this will help discern whether the patient should receive the care necessary to prevent mortality. Finally, the predictor variables selected for this study should be pruned to only include those which aid the model’s performance. This will result in fewer excluded patients and, therefore, more examples of patient mortality for the model to learn from.

Conclusion

A predictive model for patient with trauma mortality from approximately two million unique visits to the US ED was developed, and it achieved similar performance characteristics to contemporary models. However, predictors used in this study did not allow the model to fully differentiate between fatal and survivable fall injuries, as the model saw a significant performance boost when fall injuries were removed from the dataset. Future work will need to determine the predictors or processing methods needed to overcome this confounding factor. Ultimately, this study demonstrates that ML models can make predictions highly correlated with a trauma patient’s true outcome. As a result, healthcare workers in the ED may use them as a risk assessment aid when determining the urgency of a patient’s condition. This approach has the potential to reduce the burden on healthcare personnel, prevent overutilisation of resources due to overtriage and improve the quality of care available to those who truly need it to reduce mortality risk.