Original research

Long short-term memory model identifies ARDS and in-hospital mortality in both non-COVID-19 and COVID-19 cohort

Abstract

Objective To identify the risk of acute respiratory distress syndrome (ARDS) and in-hospital mortality using long short-term memory (LSTM) framework in a mechanically ventilated (MV) non-COVID-19 cohort and a COVID-19 cohort.

Methods We included MV ICU patients between 2017 and 2018 and reviewed patient records for ARDS and death. Using active learning, we enriched this cohort with MV patients from 2016 to 2019 (MV non-COVID-19, n=3905). We collected a second validation cohort of hospitalised patients with COVID-19 in 2020 (COVID+, n=5672). We trained an LSTM model using 132 structured features on the MV non-COVID-19 training cohort and validated on the MV non-COVID-19 validation and COVID-19 cohorts.

Results Applying LSTM (model score 0.9) on the MV non-COVID-19 validation cohort had a sensitivity of 86% and specificity of 57%. The model identified the risk of ARDS 10 hours before ARDS and 9.4 days before death. The sensitivity (70%) and specificity (84%) of the model on the COVID-19 cohort are lower than MV non-COVID-19 cohort. For the COVID-19 + cohort and MV COVID-19 + patients, the model identified the risk of in-hospital mortality 2.4 days and 1.54 days before death, respectively.

Discussion Our LSTM algorithm accurately and timely identified the risk of ARDS or death in MV non-COVID-19 and COVID+ patients. By alerting the risk of ARDS or death, we can improve the implementation of evidence-based ARDS management and facilitate goals-of-care discussions in high-risk patients.

Conclusion Using the LSTM algorithm in hospitalised patients identifies the risk of ARDS or death.

What is already known on this topic

  • Acute respiratory distress syndrome (ARDS) is commonly under-recognised in clinical settings, which can lead to delays in evidence-based management.

What this study adds

  • A long short-term memory algorithm trained on mechanically ventilated patients can identify the risk of ARDS development or in-hospital mortality using structured electronic health record data without the use of chest X-ray analysis. SARS-CoV-2 infection has a noted high incidence of ARDS. The model, trained on mechanically ventilated non-COVID-19 patients, performed well on COVID-19 patients, with an evaluation of 1.82 patients needed to identify 1 patient at risk of ARDS or death in the hospital.

How this study might affect research, practice or policy

  • Being able to identify the risk of ARDS, regardless of COVID-19 status, early can improve compliance with evidence-based management and allow prognostication.

Introduction

Acute respiratory distress syndrome (ARDS) affects nearly a quarter of all acute respiratory failure patients requiring mechanical ventilation. It contributes to high morbidity and mortality of critically ill patients.1 ARDS is consistently under-recognised, leading to delays in implementing evidence-based best practices, such as the use of lung-protective ventilation strategies.2 3 The onset of the COVID-19 pandemic overwhelmed the healthcare system in the USA, and patients with severe to critical SARS-CoV-2 infections had a high incidence of ARDS and high mortality. This was especially true early in the pandemic, before the discovery of using early steroids and other immunosuppressants for treatment.4 5 An electronic health record (EHR)-based decision support system that accurately identifies patients with ARDS can improve the management and escalation of these critically ill patients.6 Different machine learning techniques, such as L2-logistic regression, artificial neural networks and XGBoost gradient boosted tree models, have leveraged EHR to identify or predict ARDS, yielding robust statistical discrimination as reported in studies.7–9 In a distinct study, Zeiberg et al applied L2-regularised logistic regression to structured EHR data sourced from a single-centre population within the initial 7 days of hospitalisation. A meticulous two-physician chart review established the gold standard diagnosis of ARDS. Despite the rarity of ARDS occurrences (2.5%) within the testing cohort of this investigation, the area under the receiver operating curve (AUROC) attained an impressive value of 0.81.7 Other investigations centred on using the Medical Information Mart for the ICU databases.10 11 These endeavours relied on diverse data sources such as free-text entries, diagnostic codes and radiographic reports for both the diagnosis and prediction of ARDS.10 11

We aimed to train a deep learning model using long short-term memory (LSTM) framework and active learning method using a historic dataset from a mechanically ventilated (MV) non-COVID-19 cohort to identify patients with risk of ARDS or in-hospital mortality. We validated the model on an MV non-COVID-19 cohort, a COVID+ cohort and a subgroup of MV COVID+ cohort.

Materials and methods

The study was conducted at Montefiore Medical Center, encompassing three hospital sites.

Cohort assembly

MV non-COVID-19 cohorts

Non-COVID-19 cohort 1 was constructed between 1 January 2017 and 31 August 2018 (figure 1). We included MV adults in the ICU with ages greater than 18. Each patient’s chart was reviewed for ARDS.

Figure 1
Figure 1

Cohort assembly and model training. ARDS, acute respiratory distress syndrome; LSTM, long short-term memory; MV, mechanically ventilated.

Ground truth labelling: ARDS gold-standard identification

We defined ARDS using the Berlin criteria: hypoxaemia (arterial oxygen tension (PaO2) to fractional inspired oxygen (FiO2) ratio (PFR)≤300 with positive pressure ventilation ≥5cmH20), bilateral infiltrates on chest radiographs by independent review and a presence of ARDS risk factors (sepsis, shock, pancreatitis, aspiration, pneumonia, drug overdose and trauma/burn) not solely due to heart failure.12 We used the first date and time of PFR≤300 with confirmed bilateral infiltrates within 24 hours as the time of ARDS presentation (ToP of ARDS).

Active learning

We used the ‘active learning’ technique to provide additional adult MV patients from July 2016 to December 2016 and September 2018 to December 2019 (AL-cohort).13 A preliminary recurrent neural network was developed using the LSTM model and trained with the original non-COVID-19 cohort 1. Next, we applied the preliminary model to the AL-cohort. We used pool-based sampling and uncertainty techniques to identify records from AL-cohort to be reviewed and labelled by clinicians.13 The uncertainty technique includes patients whose scores are very close to the cut-off, which means the model is least confident about them. We chose a cut-off of 0.80 and selected all records with a score between 0.75 and 0.85. We created the MV non-COVID-19 cohort 2 using the top 1% of the highest, lowest 1% and medium scores of the AL cohort. This allowed us to enrich MV non-COVID-19 cohort 2 with patients with ARDS or those who died in the hospital.

COVID-19 validation cohort

We included all hospitalised adult patients with and without mechanical ventilation with a positive SARS-Cov-2 transcription-mediated amplification assay from 1 March 2020 to 17 April 2020 in the COVID-19 cohort.

Training and validation cohort splitting

MV non-COVID-19 cohorts 1 and 2 were combined as the MV non-COVID-19 cohort. We randomly selected 80% of patients for training (MV non-COVID training cohort) and validation to learn model parameters and find optimal hyperparameters. The trained model was validated on the remaining 20% of the non-COVID-19 cohort (MV non-COVID-19 validation cohort), the COVID-19 cohort and the MV COVID-19 cohort separately (figure 1).

EHR data collection and processing

Clinical data were collected through automated abstraction of EHR data. Raw EHR data for each admission were abstracted, sampled and validated (online supplemental table 2).

Sampling

Raw longitudinal EHR data were sampled every hour. Sampling was necessary since the different variables were recorded at different timestamps with different frequencies to aggregate the longitudinal data into hourly snapshots. If the data were recorded multiple times within 1 hour, we computed the minimum and maximum based on all recorded measurements. If it was not recorded at all within the 1-hour time frame, we considered it as ‘missing’. For data that were recorded exactly once during an hour, the minimum and maximum would be the same.

Data validation

Data validation was performed by range checking (online supplemental table 2). If the recorded measure was outside the valid range, we discarded it and treated it as a missing value.

Missing data

The missing data were handled by ‘forward imputing’, where the most recent value fills the missing value. If there were no data available for imputation, we used normal values. We used the lower bound of the normal range as the minimum and the upper bound as the maximum value for those timestamps. A feature vector of size 132 represents each timestamp.

Model training

LSTM network is a paradigm of recurrent neural networks that can capture the temporal information of sequential data.14 We used the EHR data, including the previous 12 hours, as the network inputs to train a model that can generate a predictive score for every patient at every hour. The network consisted of an LSTM unit with 10 filters, followed by a drop-out layer with 50% probability of keeping.15 The network ended with a linear layer and a Sigmoid activation function to output a score from 0 to 1, which is interpreted as the probability of developing ARDS or in-hospital mortality.

Model evaluation

We applied the model on the MV non-COVID-19 validation cohort and COVID-19 cohort hourly to produce the score for that timestamp which is an indication of the probability of ARDS development or death. For each cohort, we calculated the AUROC. We also calculated the sensitivity, specificity, positive predictive value (PPV), negative predictive value, and F1 score at different risk thresholds (cutoffs). We use the highest F1 score to generate a confusion matrix for selecting a score cut-off. The warning time is the first time the score exceeds the predefined cut-off. We continued running the test until the score exceeded the cut-off or discharge time. We evaluated model timeliness based on ARDS and death, ARDS and not death, no ARDS and death, no ARDS and not death and compared the actual ToP ARDS time/death time with the warning time.

Feature importance

Feature importance identifies a subset of features that are the most relevant for the accuracy of the model. We used local interpretable model-agnostic explanations (LIME),16 to determine the importance of each variable to the accuracy of the model. The feature importance value was determined for 200 randomly sampled patients in each cohort using LIME, then calculated the average across all samples.

Results

Cohort description

MV non-COVID-19 cohort 1 included 3278 patients (online supplemental table 1 and figure 1). MV Non-COVID-19 cohort 2 was derived from the active learning, consisting of 627 patients (online supplemental table 1). We combined MV Non-COVID-19 cohorts 1 and 2 to create the MV non-COVID-19 Cohort (n=3905, table 1). COVID-19 cohort included 5672 patients (table 1). Online supplemental table 3 shows the descriptive statistics of all variable fields in the MV non-COVID and COVID-19 cohorts.

Table 1
|
Cohorts characteristics

Model diagnostics

MV non-COVID-19 validation cohort

Based on the highest F1 score, we chose a model score cut-off at 0.90. The model diagnostics are presented in table 2, figure 2. The model warned of patient risk at a median of 10 hours (IQR −75 to 4) before ARDS and −225 hours or 9 days (IQR −461 to 101 hours) before death in the hospital (table 3). In ARDS survivors, the majority of the patients had ARDS risk identified before intubation and before ARDS diagnosis (table 3). For ARDS non-survivors, the model warned at 1 hour (IQR −38 to 9) before intubation, −20 hours (IQR −115 to 0.3) before ARDS and at −314 hours (IQR −589 to –128 hours) before death (table 3).

Table 2
|
Model diagnostics
Table 3
|
Timeliness of model
Figure 2
Figure 2

Model diagnostics, AUROC, PPV with sensitivity and NNE with sensitivity. AUROC, area under the receiver operating curve; MV, mechanically ventilated; NNE, number needed to evaluate; PPV, positive predictive value.

COVID-19 cohort and MV COVID-19 subcohort

Using the same cut-off of 0.9, we applied the model to COVID-19 and MV COVID-19 subcohorts. The model diagnostics are presented in table 2 and figure 2. When the model was applied to the COVID-19 cohort, the PPV was lower and more patients needed to be screened compared with the MV non-COVID-19 validation cohort. Whereas in the MV COVID -19 subcohort patients had a high prevalence of ARDS and in-hospital mortality, the PPV and number needed to evaluate were much lower than in the MV non-COVID-19 Validation Cohort.

In the COVID-19 cohort, the model warned the patient was likely to have ARDS or in-hospital mortality 3 hours after intubation and at ToP ARDS (table 3). Among the non-survivors, the model warned 2.4 days before in-hospital mortality (IQR 4.7–0.83) in COVID-19 patients, and 1.54 days before in-hospital mortality (IQR 3.6–0.46) in MV COVID-19 patients (table 3).

Feature importance

For both the MV non-COVID-19 and COVID-19 cohorts, we randomly selected 200 encounters from each cohort and performed LIME (online supplemental figure 1). The top contributors are similar in the MV non-COVID-19 and COVID-19 cohorts. The most important variable to the model was lactate level in discriminating the clinical outcome. The model consistently used lactate, age, cryoprecipitate transfusion, dopamine, bicarbonate level and epinephrine as important input variables (online supplemental figure 1).

Discussion

From a cohort of pre-COVID-19 pandemic patients on mechanical ventilation, we developed and validated an LSTM model to identify patients at risk for ARDS or in-hospital mortality. This model was successfully integrated into EHR and identified patients at risk for ARDS or in-hospital mortality in all adults hospitalised with and without COVID-19 infection, regardless of mechanical ventilation status. The model was also able to warn well before the events of ARDS or death in both the MV non-COVID-19 and COVID-19 cohorts. The timeliness of the model allows clinicians to modify management and implement evidence-based practices promptly.

This is the first utilisation of an LSTM network for identifying the risk of ARDS and in-hospital mortality. The LSTM is a recurrent neural network that uses feedback layers to capture temporal aspects such as sequences and trends. This approach is well suited for this study because past events and the progression of patient status are often valuable to determine the probability of ARDS or death. As in the reality of managing critically ill patients, physiological observations at each time point are taken into account. Their change and progression or regression inform the decisions at the subsequent processing of this information. This is well suited for dynamically changing situations to monitor and identify patients progressing to ARDS or in-hospital mortality. LSTM models have been used to predict heart failure, transfusion needs in the ICU, and mortality in the neonatal ICU, all with better predictive utility than traditional logistic regression models.17–19 We chose to include ARDS diagnosis and in-hospital mortality as our patient-centred outcomes of interest instead of ARDS or in-hospital mortality alone, as in previous ARDS prediction studies.6 7 20 Identifying the risk of ARDS or in-hospital mortality has shown real clinical implications when managing patients, mitigating the ambiguity that sometimes can exist in ARDS clinical diagnosis based on shifting diagnostic criteria.7 8 20–22

This cohort is one of the largest validated ARDS gold standards developed by manual chart review and active learning from a single centre. We did not rely on ICD-10 diagnosis codes or radiology reports to identify ARDS. Instead, we followed the Berlin criteria using PFR, independent review of chest X-ray for the presence of bilateral infiltrates and risk factors of ARDS in the patients’ chart. Our model performed similarly to previously reported models using other machine learning methods, ranging from 0.71 to 0.90.7 9–11 21 We forgo chest X-ray interpretation as input variables, as in Zeiberg et al.7 Other large-scale ARDS identification studies which used natural language processing of radiology reports and diagnostic codes in clinical settings would delay ARDS recognition and rely heavily on clinician decisions.9 11 Using chest radiographs for the diagnosis of ARDS has its limitations, as studies show high interobserver variabilities despite training.12 23 In addition, radiology report turn-around times can range from 15 min to 26 hours, depending on the study location, availability of staff and hospital resources.24 25 This reliance on chest radiograph interpretations may delay ARDS diagnosis.

Despite the different clinical characteristics of the study cohorts, being MV patients non-COVID-19 versus non-MV COVID-19 patients, important features in risk identification were broadly consistent between the cohorts using lactate, age, cryoprecipitate transfusion, dopamine, bicarbonate level and epinephrine as important input variables. LIME can directly associate model features to increased or decreased risk of ARDS or death in an individual, on a patient-by-patient-level.26 27 We randomly sampled 200 patients in each cohort and obtained an average of the absolute LIME values to understand what features were generally used. This does not provide a clinical explanation and rationale for why features may relate to higher or lower scores. Instead, it sheds light on important features that the model needs as its input data to predict a score accurately, whether additive or subtractive, to the risk. Norepinephrine was the most commonly used vasopressor for both cohorts; intriguingly, it did not contribute to the model consideration. The model rarely used vasopressors such as dopamine and epinephrine to discriminate the outcome of ARDS and/or in-hospital mortality. Oxygen support devices were also not deemed important on average; we postulate that our gold standard labelling required mechanical ventilation for ARDS identification, making oxygen support devices less important in the discrimination.

In clinical practice, ARDS is underdiagnosed, which leads to increased exposures in management that are detrimental to patients, such as high tidal volume ventilation and delayed implementation of evidence-based practices that are helpful.2 3 28–31 We used continuous data at 1-hour intervals starting at hospital admission to identify the early risk of an adverse outcome. Indeed, in the non-COVID-19 cohort, we identified ARDS hours before intubation and at the time of ToP ARDS. The majority of patients (56.5%) had been identified before ARDS diagnosis in the MV non-COVID-19 cohort, and this remained the case in the COVID+ cohort (43%). Implemented and delivered as a clinical decision support system, the early recognition would allow clinicians to initiate treatment such as LTVV as early as possible, when it may more positively impact outcomes.3

Furthermore, the model identified the risk of in-hospital mortality 9 days in advance in the non-COVID-19 cohort and 2 days in advance in the COVID-19 cohort. This has significant implications for triaging patients during surge capacity. In the MV non-COVID-19 cohort, there was no concern for ventilator or ICU resource allocation. Early identification of risk for death would alert the clinician to implement aggressive management and allow the treating physician to consider early palliation intervention/conversation. In the setting of a high volume surge of respiratory illness, such as the onset of the COVID-19 pandemic, where the incidences of ARDS and death are high, identifying adverse outcomes days in advance could help the clinician in making necessary triage decisions for resource allocation.32–34

Our study has some limitations. First, our cohorts were constructed from a single centre in the Bronx, and the patients’ characteristics may not be generalisable to other centres and populations. However, our medical centre consists of three hospitals ranging from community and academic to tertiary transplant centres, thus spanning a wide spectrum of disease severity. In addition, we validated the algorithm in the COVID-19 cohort regardless of the respiratory support type, demonstrating consistent model performance across different cohorts. Second, although we were able to determine feature importance using LIME on 200 samples from each cohort, we were unable to discern the actual direction of association with the risk of ARDS or death. We cannot discern if the individual variables increase or decrease the risk of ARDS or death, despite their importance to the overall model. However, the consistency in features used to determine risk between the validation cohorts is reassuring. Ultimately, the variables that we included in models are variables known to be clinically associated with ARDS or death; therefore, the direction of influence on risk assessment is less germane. The strength of our study lies in the predictive nature of this algorithm and the timeliness of its predictions. Using longitudinal data from admission allowed the LSTM model to learn from the progression of the patient’s clinical status over time. This model also was flexible to have similar diagnostic performance in patients with different clinical characteristics.

In conclusion, our LSTM model identified risk for ARDS and in-hospital mortality on patients with or without COVID-19 regardless of mechanical ventilator support. The model identified patients early, which implies management changes can be implemented early.