Decision support systems in the healthcare domain should be reliable, interpretable and robust; therefore, we accompanied the aforementioned results with a thorough study on interpretability both at the patient and cohort levels and an assessment of robustness by studying disease-specific subcohorts.
Phenotype persistency
We determined that it was beneficial to propagate phenotypes forward in time. Each phenotype is marked by a human clinical expert based on whether it typically persists throughout the ICU stay. Consequently, transient (eg, fever, cough and dyspnoea) and persistent (eg, diabetes and cancer) phenotypes propagate until the appearance of a new clinical note or the end of stay, respectively. We performed an ablation study and observed that phenotype propagation was more beneficial for RF than LSTM. The RF models with phenotype propagation achieved 4.6% higher AUC-ROC for in-hospital mortality, 2.5% higher AUC-ROC for decompensation and 3.4% higher kappa for LOS than RF without phenotype propagation. However, LSTM with phenotype propagation achieved 1.4% higher AUC-ROC for in-hospital mortality, comparable results for decompensation, and 1.1% lower kappa for LOS. We hypothesise that LSTM, by design, can better capture temporal relationships, given a large amount of data to learn from. The results are presented in section A.7 (online supplemental appendix).
Phenotype importance
To elucidate the contribution of phenotypical features to prediction performance, the most important features were studied using the SHAP values.13 Because computing SHAP values is computationally complex, to accelerate the computation, this and all subsequent interpretability analyses based on SHAP values were conducted on the RF models rather than on the LSTM models. An illustration of our investigation is shown in figure 1, where we present the top predictive features for in-hospital mortality and physiological decompensation. This confirms that phenotypical features are beneficial for in-hospital mortality prediction, given that 13 of the 20 most important features are phenotypes. This is explained by the fact that forecasts need to rely on information that can provide insights accurately into the long-term future.
Figure 1Top features for in-hospital mortality and physiological decompensation. Features are sorted in decreasing importance according to their mean absolute SHAP values. Each row presents a condensed summary of the feature’s impact on the prediction. Each data sample is represented as a single dot in each row, and its colour on a particular row represents the value of that sample for that feature, with blue corresponding to lower values or absence and red corresponding to higher ones or presence.The SHAP value (horizontal position of a dot) measures the contribution of that feature on a sample, towards the prediction (right corresponding to mortality or decompensation and left corresponding to survival or out of decompensation risk). For instance (in A, top row), since the vertical axes clearly splits patients by colour, manifesting HP:0012531 pain consistently leads to lower chances of dying. SHAP, Shapley additive explanation.
Contrary to bedside measurements, which may not correlate well with future outcomes owing to their dynamic nature, phenotypes are highly informative, given that they capture, for instance, comorbidities, which are essential for predicting mortality.23 Furthermore, another study24 including 230 000 ICU patients found that combining comorbidities with acute physiological measurements yielded the best results, outperforming traditional mortality scores (APACHE-II25 and SAPS-II26).
Interestingly, the top-ranking feature for mortality prediction is whether the patient experiences pain. We also observed that the second top-ranking feature is constitutional symptoms (HP:0025142). Noting that this is actually the resulting phenotype after aggregating all of its children, this phenotype should be interpreted not as a textual mention in the patient’s EHR of the broad term but rather as a mention of any of its children (most notably generalised pain). Consequently, the second feature again highlights the importance of pain.
Although not decisive, some initial evidence corroborates the fact that pain management improves outcomes in the ICU.27 However, pain can be interpreted as a proxy for establishing a high level of consciousness, which has been correlated with better outcomes in the ICU.28
The other top-ranking phenotypes, such as atrial arrhythmia, nausea and vomiting, cover most of the body systems (ie, heart, lungs, gastrointestinal tract, central nervous system, coagulation, infection and kidneys), which are typically assessed using clinically validated scores, for example, APACHE-II and SAPS-II.
Our study also showed that, although phenotypical features are not as important for decompensation as for in-hospital mortality (only 3 out of the top 20 features for this task were phenotypes), they are nonetheless useful because they provide a more accurate estimation of the predicted risk. Given that this task is concerned with predicting mortality within the next 24 hours, bedside measurements become more informative because of their temporal correlation (illustrated in section A.8, online supplemental appendix). Nevertheless, bedside measurements can be ambiguous or provide an incomplete picture of the patient’s status without the data found in clinical notes. For example, for one patient, neoplasm of the respiratory system (HP:0100606) was found to be the top feature, and although this phenotype was persistent, it increased the risk of decompensation appropriately, providing an overall better estimation. An illustration of this patient is presented in section A.9 (online supplemental appendix).
Similarly, the top features for long lengths of stay (more than 1 week) are presented in section A.10 (online supplemental appendix), where 10 of the 20 top features are phenotypes.
Calibration
The calibration of machine learning models compares the distribution of the probability predicted by the models with the distribution of probabilities observed in real patients. To measure model calibration, we used the Brier score29 (the lower, the better). Our investigation of the respective calibration curves (see figure 2 and section A.11, online supplemental appendix) shows that phenotypes from unstructured notes improve model calibration across set-ups, especially for physiological decompensation and in-hospital mortality, which means that the distribution predicted by the models is closer to the real distribution of patients.
Figure 2Calibration curves with LSTM for (A) physiological decompensation and (B) in-hospital mortality. Calibration curves are presented with its Brier score (the lower the better). Note that overall inclusion of phenotypical features from unstructured data helps with calibration. LSTM in legend refers to using structured features only. Ours, NCR, CB: phenotypical features from our phenotyping model, neural concept recogniser and ClinicalBERT, respectively. LSTM, long short-term memory; SAPS, Simplified Acute Physiology Score.
Prognosis analysis
Beyond producing clinically relevant explanations at the cohort level, with the help of SHAP values, we characterised a patient’s disease trajectory and retrospectively discovered when and why the patient was the most vulnerable. For example, the fragment of a patient’s LOS forecast in figure 3 illustrates an estimated probability, 41 hours after admission, of an LOS longer than 14 d of 69%, mainly because the patient scored one on the Glasgow Coma Scale verbal response. One hour later, when a clinical note became available, worrisome phenotypes appeared (including oedema, hypotension and abnormality of the respiratory system). Consequently, the estimated probability increased to 88%.
Figure 3Illustrative case for an ICU length of stay of more than 14 days. Time course of the normalised predicted probability for a stay of more than 14 days and feature heatmap for a representative segment of the ICU stay. Each row of the heatmap represents one of the top features. At each time step, a feature can contribute positively (red) or negatively (blue) for predicting a stay of 14 days or more. Black horizontal bars at the right of each row represent the importance of the features. Note that a new clinical note that is available at the 42nd hour (vertical dashed line) leads to an increase in confidence of longer stay due to new features. Given the appearance of oedema, hypotension and abnormality of the respiratory system, the probability of a long stay increases from 69% to 88%. ICU, intensive care unit.
Cohort study
To understand its robustness, the performance of the proposed approach was assessed on cohorts of patients with different diseases, especially under-represented diseases. The test set was split into four disease-specific cohorts for patients with cardiovascular disease, diabetes, cancer and depression. The accuracies of the best LSTM models (using structured and phenotypical features) were reported individually for each cohort in each ICU task.
For in-hospital mortality and physiological decompensation, we observed comparable accuracies across the four cohorts. We reported an AUC-ROC range of 0.780–0.826 for in-hospital mortality and 0.792–0.820 for physiological decompensation in the four cohorts. In contrast, for LOS, we observed lower kappa values of 0.321 and 0.330 for small cohorts with cancer and depression, respectively, as opposed to 0.413 and 0.424 for larger cohorts with cardiovascular diseases and diabetes. We hypothesised that the nature of diseases has strong implications for in-hospital mortality and physiological decompensation, whereas LOS can be influenced by more factors that require larger data samples to model their interactions. The results are presented in section A.12 (online supplemental appendix).