Discussion
Adapted current scoring systems
The variables used within existing scoring systems featured in online supplemental table 1 were analysed to explore potential reasons for their differing predictive ability of mortality in patients with COVID-19.
The APACHE II score was found to have the highest AUROC values, followed by mEI and SOFA systems. APACHE II presides over other scores in terms of mortality prediction possibly due to its consideration of both age and comorbidities, whereas scores such as CURB-65 only assesses age and SOFA involves neither. Notably, however, the cut-off value for APACHE II is much lower when applied to patients with COVID-19 than under normal ICU conditions; while GCS is an important component of APACHE II, the nervous system is typically less impacted than the respiratory system in COVID-19 infection.7
Considering the effects of COVID-19 on respiratory function are more marked than its cardiovascular impacts,12 it is unsurprising that most of the studies listed in online supplemental table 1 show respiratory parameters such as RR in CURB-65 to be independently more indicative of mortality than blood pressure and confusion, which are more related to haemodynamics. qSOFA’s focus on blood pressure and mental state may explain its lower AUROC and poorer predictive performance. Cetinkal et al,13 however, argue that as previous studies reveal worse clinical outcomes in patients with cardiac injury, non-respiratory variables in the CHA2D2VASc system such as older age, DM, HTN and previous cardiovascular disease are valuable parameters for mortality risk stratification. However, AUROC values found for CHA2D2VASc remain at the low end compared with other existing scoring systems, despite modifications catered to COVID-19 added to form the m-CHA2D2VASc scale. Even this version, with an AUROC higher by 0.06, offers predictive ability similar to univariate NLR and inferior to troponin increase.
Ortiz et al12 demonstrated A-DROP, a modified version of CURB-65, to provide more accurate mortality prediction than Pneumonia Severity Index (PSI), CURB-65, CRB-65, SMART-COP, qSOFA and National Early Warning Score 2 (NEWS2). Its superior discrimination may be due to its more accurate respiratory function evaluation (oxygen saturation [SpO2] >90% / arterial oxygen tension [PaO2] <60 mm Hg in A-DROP vs respiratory rate ≥30/min in CURB-65). The modified age cut-off (male >70 / female >75 in A-DROP vs age >65 in CURB-65) may also contribute to A-DROP’s advantage when applied to COVID-19, considering the median age of COVID-19 non-survivors is 69 years.14
Ultimately, although APACHE II, SOFA, PSI and CURB-65 are well-founded in clinical practice, their requirement for sophisticated patient information makes rapid assessment impossible, an important benefit for triaging patients with COVID-19 in often overrun hospitals. Wang et al’s study7 on MEWS suggests this system can overcome the issue of efficiency as a simple and rapid assessment able to be performed within minutes of patient admission while maintaining equal predictive ability.
Intriguingly, Gupta et al15 evaluated 22 prognostic models (including aforementioned systems), concluding that they should not be recommended for routine clinical implementation because none of them offered incremental value compared with univariable predictors to risk stratify COVID-19 mortality, of which patient’s age is a strong predictor of mortality. Similarly, Bradley et al16 concluded that CURB-65, NEWS2 and qSOFA all underestimate the mortality of patients with COVID-19.
COVID-19 scoring systems
To maximise the accuracy and effectiveness of mortality prediction models, novel scores should focus on identifying features that are COVID-19-specific. Examples of complications that are highly associated with COVID-19 include hypercoagulability and inflammation.17 18 However, only 27% of new prognostic scores included in this review incorporated CRP—an important inflammatory marker. Similarly, thrombopenia has been associated with higher rates of mortality,19 which reflects the importance of including platelet count in prognostic models, but only 16% of new scores took this into account.
Interestingly, the three prediction models with the highest AUROC values have all used D-dimer and lymphocyte count to predict mortality. This could reflect the importance of these two parameters in COVID-19 pathophysiology. However, these are all single-centre studies tested on significantly smaller sample sizes compared with other models with lower AUROC values. Models tested on a larger population, for instance, Mancilla-Galindo et al’s18 national cohort study with a sample size of 83 779 (AUROC=0.8000), could be more representative and generalisable.
The most common parameter used in novel prediction models for mortality of patients with COVID-19 is age, followed by lymphocyte count, D-dimer, oxygen saturation, CRP and platelet count. Other less common parameters include RR, lactate dehydrogenase, NLR, PCT and blood urea nitrogen.
Fumagalli et al19 report age as the strongest predictor of severe outcomes and mortality. Similarly, Mei et al’s20 21 prognostic model included age as one of five indicators of mortality and reports a strong association between advanced age and death from COVID-19.
There seems to be no association between the number of parameters and the prognostic power and accuracy of a scoring system. Several mortality prediction models with a small number of parameters have had higher AUROC values, for example, Liu et al22 had an AUROC value of 0.9940 with only three variables compared with Mancilla-Galindo et al18 (COVID-GRAM) with an AUROC value of 0.7750 and 10 parameters.
The most common comorbidities for predicting mortality are HTN, DM, obesity, cardiovascular disease, chronic kidney disease, smoking and malignancy.
Single parameters
COVID-19 has a different clinical picture to pneumonia and influenza, providing an avenue to explore what routinely available clinical information best predicts mortality. We explored blood parameters not currently extensively implemented into existing COVID-19 mortality prediction models, which are represented in online supplemental table 4.
We discuss the feasibility of introducing the below blood tests and imaging modalities into routine practice for risk stratification of patients with COVID-19.
Cardiac biomarkers
Cardiac biomarkers were the the most common parameters studied in our literature search. High-sensitivity cardiac troponins have been shown to be independently associated with all-cause mortality in patients with COVID-19 (p<0.05), after accounting for age, sex and comorbidities, shown in online supplemental table 4. High-sensitivity cardiac troponins (hs-cTnI and hs-TnT) are markers of myocardial injury that are currently primarily used in the prognostication of acute coronary syndrome. Despite evidence that 50% with confirmed COVID-19 have elevated cardiac biomarkers at the time of hospital admission, the patient sample sizes are limited in current studies to less than 500 patients and single centres.22 Cao et al23 retrospectively observed 244 patients and incorporated hs-cTnI into a model of empirical prognostic factors. A proposed cut-off (>20 ng/L serum hs-cTnI levels) yielded an AUROC increase from 0.65 to 0.71 (p<0.01) and demonstrated feasibility of this parameter to increase predictive performance.24
Inflammatory biomarkers
Liu et al25 confirmed the independent association of PCT with mortality in a cohort of 1525 patients through retrospective analysis. Due to the large cohort and continued follow-up of PCT levels throughout hospital stay, this study provides stronger evidence for the inclusion of PCT into scoring systems, which has begun to be implemented but is still in the minority of included parameters. Fois et al26 used the same study design and identified the systemic inflammation index (SII) as an independent predictor of mortality. However, the study quality was poor—with only 119 patients and the large number of different inflammation indexes being studied in different combinations. It is unclear whether any clinical utility is offered by implementation of SSI, considering deranged lymphocyte count is already widely established as a useful predictor.20
Renal and hepatic function biomarkers
Esposito et al27 identified estimated glomerular filtration rate (using a baseline of 60 mL/min/1.73 m2), and Fu et al24 identified cholestasis and hypoproteinaemia as independent predictors of mortality. Interestingly, as with cardiac biomarkers, these were predictors even after accounting for pre-existing comorbidities. The obvious benefit to clinical practice of renal and hepatic function markers is that they are routinely done on hospital admission and straightforward to clinicians to score in a system. Replication of large-scale multicentre studies is needed before determining the diagnostic validity of such parameters in the stratification of patients with COVID-19 in a statistical or AI model. It must be acknowledged that additional parameters must be externally validated to determine AUROC values and appropriate cut-offs for parameters.6
Lung imaging
Trabulus et al,28 Francone et al29 and Xu et al30 examined the relationship between chest CT findings and mortality, with all three studies reporting independent associations with mortality (p<0.05). Two studies31 32 used a methodology involving an overall severity score of each scan and proposed defined cut-offs above which there was yield of best predictive value. These cut-offs are of value for clinicians to allocate scans with a high/medium/low rating which can be used to triage patients with COVID-19. However, both these studies have limitations in their methodology and design, which need to be addressed before implementation of CT severity into scoring systems. In the study by Gao et al,31 follow-up was limited to 24 days; a minimum of at least 28-day mortality is recommended to better reflect the clinical course of COVID-19 in most cases.7 In addition, both severity score studies were retrospective in nature, which is susceptible to incomplete clinical records and bias in the interpretation of CT by different radiologists. Chest CT while highly sensitive is not a first-line test due to limited resources to CT scan in all COVID-19-positive hospital admissions. Routine implementation of admission CT scans would also carry a radiation burden to patients, which is arguably unnecessary if alternative parameters conferring equal predictive power without additional risk of iatrogenic effects could be used. Perhaps, chest CT is more appropriate in the discharge process of clinically stable, triaged patients with COVID-19 rather than as a first-line test as part of an admission scoring system.
AI in predicting mortality
Between ML and DL models, it is unclear which branch of AI modelling would be superior in predicting mortality due to the similar AUROC values. These similar values can be accounted for by limitations in the study methods.
Within all AI modelling papers, Meng et al33 and Vaid et al10 were the only studies that conducted external validation. External validation is an important step to verify the effectiveness of the model in patient population. Internal validation would use the same cohort to test the model, which can lead to overfitting and an inaccurately high AUROC. The models created by Bertsimas et al34 Gao et al31 and Meng et al35 gathered training set data from multiple centres, whereas the other models used single-centre data. Therefore, these models would increase applicability to the general population.
As COVID-19 has only been prevalent for a year, not many models have had the chance to be prospectively tested. Vaid et al10 produced the only model that was prospectively tested. This is important as it demonstrates the model’s real-world performance. Many models with a large number of incorporated parameters included patients with missing values, leading to estimation. This may be useful in clinical practice as not all patients have every test carried out.
It is important to recognise that COVID-19 management and treatment guidelines are constantly being updated, which influences mortality rates. As AI models use dynamic data,10 reporting of model AUROC in earlier stages of the pandemic may not have been as accurate.
Limitations
There are inherent limitations to this review. Most studies included were single centre and retrospective, whereas multicentre, prospective research may provide more insight. Although AUROC scores are universally accepted outcome measures of the accuracy of prediction models,6 they are limited in their clinical interpretability as they lack a direct link to individual patient outcomes. Thus, future reviews could use additional performance metrics in addition to AUROC to assess the accuracy of different models.