Original Research

Promising algorithms to perilous applications: a systematic review of risk stratification tools for predicting healthcare utilisation

Abstract

Objectives Risk stratification tools that predict healthcare utilisation are extensively integrated into primary care systems worldwide, forming a key component of anticipatory care pathways, where high-risk individuals are targeted by preventative interventions. Existing work broadly focuses on comparing model performance in retrospective cohorts with little attention paid to efficacy in reducing morbidity when deployed in different global contexts. We review the evidence supporting the use of such tools in real-world settings, from retrospective dataset performance to pathway evaluation.

Methods A systematic search was undertaken to identify studies reporting the development, validation and deployment of models that predict healthcare utilisation in unselected primary care cohorts, comparable to their current real-world application.

Results Among 3897 articles screened, 51 studies were identified evaluating 28 risk prediction models. Half underwent external validation yet only two were validated internationally. No association between validation context and model discrimination was observed. The majority of real-world evaluation studies reported no change, or indeed significant increases, in healthcare utilisation within targeted groups, with only one-third of reports demonstrating some benefit.

Discussion While model discrimination appears satisfactorily robust to application context there is little evidence to suggest that accurate identification of high-risk individuals can be reliably translated to improvements in service delivery or morbidity.

Conclusions The evidence does not support further integration of care pathways with costly population-level interventions based on risk prediction in unselected primary care cohorts. There is an urgent need to independently appraise the safety, efficacy and cost-effectiveness of risk prediction systems that are already widely deployed within primary care.

What is already known on this topic

  • Risk prediction models that stratify primary care populations according to their likelihood of accessing healthcare resources are generally considered to perform well within similar contexts to those in which they were derived. It is unclear how they perform when deployed in wider global contexts and indeed if their application can be harnessed to reduce resource demands.

What this study adds

  • We find that most models have not been studied in a sufficient diversity of contexts to appraise the robustness of prediction, however, those that have appear to retain their discriminatory ability. The real-world application of these models to reduce healthcare resource use in unselected cohorts has produced disappointing results, with an equal weight of evidence suggesting a harmful effect as a beneficial one in this context.

How this study might affect research, practice or policy

  • Our results call into question the common, and costly, practice of commissioning population health management strategies based on risk stratification of whole primary care populations without a concrete understanding of the associated risks.

Introduction

Risk stratification tools that predict healthcare resource use are widely used in primary care settings.1–6 These tools are integral to population health management (PHM) strategies around the world, enabled by the availability of routinely collected data from sources such as electronic health records.7 Risk stratification tools typically use predictive models that are developed through statistical or machine learning (ML) techniques, to generate an individual risk score for some measure of resource use. These scores form a key component of anticipatory care pathways, where those at the highest risk may be targeted for specific interventions aimed at reducing future morbidity.8–11 The process by which these tools are ideally developed and deployed within healthcare systems is summarised in figure 1.

Figure 1
Figure 1

An infographic describing an idealised process for developing and deploying a risk prediction tool within a healthcare system. In black is the deployment cycle, linking risk prediction tools and their associated population health management measures to a lifecycle of evidence generation, impact evaluation and monitoring for negative consequences that are fed back into the model and intervention.

A growing body of literature describes the development and validation of risk stratification tools in the primary care setting reporting an acceptable discriminatory power for the majority of models.1 2 12 13 However, existing work broadly focuses on the assessment of model performance within retrospective datasets, with little attention paid to their efficacy in real-world settings, where the clinical impacts of deploying these algorithms within a population are assessed. Commercial literature asserts the efficacy of interventions based on algorithmic case selection in improving key outcomes, such as hospital admission rates, but suffers from a lack of transparency in data and methodology.14 15

Predictive models that appear accurate in development are increasingly found to be ineffective or unsafe when deployed in clinical pathways. Predictive performance may be diminished when translated to demographically and culturally distinct populations, or when deployed using electronic health data with differing characteristics. Differences in how healthcare resources are used in local settings, alongside inherent biases inlaid within such technologies, may result in varying clinical effectiveness from inconsistent intervention thresholds, variation in the physical clinical interventions that are deployed, to sociotechnical variation across end-users and processes.16–20 Resultantly, where an algorithm is deployed into an untested context without real-world evidence for a comparable integrated pathway, there are risks to both patient safety and exacerbation of healthcare inequalities through a lack of fairness in prediction or intervention allocation.

With extensive integration of risk stratification into pathways within primary care systems worldwide it is of paramount importance to establish the current evidence base on which these care-defining interventions can be appraised. We therefore systematically review the available literature concerning risk stratification tools for predicting future healthcare utilisation in primary care populations. We present three aims: (1) to update existing evidence for algorithmic solutions with attention paid to predictive performance and risk of bias in dataset evaluation, as well as real-world clinical outcomes; (2) to describe the transfer of algorithms from initial development to testing and deployment in different global contexts and (3) to evaluate risks in cross-context transfer and application. Based on our findings, we provide recommendations for the responsible evaluation and deployment of predictive risk stratification tools.

Methods

Search strategy

A systematic search of the MEDLINE, Embase and Global Health databases was carried out on 18 July 2023 via the Ovid platform. PRISMA guidelines were followed throughout the conduct and reporting of this review.21 A combination of keywords and MeSH terms was used to curate relevant literature, details of which are available in online supplemental material.

Inclusion and exclusion criteria

We defined our inclusion criteria using the Population, Intervention, Control and Outcome method. The population of our analysis was selected to be comparable to the populations in which these models are currently in use. We therefore included only papers that applied algorithms to unselected primary care populations, where deployment was to the entire patient population for a given organisation without selection of particular groups. Prestratified populations, such as specific disease groups, or groups previously identified as high risk for healthcare utilisation, were excluded. Age-stratified populations were permitted as this is a pragmatic selection criterion adopted by the majority of predictive modelling work. Publications applying algorithms to historic research study datasets or specifically designed questionnaires (ie, not routinely collected or ‘real-world data’22) were also excluded.

Our intervention was defined as the application of a risk stratification model to an appropriate population in the process of derivation or validation, or to perform case selection as part of a PHM strategy. Models reliant on non-routinely collected data, such as questionnaire results, were excluded.

Outcomes included measures of predictive performance across five main categories: access to primary care services; emergency department attendance; healthcare costs; hospital admissions and readmission. Studies examining risk of readmission were included provided that the study population was not limited to patients with a recent admission. A group formed of those who had recently been admitted would, by definition, no longer be considered unselected and would thus violate our population criteria. Composite (eg, admissions and mortality as a single endpoint) and component (eg, respiratory admissions instead of total admissions) outcomes were excluded. We also considered clinical impact assessments related to a real-world evaluation.

Study selection and quality appraisal

Titles and abstracts were screened by two reviewers (CO/JZ) according to the criteria set out above, with all conflicts decided by a third (JM). Eligible publications were read in full and assessed for exclusions not apparent in the title or abstract, and for methodological quality.23 Risk of bias was assessed using the Prediction model Risk Of Bias ASsessment Tool.24

Data extraction

We extracted information regarding model characteristics, study design and context, predictive performance, and measures of clinical impact from any associated intervention where evaluation took place in a real-world setting. Due to significant heterogeneity in study design and reporting a meta-analysis was not conducted. C-statistics were used as the primary outcome for model performance. A subset of papers did not report discrimination, but instead reported goodness of fit using coefficient of determination (R2) which were extracted where available. Impact evaluations were described using the terminology and significance testing employed in the original paper, commonly expressed as the absolute difference (AD) between groups or odds ratios (OR).

Model appraisal

Models that appeared in multiple studies were qualitatively appraised by comparing their derivation methodology to subsequent external validation or clinical evaluation studies. For each model we report: the context of its original development; contexts in which the model’s predictive performance has been tested; and contexts in which the model’s real-world impacts have been assessed. Results were synthesised separately as the outcome of either internal or external validation. Internal validation was defined as any measure of predictive performance within the same population in which the model was derived, and external as any validation using data from a separate population.

Results

Systematic review

Our review identified 3897 publications eligible for screening after duplicates were removed (figure 2). Of these, 3636 were excluded on the basis of their title or abstract alone leaving 261 that were sought for retrieval. Full texts could not be retrieved for 10 publications, thus 251 were reviewed in full. A total of 51 publications met our criteria and were included in our final analysis (online supplemental table 1).25–75 Further detail about the identified models, along with our risk of bias analysis, can be found in online supplemental materials.

Figure 2
Figure 2

A PRISMA flow diagram showing the process of study selection for our analysis. PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

The majority of studies were based in the USA (23), with the remainder set in the UK (10), Spain (9), Canada (2), Italy (2), New Zealand (2), Australia (1), Ireland (1) and Israel (1). Population sizes ranged from 96 to 5.4 million with a median value of 94 264 (IQR 12 800–434 027). Hospital admission was the most commonly predicted outcome (34), followed by healthcare costs (14), emergency department attendance (9), access to primary care services (8), mortality (5) and readmission (2).

19 studies reported the derivation and internal validation of a risk stratification model with 32 describing validation of a model in a separate population dataset. 10 studies reported the results of implementing PHM measures based on case selection by a risk stratification model in a real-world clinical pathway. These included five randomised control trials (RCTs), three prospective cohort studies and two retrospective cohort studies. PHM strategies used were case management (8), telemonitoring (4) and care coordination (3).

We identified 28 risk stratification tools across all studies. 42 studies examined a single model, whereas 9 studied the comparative efficacy of several models. Johns-Hopkins ACG was the most studied algorithm (20), followed by the Charlson Comorbidity Index (10), Hierarchical Condition Categories (8), the Chronic Illness and Disability Payment System (3), RxRisk (3), the Elder Risk Assessment Index (2), the Patients At Risk of Rehospitalisation algorithm (2) and QAdmissions (2). Of the remainder, four were proprietary ML algorithms.

Results of internal and external validation studies

A summary of the derivation characteristics of each of the 28 discovered models is compared with the results of subsequent validation studies in online supplemental table 2.25–84 The results of internal validation studies echoed previous reviews with C-statistics for various outcomes ranging from 0.67 to 0.90. Notably, three of the highest C-statistics within internal validation samples were displayed by models derived using ML techniques—0.84,67 0.8542 and 0.90.55

Half (14) of the discovered models underwent external validation. Of these, only the Charlson Comorbidity Index and the Johns Hopkins ACG System were validated internationally. Model performance in external validation studies generally resembled internal validation performance for each model, with C-statistics ranging from 0.53 to 0.88. Accounting for heterogeneity in study design and reporting, there was no evident association between validation context and model discrimination, with models broadly displaying consistent predictive performance when transported to external datasets.

Results of real-world evaluation studies

Two studies reported the implementation of risk stratification tools into care pathways within the same population used for development. The Nairn Case Finder73 and the Predictive RIsk Stratification Model (PRISM)25 algorithms were used to identify those that might benefit from case management, both in the hope of reducing hospital admissions. In a prospective stepped-wedge clinical trial conducted across more than 230 000 patients in 32 primary care practices, the practice resource allocation intervention linked to PRISM resulted in significantly increased hospital admissions (OR 1.44 (95% CI 1.39 to 1.50), p<0.001), as well as increased emergency presentations, time in hospital, and primary care workload. The intervention guided by the Nairn Case Finder significantly reduced hospital admissions (AD=42.5%, p=0.002) in a population of 96 high-risk patients from a single locality, when matched 1:1 on risk score to patients in a separate control population.

Eight of the discovered models were deployed as tools for case selection as part of a PHM strategy in a separate context from development. The Johns Hopkins ACG System was deployed in two separate studies, whereas each of the other models was deployed only once. Healthcare utilisation measures were not significantly influenced by interventions guided by the Hierarchical Condition Category71 and PacifiCare’s Medicare Risk Programme37 models. Similarly equivocal evidence for the efficacy of interventions linked with the Johns Hopkins ACG System was observed, with one study showing no benefit31 and the other demonstrating benefit in groups selected by the model (OR 0.91 (95% CI 0.86 to 0.96)) but reciprocal harm in non-prioritised groups (OR 1.19 (95% CI 1.09 to 1.30)).32 Interventions linked with the Elder Risk Assessment Index30 and QAdmissions48 algorithms led to significant increases in mortality (AD 10.8%, p=0.008) and hospital admissions (difference in difference 79.8 (95% CI 21.2 to 138.4), p=0.01), respectively.

Significant reductions in hospital admissions were achieved through interventions guided by the combined predictive model (AD=−0.9, p<0.001),39 Patients At Risk for Rehospitalisation algorithm (AD=−0.3, p<0.001)39 and SCAN Health Plan Model (AD=11.5%, p=0.02).51 Figure 3 summarises the main findings of this review, describing only the models that underwent external validation or real-world evaluation.

Figure 3
Figure 3

An infographic summarising the validation characteristics of the identified models that underwent external validation or real-world testing. Models that underwent more extensive validation processes are represented by larger boxes. Each box contains aggregated data for all of the external validation and real-world evaluation studies for each model. Validation countries are represented by flags with the number of studies based in each country overlying. R2 and C-statistics are displayed as ranges for all of the outcome measures tested for each model for illustrative purposes only. A&E, accident and emergency department; PPV, positive predictive value; RCT, randomised controlled trial; RR, risk ratio.

Discussion

Main results

Our review identifies 28 risk stratification tools designed to predict healthcare utilisation in an unselected primary care population. The discriminatory ability of half of the discovered models was validated in an external cohort. However, only two, the Charlson Comorbidity Index and Johns Hopkins ACG System, were validated in a different country from their derivation dataset. No evident association between validation context and model discrimination was observed. Models derived using ML techniques displayed the best predictive performance, however, none of these models underwent external validation.

The results of real-world evaluation studies present equivocal evidence for the efficacy of these population-level interventions. The majority of publications reported no change, or indeed significant increases, in healthcare utilisation within groups targeted by the intervention, with only one-third of reports demonstrating some benefit.

Comparison with the literature

We corroborate the results of previous reviews by observing that the discriminatory power of a variety of risk stratification tools is robust to external validation.1 2 12 13 We add that the context of model validation appears to have minimal impact on predictive performance and highlight a scarcity of literature appraising the impact of deploying these models to guide PHM strategies despite extensive integration of risk stratification into pathways within primary care systems worldwide.3–6

Our finding that deployment of these models is not consistently associated with reductions in healthcare utilisation is perhaps unsurprising. PHM strategies applied to unselected primary care cohorts, with case selection achieved through a variety of different means, have frequently been shown to increase costs without an associated reduction in morbidity.9 85–87 A single 2014 meta-analysis, aggregating a heterogeneous group of strategies as a single intervention, demonstrated marginal reductions in resource use within a relevant cohort.88 However, these findings were subject to substantial heterogeneity (I²=58%–85%) and, while ostensibly the target population of this analysis was patients generally at high risk of healthcare resource use, the majority of included studies reported interventions targeted at specific disease cohorts. There is broad consensus that PHM strategies designed specifically for those with certain chronic conditions significantly reduce morbidity.89–94 Taken with our findings, the available evidence indicates that the success of PHM strategies in specific disease groups may not be generalisable to unselected cohorts, and this remains the case when predictive modelling is employed to augment case selection.

The findings of our analysis of peer-reviewed literature stand in stark contrast to the impact statements of commercial suppliers of care systems that employ risk stratification. One such statement compared resource use statistics of product users to standardised national trends in an unadjusted analysis finding significant reductions in every parameter.15 However, as is expressly the case for statements within product literature, a lack of transparency relating to the methods of data collection and analysis makes verifying these claims impossible.

Interpretation

We propose that the discouraging results of studies deploying risk stratification tools to guide PHM strategies primarily result from a mismatch between theoretical model development and complexities of real-world pathways. Risk stratifying patients by their likelihood of resource use alone almost invariably leads to the creation of a diverse intervention cohort, where individual clinical need is likely to be heterogeneous. This is likely the reason that population-level interventions have failed to replicate the results of successful programmes targeting specific chronic conditions. Presently, there is a paucity of evidence to guide best practice once high-risk users are identified, and no recommendations can be made about the efficacy of any single intervention over another. Results of real-world evaluation studies, therefore, present a cautionary tale of designing clinical pathways based on the principle of simply flagging high-risk patients without a concrete understanding of how this translates into practice.

We did not observe an effect of validation context on algorithmic performance. This is most likely due to the low number of comparable values obtained for each model, the heterogeneity of the study design, and a predictably small absolute effect size. Diminished performance when algorithms are deployed in new environments is a highly replicable finding, and our results should not be interpreted to contradict this established premise. However, this finding does imply that poor predictive performance is unlikely to be the primary reason for the failure of these algorithms to produce consistent results.

Limitations

It is important to put these findings within the context of our methodological constraints. Primarily, our analysis was limited by the heterogeneity of the included studies. Model performance was variably reported in terms of C-statistics and R2 values which cannot be directly compared. Real-world evaluation studies suffered from a lack of uniformity of intervention as many reported the results of a bespoke system designed by the study authors. This prevented direct comparison of the efficacy of particular intervention categories within our study cohort as their results could not be appropriately aggregated. While our analysis identified several models with sufficient diversity of validation to demonstrate robust performance in a variety of contexts, this sample was small, and no strong conclusions can be drawn about the scale of algorithmic drift when such models are transported to new datasets. Finally, the majority of included publications were observational or cohort studies, with only a small number of RCTs identified.

Implications

The integration of risk stratification into pathways that define care decisions for millions of individuals around the world is already well established. Our findings suggest an absence of clinical impact, and indeed a signal of harm in a third of cases, raising several important considerations. First, this presents clear implications for patient safety, particularly in the absence of regular independent appraisal of the personal and system-wide effects. In addition to aggregate population health impacts, this includes the impact on individuals of incorrect stratification, and of negative biases through poorly calibrated algorithms. Second, the effects on provider workload of instituting and enacting these often time-consuming PHM interventions must be considered in the calculation of risk versus benefit. Finally, the absence of established benefits calls into question the cost-effectiveness of these programmes, particularly when used in healthcare systems where resources are constrained.

We therefore propose the following recommendations:

  1. Deployment of individual-level risk prediction, with impact on clinical care pathways, must be subject to the same controls as other medical technologies. This would require matching their use to a responsible lifecycle of evidence generation, impact evaluation and monitoring for negative consequences. Such a lifecycle should include pre hoc evaluation, in the form of local testing, and controlled trials for integrated pathways, as well as post hoc analyses of economic impact and healthcare outcomes in targeted and non-targeted groups. The first step in this process may be agreement on an auditable validation framework, such as BS 30440 developed by the British Standards Institution, to permit a more systematic approach to evaluation of such products.

  2. National bodies involved in the procurement of commercial risk stratification services must review the cost-effectiveness and systemic implications of adjusting the likelihood of individuals within the population they serve accessing care based on personal predicted risk.

  3. Regulatory bodies, including the Medicines and Healthcare products Regulatory Agency and the US Food and Drug Administration, must either confirm that risk stratification algorithms fall within their purview and are thus subject to the same regulation as other technologies defined as a ‘Software as a Medical Device’, or clarify why these algorithms do not fall into this category.

Conclusion

While model performance appears to generalise in most evaluations, there is little evidence to suggest that the identification of high-risk individuals can be translated to improvements in service delivery or morbidity. The available evidence does not support further integration of these types of risk prediction into population healthcare pathways. There is an urgent need to independently appraise the safety, efficacy and cost-effectiveness of risk prediction systems that are already widely deployed within primary care.